[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [gwl-devel] support for containers
From: |
Ricardo Wurmus |
Subject: |
Re: [gwl-devel] support for containers |
Date: |
Wed, 30 Jan 2019 13:46:49 +0100 |
User-agent: |
mu4e 1.0; emacs 26.1 |
Hi Simon,
> On Wed, 30 Jan 2019 at 00:16, Ricardo Wurmus <address@hidden> wrote:
>
>> Since we don’t hash the data (because it’s expensive) the scripts are
>> “proxies” for the data files. We compute the hashes over the dependent
>> scripts and assume that this is enough to decide whether to recompute
>> data files or to serve them from the cache/store.
>
> Just to be sure to well understand your point, let pick the simple
> example from genomics pipeline:
> FASTQ -align-> BAM -variant-> VCF
> So, you intend to hash:
> - the data FASTQ
> - the scripts align and variant
> Or only the scripts containing reference to inputs (here FASTQ), where
> the reference is a location fixed by the user.
Currently, there is no good way for a user to pass inputs to a workflow,
so I haven’t yet thought about how to handle the user’s input files.
This still needs to be done. Currently, the only way a user can provide
files as inputs is by writing a process that “generates” the file (even
if it does so by merely accessing the impure file system). That’s
rather inconvenient and it wouldn’t work in a container where only
declared files are available.
Users should be able to map files to any process input from the command
line (or through a configuration file). For a provided input we should
take into account the hash of some file property: the timestamp and the
name (cheap), or the contents (expensive).
As regards hashing the scripts here’s what I have so far:
--8<---------------cut here---------------start------------->8---
(define (workflow->data-hashes workflow engine)
"Return an alist associating each of the WORKFLOW's processes with
the hash of all the process scripts used to generate their outputs."
(define make-script (process->script engine))
(define graph (workflow-restrictions workflow))
;; Compute hashes for chains of scripts.
(define (kons process acc)
(let* ((script (make-script process #:workflow workflow))
(hash (bytevector->u8-list
(sha256 (call-with-input-file script get-bytevector-all)))))
(cons
(cons process
(append hash
;; Hashes of processes this one depends on.
(append-map (cut assoc-ref acc <>)
(or (assoc-ref graph process) '()))))
acc)))
(map (match-lambda
((process . hashes)
(cons process
(bytevector->base32-string
(sha256
(u8-list->bytevector hashes))))))
(fold kons '()
(workflow-run-order workflow #:parallel? #f))))
--8<---------------cut here---------------end--------------->8---
I.e. for any process we want the hash over the script used for the
current process and for all processes that lead up to the current one.
This gives us a hash string for every process. We can then look up
“${GWL_STORE}/${hash}/output-file-name” — if it exists we use it. The
workflow runner would now also need to ensure that process outputs are
linked to the appropriate GWL_STORE location upon successful execution.
--
Ricardo
- [gwl-devel] support for containers, Ricardo Wurmus, 2019/01/28
- Re: [gwl-devel] support for containers, Ricardo Wurmus, 2019/01/29
- Re: [gwl-devel] support for containers, zimoun, 2019/01/29
- Re: [gwl-devel] support for containers, Ricardo Wurmus, 2019/01/29
- Re: [gwl-devel] support for containers, zimoun, 2019/01/29
- Re: [gwl-devel] support for containers, Ricardo Wurmus, 2019/01/30
- Re: [gwl-devel] support for containers, zimoun, 2019/01/29
- Re: [gwl-devel] support for containers, Ricardo Wurmus, 2019/01/29
- Re: [gwl-devel] support for containers, zimoun, 2019/01/30
- Re: [gwl-devel] support for containers,
Ricardo Wurmus <=
Re: [gwl-devel] support for containers, zimoun, 2019/01/29