guix-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: GWL pipelined process composition ?


From: Roel Janssen
Subject: Re: GWL pipelined process composition ?
Date: Thu, 19 Jul 2018 10:15:24 +0200
User-agent: mu4e 1.0; emacs 26.1

zimoun <address@hidden> writes:

> Hi Roel,
>
> Thank you for all your comments.
>
>
>> Maybe we can come up with a convenient way to combine two processes
>> using a shell pipe.  But this needs more thought!
>
> Yes, from my point of view, the classic shell pipe `|` has two strong
> limitations for workflows:
>  1. it does not compose at the 'process' level but at the procedure 'level'
>  2. it cannot deal with two inputs.

Yes, and this strongly suggests that shell pipes are indeed limited to
the procedures *the shell* can combine.  So we can only use them at the
procedure level.  They weren't designed to deal with two (or more)
inputs, and if they were, that would make it vastly more complex.

> As an illustration for the point 1., it appears to me more "functional
> spirit" to write one process/task/unit corresponding to "samtools
> view" and another one about compressing "gzip -c". Then, if you have a
> process that filters some fastq, you can easily reuse the compress
> process, and composes it. For more complicated workflows, such as
> RNAseq or other, the composition seems an advantage.

Maybe we could solve this at the symbolic (programming) level instead.

So if we were to try to avoid using "| gzip -c > ..." all over our code,
we could define a function to wrap this.  Here's a simple example:

(define (with-compressed-output command output-file)
  (system (string-append command " | gzip -c > " output-file)))

And then you could use it in a procedure like so:

(define-public A
  (process
    (name "A")
    (package-inputs (list samtools gzip))
    (data-inputs "/tmp/sample.sam")
    (outputs "/tmp/sample.sam.gz")
    (procedure
     #~(with-compressed-output
         (string-append "samtools view " #$data-inputs)
         #$outputs))))

This isn't perfect, because we still need to include “gzip” in the
‘package-inputs’.  It doesn't allow multiple input files, nor does it
split the “gzip” command from the “samtools” command on the process
level.  However, it does allow us to express the idea that we want to
compress the output of a command and save that in a file without having
to explicitely provide the commands to do that.

>
> As an illustration for the point 2., I do not do with shell pipe:
>
>   dd if=/dev/urandom of=file1 bs=1024 count=1k
>   dd if=/dev/urandom of=file2 bs=1024 count=2k
>   tar -cvf file.tar file1 file2
>
> or whatever process instead of `dd` which is perhaps not the right example 
> here.
> To be clear,
>   process that outputs fileA
>   process that outputs fileB
>   process that inputs fileA *and* fileB
> without write on disk fileA and fileB.

Given the ‘dd’ example, I don't see how that could work without
reinventing the way filesystems work.

> All the best,
> simon

Thanks!

Kind regards,
Roel Janssen



reply via email to

[Prev in Thread] Current Thread [Next in Thread]