guix-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: guix and mirroring dataset


From: zimoun
Subject: Re: guix and mirroring dataset
Date: Thu, 27 May 2021 02:24:30 +0200

Hi,

> Does the guix project and members suggest best guix-ish practices for
> managing on premise mirrors of large file-based data-sets such as
> appear in genomics HPC evironments? 

>From my understanding, it is still “unsolved“ and there is no clear
answer.

Basically, the /gnu/store is not designed for managing large dataset and
something is somehow missing.  On the mailing list gwl-devel@gnu.org, we
have already discussed that point although nothing came up, AFAIU.
Recently, we discussed again, see the thread:

<https://yhetil.org/gwl/87r1k2ti7k.fsf@elephly.net/T/#>

Your input is welcome. :-)

> Perhaps a guix-ish response to [Go Get Data \(GGD\) is a framework
> that facilitates reproducible access to genomic
> data](https://www.nature.com/articles/s41467-021-22381-z) 

AFAIR, Ricardo pointed this GoGetData.  Personally, I have not yet look
at the details.

> That would build on GWL?

>From my understanding, something is missing between ’packages’,
’process’ and ’workflow’, for instance ’data’.  And speaking about
genomics, there is 2 kinds of large data:

 - fixed output (immutable?): think FASTA and FASTQ
 - computed output (mutable?): think BAM and indexes

and it is not clear how to deal with them.  And once that answered, how
to share them (substitutes)? HTTP as all are doing, but we could also
want IPFS or any other things which would avoid the mirroring/sync
issues. 

> Use cases would be, e.g. download/sync selected (versions of) genomes
> from Ensembl/NCBI etc and index them for Blast, blat, bowtie{2}, bwa,
> STAR, GMAP, HiSAT, IGV, BioConductor, etc... 
>
> I see much that addresses analysis workflows, such as
>  -  [Reproducible genomics analysis pipelines with GNU 
> Guix](https://www.biorxiv.org/content/10.1101/298653v2.full)
>  - [Scalable Workflows and Reproducible Data Analysis for 
> Genomics](https://pubmed.ncbi.nlm.nih.gov/31278683/)
>  - [PiGx: reproducible genomics analysis pipelines with GNU 
> Guix](https://academic.oup.com/gigascience/article/7/12/giy123/5114263)
>
> Am I missing similar efforts toward maintaining an up-to-date catalog
> of the genomic resources that such workflows require? 

For now, some are maintained as packages, for instance:

  $ guix search "^r-" hg19 | recsel -C -P name
  r-phastcons100way-ucsc-hg19
  r-bsgenome-hsapiens-ucsc-hg19-masked
  r-txdb-hsapiens-ucsc-hg19-knowngene
  r-bsgenome-hsapiens-ucsc-hg19
  r-snplocs-hsapiens-dbsnp144-grch37
  r-illuminahumanmethylation450kanno-ilmn12-hg19
  r-fdb-infiniummethylation-hg19
  r-copyhelper

which are relative small, for another instance:

--8<---------------cut here---------------start------------->8---
r-txdb-hsapiens-ucsc-hg38-knowngene total: 91.8 MiB
r-bsgenome-hsapiens-ucsc-hg38 total: 765.2 MiB
r-copyhelper total: 42.9 MiB
--8<---------------cut here---------------end--------------->8---


Hope that helps,
simon



reply via email to

[Prev in Thread] Current Thread [Next in Thread]