gwl-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Managing data files in workflows


From: Ricardo Wurmus
Subject: Re: Managing data files in workflows
Date: Mon, 03 May 2021 11:18:18 +0200
User-agent: mu4e 1.4.15; emacs 27.2


Konrad Hinsen <konrad.hinsen@fastmail.net> writes:

Hi Ricardo,

We can fix the problem with symlinks by restoring the target of the link instead of the link itself, but I feel that we need to take a step back
and consider what this cache is really to be used for.

Indeed, and I have to admit that this isn't clear to me yet. What is it supposed to protect against? Modification of files by other processes of the workflow? Modification of files outside of the workflow? Both?

For the second situation (modification outside of the workflow), I think it would be sufficient to store a checksum, and terminate the workflow
with an error if it detects such tampering.

The first situation is more difficult. There are actually two cases:
 1. The workflow intentionally updates files as it proceeds.
 2. The workflow modifies a file by mistake.

Only the workflow author can make the distinction, so this needs some specific input syntax. Case 2 could then again be handled by a simple
checksum test for signalling an error.

This leaves case 1, for which the only good solution is to make a copy of the file at the end of each process, and restore it in later runs.

Yes, you are right. On wip-drmaa I changed the cache to never symlink. It either hardlinks or copies. This solves the immediate problem.

Yes, the semantics of hardlink/copy differ, but since our assumption is that intermediate files are reproducible, we can ignore this at this point.

I want to make the cache store/restore actions configurable, though, so that you can implement whatever caching method you want (including caching by copying to AWS S3). I’d like to introduce modifiers “immutable” and “mutable”, so that you can write “immutable file "whatever" you "want"” etc. “immutable” would take care of recording hashes and checking previously recorded hashes in a local state directory.

--
Ricardo



reply via email to

[Prev in Thread] Current Thread [Next in Thread]