guix-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Identical files across subsequent package revisions


From: Miguel Ángel Arruga Vivas
Subject: Re: Identical files across subsequent package revisions
Date: Wed, 23 Dec 2020 20:07:40 +0100
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux)

Hi Julien and Simon,

Julien Lepiller <julien@lepiller.eu> writes:

> Le 23 décembre 2020 09:07:23 GMT-05:00, zimoun <zimon.toutoune@gmail.com> a 
> écrit :
>>Hi,
>>
>>On Wed, 23 Dec 2020 at 14:10, Miguel Ángel Arruga Vivas
>><rosen644835@gmail.com> wrote:
>>> Another idea that might fit well into that kind of protocol---with
>>> harder impact on the design, and probably with a high cost on the
>>> runtime---would be the "upgrade" of the deduplication process towards
>>a
>>> content-based file system as git does[2].  This way the a description
>>of
>>> the nar contents (size, hash) could trigger the retrieval only of the
>>> needed files not found in the current store.
>>
>>Is it not related to Content-Addressed Store?  i.e, «intensional
>>model»?
>>
>>Chap. 6: <https://edolstra.github.io/pubs/phd-thesis.pdf>
>>Nix FRC:
>><https://github.com/tweag/rfcs/blob/cas-rfc/rfcs/0062-content-addressed-paths.md>
>
> I think this is different, because we're talking about sub-element
> content-addressing. The intensional model is about content-addressing
> whole store elements. I think the idea would be to save individual
> files in, say, /gnu/store/.links, and let nar or narinfo files
> describe the files to retrieve. If we are missing some, we'd download
> them, then create hardlinks. This could even help our deduplication I
> think :)

Exactly.  My first approach would be a tree %links-prefix/hash/size, to
where all the internal contents of each store item would be hard linked:
mainly Git's approach with some touch here and there---some of them
probably have too much good will, the first approach isn't usually the
best. :-)

- The garbage collection process could check if there is any hard link
  to those files or remove them otherwise, deleting the hash folder when
  needed.

- The deduplication process could be in charge of moving the files and
  placing hard links instead, but hash collisions with the same size are
  always a possibility, therefore some mechanism is needed to treat
  these cases[1] and the vulnerabilities associated to them.

- The substitution process could retrieve from the server the
  information about the needed files, check which contents are already
  available and which ones must be retrieved, and ensure that the
  "generated nar" is the same as the one from the server.  This is quite
  related to the deduplication process and the mechanism used there[2].

Happy hacking!
Miguel

[1] Perhaps the usage of two different hash algorithms instead of one,
or different salts, could be enough for the "common" error case as a
collision on both cases are quite improbable.  They are possible anyway
with a size bigger than the hash size, therefore a final fallback to
actual bytes is probably needed.

[2] The folder could be hashed, even with a temporary salt agreed with
the server, to perform an independent/real-time check, but any issue
here has bigger consequences too, as no byte to byte comparison is
possible before the actual transmission.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]