[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Architecture to reduce download time when pulling multiple packages
From: |
James R. Haigh (+ML.GNU.Guix subaddress) |
Subject: |
Re: Architecture to reduce download time when pulling multiple packages – historic success with magnet URLs, BTIHs, & Aria2c! |
Date: |
Fri, 13 Oct 2023 19:05:52 +0100 |
Hi Josh,
At Z-0400=2023-10-13Fri12:36:01, Josh Marshall sent:
> This is to parallelize connections which should never hurt downloading but
> can help. Mirroring would be parallelizing for providing packages, what I
> want to implement is to parallelize obtaining packages. Server side vs
> client side.
Please, if you are going to do something like this, please use a
torrent architecture like BitTorrent or GNUnet – I suggest Aria2c as a very
good CLI download backend that can be daemonised and sent instructions over a
socket to add, pause, remove downloads, etc., and it supports magnet URLs
including the existing nontorrent servers (via ‘as’ parameters, iirc.).
I actually implemented this in a local copy of APT Daemon many years
ago (circa 2011), but the change was not accepted upstream to Launchpad
(because I was not on bleeding-edge; I was too slow to keep-up with the
upstream development). My fork got forgotten about, because to get the full
benefit the server would have had to have added a BitTorrent Info Hash (BTIH)
to the metadata of each package, along with the MD5, SHA-256, etc. that it
already did (not a big ask, really). That said, without the full benefit of
having the metadata, it did provide immediate benefit and I used it for many
years, not upgrading my Ubuntu 11.04 Natty Narwhal that I was using back then
until I really had to.
The immediate benefit that it provided was exactly as you described: It
allowed parallelisation of nontorrent downloads, be it from the same server or
from multiple mirrors. Iirc., I achieved this by simply passing the download
list to Aria2c in daemon mode, I think I also converted all the HTTP URLs to
‘as’ parameters in magnet links, so that multiple mirrors could be passed using
multiple ‘as’ parameters in each magnet link. Then I simply relied on Aria2c
being amazing at parallelising everything that I had given it! I then also
implemented progress updates such that APT Daemon could reflect where Aria2c
was up to.
The way I implemented this using Aria2c and magnet URLs meant that if
additional hashes were known, they could be used as well, and so if the server
metadata made the simple addition of adding BTIHs, it allows swarming to occur,
which in-turn would massively reduce load on the central servers, and allow
anyone who want to be a mirror to be a mirror simply by seeding indefinitely.
A default share ratio of 1.0 means that no user is a burden on the network,
unless they deliberately change that. Users can donate to the running costs of
the project simply by increasing their share ratio, which adds another means of
contribution that they may find easier than the others.
Anyone keen to keep old packages online can simply seed them
indefinitely, so this is also really great for archival purposes. Even if the
central project loses interest in the old packages and deletes them, anyone
else can keep them up. The hashes ensure that they have not been tampered with.
There is also a really cool benefit that occurs, or can occur, on a
LAN. An entire network of computers can all swarm locally with each other,
thus needing each package to only need downloading through the metered last
mile bottleneck from the WAN precisely once – providing that local broadcasting
is supported. I think this requires Avahi, and I seem to remember that Aria2c
supports this but I can't remember. I don't ever remember getting this bit
working but also I did not try hard because it would have required the metadata
that I didn't have until after download, so even if I got it working it would
not have been directly useful unless the APT repositories that I was using
would include the BTIHs.
So yeah, loads of great benefits to this architecture, and I
highly-recommend it: convert all existing URLs to magnet links (can be done
client-side as I did; or server-side); optionally add any additional mirrors as
additional ‘as’ parameters (again client-side or server-side); add ‘btih’
parameters to the magnet links (the BTIH must be included in the server
metadata to get the full benefit of the swarming, but conversion to magnet link
format can be done client-side or server-side); then simply pass all this to a
really good parallelising backend such as Aria2c; then update any progress data
and relay pause, resume, cancel, etc. to the backend.
One final note, as I am sure that there are a lot of GNUnet fans on
this list, is that I would try Aria2c first to see how well it can work, and
then try GNUnet or whatever else once you have a standard to benchmark against.
Both are Free Software, so no concern there. Aria2c is an all-round download
manager CLI that works with or without swarming, i.e. it is just as good at
HTTPS as it is BitTorrent, and can do both at the same time. GNUnet has the
advantage of working from SHA-256 iirc., which is generally already included in
the metadata of the repositories of various distributions, but I think it lacks
a lot of other features and stability and ecosystem of alternative backends,
compared to the BitTorrent network.
Of course, there is no harm in including other hashes along with BTIH,
to allow people to experiment with alternative backends, while always ensuring
that what works works well. Another hash that may be useful to include is the
Tiger Tree Hash, which is structurally very similar to BTIH, but stronger,
iirc..
The first thing that the Guix project can do to signal interest in this
architecture is to simply include the BTIH of each package in the repository
metadata. Be it in magnet URL form or not does not matter because the client
can later convert that as needed. The important thing is an authoritative
statement in metadata that this version of this package has this BTIH. Once
that metadata is available, the game is on to implement swarming support, be it
with Aria2c as a backend (as I recommend at least starting with) or otherwise.
I know that this architecture works well out of first-hand experience
with APT Daemon written in Python. The only failure I had with it was lack of
upstream support. So I consider it important to first attain the upstream
approval before really investing more time into this. I seem to remember
suggesting this to the Nix project many years ago and didn't get anywhere, and
now I don't have the energy to try to improve upstream projects if they reject
my ideas, so I'll be interested to see whether you have any success with your
attempt to do the same.
Good luck! ;-)
Kind regards,
James.
--
Wealth doesn't bring happiness, but poverty brings sadness.
Sent from Debian with Claws Mail, using email subaddressing as an alternative
to error-prone heuristical spam filtering.
Postal: James R. Haigh, Middle Farm, Vennington, nr. Westbury, nr. Shrewsbury,
Salop, SY5 9RG, Britain
JRHaigh+plz-sign4GPG@Runbox.com.asc
Description: Text document
pgpUaTYagpOUZ.pgp
Description: OpenPGP digital signature
- Architecture to reduce download time when pulling multiple packages, Josh Marshall, 2023/10/11
- Re: Architecture to reduce download time when pulling multiple packages, Christopher Baines, 2023/10/12
- Re: Architecture to reduce download time when pulling multiple packages, Josh Marshall, 2023/10/12
- Re: Architecture to reduce download time when pulling multiple packages, Christopher Baines, 2023/10/13
- Re: Architecture to reduce download time when pulling multiple packages, Josh Marshall, 2023/10/13
- Re: Architecture to reduce download time when pulling multiple packages – historic success with magnet URLs, BTIHs, & Aria2c!,
James R. Haigh (+ML.GNU.Guix subaddress) <=
- Re: Architecture to reduce download time when pulling multiple packages – historic success with magnet URLs, BTIHs, & Aria2c!, Josh Marshall, 2023/10/15
- Re: Architecture to reduce download time when pulling multiple packages – historic success with magnet URLs, BTIHs, & Aria2c!, Josh Marshall, 2023/10/17
- Re: Architecture to reduce download time when pulling multiple packages – historic success with magnet URLs, BTIHs, & Aria2c!, Christopher Baines, 2023/10/18