[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [directory-discuss] Machine readable dump of Free Software Directory
From: |
Dmitry Marakasov |
Subject: |
Re: [directory-discuss] Machine readable dump of Free Software Directory |
Date: |
Tue, 13 Mar 2018 23:50:02 +0300 |
User-agent: |
Mutt/1.9.1 (2017-09-22) |
* Dmitry Marakasov (address@hidden) wrote:
> I'd like to support Free Software Directory in https://repology.org, a
> service which tracks packages versions across hundreds of repositories.
> This will both allow to enrich Repology with another source of verified
> information on free software projects, and to keep Directory more up to
> date by detecting outdated information.
>
> I need a machine readable dump for FSD for this purpose, as scrapping
> and parsing individual wiki pages does not look viable. Something like
> https://directory.fsf.org/wiki/All, but in XML/JSON and with additional
> version column would be sufficient. Is something like that possible?
Thanks, I was able to parse http://static.fsf.org/nosvn/directory/directory.xml
dump.
There are some problems with data though:
- Seems like there are a lot of entries imported from Debian, which means
incorrect versions (with Debian suffixes like Beanstalkd 1.10-1
instead of 1.10) and incorrect download locations (ftp.debian.org
instead of upstream). Is there a reliable way to filter these out?
I've tried to look for "Debian_import" in "Submitted_by", but it
doesn't seem to be reliable: after dropping these I'm still seeng
debian suffixes in versions.
- There are a lot of perl modules which are not distinguishable from
other software. In most repos there are distinct prefixes/suffixes,
e.g. p5-FOO or libFOO-perl, so repology is able to detect them and
merge them under single perl:FOO, avoiding clashes. Is it possible
to reliably pick out perl modules in FSD?
- Assorted garbage: names like "2532 [[file:pipe.png]]gigs", versions
like "Version 1.99"
So, the question is whether it's possible to only pick entries
edited and verified by humans, with data conforming to upstream,
and filtering out perl modules.
I know it's possible to do some heuristics (e.g. looking for
debian.org and cpan.org in URLs), but I don't really like this.
--
Dmitry Marakasov . 55B5 0596 FF1E 8D84 5F56 9510 D35A 80DD F9D2 F77D
address@hidden ..: jabber: address@hidden http://amdmi3.ru