guix-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Search improvements (Was: Opposition to new single-letter package na


From: zimoun
Subject: Re: Search improvements (Was: Opposition to new single-letter package name "t")
Date: Tue, 9 Mar 2021 19:37:23 +0100

Hi Tobias,

On Tue, 9 Mar 2021 at 18:14, Tobias Geerinckx-Rice <me@tobias.gr> wrote:

> For most upstreams whether or not dashes were in vogue[0] when
> they named their project is literally arbitrary.  We'd penalise
> many other packages like texlive-todonotes, open{ssh,vpn,*},
> ktexteditor, r-performanceanalytics, qutebrowser, ...  It's not a
> net win.

I am not sure to understand what you mean here.

> If I might pet my own peeve, I think clever heuristics appear
> necessary in part because %package-metrics grossly overscores
> package names.  Rank them *below* synopsis & description--which
> will contain the name anyway--with a metric of 1, maybe 2.  Enough
> to keep the relevant stuff above the irrelevant stuff (python- >
> ruby-, etc.) without distorting things as they do now.

I really did math, i.e., write the scoring function, something like
(to simplify)

  score(package, query) = sum_{term in query} (wS cS + wD cD + w)

where wS, wD, wN are the weights for synopsis, description, name and
cS, cD, cN are the number of occurrences.  Then for example computed
Jacobian and so on in order to see the relation between the weights w*
and the number of occurrence c*.  Or I gave a look at the condition to
have:

  score(package_1, query) = score(package_2, query)

and basically, using the linear relevance as it is currently, the
weight (%package-metrics) are not so bad; you cannot find a really
better heuristic.  Another conclusion is: it really depends on the
number of terms the query has.  Basically, if you type one term, you
know what you are looking for and it is the package name but your are
not sure.  For more terms, currently the result strongly depends on
the quality of the synopsis and description.  For instance, try:

  guix search gnu compiler

and compare the description of all the packages with a relevance
higher than 4 (gcc-toolchain).  Well, with a linear and local scoring
function as it is currently, you cannot improve much, IMHO.  By local,
I mean only considering the words of one package independently of the
words of other packages.  That's why TF-IDF [1].  For a concrete
example, see 
<https://lists.gnu.org/archive/html/guix-devel/2019-07/msg00252.html>.
  Once you have a TF-IDF, the natural scoring is BM25 [2].  Well, it
is included in Xapian and there is a patch by Arun using Xapian as a
backend for "guix search", see
<http://issues.guix.gnu.org/issue/39258#14>.  It is missing a good
evaluation, i.e., queries examples.  I have asked such examples (what
query an user type and what they are expecting) here
<https://lists.gnu.org/archive/html/guix-devel/2020-05/msg00190.html>
but no one replied and since I am enough comfortable with searching
with Guix and other bugs are more annoying for my workflow, I moved to
other stuff.

For another discussion on the topic, see
<https://lists.gnu.org/archive/html/guix-devel/2020-01/msg00222.html>.


Since 2020, I have read pieces of "word embdeding" (part of vogue[0]
graph neural nets), and I think it would a great project: first some
vogue[0] stats to evaluate how the packages cluster together, i.e., is
emacs-foo closer to emacs-bar or python-foo?  and second depending on
the results, implement such embdeding to improve "guix search".  The
first means use Julia (or package PyTorch for Guix ;-)) and the second
means implement targeting Guile (it could awesome to have an
equivalent to Zygote [3,4] for Guile).

0: Not a joke. :-)
1: <https://en.wikipedia.org/wiki/Tf%E2%80%93idf>
2: <https://en.wikipedia.org/wiki/Okapi_BM25>
3: <https://github.com/FluxML/Zygote.jl>
4: <https://arxiv.org/pdf/1810.07951.pdf>


Cheers,
simon



reply via email to

[Prev in Thread] Current Thread [Next in Thread]