guix-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Improving ‘guix search’ scoring


From: Ludovic Courtès
Subject: Improving ‘guix search’ scoring
Date: Wed, 17 Jul 2019 23:27:47 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/26.2 (gnu/linux)

Hello zimoun!

zimoun <address@hidden> skribis:

> However, a kind of tf-idf [1] should be used to better self organize
> the packages when searching.
>
> [1] https://en.wikipedia.org/wiki/Tf%E2%80%93idf
>
>
> For example, I have 10146 package definitions:
>   guix search ' ' | recsel -P name -C | wc -l
>   10146
> and 46 contain the word 'drawing'.
> So, the Inverse-Document-Frequency is:
>  IDF(drawing) = log(10146 / 46)
>
> Let consider the 3 first most relevant package (with the current score).
> The term `drawing` appears:
>    for pkg in $(guix search drawing | recsel -C -P name | head -n3);\
>    do\
>       echo $pkg ; guix package --show=$pkg | grep -c drawing ;\
>    done
>
> FREQ(drawing, texlive-latex-eepic) = 5
> FREQ(drawing, tuxpaint) = 2
> FREQ(drawing, xfig) = 2
>
> Let normalize by the length of the document:
>    for pkg in $(guix search drawing | recsel -C -P name | head -n3);\
>    do\
>       echo $pkg ; guix package --show=$pkg \
>       | recsel -P synopsis,description | wc -w ;\
>    done
>
> LEN(texlive-latex-eepic) = 68
> LEN(tuxpaint) = 60
> LEN(xfig) = 76
>
> Then one definition of the Term-Frequency is:
>
> TF(drawing, texlive-latex-eepic) = 5 / 68
> TF(drawing, tuxpaint) = 2 / 60
> TF(drawing, xfig) = 2 / 76
>
>
> The TF-IDF reads:
>
> TF-IDF(drawing, texlive-latex-eepic) = 5/68*log(10146/46) =0.3968
> TF-IDF(drawing, tuxpaint) = 2/60*log(10146/46) =0.1799
> TF-IDF(drawing, xfig) = 2/76*log(10146/46) =0.1420
>
>
> This does not change much the current result. But this allows to
> better know which words are "good filter".
>
> Let consider the word `program` and the package `tuxpaint`.
> The current relevance score is 5 for `program`. The term appears 2
> times (note that `software` appears in synopsis which should be
> replaced be `program`).
> The current relevance score is 7 for `drawing`. The term also appears 2 times.
> The difference is just because the weight per field.
>
> However, the TF-IDF is totally different:
>
> TF-IDF(drawing, tuxpaint) = 2/60*log(10146/46) =0.1799
> TF-IDF(program, tuxpaint) = 2/60*log(10146/1056) =0.0754
>
> Well, the term `drawing` owns more information than the term `program`
> for the package tuxpaint.

That’s insightful!

I guess computing the TF-IDF could perhaps improve the results compared
to the current scoring mechanism.  It would be worth trying to implement
it.

The bottom line though, as you wrote, is that this all depends on the
quality of synopses and descriptions, and there’s only so much we can
draw from 5-line descriptions.

Thanks,
Ludo’.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]