[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Any faster way to find frequency of words?
From: |
Eric Abrahamsen |
Subject: |
Re: Any faster way to find frequency of words? |
Date: |
Sun, 09 May 2021 20:37:10 -0700 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux) |
Jean Louis <bugs@gnu.support> writes:
> * Eric Abrahamsen <eric@ericabrahamsen.net> [2021-05-09 17:57]:
>> Jean Louis <bugs@gnu.support> writes:
>>
>> > I am interested if there is some better way for Emacs Lisp to find
>> > frequency of words.
>> >
>> > Purpose is to create HTML clickable tag clouds similar to image tag
>> > clouds. But I will invoke Perl from Emacs to generate it. For that, I
>> > have to analyze the text first.
>>
>> Is there any particular improvement you're trying to make?
>
> I am invoking Perl on the fly and producing clickable HTML tag
> cloud. It would be boring and tiresome to re-write Perl's module into
> Emacs Lisp, though useful. For now, I rather just do it on the fly.
>
> As HTML tags are created from text, I need nothing but alphabetical
> characters. Function is invoked rarely.
>
> It is also useful to generate tags for particular text, that helps me
> to curate WWW pages.
Right, but what I meant was, is there anything wrong with the
implementation you posted?
>> I guess I'd suggest using Emacs syntax parsing functions, ie
>> `forward-word' and `buffer-substring'. Then you can fine tune the
>> definition of words using the local syntax table.
>
> That is also interesting approach, it could just go over the words and
> enter them into list.
Yes, and it can help you skip garbage characters that shouldn't count as
words. Things like `(skip-syntax-forward "^w")` (meaning "skip a run of
characters that aren't word constituents") can be very useful.
>> > (mapc (lambda (word)
>> > (when (> (length word) 2)
>> > (let ((word (downcase word)))
>> > (if (numberp (gethash word hash))
>> > (puthash word (1+ (gethash word hash)) hash)
>> > (puthash word 1 hash)))))
>>
>> While hash tables are probably best for very large texts, alists are
>> nice because you can use place-setting with a default, simplifying the
>> above to:
>>
>> (cl-incf (alist-get word frequency-alist 0 nil #'equal))
>
> The idea gave me idea to use the defaults from hashes, so I have made
> it now as below (puthash word (1+ (gethash word hash 0)) hash), that
> is result of brain storming here...
> (defun rcd-word-frequency (text &optional length)
> "Returns word frequency as hash from TEXT.
>
> Words smaller than LENGTH are discarded from counting."
> (let* ((hash (make-hash-table :test 'equal))
> (text (text-alphabetic-only text))
> (length (or length 3))
> (words (split-string text " " t " "))
> (words (mapcar 'downcase words))
> (words (mapcar (lambda (word) (when (> (length word) length) word))
> words))
> (words (delq nil words)))
> (mapc (lambda (word)
> (puthash word (1+ (gethash word hash 0)) hash))
I totally forgot that `gethash' has a default argument! So the line
above can just be:
(cl-incf (gethash word hash 0))
I don't know why, but I really enjoy that.
> words)
> hash))
>
> I am not sure if I should rather collect it into alist. Maybe I could
> collect it straight into by frequency ordered list like:
>
> (("word" 9) ("another" 7) ("more" 3))
>
> That is what I am doing here, to construct string of most frequent tags:
>
> (defun rcd-word-frequency-string (text &optional length how-many-words)
> (let* ((words (rcd-word-frequency text length))
> (words (hash-to-list words))
> (number (or how-many-words 20))
> (frequent (seq-sort (lambda (a b)
> (> (cadr a) (cadr b)))
> words)))
> (mapconcat (lambda (a) (car a)) (butlast frequent (- (length frequent)
> number)) " ")))
I don't have a `hash-to-list' function, but once you've built your table
it seems like the rest of it is fairly straightforward.
- Any faster way to find frequency of words?, Jean Louis, 2021/05/09
- Re: Any faster way to find frequency of words?, Eric Abrahamsen, 2021/05/09
- Re: Any faster way to find frequency of words?, Emanuel Berg, 2021/05/09
- Re: Any faster way to find frequency of words?, Jean Louis, 2021/05/09
- Re: Any faster way to find frequency of words?,
Eric Abrahamsen <=
- Re: Any faster way to find frequency of words?, Jean Louis, 2021/05/10
- RE: [External] : Re: Any faster way to find frequency of words?, Drew Adams, 2021/05/10
- Re: [External] : Re: Any faster way to find frequency of words?, Jean Louis, 2021/05/10
- RE: [External] : Re: Any faster way to find frequency of words?, Drew Adams, 2021/05/10
- Re: [External] : Re: Any faster way to find frequency of words?, Jean Louis, 2021/05/10
Re: Any faster way to find frequency of words?, Emanuel Berg, 2021/05/09