Re: [Txr-users] using txr for web scraping

txr-users

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Txr-users] using txr for web scraping

From:	Kaz Kylheku
Subject:	Re: [Txr-users] using txr for web scraping
Date:	Mon, 19 Oct 2009 12:56:32 -0700

On Mon, Oct 19, 2009 at 5:49 AM, Kai Carver <address@hidden> wrote:
> Hi,
>
> txr looks interesting!
>
>
> I wonder whether it could be adapted for web scraping, if we
> substitute "URL" for "text files".
>
>  @(next SOURCE)
>
> It might be as simple as allowing SOURCE to be a URL in next directives?
>
> Ah, I suppose you could already do web scraping with shell commands?
>
>  @(next "!wget -O - http://www.google.com/";)

Hi Kai!

If users can get by with wget, that's great. It means txr doesn't need
sockets, URL handling and HTTP support in the code base.

External processing can also give us some computation ability like
math. E.g. here is
a clone of the ``free'' utility based on Linux's /proc/meminfo. Here
we capture the output of a shell command echoing the results of
arithmetic expansion, which we have generated using a quasiliteral.

#!txr -f
@(next "/proc/meminfo")
@(skip)
MemTotal:@/ +/@TOTAL kB
MemFree:@/ +/@FREE kB
Buffers:@/ +/@BUFS kB
Cached:@/ +/@CACHED kB
@(skip)
SwapTotal:@/ +/@SWTOT kB
SwapFree:@/ +/@SWFRE kB
@(next `!echo $(( @TOTAL - @FREE ))`)
@USED
@(next `!echo $(( @USED - @BUFS - @CACHED ))`)
@RUSED
@(next `!echo $(( @FREE + @BUFS + @CACHED ))`)
@RFREE
@(next `!echo $(( @SWTOT - @SWFRE ))`)
@SWUSE
@(output)
               TOTAL         USED         FREE      BUFFERS       CACHED
Mem:    @{TOTAL -12} @{USED  -12} @{FREE  -12} @{BUFS  -12} @{CACHED -12}
+/- buffers/cache:   @{RUSED -12} @{RFREE -12}
Swap:   @{SWTOT -12} @{SWUSE -12} @{SWFRE -12}
@(end)

We are going to need Unicode support for handling arbitrary documents
from the web!

I started a Unicode branch for this already.

Using the ``char'' type made initial development easy, but it's a
serious limitation.

> By the way the documentation seems quite good (haven't quite read it all yet).
>
>  http://www.nongnu.org/txr/txr-manpage.html
>
> Small suggestions for improvement of the documentation:
>
> - an example or two for Positive Match would be welcome.
>
> If I understand correctly, the examples could be:
>
>  pattern:      "text @{FOO /[0-9]/}"
>  data:         "text 123 some more text"
>  result:       FOO="123"

Not quite. Firstly, to get the three digits, we need the regex to match
more than one character. (There is no implicit collecting over a regex).

Secondly, it's still necessary to match the rest of the line.

Entire lines must always be matched (but not entire input sources).

We could ignore all material after the @FOO match with the @/.*/
regexp:

  text @{FOO /[0-9]/+}@/.*/

or have a more specific match:

  text @{FOO /[0-9]/+} some more text

which is like

  text @FOO some more text

except that FOO is now constrained to consist of digits.

>  pattern:      "phone: @{area 3} @{local 8}"
>  data:         "phone: 617 867-5309 ext. 123"
>  result:       area="617", phone="867-5309"

Again, the need to match right to the end.

> - the (until) example (before "The Flatten Directive") is twice the same.

Will fix.

> One more question: is there "multi-line mode"? That is, can you match
> a variable across several lines? Something like:

No. The regex engine can amtch newlines of course (and you can code
these using \n). But that's of no use because the data never has newlines
in it.

That's a feature to think about.

I'm planning to refactor the implementation so that the matching of horizontal
material (within a  line) is more unified with the logic for vertical
processing;
it would be a good idea to roll in that idea then.

>  pattern:      "bla bla. @{FOO /[^.]+/m}"
>  data:         "bla bla. This is a sentence
> on two lines. Bla bla"
>  result:       FOO="This is a sentence
> on two lines"
>
> Or would I need to use a collect and concatenate to do that?

Currently that is the case.

It would be useful to have a @(freeform) directive, in which some
delimited portion of the input, spanning multiple lines, is treated as
one big string with embedded newlines. The query within @(freeform)
would then be a one-line match for this giant virtual line (without
the usual constraint that it must match that entire line).

[Prev in Thread]

Current Thread

[Next in Thread]

[Txr-users] using txr for web scraping, Kai Carver, 2009/10/19
- Re: [Txr-users] using txr for web scraping, Kaz Kylheku <=
  - Re: [Txr-users] using txr for web scraping, Kai Carver, 2009/10/20

Prev by Date: [Txr-users] using txr for web scraping
Next by Date: Re: [Txr-users] using txr for web scraping
Previous by thread: [Txr-users] using txr for web scraping
Next by thread: Re: [Txr-users] using txr for web scraping
Index(es):
- Date
- Thread