txr-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Txr-users] using txr for web scraping


From: Kai Carver
Subject: Re: [Txr-users] using txr for web scraping
Date: Tue, 20 Oct 2009 23:06:03 +0200

Hi,

> If users can get by with wget, that's great. It means txr doesn't need
> sockets, URL handling and HTTP support in the code base.

It's nice how txr uses the good old Unix tools philosophy!
  http://www.faqs.org/docs/artu/ch01s06.html

I'll try playing with txr and wget (or curl or some wrapper of same
that does caching) to see if it can at least work as proof-of-concept
for extracting or reformatting data from the web. My primary interest
is in Web stuff, though I do plenty of file-munging too, but not such
systems-oriented stuff as shows up in your examples.

Thanks for clearing up those examples for me, I should have tested them first.

And yes Unicode and multi-line matching would be needed for some web
scraping cases, though I suspect it can already be useful without.

> It would be useful to have a @(freeform) directive, in which some
> delimited portion of the input, spanning multiple lines, is treated as
> one big string with embedded newlines. The query within @(freeform)
> would then be a one-line match for this giant virtual line (without
> the usual constraint that it must match that entire line).

Ok if I understand correctly, by default, expressed in Perl terms:
1. you match line by line,
2. you put an implicit ^ and $ around each match,
3. every @variable is a (.*?) capture unless otherwise specified,
4. @(skip) is a bit like @/.*/, but across lines.

Maybe to make things more free form and less line-based it would be
enough to just allow something like the m and s modifiers
(PCRE_MULTILINE and PCRE_DOTALL in PCRE)?
  http://perldoc.perl.org/perlre.html#Modifiers
  http://php.net/manual/en/regexp.reference.internal-options.php
Either as directives
  @(opt ms) @# equivalent to you proposed @(freeform) directive?
or as command-line switches?
Actually being able to specify some of the command-line switches
inside the pattern might be useful, if -m specifies modifiers and -d
specifies a delimiter:
  @(opt (m ms) (d |))

So here would be 3 possibilities on how to express a match across several lines:
  pattern:      "bla. @{FOO /[^.]+/ms}"
or
  pattern:      "@(opt msi)|bla. @{FOO /[^.]+/}"
or
  txr -m ms
  pattern:      "bla. @{FOO /[^.]+/}"


I have a couple more questions and a few more suggestions, if I may be
so bold and you don't mind my asking... Please ignore them if they are
not useful to you!

1. is it possible to reference just one element within a list? @{A[5]}
or @A[5] don't work...

$ ./txr -c "@(coll)@{A /[^:]+/}@(until) @(end)
@(output)
@{A[5]}
@(end)
" - < /etc/passwd


2. might there be a way to do one-liners, maybe by specifying a
character that will serve as a line separator instead of newline?

So this silly example that just reformats /etc/passwd:

$ ./txr -c "@(collect)
@(coll)@{A /[^:]+/}@(end)
@(cat A) ...
@(output)
@A
@(end)
@(end)
" - < /etc/passwd

could become, with separator | and some kind of optional by-default -n
collect loop, a one-liner:

$ ./txr -d\| -n -c "@(coll)@{A /[^:]+/}@(end)|@(cat A) ...
|@(output)|@A @(end)|" - < /etc/passwd

Not terribly readable, for sure... But this kind of one-liner may be useful.

If I had 1. and 2., I could use txr to do the kind of one-liner cut allows:

$ cut -d: -f1,6 /etc/passwd | head -3
root:/root
daemon:/usr/sbin
bin:/bin

except txr would be more powerful.

Another common use case I'd like to try is txr for quick web server
log analysis.


Ok sorry, that was probably too long. I'll let you know if I have a
working example of some web scraping.

k




reply via email to

[Prev in Thread] Current Thread [Next in Thread]