igraph-help
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [igraph] GraphML


From: Tamás Nepusz
Subject: Re: [igraph] GraphML
Date: Fri, 13 Dec 2013 13:33:00 +0100

Hi,  

> It's in GraphML format and over 900MB in size. I let it run overnight  
> and it's still not done. The file contains email content that I don't  
> need - I'm really just after who sent an email to whom. Is there any  
> way to just read this in, and ignore the rest, that might be faster?

I would do some preprocessing on the GraphML file; in particular, remove those 
subtrees from the GraphML file that are within a <data key=“body”>...</data> 
section. Since GraphML is just plain XML, your best bet is probably some 
command-line XML manipulation tool. I was told that XMLStarlet 
(http://xmlstar.sourceforge.net/download.php) is quite good at such 
manipulations; I haven’t used it personally but a quick glance into its 
documentation shows that you can probably achieve your goal with:

xmlstarlet ed -N ns=http://graphml.graphdrawing.org/xmlns -d 
“//ns:address@hidden']” input.graphml

(The above command line may not entirely be correct, but the idea is that you 
select all the “data” elements in the file where its “key” attribute is equal 
to “body” and delete those. The -N option declares the XML namespace within 
which the data element is to be found).

Note that the start of the file downloaded from infochimps seems to have some 
metadata at the front; I had to skip the first 1024 bytes to get to the first 
XML tag.

T.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]