Re: [igraph] Memory-Mapped Graphs

Thanks, Tamas. I'll add a little more color to what I'm doing.

I'm actually parsing the output format of https://pypi.python.org/pypi/meliae (which ends up as a list of json objects that have attributes about python objects in memory, and references to other objects). So, I can't use any of the built in igraph loaders. Instead, I'm using add_vertex for each object as I'm reading it, and the queueing up batches of edges to add in bulk periodically.

The advantage I see of memory mapping is that if we can get to the point where a particular igraph_t (and all of the data it references) is backed by a memory mapped file, then loading it doesn't involve parsing at all. Instead, it's just a matter of mapping the file contents into memory again, and wiring up the pointers to point to the right file offsets. I haven't understood enough of the igraph code yet to know whether reallocs of the vertex and edge lists are well contained, but it should be the case that the knowledge of memory-mapped or not can be contained to a relatively small location.

Also, I don't think it's the case that "you have to load the entire graph into memory at some point". If using a memory mapped file, I think you'd be able to rely on the os to only have the subset of pages resident that you actually need to run the particular graph operations. Now, it may be that some operations will force you to walk through the entire graph, which would introduce some memory pressure as pages are paged in and out, but I don't think that would be worse that the current strategy of having all of the nodes dynamically allocated.

-Cale

On Tue Feb 24 2015 at 10:27:59 AM Tamas Nepusz <address@hidden> wrote:

> I would like to store that graph in a persistent format, and it seems like
> igraph using a memory-mapped file (mmap) would do the trick. Has anyone
> attempted that in the past?
I don't think so, and actually, I'm not convinced about the advantages of this
approach -- the thing is that igraph would have to load the entire graph into
memory at some point anyway, so you cannot avoid reading through the whole
file.

I assume that you are using the edge list format (which is the fastest to load)
and I did some profiling on my machine with a graph the size of yours (but
generated with igraph_erdos_renyi_game). On my machine, it took ~25 seconds to
load the entire graph, and about 38.7% of that time was spent in
igraph_create() which converts an edge list (presented as a list of numbers)
into an igraph_t object. This is the part that cannot be avoided, no matter
what format you use to store your graph on the disk.

The part where we can probably save some time is the calls to fscanf(), getc(),
ungetc() and feof(), which accounted for 31.9% + 12.3% + 6.6% + 5.8% of the time,
respectively. These calls are necessary due to how the nodes are stored in the
file: they are written as strings, which have to be parsed. You could probably
save a lot of time by using a binary graph format and then creating an
appropriate reader and writer for that. I am thinking about something like
this:

first, use 4 bytes to store the number of vertices and maybe one byte to store
whether the graph is directed or undirected
then, for each vertex,
store the number of neighbors in 4 bytes
and then store the indices of the actual neighbors on 4 bytes each

This has the advantage that there is no need to call fscanf(), getc(), ungetc()
and feof() -- you can simply read chunks of four bytes as needed and there is
no need to check for EOF repeatedly because you "know" in advance how many
vertices there will be. It also uses a more compact representation because
n edges originating from the same vertex use only (n+1)*4 bytes, so there will
be less data to read from the disk. Also, due to the way how the edge list is
stored in this case, the edge list will be sorted by source vertices by
default, and this might also save some more time in the call to igraph_create()
later on.

Actually, igraph already contains a reader which is very similar to this; it is
called igraph_read_graph_graphdb(). The only difference is that it handles
undirected graphs only and it stores vertex indices on two bytes so it
scales up to 65535 vertices only. It is probably easy to adapt the source code
of igraph_read_graph_graphdb() to make it use four bytes per vertex and then
you are essentially ready. (And maybe it would be a nice addition to the core
library as well, although it's probably Gabor's call in the end and not mine).

All the best,
--
T.

_______________________________________________
igraph-help mailing list
address@hidden
https://lists.nongnu.org/mailman/listinfo/igraph-help

From:	Calen Pennington
Subject:	Re: [igraph] Memory-Mapped Graphs
Date:	Tue, 24 Feb 2015 16:23:59 +0000