Hi Gabor,
I tried what you suggested, and I can see the merit to the
approach. On my
first attempt, I trimmed the all clusters smaller than 20 members:
trim <- function(G) {
cls <- clusters(G)
smallcls <- which(cls$csize<20)-1
ids_to_remove <- which(cls$membership %in% smallcls) -1
delete.vertices(G,ids_to_remove)
}
I then removed the largest cluster using:
remove_largest <- function(G) {
cls <- clusters(G)
maxcsize <- max(cls$csize)
ids_in_largest <- which(cls$membership %in% (which(cls
$csize==maxcsize)-1))-1
other_ids <- which(cls$membership %in% (which(cls
$csize<maxcsize)-1))-1
list(delete.vertices(G,other_ids), delete.vertices(G,ids_in_largest))
}
and took the second component returned and was able to decompose
and run
betweenness on it:
tween <- function(G,OF)
{
comps <- decompose.graph(G)
for (i in 1:(length(comps))){
write(rbind(V(comps[[i]])$id,betweenness(comps[[i]])),file=OF,nc=2,
sep=",",
append=TRUE)
}
}
This gives me betweenness data for the large clusters (but not the
small ones
or the largest one), or about 200K vertices out of my set of 5M
vertices. I
would really like to get betweenness measure for the entire
dataset, and I
think it's within reach. I tried adding an additional step:
partition <- function(G) {
cls <- clusters(G)
g0ids <- which(cls$membership%%4==0)-1
g1ids <- which(cls$membership%%4==1)-1
g2ids <- which(cls$membership%%4==2)-1
g3ids <- which(cls$membership%%4==3)-1
list( delete.vertices(G,c(g1ids,g2ids,g3ids)),
delete.vertices(G,c(g0ids,g2ids,g3ids)),
delete.vertices(G,c(g0ids,g1ids,g3ids)),
delete.vertices(G,c(g0ids,g1ids,g2ids)))
}
That is, partitioning the set into four graphs, which I applied the
same
process to, only using Amazon EC2 resource of 15GB physical memory
with 4 cpu
cores and 64-bit Fedora OS. (I ran each of the partitions in a
parallel,
separate instance of R.) Each partition contains about 600K
vertices, 600K
edges, and consists of about 60K clusters. However, each time I run
this all of
them terminate independently about four hours later (at slightly
different
times) with the following error:
Error: protect(): protection stack overflow
Error: protect(): protection stack overflow
Execution halted
The error occurs while decompose.graph is running, cpu is at 100%,
no swapping,
and there is 10GB of free memory. Do you think this is coming from
R or from
igraph? Is there an R parameter or igraph parameter I can tune to
get around
this? Any help would be appreciated.
My next steps will be to try subdividing into 8 partitions, then
16, until I
can complete the run. But of course, each run on EC2 costs $10 or
so! :-)
Thanks very much!
Dave
David Hunkins
address@hidden
im: davehunkins
415 336-8965
On Jul 19, 2008, at 2:18 AM, Gabor Csardi wrote:
Hi David,
yes, you're right, decompose.graph is not O(V+E), it is in fact
O(c(V+E)),
where 'c' is the number of components, I'll correct that.
'clusters' gives back the membership of the vertices, it is
in the 'membership' component, so you could use this to create
subgraphs.
But it does not make sense, since this is exactly what
decompose.graph is
doing, so it will be just as slow.
What you can try, is to eliminate the trivial components from
your graph
first, i.e. the one with one, two, vertices, maybe up to ten, and
then (if there are much less components left) decompose the graph.
Remember,
however, that you cannot run betweenness on a graph with hundred
thousend
vertices or more. Most networks have a giant component, so if
you have
5M vertices in the full graph, you might still end up with 1M in
the
largest component. Check this first with 'clusters'.
I've been working on speeding up betweenness.estimate, it is much
better now, but of course I'm still not sure that it is fast
enough
for your graph, it depends on the graph structure as well, not
only on the size of the graph. You can give it another try, here
is the new package:
http://cneurocvs.rmki.kfki.hu/igraph/download/igraph_0.6.tar.gz
I think a viable approach could be to
1) eliminate the small clusters from the graph
2) decompose the remainder into components
3) run betweenness.estimate on the components, with cutoff=2, or
3.
It is a question, however, whether such a small cutoff is enough.
Speeding up decompose.graph has been on the TODO list for long,
I gave more priority now.
G.
On Fri, Jul 18, 2008 at 01:31:35PM -0700, David Hunkins wrote:
Hi, I'm working on a large disconnected graph (5M vertices,
10M edges,
500k
clusters). I'm trying to reduce the time it takes to compute
betweenness for
each vertex by breaking the graph up into connected
components.
Decompose.graph
does this in a very convenient way, since it returns graph
objects that
I can
run betweenness on:
comps <- decompose.graph(g10k)
for (i in 1:length(comps)){
write(rbind(V(comps[[i]])
$id,betweenness(comps[[i]])),file="outfile",
nc=2, sep
=",", append=TRUE)
}
However decompose.graph is very slow compared with clusters,
which
appears to
do the same thing in almost no time. (I can compute
no.clusters on my
graph in
a few seconds, whereas decompose.graph, run on the same
graph, does not
finish
in 24 hours.) The docs for the C functions indicate that
'clusters'
and
'decompose.graph' both have O(V + E) time complexity, but I
have not
found this
to be true.
It appears that others have selected 'clusters' to partition
large
graphs:
http://lists.gnu.org/archive/html/igraph-help/2007-12/msg00046.html
Does anybody have some R 'glue code' that makes clusters
return a list
of
graphs like decompose.graph does? (I'm an R newbie.) Or other
suggestions /
clarifications?
Thanks,
Dave
David Hunkins
address@hidden
im: davehunkins
_______________________________________________
igraph-help mailing list
address@hidden
http://lists.nongnu.org/mailman/listinfo/igraph-help
--
Csardi Gabor <address@hidden> UNIL DGM
_______________________________________________
igraph-help mailing list
address@hidden
http://lists.nongnu.org/mailman/listinfo/igraph-help