Re: [igraph] 'decompose.graph' versus 'clusters'

igraph-help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [igraph] 'decompose.graph' versus 'clusters'

From:	David Hunkins
Subject:	Re: [igraph] 'decompose.graph' versus 'clusters'
Date:	Fri, 25 Jul 2008 08:13:39 -0700

Okay, so the 8-way partition did the trick (decompose.graph was ableto pull apart the 300k-node graphs but not the 600k-node graphs,which, when run, had generated the protection faults). I think I'mcalculating real betweenness values for each connected component,because the order of operations is this:1. remove largest cluster (the 2M-plus node cluster that breakseverything)2. remove smallest clusters (the 1- and 2-node clusters that I'm notinterested in)3. take remaining clusters (about 200,000 of them) and divide them upinto 8 groups4. for each of the 8 groups, run decompose.graph to return a list ofsubgraphs5. run betweenness on each of the graphs in the list of subgraphs (soI am only ever running betweenness on something that's a maximalconnected component in the original graph)If I still haven't understood something about betweenness please letme know.

This was surprisingly fast (just 8 hours of cpu time).

Next month I'll be trying such a strategy again on a much largerdataset; I'll be using the faster decompose.graph (presumably that'sin your latest 0.6 tarfile) and let you know how it goes.


Thanks again for improving betweenness and decompose.graph!

Dave

David Hunkins
address@hidden
im: davehunkins
415 336-8965


On Jul 23, 2008, at 1:58 AM, Gabor Csardi wrote:

David,

the 'protection stack' error message is coming from R, but
I think the reason is an igraph bug.

I've speeded up decompose.graph, it is fast now, but does not
handle attributes yet, that would take another day. But this
is not the main issue for you. Anyway, I'll upload a new package inthe
evening (if everything goes well), that should correct the 'protection
stack' error as well.

Partitioning the graph into subgraphs and calculating the betweenness
separately for them is not a good idea I think, because you cannot
know how far these betweenness values will be from the real ones.
Using betweenness.estimate is a better approach in my opinion.

Calculating the "real" betweenness for 1M (or more) vertices
is definitely out of reach with the current code. One would needparallel
code, a lot of processors and a couple of weeks for this.

Best,
Gabor

On Tue, Jul 22, 2008 at 03:56:04PM -0700, David Hunkins wrote:
Hi Gabor,
I tried what you suggested, and I can see the merit to theapproach. On my
first attempt, I trimmed the all clusters smaller than 20 members:

trim <- function(G) {
cls <- clusters(G)
smallcls <- which(cls$csize<20)-1
ids_to_remove <- which(cls$membership %in% smallcls) -1
delete.vertices(G,ids_to_remove)
}

I then removed the largest cluster using:

remove_largest <- function(G) {
cls <- clusters(G)
maxcsize <- max(cls$csize)
ids_in_largest <- which(cls$membership %in% (which(cls$csize==maxcsize)-1))-1other_ids <- which(cls$membership %in% (which(cls$csize<maxcsize)-1))-1
list(delete.vertices(G,other_ids), delete.vertices(G,ids_in_largest))
}
and took the second component returned and was able to decomposeand run
betweenness on it:

tween <- function(G,OF)
{
comps <- decompose.graph(G)
for (i in 1:(length(comps))){
write(rbind(V(comps[[i]])$id,betweenness(comps[[i]])),file=OF,nc=2,sep=",",
append=TRUE)
}
}
This gives me betweenness data for the large clusters (but not thesmall onesor the largest one), or about 200K vertices out of my set of 5Mvertices. Iwould really like to get betweenness measure for the entiredataset, and I
think it's within reach. I tried adding an additional step:

partition <- function(G) {
cls <- clusters(G)
g0ids <- which(cls$membership%%4==0)-1
g1ids <- which(cls$membership%%4==1)-1
g2ids <- which(cls$membership%%4==2)-1
g3ids <- which(cls$membership%%4==3)-1
list( delete.vertices(G,c(g1ids,g2ids,g3ids)),
delete.vertices(G,c(g0ids,g2ids,g3ids)),
delete.vertices(G,c(g0ids,g1ids,g3ids)),
delete.vertices(G,c(g0ids,g1ids,g2ids)))
}
That is, partitioning the set into four graphs, which I applied thesameprocess to, only using Amazon EC2 resource of 15GB physical memorywith 4 cpucores and 64-bit Fedora OS. (I ran each of the partitions in aparallel,separate instance of R.) Each partition contains about 600Kvertices, 600Kedges, and consists of about 60K clusters. However, each time I runthis all ofthem terminate independently about four hours later (at slightlydifferent
times) with the following error:

Error: protect(): protection stack overflow
Error: protect(): protection stack overflow
Execution halted
The error occurs while decompose.graph is running, cpu is at 100%,no swapping,and there is 10GB of free memory. Do you think this is coming fromR or fromigraph? Is there an R parameter or igraph parameter I can tune toget around
this? Any help would be appreciated.
My next steps will be to try subdividing into 8 partitions, then16, until Ican complete the run. But of course, each run on EC2 costs $10 orso! :-)
Thanks very much!

Dave

David Hunkins
address@hidden
im: davehunkins
415 336-8965


On Jul 19, 2008, at 2:18 AM, Gabor Csardi wrote:


   Hi David,
yes, you're right, decompose.graph is not O(V+E), it is in factO(c(V+E)),
   where 'c' is the number of components, I'll correct that.

   'clusters' gives back the membership of the vertices, it is
in the 'membership' component, so you could use this to createsubgraphs.But it does not make sense, since this is exactly whatdecompose.graph is
   doing, so it will be just as slow.
What you can try, is to eliminate the trivial components fromyour graph
   first, i.e. the one with one, two, vertices, maybe up to ten, and
   then (if there are much less components left) decompose the graph.
   Remember,
however, that you cannot run betweenness on a graph with hundredthousendvertices or more. Most networks have a giant component, so ifyou have5M vertices in the full graph, you might still end up with 1M inthe
   largest component. Check this first with 'clusters'.

   I've been working on speeding up betweenness.estimate, it is much
better now, but of course I'm still not sure that it is fastenough
   for your graph, it depends on the graph structure as well, not
   only on the size of the graph. You can give it another try, here
   is the new package:
   http://cneurocvs.rmki.kfki.hu/igraph/download/igraph_0.6.tar.gz

   I think a viable approach could be to
   1) eliminate the small clusters from the graph
   2) decompose the remainder into components
3) run betweenness.estimate on the components, with cutoff=2, or3.
   It is a question, however, whether such a small cutoff is enough.

   Speeding up decompose.graph has been on the TODO list for long,
   I gave more priority now.

   G.

   On Fri, Jul 18, 2008 at 01:31:35PM -0700, David Hunkins wrote:
Hi, I'm working on a large disconnected graph (5M vertices,10M edges,
       500k

       clusters). I'm trying to reduce the time it takes to compute
       betweenness for
each vertex by breaking the graph up into connectedcomponents.
       Decompose.graph
does this in a very convenient way, since it returns graphobjects that
       I can

       run betweenness on:



       comps <- decompose.graph(g10k)

       for (i in 1:length(comps)){
write(rbind(V(comps[[i]])$id,betweenness(comps[[i]])),file="outfile",
       nc=2, sep

       =",", append=TRUE)

       }
However decompose.graph is very slow compared with clusters,which
       appears to
do the same thing in almost no time. (I can computeno.clusters on my
       graph in
a few seconds, whereas decompose.graph, run on the samegraph, does not
       finish
in 24 hours.) The docs for the C functions indicate that'clusters'
       and
'decompose.graph' both have O(V + E) time complexity, but Ihave not
       found this

       to be true.
It appears that others have selected 'clusters' to partitionlarge
       graphs:



       http://lists.gnu.org/archive/html/igraph-help/2007-12/msg00046.html
Does anybody have some R 'glue code' that makes clustersreturn a list
       of

       graphs like decompose.graph does? (I'm an R newbie.) Or other
       suggestions /

       clarifications?



       Thanks,



       Dave



       David Hunkins

       address@hidden

       im: davehunkins








       _______________________________________________

       igraph-help mailing list

       address@hidden

       http://lists.nongnu.org/mailman/listinfo/igraph-help



   --
   Csardi Gabor <address@hidden>    UNIL DGM


   _______________________________________________
   igraph-help mailing list
   address@hidden
   http://lists.nongnu.org/mailman/listinfo/igraph-help
_______________________________________________
igraph-help mailing list
address@hidden
http://lists.nongnu.org/mailman/listinfo/igraph-help
--
Csardi Gabor <address@hidden>    UNIL DGM


_______________________________________________
igraph-help mailing list
address@hidden
http://lists.nongnu.org/mailman/listinfo/igraph-help

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [igraph] 'decompose.graph' versus 'clusters', (continued)
- Re: [igraph] 'decompose.graph' versus 'clusters', Gabor Csardi, 2008/07/19
  - Re: [igraph] 'decompose.graph' versus 'clusters', David Hunkins, 2008/07/19
    - Re: [igraph] 'decompose.graph' versus 'clusters', Gabor Csardi, 2008/07/21
    - [igraph] q values from community_fastgreedy() method, venura.2.mendis, 2008/07/21
    - Re: [igraph] q values from community_fastgreedy() method, Gabor Csardi, 2008/07/21
    - RE: [igraph] q values from community_fastgreedy() method, venura.2.mendis, 2008/07/21
    - Re: [igraph] q values from community_fastgreedy() method, Tamas Nepusz, 2008/07/22
    - RE: [igraph] q values from community_fastgreedy() method, venura.2.mendis, 2008/07/22
  - Re: [igraph] 'decompose.graph' versus 'clusters', David Hunkins, 2008/07/22
    - Re: [igraph] 'decompose.graph' versus 'clusters', Gabor Csardi, 2008/07/23
    - Re: [igraph] 'decompose.graph' versus 'clusters', David Hunkins <=
    - Re: [igraph] 'decompose.graph' versus 'clusters', Gabor Csardi, 2008/07/25
    - Re: [igraph] 'decompose.graph' versus 'clusters', David Hunkins, 2008/07/25
    - Re: [igraph] 'decompose.graph' versus 'clusters', Gabor Csardi, 2008/07/25

Prev by Date: Re: [igraph] Nodes at a given distance
Next by Date: Re: [igraph] 'decompose.graph' versus 'clusters'
Previous by thread: Re: [igraph] 'decompose.graph' versus 'clusters'
Next by thread: Re: [igraph] 'decompose.graph' versus 'clusters'
Index(es):
- Date
- Thread