Re: [igraph] 'decompose.graph' versus 'clusters'

igraph-help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [igraph] 'decompose.graph' versus 'clusters'

From:	Gabor Csardi
Subject:	Re: [igraph] 'decompose.graph' versus 'clusters'
Date:	Wed, 23 Jul 2008 03:58:51 -0500
User-agent:	Mutt/1.5.17+20080114 (2008-01-14)

David,

the 'protection stack' error message is coming from R, but 
I think the reason is an igraph bug.

I've speeded up decompose.graph, it is fast now, but does not 
handle attributes yet, that would take another day. But this 
is not the main issue for you. Anyway, I'll upload a new package in the 
evening (if everything goes well), that should correct the 'protection
stack' error as well.

Partitioning the graph into subgraphs and calculating the betweenness 
separately for them is not a good idea I think, because you cannot 
know how far these betweenness values will be from the real ones.
Using betweenness.estimate is a better approach in my opinion.

Calculating the "real" betweenness for 1M (or more) vertices 
is definitely out of reach with the current code. One would need parallel 
code, a lot of processors and a couple of weeks for this. 

Best,
Gabor

On Tue, Jul 22, 2008 at 03:56:04PM -0700, David Hunkins wrote:
> Hi Gabor,
> 
> I tried what you suggested, and I can see the merit to the approach. On my
> first attempt, I trimmed the all clusters smaller than 20 members:
> 
> trim <- function(G) {
> cls <- clusters(G)
> smallcls <- which(cls$csize<20)-1
> ids_to_remove <- which(cls$membership %in% smallcls) -1
> delete.vertices(G,ids_to_remove)
> }
>
> I then removed the largest cluster using:
> 
> remove_largest <- function(G) {
> cls <- clusters(G)
> maxcsize <- max(cls$csize)
> ids_in_largest <- which(cls$membership %in% (which(cls$csize==maxcsize)-1))-1
> other_ids <- which(cls$membership %in% (which(cls$csize<maxcsize)-1))-1
> list(delete.vertices(G,other_ids), delete.vertices(G,ids_in_largest))
> }
>
> and took the second component returned and was able to decompose and run
> betweenness on it:
> 
> tween <- function(G,OF) 
> { 
> comps <- decompose.graph(G) 
> for (i in 1:(length(comps))){
> write(rbind(V(comps[[i]])$id,betweenness(comps[[i]])),file=OF,nc=2, sep=",",
> append=TRUE)
> } 
> } 
> 
> This gives me betweenness data for the large clusters (but not the small ones
> or the largest one), or about 200K vertices out of my set of 5M vertices. I
> would really like to get betweenness measure for the entire dataset, and I
> think it's within reach. I tried adding an additional step:
> 
> partition <- function(G) { 
> cls <- clusters(G) 
> g0ids <- which(cls$membership%%4==0)-1
> g1ids <- which(cls$membership%%4==1)-1
> g2ids <- which(cls$membership%%4==2)-1
> g3ids <- which(cls$membership%%4==3)-1
> list( delete.vertices(G,c(g1ids,g2ids,g3ids)),
> delete.vertices(G,c(g0ids,g2ids,g3ids)),
> delete.vertices(G,c(g0ids,g1ids,g3ids)),
> delete.vertices(G,c(g0ids,g1ids,g2ids))) 
> }
> 
> That is, partitioning the set into four graphs, which I applied the same
> process to, only using Amazon EC2 resource of 15GB physical memory with 4 cpu
> cores and 64-bit Fedora OS. (I ran each of the partitions in a parallel,
> separate instance of R.) Each partition contains about 600K vertices, 600K
> edges, and consists of about 60K clusters. However, each time I run this all 
> of
> them terminate independently about four hours later (at slightly different
> times) with the following error:
> 
> Error: protect(): protection stack overflow
> Error: protect(): protection stack overflow
> Execution halted
> 
> 
> The error occurs while decompose.graph is running, cpu is at 100%, no 
> swapping,
> and there is 10GB of free memory. Do you think this is coming from R or from
> igraph? Is there an R parameter or igraph parameter I can tune to get around
> this? Any help would be appreciated.
> 
> My next steps will be to try subdividing into 8 partitions, then 16, until I
> can complete the run. But of course, each run on EC2 costs $10 or so! :-)
> 
> Thanks very much!
> 
> Dave
> 
> David Hunkins
> address@hidden
> im: davehunkins
> 415 336-8965
> 
> 
> On Jul 19, 2008, at 2:18 AM, Gabor Csardi wrote:
> 
> 
>     Hi David,
> 
>     yes, you're right, decompose.graph is not O(V+E), it is in fact O(c(V+E)),
>     where 'c' is the number of components, I'll correct that.
> 
>     'clusters' gives back the membership of the vertices, it is
>     in the 'membership' component, so you could use this to create subgraphs.
>     But it does not make sense, since this is exactly what decompose.graph is
>     doing, so it will be just as slow.
> 
>     What you can try, is to eliminate the trivial components from your graph
>     first, i.e. the one with one, two, vertices, maybe up to ten, and
>     then (if there are much less components left) decompose the graph.
>     Remember,
>     however, that you cannot run betweenness on a graph with hundred thousend
>     vertices or more. Most networks have a giant component, so if you have
>     5M vertices in the full graph, you might still end up with 1M in the
>     largest component. Check this first with 'clusters'.
> 
>     I've been working on speeding up betweenness.estimate, it is much
>     better now, but of course I'm still not sure that it is fast enough
>     for your graph, it depends on the graph structure as well, not
>     only on the size of the graph. You can give it another try, here
>     is the new package:
>     http://cneurocvs.rmki.kfki.hu/igraph/download/igraph_0.6.tar.gz
> 
>     I think a viable approach could be to
>     1) eliminate the small clusters from the graph
>     2) decompose the remainder into components
>     3) run betweenness.estimate on the components, with cutoff=2, or 3.
>     It is a question, however, whether such a small cutoff is enough.
> 
>     Speeding up decompose.graph has been on the TODO list for long,
>     I gave more priority now.
> 
>     G.
> 
>     On Fri, Jul 18, 2008 at 01:31:35PM -0700, David Hunkins wrote:
> 
>         Hi, I'm working on a large disconnected graph (5M vertices, 10M edges,
>         500k
> 
>         clusters). I'm trying to reduce the time it takes to compute
>         betweenness for
> 
>         each vertex by breaking the graph up into connected components.
>         Decompose.graph
> 
>         does this in a very convenient way, since it returns graph objects 
> that
>         I can
> 
>         run betweenness on:
> 
> 
> 
>         comps <- decompose.graph(g10k)
> 
>         for (i in 1:length(comps)){
> 
>         write(rbind(V(comps[[i]])$id,betweenness(comps[[i]])),file="outfile",
>         nc=2, sep
> 
>         =",", append=TRUE)
> 
>         }
> 
> 
> 
>         However decompose.graph is very slow compared with clusters, which
>         appears to
> 
>         do the same thing in almost no time. (I can compute no.clusters on my
>         graph in
> 
>         a few seconds, whereas decompose.graph, run on the same graph, does 
> not
>         finish
> 
>         in 24 hours.) The docs for the C functions indicate that  'clusters'
>         and
> 
>         'decompose.graph' both have O(V + E) time complexity, but I have not
>         found this
> 
>         to be true.
> 
> 
> 
>         It appears that others have selected 'clusters' to partition large
>         graphs:
> 
> 
> 
>         http://lists.gnu.org/archive/html/igraph-help/2007-12/msg00046.html
> 
> 
> 
>         Does anybody have some R 'glue code' that makes clusters return a list
>         of
> 
>         graphs like decompose.graph does? (I'm an R newbie.) Or other
>         suggestions /
> 
>         clarifications?
> 
> 
> 
>         Thanks,
> 
> 
> 
>         Dave
> 
> 
> 
>         David Hunkins
> 
>         address@hidden
> 
>         im: davehunkins
> 
> 
> 
> 
> 
> 
> 
> 
>         _______________________________________________
> 
>         igraph-help mailing list
> 
>         address@hidden
> 
>         http://lists.nongnu.org/mailman/listinfo/igraph-help
> 
> 
> 
>     --
>     Csardi Gabor <address@hidden>    UNIL DGM
> 
> 
>     _______________________________________________
>     igraph-help mailing list
>     address@hidden
>     http://lists.nongnu.org/mailman/listinfo/igraph-help
> 
> 

> _______________________________________________
> igraph-help mailing list
> address@hidden
> http://lists.nongnu.org/mailman/listinfo/igraph-help


-- 
Csardi Gabor <address@hidden>    UNIL DGM

[Prev in Thread]

Current Thread

[Next in Thread]

[igraph] 'decompose.graph' versus 'clusters', David Hunkins, 2008/07/18
- Re: [igraph] 'decompose.graph' versus 'clusters', Gabor Csardi, 2008/07/19
  - Re: [igraph] 'decompose.graph' versus 'clusters', David Hunkins, 2008/07/19
    - Re: [igraph] 'decompose.graph' versus 'clusters', Gabor Csardi, 2008/07/21
    - [igraph] q values from community_fastgreedy() method, venura.2.mendis, 2008/07/21
    - Re: [igraph] q values from community_fastgreedy() method, Gabor Csardi, 2008/07/21
    - RE: [igraph] q values from community_fastgreedy() method, venura.2.mendis, 2008/07/21
    - Re: [igraph] q values from community_fastgreedy() method, Tamas Nepusz, 2008/07/22
    - RE: [igraph] q values from community_fastgreedy() method, venura.2.mendis, 2008/07/22
  - Re: [igraph] 'decompose.graph' versus 'clusters', David Hunkins, 2008/07/22
    - Re: [igraph] 'decompose.graph' versus 'clusters', Gabor Csardi <=
    - Re: [igraph] 'decompose.graph' versus 'clusters', David Hunkins, 2008/07/25
    - Re: [igraph] 'decompose.graph' versus 'clusters', Gabor Csardi, 2008/07/25
    - Re: [igraph] 'decompose.graph' versus 'clusters', David Hunkins, 2008/07/25
    - Re: [igraph] 'decompose.graph' versus 'clusters', Gabor Csardi, 2008/07/25

Prev by Date: Re: [igraph] 'decompose.graph' versus 'clusters'
Next by Date: [igraph] Upside-Down Trees and Vertex Size
Previous by thread: Re: [igraph] 'decompose.graph' versus 'clusters'
Next by thread: Re: [igraph] 'decompose.graph' versus 'clusters'
Index(es):
- Date
- Thread