[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [igraph] PageRank implementations give vastly different results
From: |
Tamás Nepusz |
Subject: |
Re: [igraph] PageRank implementations give vastly different results |
Date: |
Wed, 14 May 2014 00:22:55 +0200 |
Hi Tim,
Thanks for the graph; I'm not done with the investigations, but I already
checked some basic things and I have a few comments to make here.
First, don't use Graph.Read_Edgelist to read your file; use Graph.Read_NCOL
instead. The problem is that igraph assumes that the edgelist format uses
vertex IDs from zero to |V|-1, so you end up creating thousands of isolated
vertices. If you use Graph.Read_Ncol, igraph will consider the numbers in the
input file as symbolic vertex names and "make up" vertex IDs on its own,
assigning the original IDs from the file in the "name" attribute of each vertex.
The reason why I am mentioning this is that ARPACK is thrown off course very
easily if there are lots of isolated vertices in the graph (and also this
distorts the PageRank results anyway because random teleportations usually end
up in one of these isolated vertices). If you use Read_Ncol instead of
Read_Edgelist, the results are much more favourable; ARPACK and PRPACK agree in
about 2/3 of the cases such that the maximum difference between the PageRank
vectors calculated by ARPACK and PRPACK is in the order of 10^{-15}. (Ignore
the power method -- it is deprecated and unsupported, it is left there for sake
of compatibility with older versions only). In the remaining ~1/3 of the cases,
I suspect that ARPACK fails to converge because the sum of the PageRank scores
is not exactly 1.0 for ARPACK (sometimes it is 0.99, but I have even seen
1.02), and the maximum difference between ARPACK and PRPACK is *huge* -- in the
order of 10^{12}. Further investigation shows that it is probably ARPA
CK to blame because the PageRank vector in ARPACK's case contains negative
values -- which clearly should never happen.
I think I also know the reason why ARPACK fails to converge in some cases. We
have seen it before many times that ARPACK is unreliable if the underlying
graph is not strongly connected. PRPACK works around this by calculating
PageRank scores separately for each strongly connected component and then
combining them in a clever way to get the final PageRank vector. This
conjecture seems to be confirmed by the fact that ARPACK and PRPACK agree
completely in 1000 out of 1000 trials if I extract the largest strongly
connected component of your graph -- but that contains only 20 vertices so
that's not a big thing on its own.
The only way to actually validate whether PRPACK is correct is to see if the
vector generated by PRPACK satisfies the PageRank equation. I have high
confidence that this is indeed the case, but I will nevertheless try to confirm
it numerically tomorrow or the day after.
The bottom line is:
- ignore the power method completely
- if you must choose one method out of the three right now, definitely choose
PRPACK
- if you can wait a bit more, I'll let you know whether I find any issues with
PRPACK's results or not
All the best,
Tamas