Skip to main content
Journal cover image

SimRank*: effective and scalable pairwise similarity search based on graph topology

Publication ,  Journal Article
Yu, W; Lin, X; Zhang, W; Pei, J; McCann, JA
Published in: VLDB Journal
June 1, 2019

Given a graph, how can we quantify similarity between two nodes in an effective and scalable way? SimRank is an attractive measure of pairwise similarity based on graph topologies. Its underpinning philosophy that “two nodes are similar if they are pointed to (have incoming edges) from similar nodes” can be regarded as an aggregation of similarities based on incoming paths. Despite its popularity in various applications (e.g., web search and social networks), SimRank has an undesirable trait, i.e., “zero-similarity”: it accommodates only the paths of equal length from a common “center” node, whereas a large portion of other paths are fully ignored. In this paper, we propose an effective and scalable similarity model, SimRank*, to remedy this problem. (1) We first provide a sufficient and necessary condition of the “zero-similarity” problem that exists in Jeh and Widom’s SimRank model, Li et al. ’s SimRank model, Random Walk with Restart (RWR), and ASCOS++. (2) We next present our treatment, SimRank*, which can resolve this issue while inheriting the merit of the simple SimRank philosophy. (3) We reduce the series form of SimRank* to a closed form, which looks simpler than SimRank but which enriches semantics without suffering from increased computational overhead. This leads to an iterative form of SimRank*, which requires O(Knm) time and O(n2) memory for computing all (n2) pairs of similarities on a graph of n nodes and m edges for K iterations. (4) To improve the computational time of SimRank* further, we leverage a novel clustering strategy via edge concentration. Due to its NP-hardness, we devise an efficient heuristic to speed up all-pairs SimRank* computation to O(Knm~) time, where m~ is generally much smaller than m. (5) To scale SimRank* on billion-edge graphs, we propose two memory-efficient single-source algorithms, i.e., ss-gSR* for geometric SimRank*, and ss-eSR* for exponential SimRank*, which can retrieve similarities between all n nodes and a given query on an as-needed basis. This significantly reduces the O(n2) memory of all-pairs search to either O(Kn+ m~) for geometric SimRank*, or O(n+ m~) for exponential SimRank*, without any loss of accuracy, where m~ ≪ n2. (6) We also compare SimRank* with another remedy of SimRank that adds self-loops on each node and demonstrate that SimRank* is more effective. (7) Using real and synthetic datasets, we empirically verify the richer semantics of SimRank*, and validate its high computational efficiency and scalability on large graphs with billions of edges.

Duke Scholars

Published In

VLDB Journal

DOI

EISSN

0949-877X

ISSN

1066-8888

Publication Date

June 1, 2019

Volume

28

Issue

3

Start / End Page

401 / 426

Related Subject Headings

  • Information Systems
  • 4605 Data management and data science
  • 0806 Information Systems
  • 0805 Distributed Computing
  • 0804 Data Format
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Yu, W., Lin, X., Zhang, W., Pei, J., & McCann, J. A. (2019). SimRank*: effective and scalable pairwise similarity search based on graph topology. VLDB Journal, 28(3), 401–426. https://doi.org/10.1007/s00778-018-0536-3
Yu, W., X. Lin, W. Zhang, J. Pei, and J. A. McCann. “SimRank*: effective and scalable pairwise similarity search based on graph topology.” VLDB Journal 28, no. 3 (June 1, 2019): 401–26. https://doi.org/10.1007/s00778-018-0536-3.
Yu W, Lin X, Zhang W, Pei J, McCann JA. SimRank*: effective and scalable pairwise similarity search based on graph topology. VLDB Journal. 2019 Jun 1;28(3):401–26.
Yu, W., et al. “SimRank*: effective and scalable pairwise similarity search based on graph topology.” VLDB Journal, vol. 28, no. 3, June 2019, pp. 401–26. Scopus, doi:10.1007/s00778-018-0536-3.
Yu W, Lin X, Zhang W, Pei J, McCann JA. SimRank*: effective and scalable pairwise similarity search based on graph topology. VLDB Journal. 2019 Jun 1;28(3):401–426.
Journal cover image

Published In

VLDB Journal

DOI

EISSN

0949-877X

ISSN

1066-8888

Publication Date

June 1, 2019

Volume

28

Issue

3

Start / End Page

401 / 426

Related Subject Headings

  • Information Systems
  • 4605 Data management and data science
  • 0806 Information Systems
  • 0805 Distributed Computing
  • 0804 Data Format