Sketching landscapes of page farms
The Web is a very large social network. It is important and interesting to understand the "ecology" of the Web: the general relations of Web pages to their environment. The understanding of such relations has a few important applications, including Web community identification and analysis, and Web spam detection. In this paper, we propose the notion of page farm, which is the set of pages contributing to (a major portion of) the PageRank score of a target page. We try to understand the "landscapes" of page farms in general: how are farms of Web pages similar to or different from each other? In order to sketch the landscapes of page farms, we need to extract page farms extensively. We show that computing page farms is NP-hard, and develop a simple greedy algorithm. Then, we analyze the farms of a large number of (over 3 million) pages randomly sampled from the Web, and report some iriterestiiig findings. Most importantly, the landscapes of page farms tend to also follow the power law distribution. Moreover, the landscapes of page farms strongly reflect the importance of the Web pages.