PageRank Citation Ranking􏰏 Bringing Order to the Web

PDF Publication Title:

PageRank Citation Ranking􏰏 Bringing Order to the Web ( pagerank-citation-ranking􏰏-bringing-order-web )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 007

Much time has b een sp ent making the system resilient in the face of many deeply and intricately 􏰞awed web artifacts􏰗 There exist in􏰝nitely large sites􏰐 pages􏰐 and even URLs􏰗 A large fraction of web pages have incorrect HTML􏰐 making parser design di􏰖cult􏰗 Messy heuristics are used to help the crawling pro cess􏰗 For example􏰐 we do not crawl URLs with 􏰨cgi􏰧bin􏰨 in them􏰗 Of course it is imp ossible to get a correct sample of the 􏰡entire web􏰡 since it is always changing􏰗 Sites are sometimes down􏰐 and some p eople decide to not allow their sites to b e indexed􏰗 Despite all this􏰐 we b elieve we have a reasonable representation of the actual link structure of publicly accessible web􏰗 􏰘􏰗􏰑 PageRank Implementation We convert each URL into a unique integer􏰐 and store each hyp erlink in a database using the integer IDs to identify pages􏰗 Details of our implementation are in 􏰭PB􏰔􏰓 􏰯􏰗 In general􏰐 we have implemented PageRank in the following manner􏰗 First we sort the link structure by Parent ID􏰗 Then dangling links are removed from the link database for reasons discussed ab ove 􏰤a few iterations removes the vast ma jority of the dangling links􏰥􏰗 We need to make an initial assignment of the ranks􏰗 This assignment can b e made by one of several strategies􏰗 If it is going to iterate until convergence􏰐 in general the initial values will not a􏰕ect 􏰝nal values􏰐 just the rate of convergence􏰗 But we can sp eed up convergence by cho osing a go o d initial assignment􏰗 We b elieve that careful choice of the initial assignment and a small 􏰝nite numb er of iterations may result in excellent or improved p erformance􏰗 Memory is allo cated for the weights for every page􏰗 Since we use single precision 􏰞oating p oint values at 􏰙 bytes each􏰐 this amounts to 􏰘􏰩􏰩 megabytes for our 􏰜􏰚 million URLs􏰗 If insu􏰖cient RAM is available to hold all the weights􏰐 multiple passes can b e made 􏰤our implementation uses half as much memory and two passes􏰥􏰗 The weights from the current time step are kept in memory􏰐 and the previous weights are accessed linearly on disk􏰗 Also􏰐 all the access to the link database􏰐 A􏰐 is linear b ecause it is sorted􏰗 Therefore􏰐 A can b e kept on disk as well􏰗 Although these data structures are very large􏰐 linear disk access allows each iteration to b e completed in ab out 􏰛 minutes on a typical workstation􏰗 After the weights have converged􏰐 we add the dangling links back in and recompute the rankings􏰗 Note after adding the dangling links back in􏰐 we need to iterate as many times as was required to remove the dangling links􏰗 Otherwise􏰐 some of the dangling links will have a zero weight􏰗 This whole pro cess takes ab out 􏰝ve hours in the current implementation􏰗 With less strict convergence criteria􏰐 and more optimization􏰐 the calculation could b e much faster􏰗 Or􏰐 more e􏰖cient techniques for estimating eigenvectors could b e used to improve p erformance􏰗 However􏰐 it should b e noted that the cost required to compute the PageRank is insigni􏰝cant compared to the cost required to build a full text index􏰗 􏰙 Convergence Prop erties As can b e seen from the graph in Figure 􏰙 PageRank on a large 􏰘􏰒􏰒 million link database converges to a reasonable tolerance in roughly 􏰚􏰒 iterations􏰗 The convergence on half the data takes roughly 􏰙􏰚 iterations􏰗 This graph suggests that PageRank will scale very well even for extremely large collections as the scaling factor is roughly linear in log n􏰗 One of the interesting rami􏰝cations of the fact that the PageRank calculation converges rapidly is that the web is an expander􏰧like graph􏰗 To understand this b etter􏰐 we give a brief overview of the theory of random walks on graphs􏰪 refer to Motwani􏰧Raghavan 􏰭MR􏰔􏰚 􏰯 for further random walk on a graph is a sto chastic pro cess where at any given time step we are at a no de of the graph and cho ose an outedge uniformly at random to determine the no de to visit at the next time step􏰗 A graph is said to b e an expander if it is the case that every 􏰤not subset of no des S has a neighb orho o d 􏰤set of vertices accessible via outedges emanating from no des 􏰜 details􏰗 A particular to o large􏰥

PDF Image | PageRank Citation Ranking􏰏 Bringing Order to the Web

PDF Search Title:

PageRank Citation Ranking􏰏 Bringing Order to the Web

Original File Name Searched:

1999-66.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)