PDF Publication Title:
Text from PDF Page: 007
Much time has b een sp ent making the system resilient in the face of many deeply and intricately awed web artifacts There exist innitely large sites pages and even URLs A large fraction of web pages have incorrect HTML making parser design dicult Messy heuristics are used to help the crawling pro cess For example we do not crawl URLs with cgibin in them Of course it is imp ossible to get a correct sample of the entire web since it is always changing Sites are sometimes down and some p eople decide to not allow their sites to b e indexed Despite all this we b elieve we have a reasonable representation of the actual link structure of publicly accessible web PageRank Implementation We convert each URL into a unique integer and store each hyp erlink in a database using the integer IDs to identify pages Details of our implementation are in PB In general we have implemented PageRank in the following manner First we sort the link structure by Parent ID Then dangling links are removed from the link database for reasons discussed ab ove a few iterations removes the vast ma jority of the dangling links We need to make an initial assignment of the ranks This assignment can b e made by one of several strategies If it is going to iterate until convergence in general the initial values will not aect nal values just the rate of convergence But we can sp eed up convergence by cho osing a go o d initial assignment We b elieve that careful choice of the initial assignment and a small nite numb er of iterations may result in excellent or improved p erformance Memory is allo cated for the weights for every page Since we use single precision oating p oint values at bytes each this amounts to megabytes for our million URLs If insucient RAM is available to hold all the weights multiple passes can b e made our implementation uses half as much memory and two passes The weights from the current time step are kept in memory and the previous weights are accessed linearly on disk Also all the access to the link database A is linear b ecause it is sorted Therefore A can b e kept on disk as well Although these data structures are very large linear disk access allows each iteration to b e completed in ab out minutes on a typical workstation After the weights have converged we add the dangling links back in and recompute the rankings Note after adding the dangling links back in we need to iterate as many times as was required to remove the dangling links Otherwise some of the dangling links will have a zero weight This whole pro cess takes ab out ve hours in the current implementation With less strict convergence criteria and more optimization the calculation could b e much faster Or more ecient techniques for estimating eigenvectors could b e used to improve p erformance However it should b e noted that the cost required to compute the PageRank is insignicant compared to the cost required to build a full text index Convergence Prop erties As can b e seen from the graph in Figure PageRank on a large million link database converges to a reasonable tolerance in roughly iterations The convergence on half the data takes roughly iterations This graph suggests that PageRank will scale very well even for extremely large collections as the scaling factor is roughly linear in log n One of the interesting ramications of the fact that the PageRank calculation converges rapidly is that the web is an expanderlike graph To understand this b etter we give a brief overview of the theory of random walks on graphs refer to MotwaniRaghavan MR for further random walk on a graph is a sto chastic pro cess where at any given time step we are at a no de of the graph and cho ose an outedge uniformly at random to determine the no de to visit at the next time step A graph is said to b e an expander if it is the case that every not subset of no des S has a neighb orho o d set of vertices accessible via outedges emanating from no des details A particular to o largePDF Image | PageRank Citation Ranking Bringing Order to the Web
PDF Search Title:
PageRank Citation Ranking Bringing Order to the WebOriginal File Name Searched:
1999-66.pdfDIY PDF Search: Google It | Yahoo | Bing
Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info
Cruising Review Topics and Articles More Info
Software based on Filemaker for the travel industry More Info
The Burgenstock Resort: Reviews on CruisingReview website... More Info
Resort Reviews: World Class resorts... More Info
The Riffelalp Resort: Reviews on CruisingReview website... More Info
CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)