MODELS AND ALGORITHMS FOR PAGERANK SENSITIVITY

PDF Publication Title:

MODELS AND ALGORITHMS FOR PAGERANK SENSITIVITY ( models-and-algorithms-for-pagerank-sensitivity )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 112

92 4 ⋅ random alpha pagerank Feature vectors for each host are included with the Becchetti et al. [2008] data. These features are numerical results that may have an impact on the “spaminess” of the pages on that web host and include TrustRank [Gyöngyi et al., 2004], PageRank, Truncated PageRank [Becchetti et al., 2008], amongst others. Thus, pure PageRank ideas are already included. To support our statement that the standard deviation of RAPr is different, then, we must be able to improve upon the performance with all these features present. Although measures like PageRank, TrustRank, and RAPr produce one or two scores for each page, the previous study found that computing a few statistics on these features aided the classification task. Thus, for RAPr on each host, we produce • logofRAPrexpectation • logof(RAPrexpectation/logofoutdegree) • logof(RAPrexpectation/logofindegree) • standarddeviationofRAPrexpectationonin-links • logof(standarddeviationofRAPrexpectationonin-links/PageRank) • logofRAPrstandarddeviation • logof(RAPrstandarddeviation/logofoutdegree) • logof(RAPrstandarddeviation/logofindegree) • standarddeviationofstandarddeviationonin-links • logof(standarddeviationofRAPrstandarddeviationonin-links/PageRank) • logof(standarddeviationofRAPr/RAPrexpectation) where the RAPr scores are from the host home page, and the page with largest PageRank on the host. In total, we produce 22 features (= 11 from the list ×2 for the different host pages) from the RAPr statistics. Hosts, with all of their features, are then input to a machine learning framework that attempts to learn a decision rule about spam based on these features.31 Just like the original work, we use a Bagged J48 tree classifier in Weka [Witten and Frank, 2005] with 10 bags. Bagging a classifier produces a new classifier whose label is the concensus of a bag of independent classifiers. On the training data, we conducted 50 independent 10-fold cross-validation experiments to estimate the performance of the classifier, and table 4.6 dis- plays the results. For each classifier, we show the precision fraction of spam pages corrected labeled as spam; recall fractionoftotalspampagesidentified; fscore harmonic mean of precision and recall; false positive fraction of non-spam pages mislabeled as spam; and false negative fraction of spam pages mislabeled as non-spam. In the table we also add features based on the derivative. For the deriva- tive features, we use the derivative instead of the standard deviation in the previous list. Both the derivative and RAPr features improve the performance of the classifier! It is a small improvement, only a few tenths of a percent in both cases. Using features from the Beta(−0.5, −0.5, [0, 3, 099]) distribution, we obtain the best classification performance. In some sense, this distribution represents the least-likely surfer behavior. In contrast, the actual surfer behav- ior Beta(1.5, 0.5, [0, 0.99]) has the worst performance of all the experiments 31 Covering a full machine learning background is well outside the scope of this thesis.

PDF Image | MODELS AND ALGORITHMS FOR PAGERANK SENSITIVITY

PDF Search Title:

MODELS AND ALGORITHMS FOR PAGERANK SENSITIVITY

Original File Name Searched:

gleich-pagerank-thesis.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com | RSS | AMP