
PDF Publication Title:
Text from PDF Page: 123
4.8.4 Spam classification Thus far, the evaluations of RAPr have been speculative. We’ve seen that the standard deviation vector differs from the standard PageRank vector. However, the proof is in the pudding and for RAPr, the pudding is spam. Web spam occurs when a web site consists primarily of misleading content or links designed to draw visitors to generate ad revenue or inflate another site’s importance. Web spam is distinguished by this artificiality. Identifying these sites is a growing problem and one technique is pure link analysis. Hypothetically, spam sites have dramatically different linking patterns than natural (non-spam) sites. In Castillo et al. [2006] and Becchetti et al. [2008], the authors investigate identifying web spam purely from link analysis. They labeled around 7,500 hosts from the uk-2006 graph as follows. Label spam non-spam no label Train Test 674 1250 4948 601 5780 9551 4.8 ⋅ applications 101 The data have a training and test subset, although only the training subset is used in Becchetti et al. [2008] and in the following experiments. In the remainder of our own experiment, we continue following the methodology of Becchetti et al. [2008], and add the standard deviation vector from RAPr as an additional feature for a spam classification task. They released their data, which makes experimenting straightforward. Figure 4.14 shows that the standard deviation information identifies some spam pages. In particular, a high standard deviation relative to PageRank (the right-hand side of the right figure) is a reasonably strong indicator. Ironically, a low standard deviation also appears to be an indicator. Feature vectors for each host are included with the Becchetti et al. [2008] data. These features are numerical results that may have an impact on the “spaminess” of the pages on that web host and include TrustRank [Gyöngyi et al., 2004], PageRank, Truncated PageRank [Becchetti et al., 2008], amongst others. Thus, pure PageRank ideas are already included. To support our statement that the standard deviation of RAPr is different, then, we must be able to improve upon the performance with all these features present. Although measures like PageRank, TrustRank, and RAPr produce one or two scores for each page, the previous study found that computing a fewPDF Image | Instagram Cheat Sheet
PDF Search Title:
Instagram Cheat SheetOriginal File Name Searched:
pagerank-sensitivity-thesis-online.pdfDIY PDF Search: Google It | Yahoo | Bing
Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info
Cruising Review Topics and Articles More Info
Software based on Filemaker for the travel industry More Info
The Burgenstock Resort: Reviews on CruisingReview website... More Info
Resort Reviews: World Class resorts... More Info
The Riffelalp Resort: Reviews on CruisingReview website... More Info
| CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com | RSS | AMP |