Page Rank

use scrapy to collect pages relationship information and build page rank dataset

use hadoop and dataset collected by scrapy to implement page rank algorithm

Collect Page Rank Dataset

We use scrapy to collect page rank dataset. The related code locates in the scrapy\ dir

Usage

install scrapy first

pip install scrapy

run scrapy inside scrapy\

cd scrapy
scrapy crawl pagerank

change start_urls and allowed_domains (option)

the default start_urls=['/~https://github.com/'], we can change it through:

scrapy crawl pagerank -a start='https://stackoverflow.com/'

we can restrict the collecting range into one domain:

scrapy crawl pagerank -a start='https://stackoverflow.com/' -a domain='stackoverflow.com'

Be careful, domain should fit with start; If you don't set start, it should fit with the default start_urls

Dataset

The data collected by spider will be stored into keyvalue, transition and pr0 respectively.

keyvalue records the each url and its key, splited by '\t':

0\t/~https://github.com/\n
1\t/~https://github.com/features\n
2\t/~https://github.com/business\n
3\t/~https://github.com/explore\n
...

transtion records the relationship betwenn pages:

0\t0,1,2,3,4,5,6,7,8,9\n
...

page of id 0 points to pages of id 0,1,2,3,4,5,6,7,8,9, they are splited by '\t'

pr0 is the initial page rank value for each page:

0\t1\n
1\t1\n
2\t1\n
...

Later, we need to put transition and pr0 into hdfs and use hadoop to calculate page rank.

Implement Page Rank Algorithm

We use hadoop to implement the page rank algorithm

compile and package hadoop project by maven

cd hadoop
mvn package -DskipTests

put transition and pr0 produced by spider into hdfs

hdfs dfs -mkdir -p pagerank/transition/
hdfs dfs -put ../scrapy/transition pagerank/transition/
hdfs dfs -mkdir -p pagerank/pagerank/pr0
hdfs dfs -put ../scrapy/pr0 pagerank/pagerank/pr0/
hdfs dfs -mkdir -p pagerank/keyvalue/
hdfs dfs -put ../scrapy/keyvalue pagerank/keyvalue/
hdfs dfs -mkdir -p pagerank/cache

deal with MySQL

create MySQL database

mysql -uroot -p
create database PageRank;
use PageRank;
create table output(page_id VARCHAR(250), page_url VARCHAR(250), page_rank DOUBLE);
GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY 'password' WITH GRANT OPTION;
FLUSH PRIVILEGES;

download connector and put it into HDFS

hdfs dfs -mkdir /mysql 
hdfs dfs -put mysql-connector-java-*.jar /mysql/

run hadoop project

hadoop jar page-rank.jar Driver pagerank/transition pagerank/pagerank/pr pagerank/keyvalue pagerank/cache/unit 10

Driver is the entry class
pagerank/transition/ is the dir of transition in hdfs
pagerank/pagerank/pr is the base dir of pr0 in hdfs
pagerank/keyvalue is the dir of keyvalue in hdfs
pagerank/unit is the dir of middle results
10 is the times of convergence

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
hadoop		hadoop
scrapy		scrapy
.editorconfig		.editorconfig
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Page Rank

Collect Page Rank Dataset

Usage

Dataset

Implement Page Rank Algorithm

Reference

About

Releases

Packages

Languages

scloudyy/PageRank

Folders and files

Latest commit

History

Repository files navigation

Page Rank

Collect Page Rank Dataset

Usage

Dataset

Implement Page Rank Algorithm

Reference

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages