use scrapy
to collect pages relationship information and build page rank dataset
use hadoop
and dataset collected by scrapy
to implement page rank algorithm
We use scrapy
to collect page rank dataset. The related code locates in the scrapy\
dir
- install scrapy first
pip install scrapy
- run
scrapy
insidescrapy\
cd scrapy
scrapy crawl pagerank
- change start_urls and allowed_domains (option)
the default start_urls=['/~https://github.com/']
, we can change it through:
scrapy crawl pagerank -a start='https://stackoverflow.com/'
we can restrict the collecting range into one domain:
scrapy crawl pagerank -a start='https://stackoverflow.com/' -a domain='stackoverflow.com'
Be careful, domain
should fit with start
; If you don't set start
, it should fit with the default start_urls
The data collected by spider will be stored into keyvalue
, transition
and pr0
respectively.
keyvalue
records the each url and its key, splited by '\t':
0\t/~https://github.com/\n
1\t/~https://github.com/features\n
2\t/~https://github.com/business\n
3\t/~https://github.com/explore\n
...
transtion
records the relationship betwenn pages:
0\t0,1,2,3,4,5,6,7,8,9\n
...
page of id 0 points to pages of id 0,1,2,3,4,5,6,7,8,9, they are splited by '\t'
pr0
is the initial page rank value for each page:
0\t1\n
1\t1\n
2\t1\n
...
Later, we need to put transition
and pr0
into hdfs
and use hadoop
to calculate page rank.
We use hadoop
to implement the page rank algorithm
- compile and package hadoop project by maven
cd hadoop
mvn package -DskipTests
- put
transition
andpr0
produced by spider intohdfs
hdfs dfs -mkdir -p pagerank/transition/
hdfs dfs -put ../scrapy/transition pagerank/transition/
hdfs dfs -mkdir -p pagerank/pagerank/pr0
hdfs dfs -put ../scrapy/pr0 pagerank/pagerank/pr0/
hdfs dfs -mkdir -p pagerank/keyvalue/
hdfs dfs -put ../scrapy/keyvalue pagerank/keyvalue/
hdfs dfs -mkdir -p pagerank/cache
- deal with MySQL
- create MySQL database
mysql -uroot -p
create database PageRank;
use PageRank;
create table output(page_id VARCHAR(250), page_url VARCHAR(250), page_rank DOUBLE);
GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY 'password' WITH GRANT OPTION;
FLUSH PRIVILEGES;
- download connector and put it into HDFS
hdfs dfs -mkdir /mysql
hdfs dfs -put mysql-connector-java-*.jar /mysql/
- run hadoop project
hadoop jar page-rank.jar Driver pagerank/transition pagerank/pagerank/pr pagerank/keyvalue pagerank/cache/unit 10
Driver
is the entry classpagerank/transition/
is the dir of transition in hdfspagerank/pagerank/pr
is the base dir of pr0 in hdfspagerank/keyvalue
is the dir of keyvalue in hdfspagerank/unit
is the dir of middle results10
is the times of convergence