There are at least three containers running at any given time. A RabbitMQ message broker, MongoDB data store and task worker. The crawler makes use of Celery, the distributed task queue library, it's use of which is more apparent when you scale multiple workers.
To get started, run the following command
make start_containers
Before starting the crawl you may wish to increase the number of worker containers, which by default is one.
docker scale worker=[NUMBER_OF_WORKERS]
Once things are up and running you can start a crawl of all the urls found in the crawlable_urls.txt file.
make start_crawl
The results of the crawl are saved into Mongo. You can connect to MongoDB from your host using the address localhost:27018
.
- Tony Wang and the article 'How to build a scaleable crawler...' for which the code base is originally based upon.
- DomCop for the link to the Open PageRank data used in lieu of the Alexa top million.