Simple Site Crawler

Pre-requisites

Getting started

There are at least three containers running at any given time. A RabbitMQ message broker, MongoDB data store and task worker. The crawler makes use of Celery, the distributed task queue library, it's use of which is more apparent when you scale multiple workers.

To get started, run the following command

make start_containers

Before starting the crawl you may wish to increase the number of worker containers, which by default is one.

docker scale worker=[NUMBER_OF_WORKERS]

Once things are up and running you can start a crawl of all the urls found in the crawlable_urls.txt file.

make start_crawl

The results of the crawl are saved into Mongo. You can connect to MongoDB from your host using the address localhost:27018.

Thanks

Tony Wang and the article 'How to build a scaleable crawler...' for which the code base is originally based upon.
DomCop for the link to the Open PageRank data used in lieu of the Alexa top million.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
celery_tasks		celery_tasks
scripts		scripts
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
crawable_urls.txt		crawable_urls.txt
docker-compose.yml		docker-compose.yml
dockerfile		dockerfile
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simple Site Crawler

Pre-requisites

Getting started

Thanks

About

Releases

Packages

Languages

nathanbrock/simple-site-crawler

Folders and files

Latest commit

History

Repository files navigation

Simple Site Crawler

Pre-requisites

Getting started

Thanks

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages