A simple python script to extract URL endpoints from a website. One of a number of tools in my arsenal for helping migrate websites between platforms, with little to no SEO penalty.
This script will extract all URLs from a website, and log them to a CSV file; making it easier to plan the URL structure for your new website
- Clone the repo to your local machine
git clone git@github.com:danmenzies/url-logger.git
- Create a virtual environment
python3 -m venv venv
- Activate the virtual environment
source venv/bin/activate
- Install the requirements
pip install -r requirements.txt
Once either of these scripts has finished, see the ./data
directory for the CSV file containing the URLs
Crawler:
- Open the
./scripts/
directory - Run the script
python manual-crawl.py
- Enter the domain (or subdomain) of the website you want to extract URLs from
- Grab a coffee, this may take a while...
Sitemap Grabber:
- Open the
./scripts/
directory - Run the script
python convert-sitemap.py
- Enter the domain (or subdomain) of the website you want to extract URLs from
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. You are also welcome to fork the repo and make your own changes, if you prefer; and I welcome any requests to merge back into the main branch.
Feedback is also welcome, if you have any suggestions for improvements, please open an issue.
Please only scrape sites you have been authorised to scrape! I take no responsibility for any misuse of this script.