A UK whitepages alternative which leverages the BT Phonebook and ripgrep to quickly parse through PDFs to find data on people, phone numbers, businesses, and addresses.
This project downloads the BT Phonebook PDFs, indexes the data contained within them, and then allows you to quickly search for records (such as names, addresses, or phone numbers) using ripgrep for high performance.
-
scraper.py
Scrapes a predefined website for PDF links (sourced from the BT Phonebook) and downloads them into thepdfs/
directory. -
search_pdfs.py
Checks if thepdfs/
directory contains any PDFs. If not, it automatically executesscraper.py
to download them. It then indexes the PDFs by extracting and parsing their text (using a phone number pattern as a delimiter) and provides an interactive search prompt powered by ripgrep. -
requirements.txt
Lists the Python packages required for this project.
- Python 3.6+
-
pdftotext (optional but recommended for fast PDF text extraction)
Refer to the pdftotext documentation or install via your package manager. -
ripgrep (rg) (for fast, memory-efficient searching)
Visit ripgrep on GitHub for installation instructions and ensure it is in your system's PATH.
-
Clone the repository or download the scripts (
scraper.py
,parser.py
, andrequirements.txt
) into the same directory. -
Install Python dependencies using pip:
pip install -r requirements.txt
-
Ensure External Tools are Installed:
- Install pdftotext.
- Install ripgrep and ensure it is available in your system's PATH.
-
Run the Search Script:
python parser.py
-
Script Behavior:
- The script checks if the
pdfs
directory exists and contains any PDF files. - If no PDFs are found, it automatically runs
scraper.py
to download PDFs from the BT Phonebook. - Once the PDFs are available, the script indexes them (creating
records_index.txt
) and then prompts you for a search query. - Enter a search term (e.g., a name, address fragment, or phone number) to see matching records, which are retrieved using ripgrep.
- The script checks if the
-
Change the Scraping URL:
To modify the URL from which PDFs are scraped, edit thebase_url
variable inscraper.py
. -
Modify Record Parsing:
Adjust the regular expression in theparse_records
function inparser.py
if your PDF data format changes.
- PyPDF2 for PDF text extraction.
- requests and BeautifulSoup for web scraping.
- ripgrep for fast and efficient searching.
- pdftotext for efficient PDF text extraction.
- AI for filling in my knowledge gaps.