Determine the prevalence and applicability of Benford's Law by applying it to public datasets.
Benford's Law is a mathematical phenomenon where the first digit of numbers in many real-world datasets does not occur with equal frequency. The number
Due to its reliability, it is often used as a method of fraud detection.
- Download Spark
- Download a compatible version of Java
- Set
SPARK_HOME
andJAVA_HOME
environment variables to their respective paths python3 -m venv env
andsource env/bin/activate
to start virtual envpip install -r requirements.txt
- Fetch the data using the
fetch_data_scripts/<script.sh>
scripts.- You may need to make the script executable with
chmod +x fetch_data_scripts/<script.sh>
- You may need to make the script executable with
python src/main.py
I checked several different datasets to see if they followed Benford's Law. I logged some useful metrics (in the output.txt
file) such as the number of orders of magnitude present in the dataset, sample size, and results from a chi-squared test, including p-value.
- Orders of magnitude: It's generally accepted that the datasets that span several orders of magnitude are more likely to follow Benford's Law
- Sample Size: More samples hopefully corresponds to a wider range of data, so larger datasets are more likely to follow Benford's Law
- p-value: Here is a brief refresher on p-values:
- The p-value is the likelihood of rejecting the null hypothesis in favor of the alternative hypothesis
- In this study, the null hypothesis is this: "Each first digit will occur with equal frequency. They will follow a uniform distribution"
- This makes the alternative hypothesis this: "Each digit will occur with unequal frequencies"
- I used a chi-squared test which compares the observed distribution of first digits in a dataset to the expected distribution according to Benford's Law, providing a statistical measure of how well the data conforms to the law.
I had the best results looking at populations. Consider these 2 charts:
Note how the "Top 1,000 US Cities" does not seem to follow Benford's Law, whereas the "Illinois Data" appears to follow the distribution. There are a couple reasons for that:
- The 1,000 cities only span one order of magnitude,
$[10^5,10^6]$ , unlike the Illinois data which spans 7,$[10^0,10^6]$ . - The p-value for the Illinois data was
$0.735$ , which suggests we do not have enough evidence to reject the null hypothesis. The alternative hypothesis may be true, but we need more data to make that decision.
I also ran several iterations of different prime numbers to see if there was some value that would lead to a Benford distribution. These numbers are more or less evenly distributed. The reason by the values drop off at 8 and 9 is because the last number in the set is
I wanted to fact-check one of my favorites shows. On the off chance someone hasn't seen it, I hid the spoiler in a disclosure tag here:
Breaking Bad Spoiler!
In Breaking Bad, Saul Goodman suggests that they launder Walter White's money by using a botnet. By setting up a public donations page, Walter could funnel his money into a computer program that would simulate anonymous donations coming in from all over the world. Here's the question: do public donations follow Benford's Law? If they do, then the IRS could use that to identify fraudulent activity. Fortunately for Walter and Saul, the p-value from the chi-squared test applied to this dataset suggests that there is insufficient evidence to reject the null hypothesis.Here is Benford's Distribution for Online Payment data as seen on Kaggle. Similar to the Illinois population data, the visualization appears to support Benford's Distribution, but the chi-squared test suggests that there is not enough evidence to reject the null hypothesis.