Benford's Law Analysis

Goal

Determine the prevalence and applicability of Benford's Law by applying it to public datasets.

What is Benford's Law

Benford's Law is a mathematical phenomenon where the first digit of numbers in many real-world datasets does not occur with equal frequency. The number $1$ appears most often, followed by $2$, and so on, with $9$ being the least frequent.

Due to its reliability, it is often used as a method of fraud detection.

Quick Setup

Download Spark
Download a compatible version of Java
Set SPARK_HOME and JAVA_HOME environment variables to their respective paths
python3 -m venv env and source env/bin/activate to start virtual env
pip install -r requirements.txt

How to run

Fetch the data using the fetch_data_scripts/<script.sh> scripts.
- You may need to make the script executable with chmod +x fetch_data_scripts/<script.sh>
python src/main.py

Results and Analysis

I checked several different datasets to see if they followed Benford's Law. I logged some useful metrics (in the output.txt file) such as the number of orders of magnitude present in the dataset, sample size, and results from a chi-squared test, including p-value.

Orders of magnitude: It's generally accepted that the datasets that span several orders of magnitude are more likely to follow Benford's Law
Sample Size: More samples hopefully corresponds to a wider range of data, so larger datasets are more likely to follow Benford's Law
p-value: Here is a brief refresher on p-values:
- The p-value is the likelihood of rejecting the null hypothesis in favor of the alternative hypothesis
- In this study, the null hypothesis is this: "Each first digit will occur with equal frequency. They will follow a uniform distribution"
- This makes the alternative hypothesis this: "Each digit will occur with unequal frequencies"
- I used a chi-squared test which compares the observed distribution of first digits in a dataset to the expected distribution according to Benford's Law, providing a statistical measure of how well the data conforms to the law.

Populations

I had the best results looking at populations. Consider these 2 charts:

Note how the "Top 1,000 US Cities" does not seem to follow Benford's Law, whereas the "Illinois Data" appears to follow the distribution. There are a couple reasons for that:

The 1,000 cities only span one order of magnitude, $[10^5,10^6]$, unlike the Illinois data which spans 7, $[10^0,10^6]$.
The p-value for the Illinois data was $0.735$, which suggests we do not have enough evidence to reject the null hypothesis. The alternative hypothesis may be true, but we need more data to make that decision.

Prime Numbers

I also ran several iterations of different prime numbers to see if there was some value that would lead to a Benford distribution. These numbers are more or less evenly distributed. The reason by the values drop off at 8 and 9 is because the last number in the set is $7,919$. I tried values up to $50,000,000$, and they were always evenly distributed up until the latest observed number in the dataset.

Kaggle Online Payment Data

I wanted to fact-check one of my favorites shows. On the off chance someone hasn't seen it, I hid the spoiler in a disclosure tag here:

Breaking Bad Spoiler!

In Breaking Bad, Saul Goodman suggests that they launder Walter White's money by using a botnet. By setting up a public donations page, Walter could funnel his money into a computer program that would simulate anonymous donations coming in from all over the world. Here's the question: do public donations follow Benford's Law? If they do, then the IRS could use that to identify fraudulent activity. Fortunately for Walter and Saul, the p-value from the chi-squared test applied to this dataset suggests that there is insufficient evidence to reject the null hypothesis.

Here is Benford's Distribution for Online Payment data as seen on Kaggle. Similar to the Illinois population data, the visualization appears to support Benford's Distribution, but the chi-squared test suggests that there is not enough evidence to reject the null hypothesis.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
data		data
fetch_data_scripts		fetch_data_scripts
fig		fig
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
generate_n_primes.sh		generate_n_primes.sh
output.txt		output.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benford's Law Analysis

Goal

What is Benford's Law

Quick Setup

How to run

Results and Analysis

Populations

Prime Numbers

Kaggle Online Payment Data

About

Releases

Packages

Languages

License

crmueller100/benfords-law-using-spark

Folders and files

Latest commit

History

Repository files navigation

Benford's Law Analysis

Goal

What is Benford's Law

Quick Setup

How to run

Results and Analysis

Populations

Prime Numbers

Kaggle Online Payment Data

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages