Skip to content

crmueller100/benfords-law-using-spark

Repository files navigation

Benford's Law Analysis

Goal

Determine the prevalence and applicability of Benford's Law by applying it to public datasets.

What is Benford's Law

Benford's Law is a mathematical phenomenon where the first digit of numbers in many real-world datasets does not occur with equal frequency. The number $1$ appears most often, followed by $2$, and so on, with $9$ being the least frequent.

Due to its reliability, it is often used as a method of fraud detection.

Quick Setup

  • Download Spark
  • Download a compatible version of Java
  • Set SPARK_HOME and JAVA_HOME environment variables to their respective paths
  • python3 -m venv env and source env/bin/activate to start virtual env
  • pip install -r requirements.txt

How to run

  • Fetch the data using the fetch_data_scripts/<script.sh> scripts.
    • You may need to make the script executable with chmod +x fetch_data_scripts/<script.sh>
  • python src/main.py

Results and Analysis

I checked several different datasets to see if they followed Benford's Law. I logged some useful metrics (in the output.txt file) such as the number of orders of magnitude present in the dataset, sample size, and results from a chi-squared test, including p-value.

  • Orders of magnitude: It's generally accepted that the datasets that span several orders of magnitude are more likely to follow Benford's Law
  • Sample Size: More samples hopefully corresponds to a wider range of data, so larger datasets are more likely to follow Benford's Law
  • p-value: Here is a brief refresher on p-values:
    • The p-value is the likelihood of rejecting the null hypothesis in favor of the alternative hypothesis
    • In this study, the null hypothesis is this: "Each first digit will occur with equal frequency. They will follow a uniform distribution"
    • This makes the alternative hypothesis this: "Each digit will occur with unequal frequencies"
    • I used a chi-squared test which compares the observed distribution of first digits in a dataset to the expected distribution according to Benford's Law, providing a statistical measure of how well the data conforms to the law.

Populations

I had the best results looking at populations. Consider these 2 charts:

us_1000_benfords_law.png illinois_benfords_law.png

Note how the "Top 1,000 US Cities" does not seem to follow Benford's Law, whereas the "Illinois Data" appears to follow the distribution. There are a couple reasons for that:

  • The 1,000 cities only span one order of magnitude, $[10^5,10^6]$, unlike the Illinois data which spans 7, $[10^0,10^6]$.
  • The p-value for the Illinois data was $0.735$, which suggests we do not have enough evidence to reject the null hypothesis. The alternative hypothesis may be true, but we need more data to make that decision.

Prime Numbers

I also ran several iterations of different prime numbers to see if there was some value that would lead to a Benford distribution. These numbers are more or less evenly distributed. The reason by the values drop off at 8 and 9 is because the last number in the set is $7,919$. I tried values up to $50,000,000$, and they were always evenly distributed up until the latest observed number in the dataset.

primes_benfords_law

Kaggle Online Payment Data

I wanted to fact-check one of my favorites shows. On the off chance someone hasn't seen it, I hid the spoiler in a disclosure tag here:

Breaking Bad Spoiler! In Breaking Bad, Saul Goodman suggests that they launder Walter White's money by using a botnet. By setting up a public donations page, Walter could funnel his money into a computer program that would simulate anonymous donations coming in from all over the world. Here's the question: do public donations follow Benford's Law? If they do, then the IRS could use that to identify fraudulent activity. Fortunately for Walter and Saul, the p-value from the chi-squared test applied to this dataset suggests that there is insufficient evidence to reject the null hypothesis.

Here is Benford's Distribution for Online Payment data as seen on Kaggle. Similar to the Illinois population data, the visualization appears to support Benford's Distribution, but the chi-squared test suggests that there is not enough evidence to reject the null hypothesis.

kaggle_online_payments

About

Observing the effects of Benford's Law using real data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published