Skip to content

A library of statistical distribution sampling and transducing functions

Notifications You must be signed in to change notification settings

MastodonC/kixi.stats

Repository files navigation

kixi.stats

A Clojure/ClojureScript library of statistical sampling and transducing functions.

Build Status License cljdoc badge Clojars Project

Available distributions:

  • Bernoulli
  • Beta
  • Beta-binomial
  • Binomial
  • Categorical
  • Cauchy
  • Chi-Squared
  • Dirichlet
  • Dirichlet-multinomial
  • Exponential
  • F
  • Gamma
  • Log-normal
  • Multinomial
  • Normal
  • Pareto
  • Poisson
  • Student's t
  • Uniform
  • Weibull

Statistical tests:

  • Simple Z-test (one-sample location test)
  • Two-sample Z-test
  • Welch's unequal variances t-test
  • Simple t-test
  • Chi-squared test of categorical independence

Available transducing functions:

  • Count
  • Min
  • Max
  • Proportion
  • (Arithmetic) mean
  • Geometric mean
  • Harmonic mean
  • Median
  • Variance
  • Interquartile range
  • Standard deviation
  • Standard error
  • Skewness
  • Kurtosis
  • Covariance
  • Covariance matrix
  • Correlation
  • R-squared coefficient of determination
  • Adjusted R-squared
  • MSE / RMSE
  • Correlation matrix
  • Simple linear regression
  • Standard error of the mean
  • Standard error of the estimate
  • Standard error of the prediction
  • Simple Z-test & two-sample Z-test
  • Simple t-test and two-sample t-test
  • Chi-squared test

Variance, covariance, standard deviation, skewness and kurtosis each have sample and population variants.

Documentation

View the documentation here.

Examples of kixi.stats usage can be seen between 10:20-16:00 of this video on Clojure for Machine Learning.

Installation

Install kixi.stats into your Clojure project using the appropriate form at its Clojars page:

Clojars Project

Or grab the most recent code using a Git dependency:

replace $GIT_SHA with the sha you'd like to target!

{kixi/stats
  {:git/url "/~https://github.com/MastodonC/kixi.stats.git"
  :git/sha "$GIT_SHA"}}

Usage

Transducing functions

kixi.stats.core contains statistical reducing functions that can be used with transduce:

(require '[kixi.stats.core :refer [standard-deviation correlation]])

(->> [{:x 2} {:x 4} {:x 4} {:x 4} {:x 5} {:x 5} {:x 5} {:x 7} {:x 9}]
     (transduce (map :x) standard-deviation))

;; => 2.0

(->>  [{:x 1 :y 3} {:x 2 :y 2} {:x 3 :y 1}]
      (transduce identity (correlation :x :y)))

;; => -1.0

(->> [{:x 1 :y 3 :z 2} {:x 2 :y 2 :z 4} {:x 3 :y 1 :z 6}]
     (transduce identity (correlation-matrix {:x :x :y :y :z :z})))

;; => {[:x :y] -1.0, [:x :z] 1.0, [:y :z] -1.0,
;;     [:y :x] -1.0, [:z :x] 1.0, [:z :y] -1.0}

One advantage of using transduce for statistics calculation is that multiple statistics can be calculated simultaneously by composing together reducing functions. The generic combinators available in redux or xforms can be used together with the reducing functions in kixi.stats. For example, redux' fuse will return a higher-order reducing function that can be used to execute an arbitrary number of reducing functions simultaneously:

(require '[kixi.stats.core :refer [mean standard-deviation]]
         '[redux.core :refer [fuse]])

;; Calculate mean and standard deviation at the same time:

(->> [2 4 4 4 5 5 5 7 9]
     (transduce identity (fuse {:mean mean :sd standard-deviation})))

;; => {:mean 5.0, :sd 2.0}

Integration with transducers means that the wealth of core Clojure support can be applied to working with statistics. For example, filter can be used to constrain the elements over which statistics are calculated:

(require '[kixi.stats.core :refer [median]])

(def gt5? (filter #(> % 5)))

;; Calculate the median only of numbers greater than 5:

(transduce gt5? median (range 10))

;; => 7.5

So long as xform is a stateless transducer, we can use it to create a new reducing function locally which doesn't affect other reducing functions also being composed:

(require '[kixi.stats.core :refer [count]]
         '[redux.core :refer [fuse]])

(def gt5? (filter #(> % 5)))

;; Count both all numbers and those greater than 5:

(transduce identity (fuse {:n count :gt5 (gt5? count)}) (range 10))

;; => {:n 10, :gt5 4}

The kixi.stats API is focused primarily on statistical functions and doesn't need to be littered with exhaustive count-when-style specialisms. Combinators from libraries such as redux and Clojure itself can be used to combine those functions in sophisticated ways.

Empirical distribution histograms

The Clojure version of kixi.stats.core contains reducing functions for calculating the median, interquartile range and 5-number summary using the t-digest. They can be used like this:

(require '[kixi.stats.core :refer [median iqr summary]]
         '[redux.core :refer [fuse]])

;; Calculate the median, iqr and 5-number summary:

(->> (range 100)
     (transduce identity (fuse {:median median
                                :iqr iqr
                                :summary summary})))

;; => {:median 49.5, :iqr 50.0, :summary {:min 0.0, :q1 24.5, :median 49.5, :q3 74.5, :max 99.0, :iqr 50.0}}

Although this works fine, it should be noted that each function maintains its own digest. In cases where multiple quantiles must be calculated it's more efficient to calculate a single digest with the histogram function and subsequently query it with the equivalent functions from the kixi.stats.distribution namespace.

(require '[kixi.stats.core :refer [histogram]]
         '[kixi.stats.distribution :refer [quantile]])

;; Calculate the 2.5 and 97.5 quantile from an empirical distribution

(def distribution
  (->> (range 100)
       (transduce identity histogram)))

{:lower (quantile distribution 0.025)
 :upper (quantile distribution 0.975)}

;; => {:lower 2.0, :upper 97.0}

The post-complete function defined in the kixi.stats.core allows us to chain the histogram and quantile steps like so:

(require '[kixi.stats.core :refer [histogram post-complete]]
         '[kixi.stats.distribution :refer [quantile]])

;; Calculate the 2.5 and 97.5 quantile from an empirical disribution

(->> (range 100)
     (transduce identity (post-complete histogram
                           (fn [hist]
                             {:lower (quantile hist 0.025)
                              :upper (quantile hist 0.975)}))

;; => {:lower 2.0, :upper 97.0}

The kixi.stats.distribution namespace contains many functions for operating on histograms which mirror the names from kixi.stats.core: cdf, iqr, minimum, maximum, quantile and summary. In each case, the kixi.stats.core function will return a reducing function for use with transduce whereas the kixi.stats.distribution function will accept a calculated digest and return a value directly.

Distribution sampling

kixi.stats.distribution contains functions for specifying and sampling from statistical distributions.

(require '[kixi.stats.distribution :refer [draw sample binomial]])

(draw (binomial {:n 100 :p 0.5}))

;;=> 54


(sample 10 (binomial {:n 100 :p 0.5}))

;;=> (49 53 53 44 55 47 45 51 49 51)

draw and sample are the primary means of extracting variates from a distribution. draw returns a single variate whereas sample returns n variates.

Each distribution implements the clojure.lang.ISeq / ISeqable interface, so n variates can be sampled with (take n (binomial {:n 100 :p 0.5})). However, where possible sample uses optimisations to return exactly n variates, and should be preferred.

Discrete summarisation

The Bernoulli, binomial and categorical distributions are discrete, so samples can be summarised by counting the number of times each variate appears. Discrete distributions can be directly sampled in this way with sample-summary:

(require '[kixi.stats.distribution :refer [sample-summary bernoulli]])

(sample-summary 1000 (bernoulli {:p 0.3}))

;;=> {true 296, false 704}

This is equivalent to (frequencies (sample 1000 (bernoulli {:p 0.3}))), but where possible sample-summary uses optimisations to avoid reifying and aggregating a large intermediate sample, and should be preferred. When sample-summary doesn't return a value for a particular variate, that value should be assumed zero.

Deterministic sampling

The sampling functions draw, sample and sample-summary are all designed to perform deterministically when provided with a seed value. If repeatable samples are desired, pass {:seed SEED_LONG} as the final argument:

(require '[kixi.stats.distribution :refer [uniform]])

(draw (uniform {:a 0 :b 1}) {:seed 42})

;;=> 0.7415648787718233

(draw (uniform {:a 0 :b 1}) {:seed 42})

;;=> 0.7415648787718233

Statistical tests

The kixi.stats.test namespace contains functions for performing statistical tests.

For example, we can perform a z-test between a known population mean & standard deviation and a sampled mean with a given sample size in the following way:

(require '[kixi.stats.test :refer [simple-z-test]])

(simple-z-test {:mu 100 :sd 12} {:mean 96 :n 55} {:tails :lower})

;;=> {:p-value 0.0067167326028858}

As with the kixi.stats.distribution namespace - which contains many functions which mirror kixi.stats.core - simple-z-test is also available in kixi.stats.core. The latter function returns a reducing function for use with transduce.

(require '[kixi.stats.core :refer [simple-z-test]])

;; If the standard deviation is not provided, the sample standard deviation will be used instead (a 'plug-in test')
(transduce identity (simple-z-test {:mu 100}) (range 200))

;;=> {:p-value 0.9027648250246222}

References

Statistical reducing functions strong influenced by Tesser. Pseudorandom number generation is provided by test.check.

Contributors

License

Copyright © 2024 Mastodon C Ltd

Distributed under the Eclipse Public License version 1.0.