Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{SLmetrics} Version 0.3-0 🚀 #33

Merged
merged 11 commits into from
Dec 30, 2024
Merged

{SLmetrics} Version 0.3-0 🚀 #33

merged 11 commits into from
Dec 30, 2024

Conversation

serkor1
Copy link
Owner

@serkor1 serkor1 commented Dec 30, 2024

Note

See NEWS or commit history for detailed changes.

📚 What?

🚀 New features

This update introduces four new features. These are described below,

Cross-Entropy Loss (PR #34): Weighted and unweighted cross-entropy loss. The function can be used as follows,

# 1) define classes and
# observed classes (actual)
classes <- c("Class A", "Class B")

actual   <- factor(
  c("Class A", "Class B", "Class A"), 
  levels = classes

)

# 2) define probabilites
# and construct response_matrix
response <- c(
  0.2, 0.8, 
  0.8, 0.2, 
  0.7, 0.3
)

response_matrix <- matrix(
  response,
  nrow = 3,
  ncol = 2,
  byrow = TRUE
)

colnames(response_matrix) <- classes

response_matrix
#>      Class A Class B
#> [1,]     0.2     0.8
#> [2,]     0.8     0.2
#> [3,]     0.7     0.3

# 3) calculate entropy
SLmetrics::entropy(
  actual,
  response_matrix
)
#> [1] 1.19185

Relative Root Mean Squared Error (Commit 5521b5b):

The function normalizes the Root Mean Squared Error by a factor. There is no official way of normalizing it - and in {SLmetrics} the RMSE can be normalized using three options; mean-, range- and IQR-normalization. It can be used as follows,

# 1) define values
actual <- rnorm(1e3)
predicted <- actual + rnorm(1e3)

# 2) calculate Relative Root Mean Squared Error
cat(
  "Mean Relative Root Mean Squared Error", SLmetrics::rrmse(
    actual        = actual,
    predicted     = predicted,
    normalization = 0
  ),
  "Range Relative Root Mean Squared Error", SLmetrics::rrmse(
    actual        = actual,
    predicted     = predicted,
    normalization = 1
  ),
  "IQR Relative Root Mean Squared Error", SLmetrics::rrmse(
    actual        = actual,
    predicted     = predicted,
    normalization = 2
  ),
  sep = "\n"
)

#> Mean Relative Root Mean Squared Error
#> 2751.381
#> Range Relative Root Mean Squared Error
#> 0.1564043
#> IQR Relative Root Mean Squared Error
#> 0.7323898

Weighted Receiver Operator Characteristics and Precision-Recall Curves (PR #31):

These functions returns the weighted version of TPR, FPR and precision, recalll in weighted.ROC() and weighted.prROC() respectively. The weighted.ROC()-function1 can be used as follows,

actual    <- factor(sample(c("Class 1", "Class 2"), size = 1e6, replace = TRUE, prob = c(0.7, 0.3)))
response  <- ifelse(actual == "Class 1", rbeta(sum(actual == "Class 1"), 2, 5), rbeta(sum(actual == "Class 2"), 5, 2))
w         <- ifelse(actual == "Class 1", runif(sum(actual == "Class 1"), 0.5, 1.5), runif(sum(actual == "Class 2"), 1, **2))
# Plot
plot(SLmetrics::weighted.ROC(actual, response, w))

⚠️ Breaking Changes

  • Weighted Confusion Matix: The w-argument in cmatrix() has been
    removed in favor of the more verbose weighted confusion matrix call
    weighted.cmatrix()-function. See below,

Prior to version 0.3-0 the weighted confusion matrix were a part of
the cmatrix()-function and were called as follows,

SLmetrics::cmatrix(
    actual    = actual,
    predicted = predicted,
    w         = weights
)

This solution, although simple, were inconsistent with the remaining
implementation of weighted metrics in {SLmetrics}. To regain consistency
and simplicity the weighted confusion matrix are now retrieved as
follows,

# 1) define factors
actual    <- factor(sample(letters[1:3], 100, replace = TRUE))
predicted <- factor(sample(letters[1:3], 100, replace = TRUE))
weights   <- runif(length(actual))

# 2) without weights
SLmetrics::cmatrix(
    actual    = actual,
    predicted = predicted
)
#>    a  b  c
#> a  7  8 18
#> b  6 13 15
#> c 15 14  4
# 2) with weights
SLmetrics::weighted.cmatrix(
    actual    = actual,
    predicted = predicted,
    w         = weights
)
#>          a        b        c
#> a 3.627355 4.443065 7.164199
#> b 3.506631 5.426818 8.358687
#> c 6.615661 6.390454 2.233511

🐛 Bug-fixes

  • Return named vectors: The classification metrics when
    micro == NULL were not returning named vectors. This has been fixed.

Footnotes

  1. The syntax is the same for weighted.prROC()

* [OPTIMIZE] A new template for classification 🔨

 * The new template uses a template handler to avoid too many overloading templates. This effective reduces the amount of code to achieve the same thing. It is still variadic by construction to maintain flexbility of future additions.

* [HOT-FIX] See commit-message 📚

* Root Relative Squared Error example script renamed so it aligns with documentation
* Removed ellipsis from S3_Accuracy documentation to avoid roxygen2 producing "fully documented"-warnings

* Removed Redundant Logic 🔨

* All overloaded functions in the classifiation class has been removed. The overall logic has been simplified, and the old functions are therefore no longer needed 🚀

* The old templates have been removed as they are no longer needed.

* [HOT-FIX] Likelihood Ratios 🔨

* All Likelihood Ratios have had the `micro`- and `na_rm`-argument removed as they were not used.
* The functions have been refactored and is now named more verbosely according to the metric.

* [REFACTOR] See commit message 📚

* All classification functions are constructed more verbosely - all affilliated derived classes are named as FooClass in CamelCase.
* All function logic and arguments are namespace qualified, and are now on the form Rcpp::ObjectType
* All additional arguments other than `w` and `micro` are handled inside the derived class. This reduces the clutter in the classifcation class object.

All tests passed locally.
* `beta` is now correctly passed as a double parameter.
* Test-setup flexibility 🔨

* Added interactive tests to ease the development flow. This enables sourcing all files via an external script.

* New Matrix Class 🔨

* The matrix class reduces the amount of repeated code.
* The matrix templates have been incorporated in the class, and are using overloading instead of if-statements
* Updated NAMESPACE, and associated methods.

* Deleted classification_Utils.H 🔨

* The header file is no longer needed.
* Deleted calls to the header file.

* Consolidated classification_Helpers.h 🔨

* The classification_Utils.H contents are moved to the helper.

* [UPDATE] See commit-message 📚

* Updated NEWS (but not rendered),
* Updaetd unit-tests and references according to the new matrix method.
* Test-setup flexibility 🔨

* Added interactive tests to ease the development flow. This enables sourcing all files via an external script.

* New Matrix Class 🔨

* The matrix class reduces the amount of repeated code.
* The matrix templates have been incorporated in the class, and are using overloading instead of if-statements
* Updated NAMESPACE, and associated methods.

* Deleted classification_Utils.H 🔨

* The header file is no longer needed.
* Deleted calls to the header file.

* Consolidated classification_Helpers.h 🔨

* The classification_Utils.H contents are moved to the helper.

* [UPDATE] See commit-message 📚

* Updated NEWS (but not rendered),
* Updaetd unit-tests and references according to the new matrix method.

* [FEATURE] Relative Root Mean Squared Errror 🚀

* A new feature have been (re)introduced; Relative Root Mean Squared Error. The function normalizes the RMSE relative to the mean, range or IQR.
* The quantile functions have been taken from `pinball()` - it is probably a good idea to create another header file for this.
* Created unit-tests, example and documentatation.

NOTE: NEWS are not updated.

* [DOCUMENTATION] Updated NEWS 📚
* The returned vectores of classification metrics weren't named.
commit cfb8d5cbacebafc0997cd8f2c076940b18cd82ec
Author: serkor1 <77464572+serkor1@users.noreply.github.com>
Date:   Sat Dec 28 11:53:22 2024 +0100

    Re-written unit-tests for the new functions :hammer:

    * Functions written in R have had py_-prefix removed so the distinction between what is from scikit-learn and manual is clear. NOTE: not all functions have been rewritten. This is a work-in-progress.

commit aa60bbbaa319b5ef7e512e87291a6fbd74e28623
Author: serkor1 <77464572+serkor1@users.noreply.github.com>
Date:   Sat Dec 28 11:50:37 2024 +0100

    [OPTIMIZE] Refactored reference functions :rocket:

    * The prROC and ROC function from scikit-learn now iterates through all available labels, to simplify  the R side of the unit-tests
    * The reference functions have been refactored such that the amount of repeated  code is reduced by introducing a generalized metric function.
    * The reference functions have been split between regression and classification functions
    * The setup script has been wrapped in is.interactive() so the tests can be run directly
* Removed old artefacts; all functions are now treated equally - so there is no need to specify function names.
* Removed python-setup: This is piece of code were created when transitioning from Rstudio to Positron. The idea was that the python testing environment were to be selfcontained in the project. But for reason either related to my skills, {reticulate} or Positron it never worked as intended. And considering the state of {SLmetrics} it is not worth spending time on making it work at this stage. This is for later when all else is done.

* Added cleaning-commands to build and check before and after checks; its always good to start from scratch.
* [FEATURE] Weighted `ROC()` and `prROC()` 🚀

* The functions now has a weighted version.
* The functions still supports custom thresholds.
* Micro and na.rm has been removed for now; the micro/macro average still needs some polishing

* [UNIT-TEST] Unit-tests updated 🔨

* The unit-tests are a bit shaky for ROC as scikit-learn drops some thresholds. This is not implemented here.

* [DOCUMENTATION] Vignette Modfied 📚

* It was using max() which returrns Inf; Inf can't be plotted...

* [UPDATE] NAMESPACE and .Rd updated 📚

* [DOCUMENTATION] Updated NEWS 📚

* [BUG-FIX] Added class to functions 🥷

* [BUG-FIX] .... Don't ask
* General layout changes and  additions to the README

NOTE: The performance evaluation have been extended to include memory-checks via {bench}
@serkor1 serkor1 added documentation Improvements or additions to documentation enhancement New feature or request optimze Various optimizations to source code labels Dec 30, 2024
@serkor1 serkor1 self-assigned this Dec 30, 2024
##  📚  What?

* **New Feature:** Weighted and unweighted cross-entropy loss.
* **Bug-fix:** The {bench} package reference was misspelled, and
`prROC()` was incorrectly sharing documentation with `ROC()`. This has
been corrected.

##  👀  Showcase

The order of the <[factor]> doesn't matter, as long as the
`response_matrix` is correctly specified probability-wise. Ie. the
`classes` can be specified in any order, as long as the corresponding
`response_matrix` follows the order of the classes. See below.

### Example 1: "Class A" followed by "Class B"

``` r
# 1) define classes and
# observed classes (actual)
classes <- c("Class A", "Class B")

actual   <- factor(
  c("Class A", "Class B", "Class A"), 
  levels = classes

)

# 2) define probabilites
# and construct response_matrix
response <- c(
  0.2, 0.8, 
  0.8, 0.2, 
  0.7, 0.3
)

response_matrix <- matrix(
  response,
  nrow = 3,
  ncol = 2,
  byrow = TRUE
)

colnames(response_matrix) <- classes

response_matrix
#>      Class A Class B
#> [1,]     0.2     0.8
#> [2,]     0.8     0.2
#> [3,]     0.7     0.3

# 3) calculate entropy
SLmetrics::entropy(
  actual,
  response_matrix
)
#> [1] 1.19185
```

<sup>Created on 2024-12-30 with [reprex
v2.1.1](https://reprex.tidyverse.org)</sup>

### Example 2: "Class B" followed by "Class A"

``` r
# 1) define classes and
# observed classes (actual)
classes <- c("Class B", "Class A")

actual   <- factor(
  c("Class A", "Class B", "Class A"), 
  levels = classes

)

# 2) define probabilites
# and construct response_matrix
response <- c(
  0.2, 0.8, 
  0.8, 0.2, 
  0.7, 0.3
)

response_matrix <- 1 - matrix(
  response,
  nrow = 3,
  ncol = 2,
  byrow = TRUE
)

colnames(response_matrix) <- classes

response_matrix
#>      Class B Class A
#> [1,]     0.8     0.2
#> [2,]     0.2     0.8
#> [3,]     0.3     0.7

# 3) calculate entropy
SLmetrics::entropy(
  actual,
  response_matrix
)
#> [1] 1.19185
```

<sup>Created on 2024-12-30 with [reprex
v2.1.1](https://reprex.tidyverse.org)</sup>

As shown, the cross-entropy is identical (1.19185 in both cases). The
order of `classes` in the factor’s levels just needs to match the order
of columns in the `response_matrix`. With this new feature, you can also
add observation-level weights via `SLmetrics::weighted.entropy()`.
@serkor1 serkor1 marked this pull request as ready for review December 30, 2024 13:20
@serkor1 serkor1 merged commit 878972a into main Dec 30, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request optimze Various optimizations to source code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant