-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
100 lines (71 loc) · 2.81 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
---
title: "`featr` package"
output: github_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(featr)
```
`featr` stands for "feature" (oh, really?) and pronounced the same way.
There is a bunch of functions to produce not-so-common tasks such as
renaming variables in the data frame or list of data frames,
one-hot-encoding with custom parameters etc.
Please see [Examples](#examples) section
## Installation
To install `featr` please use `devtools`:
```r
devtools::install_github("pavel-filatov/featr")
```
Please note: you may need to install development version of `rlang` package:
```r
devtools::install_github("r-lib/rlang")
```
## Examples
#### `cor_index()`
`cor_index()` fucntion makes a correlation index: it gets each pair of variables, calculate
correlation between them, and turn it all to data frame.
Then the function sort data frame by absolute value of correlation.
```{r}
featr::cor_index(mtcars)
```
#### `identity_index()`
For each pair of variables in input data frame `identity_index()` calculates three numbers:
- **`identical`** - number of observations where first and second variables are equal;
- **`non_identical`** - number of observations where first and second variables aren't equal;
- **`other`** all the other cases (e.g. when one of the entries is `NA`).
**`identitity_index`** is computed as ratio
$$identity\ index = \frac{identical}{identical + non\_identical}$$
That's how the fuction works:
```{r}
set.seed(42)
df <- tibble(letter1 = sample(c(letters[1:3], NA, NaN), 100, TRUE),
letter2 = sample(c(letters[1:3], NA), 100, TRUE),
letter3 = sample(c(letters[1:3], NA), 100, TRUE),
num1 = sample(1:4, 100, TRUE),
num2 = sample(1:4, 100, TRUE),
num3 = sample(c(1:3, NA), 100, TRUE))
identity_index(df)
```
You see that variables `letter1` and `letter3` have 22 equal,
40 different observations and 38 cases where at least on of them has `NA`.
`.na_include` flag lets you consider as identical cases where `var1 == NA` and `var2 == NA`.
The same way if `var1 == NA` and `var2 != NA` it will count this case as `non_identical`.
```{r}
identity_index(df, .na_include = TRUE)
```
Now `letter1` and `letter3` have 6 more identical cases.
#### `make_shift()`
`make_shift()` is great option when you need to make a lot of lags (move down)
and leads (move up) for one or more variables.
This function captures variables to apply shifting and vector of window sizes and
returns a list of quoted expressions:
```{r}
make_shift(mpg, qsec, .n = -2:2)
```
To make new variables you simply need to unqoute the function output
inside the `dplyr::mutate()` call **using `!!!`** (bang-bang-bang) opearator:
```{r}
select(mtcars, mpg, qsec) %>%
mutate(!!!make_shift(mpg, qsec, .n = -1:2))
```