-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathREADME.Rmd
125 lines (96 loc) · 4.85 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/rqdatatable)](https://cran.r-project.org/package=rqdatatable)
[![status](https://tinyverse.netlify.com/badge/rqdatatable)](https://CRAN.R-project.org/package=rqdatatable)
![](/~https://github.com/WinVector/rqdatatable/raw/master/tools/rqdatatable.png)
[`rqdatatable`](/~https://github.com/WinVector/rqdatatable) is an implementation of
the [`rquery`](/~https://github.com/WinVector/rquery) piped Codd-style relational algebra
hosted on [`data.table`](https://rdatatable.gitlab.io/data.table/). `rquery` allow the expression
of complex transformations as a series of relational operators and
`rqdatatable` implements the operators using `data.table`.
A `Python` version of `rquery`/`rqdatatable` is under initial development as [`data_algebra`](/~https://github.com/WinVector/data_algebra).
For example
scoring a logistic regression model (which requires grouping, ordering, and ranking)
is organized as follows. For more on this example please see
["Let’s Have Some Sympathy For The Part-time R User"](https://win-vector.com/2017/08/04/lets-have-some-sympathy-for-the-part-time-r-user/).
```{r}
library("rqdatatable")
```
```{r}
# data example
dL <- build_frame(
"subjectID", "surveyCategory" , "assessmentTotal" |
1 , "withdrawal behavior", 5 |
1 , "positive re-framing", 2 |
2 , "withdrawal behavior", 3 |
2 , "positive re-framing", 4 )
```
```{r}
scale <- 0.237
# example rquery pipeline
rquery_pipeline <- local_td(dL) %.>%
extend_nse(.,
probability :=
exp(assessmentTotal * scale)) %.>%
normalize_cols(.,
"probability",
partitionby = 'subjectID') %.>%
pick_top_k(.,
k = 1,
partitionby = 'subjectID',
orderby = c('probability', 'surveyCategory'),
reverse = c('probability', 'surveyCategory')) %.>%
rename_columns(., c('diagnosis' = 'surveyCategory')) %.>%
select_columns(., c('subjectID',
'diagnosis',
'probability')) %.>%
orderby(., cols = 'subjectID')
```
We can show the expanded form of query tree.
```{r, comment=""}
cat(format(rquery_pipeline))
```
And execute it using `data.table`.
```{r}
ex_data_table(rquery_pipeline)
```
One can also apply the pipeline to new tables.
```{r}
build_frame(
"subjectID", "surveyCategory" , "assessmentTotal" |
7 , "withdrawal behavior", 5 |
7 , "positive re-framing", 20 ) %.>%
rquery_pipeline
```
Initial bench-marking of `rqdatatable` is very favorable (notes [here](https://win-vector.com/2018/06/03/rqdatatable-rquery-powered-by-data-table/)).
To install `rqdatatable` please use `install.packages("rqdatatable")`.
Some related work includes:
* [`data.table`](https://rdatatable.gitlab.io/data.table/)
* [`Polars`](https://www.pola.rs)
* [`data algebra`](/~https://github.com/WinVector/data_algebra)
* [`disk.frame`](/~https://github.com/DiskFrame/disk.frame)
* [`dbplyr`](https://dbplyr.tidyverse.org)
* [`dplyr`](https://dplyr.tidyverse.org)
* [`dtplyr`](/~https://github.com/tidyverse/dtplyr)
* [`maditr`](/~https://github.com/gdemin/maditr)
* [`nc`](/~https://github.com/tdhock/nc)
* [`poorman`](/~https://github.com/nathaneastwood/poorman)
* [`rquery`](/~https://github.com/WinVector/rquery)
* [`SparkR`]( https://CRAN.R-project.org/package=SparkR)
* [`sparklyr`](https://spark.rstudio.com)
* [`sqldf`](/~https://github.com/ggrothendieck/sqldf)
* [`table.express`](/~https://github.com/asardaes/table.express)
* [`tidyfast`](/~https://github.com/TysonStanley/tidyfast)
* [`tidyfst`](/~https://github.com/hope-data-science/tidyfst)
* [`tidyquery`](/~https://github.com/ianmcook/tidyquery)
* [`tidyr`](https://tidyr.tidyverse.org)
* [`tidytable`](/~https://github.com/markfairbanks/tidytable) (formerly `gdt`/`tidydt`)
--
Note `rqdatatable` has an "immediate mode" which allows direct application of pipelines stages without
pre-assembling the pipeline. "Immediate mode" is a convenience for ad-hoc analyses, and has some negative
performance impact, so we encourage users to build pipelines for most work. Some notes on the issue can be found
[here](/~https://github.com/WinVector/rqdatatable/blob/master/extras/ImmediateIssue.md).
`rqdatatable` implements the `rquery` grammar in the style of a "Turing or Cook reduction" (implementing the result in terms of multiple oracle calls to the related system).
`rqdatatable` is intended for "simple column names", in particular as `rqdatatable` often uses `eval()` to work over `data.table` escape characters such as "`\`" and "`\\`" are not reliable in column names. Also `rqdatatable` does not support tables with no columns.