-
Notifications
You must be signed in to change notification settings - Fork 7
/
Copy pathchapter6.Rmd
684 lines (474 loc) · 19.7 KB
/
chapter6.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
---
title: 'Subsets and conditions'
description: 'Let''s begin to seek the best conditions for coping with uncertainty.'
---
## Welcome to part 2!
```yaml
type: NormalExercise
key: da73cb2b07
lang: r
xp: 100
skills: 1
```
Welcome to chapter 6: **Subsets and conditions**. In the following chapters (6-10) we will be using a dataset collected from students in the Faculty of Social Sciences in the University of Helsinki. The students filled out [this questionaire](https://elomake.helsinki.fi/lomakkeet/54219/lomake.html), which produced the dataset described [here](http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS2-meta.txt).
The answers were later combined into the following combination variables:
- *deep*: mean score of questions related to deep learning
- *stra*: mean score of questions related to strategic learning
- *surf*: mean score of questions related to surface learning
- *attitude*: mean score of questions related to attitude toward statistics
The *learning2014* dataset includes these variables along with the students age and gender. The data also includes the students points in a statistics exam. Zero points means that the student did not attend an exam. At the end of the course we will take a close look at how a positive attitude predicts students exam performance.
`@instructions`
- Look at the structure of learning2014
- Look at the first six observations of learning2014
- Compute a summary of all the variables in learning2014 (with a single function call)
`@hint`
- Use `head()` to look at the first six observations
- Use `summary()` on learning2014 to compute the summaries of the variables
`@pre_exercise_code`
```{r}
learning2014 <- read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/learning2014.txt", sep = "\t", header = TRUE)
# Access ggplot2 functions
library(ggplot2)
# Exlude students who did not attend any exams (points == 0)
learning2014 <- learning2014[learning2014$points != 0, ]
# Create objects attitude and points
attitude <- learning2014$attitude
points <- learning2014$points
# A scatter plot of attitude and points
qplot(attitude, points) + geom_smooth(method = "lm")
```
`@sample_code`
```{r}
# learning2014 is available
# Look at the structure of learning2014
str(learning2014)
# Look at the first six observations of learning2014
# Compute a summary of the variables in learning2014
```
`@solution`
```{r}
# learning2014 is available
# Look at the structure of learning2014
str(learning2014)
# Look at the first six observations of learning2014
head(learning2014)
# Compute a summary of the variables in learning2014
summary(learning2014)
```
`@sct`
```{r}
# test_object("object_name")
test_function("str", args=c("object"))
test_function("head", args=c("x"))
test_function("summary", args=c("object"))
test_error()
success_msg("Good work!")
```
---
## Explore your data
```yaml
type: MultipleChoiceExercise
key: d64c2a9918
lang: r
xp: 50
skills: 1
```
Let's get a bit more familiar with the `learning2014` dataset. Beside are plots about the dataset. Browse through them to see the distributions of the variables. After looking at the plots, select the correct claim from the list below.
`@possible_answers`
- There are many outliers in the distribution of students scores on strategic learning.
- Over thirty students got three or less points.
- There are more men than women in the dataset
- The distribution of deep learning is skewed to right (positive skewness).
- Highest frequency in the histogram of surface learning scores is little over 25.
`@hint`
- Browse through the plots and choose the correct claim.
- [Skewness](https://en.wikipedia.org/wiki/Skewness)
`@pre_exercise_code`
```{r}
# data
learning2014 <- read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/learning2014.txt", sep = "\t", header = TRUE)
# for resetting par
def.par <- par(no.readonly = TRUE)
# combination variables
two_plots<-function(x, color, plot_title, x_text){
layout(matrix(c(1,2), nrow=2, ncol=1), heights = c(2,1))
par(mar=c(0, 4,3,1), bty="n")
hist(x, main = plot_title, breaks = 25, col = color, xlim=c(0,5), xaxt="n", xlab = "")
par(mar=c(5,4,0,1))
boxplot(x, horizontal = T, ylim=c(0,5), xlab=x_text, col=color)
}
two_plots(learning2014$surf, 'salmon', "Distribution of students' scores on surface learning", 'Surface learning score')
two_plots(learning2014$deep, 'paleturquoise3', "Distribution of students' scores on deep learning", 'Deep learning score')
two_plots(learning2014$stra, 'mediumpurple2', "Distribution of students' scores on strategic learning", 'Strategic learning score')
two_plots(learning2014$attitude, 'seagreen3', "Distribution of students' scores on attitude towards statistics", 'Attitude score')
par(def.par)
# exam points
points <- table(cut(learning2014$points, (0:11)*3, include.lowest=T))
barplot(points, main = "Distribution of students' exam points", ylab='Frequency', xlab = "Points", col = "slategrey", border="slategrey")
# Age, Gender
layout(matrix(c(1,2), nrow=1, ncol=2), c(2,1))
boxplot(learning2014$age~learning2014$gender, main = "Box plot of students' age by gender", xlab = "age", horizontal = T, col=c("orange", "steelblue"))
barplot(table(learning2014$gender), main = "Distribution of students' gender", col=c("orange", "steelblue"), ylab="Frequency")
```
`@sct`
```{r}
msg1 = "Sorry, wrong. If you look at the boxplot of strategic learning, you can see that there are no outlier marks (like the one in deep learning) and the scores are close to each other."
msg2 = "Nope, in the distribution of students exam points you can see that only about 17 people got three or less points."
msg3 = "No, quite the opposite: 60 men and 120 women! "
msg4 = "This was a hard one. The distribution of deep learning is skewed to left. Try again!"
msg5 = "So true! Proceed!"
test_mc(correct = 5, feedback_msgs = c(msg1,msg2,msg3,msg4, msg5))
success_msg("Going through the plots like a pro. Awesome!")
```
---
## Selecting a subset()
```yaml
type: NormalExercise
key: c12d1b7017
lang: r
xp: 100
skills: 1
```
You have already learned how to look at your data with visualizations and summaries. This is great, but often there is a need to zoom in on a **subset** of the data matching some *condition*.
There are multiple ways to select a subset in R. Perhaps the most convenient method is the `subset()` function. `subset()` is a generic function that works with several data formats. The first argument of `subset()` is the data and the second argument are the *logical conditions* for selecting parts of the data.
When used on a data.frame, `subset()` returns the rows of the data.frame which match the logical conditions. You will soon learn more about logical conditions.
`@instructions`
- Look at the first six observations of `learning2014`
- Create object `stra_learners` by executing the example codes
- Look at the first 6 rows of the strategic learners subdata
- Create your own subset by selecting the students with attitude less than 3
`@hint`
- Use the functions `head()` and `subset()` as in the example codes.
- The strategic learning variable is named `stra`. The attitude variable is named `attitude`.
- Use the operator `<` similarily to it's counterpart `>` in the example.
`@pre_exercise_code`
```{r}
learning2014 <- read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/learning2014.txt", sep = "\t", header = TRUE)
# Look at the first six observations
head(learning2014)
```
`@sample_code`
```{r}
# learning2014 is available
# Look at the first six observations
head(learning2014)
# select students with above (>) 4 strategic learning
stra_learners <- subset(learning2014, stra > 4)
# Look at the first 6 rows of stra_learners
# select students with below (<) 3 attitude
attitude_low <-
```
`@solution`
```{r}
# learning2014 is available
# Look at the first six observations
head(learning2014)
# select students with above (>) 4 strategic learning
stra_learners <- subset(learning2014, stra > 4)
# Look at the first 6 rows of the subdata
head(stra_learners)
# select students with below (<) 3 attitude
attitude_low <- subset(learning2014, attitude < 3)
```
`@sct`
```{r}
# submission correctness tests
# example tests:
test_output_contains("head(stra_learners)")
test_object("attitude_low")
# test if the students code produces an error
test_error()
# Final message the student will see upon completing the exercise
success_msg("Excellent work!")
```
---
## Logical comparison
```yaml
type: NormalExercise
key: 73f94cfacf
lang: r
xp: 100
skills: 1
```
What you just saw inside the `subset()` function was **logical comparison**. Logical comparison creates logical vectors, which can be used to select subsets of data. Logical conditions have other uses as well and are generally important in programming tasks.
The logical comparison operators in R are:
operator | description
----------|------------
`==` | exactly equal to
`!=` | not equal to
`<` | less than
`>` | greater than
`<=` | less or equal to
`>=` | greater or equal to
Follow the instructions below to complete the exercise. Take your time, this might not be a quick exercise.
`@instructions`
- Execute and study the example codes
- Add to the row with `c("a","b","c")`: select a suitable logical comparison operator and a suitable comparison value to produce a result vector `FALSE, FALSE, TRUE`
- Add to the row with `c(1,3,2)`: select a suitable logical comparison operator and a suitable comparison value to produce a result vector `TRUE, FALSE, TRUE`
- Bonus: can you think of other ways of achieving the same results?
`@hint`
- The third example should point you in the right direction
- If you're uncertain, just choose a comparison value and try out a bunch of different operators to figure out what might work.
`@pre_exercise_code`
```{r}
# no pec
```
`@sample_code`
```{r}
# In the comments below: T = TRUE, F = FALSE
# (R also understands these abbreviations)
# Exactly equal
5 == 5 # TRUE
# Not equal
"cat" != "dog" # TRUE
# Greater or equal 3 times
c(0,1,2) >= 1 # F, T, T
# Use logical comparison to produce a result F, F, T
c("a","b","c")
# Use logical comparison to produce a result T, F, T
c(1, 3, 2)
```
`@solution`
```{r}
# In the comments below: T = TRUE, F = FALSE
# (R also understands these abbreviations)
# Exactly equal
5 == 5 # TRUE
# Not equal
"cat" != "dog" # -> TRUE
# Greater or equal 3 times
c(0,1,2) >= 1 # F,T,T
# Use logical comparison to produce a result F,F,T
c("a","b","c") == "c"
# Use logical comparison to produce a result T,F,T
c(1,3,2) < 3
```
`@sct`
```{r}
# submission correctness tests
# example tests:
test_output_contains("c('a','b','c')=='c'", incorrect_msg = "Please make the necessary adustments to produce the output FALSE FALSE TRUE")
test_output_contains("c(1,3,2)<3", incorrect_msg = "Please make the necessary adustments to produce the output TRUE FALSE TRUE")
# test if the students code produces an error
test_error()
# Final message the student will see upon completing the exercise
success_msg("Well done! Remember Spock's words: Logic is the beginning of wisdom, not the end.")
```
---
## Logical operators
```yaml
type: NormalExercise
key: d7b5a24d40
lang: r
xp: 100
skills: 1
```
Logical conditions can be combined with the following logical operators:
operator | description
----------------------- | -----------
`!`a | NOT a
a `&` b | a AND b
a <code>|</code> b | a OR b
Logical operators work like logical conditions: they compare the elements of vectors one at a time and can produce logical vectors.
So, for example `F & T` evaluates to `FALSE` and `F | T` evaluates to `TRUE`. Also, `c(F, T) & c(T, T)` produces a logical vector `FALSE TRUE`.
It is also possible to use parenthesis to control the evaluation of the operators. See how this works from the example codes.
`@instructions`
- Create and print out four subsets of the learning2014 data matching the following conditions:
- (gender is male, "M") AND (deep learning is greater than 4.5)
- (age is 25 OR age is 26) AND (attitude is greater than 3.5)
- (gender is female, "N") AND (strategic learning is greater than 4.5)
- (deep learning is more than 4) AND (points is zero OR points is greater than 30)
`@hint`
- The instructions show the correct use of parenthesis when using the logical operators and conditions
- Note that the logical equals is `==`, not `=`.
`@pre_exercise_code`
```{r}
learning2014 <- read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/learning2014.txt", sep = "\t", header = TRUE)
```
`@sample_code`
```{r}
# learning2014 is available
# male students who scored very high on deep learning
subset(learning2014, (gender == "M") & (deep > 4.5))
# 25 or 26 old students who scored high on attitude
subset(learning2014, (age == 25 | age == 26) & (attitude > 3.5))
# female students who scored very high on strategic learning
# students who scored high on deep learning for whom points is 0 or greater than 30
```
`@solution`
```{r}
# learning2014 is available
# male students who scored very high on deep learning
subset(learning2014, (gender == "M") & (deep > 4.5))
# 25 or 26 old students who scored high on attitude
subset(learning2014, (age == 25 | age == 26) & (attitude > 3.5))
# female students who scored very high on strategic learning
subset(learning2014, (gender == "N") & (stra > 4.5))
# students who scored high on deep learning for whom points is 0 or greater than 30
subset(learning2014, deep > 4 & (points == 0 | points > 30))
```
`@sct`
```{r}
# submission correctness tests
test_output_contains("subset(learning2014,(gender=='N')&(stra>4.5))", incorrect_msg="Please print out the subset containing the female students with high strategic learning scores")
test_output_contains("subset(learning2014,deep>4&(points==0|points>30))", incorrect_msg="Please print out the subset containing the students who scored high on deep learning and had either no points or very high points")
# test if the students code produces an error
test_error()
# Final message the student will see upon completing the exercise
success_msg("Excellent! You are the operator indeed.")
```
---
## Classical probability
```yaml
type: NormalExercise
key: 71e768aa4c
lang: r
xp: 100
skills: 1
```
Probability is the measure of the likelihood that an event will occur. It is quantified as a number between 0 and 1. Probability is an **essential** tool for statistics since the discipline is primarily interested in measuring uncertainty.
The classical definition of probability is the proportion of favourable cases over all possible outcomes. Thus, the probability of event $A$ with $f$ favourable cases out of $N$ possible outcomes is:
$$P(A) = \frac{f}{N}$$
For the questions below assume that we randomly pick a single student from the learning2014 data.
`@instructions`
- Execute the sample codes that create a cross table of gender and attitude categories
- What is the probability that the student is female?
- What is the probability that the student has scored more than 3 on attitude?
- What is the probability that the student is male and has scored more than 4 on attitude?
- Save the probabilities to the objects p2, p3, p4 (instead of `NA`, not available)
- Combine all the probabilities to a vector with `c()` and print it out, rounded to 2 digits
`@hint`
- Use the table created and count the number of favourable cases (f) and the number of all possible cases (N)
- The asked probabilities are then f / N
- Use a comma to combine multiple values with `c()`.
`@pre_exercise_code`
```{r}
learning2014 <- read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/learning2014.txt", sep = "\t", header = TRUE)
# transform attitude into a factor and cross table with gender
attitude <- cut(learning2014$attitude, 1:5, include.lowest = T)
gndr_att <- table(learning2014$gender, attitude)
print(addmargins(gndr_att))
```
`@sample_code`
```{r}
# learning2014 is available
# transform attitude into a factor and cross table with gender
attitude <- cut(learning2014$attitude, 1:5, include.lowest = T)
gndr_att <- table(learning2014$gender, attitude)
addmargins(gndr_att)
# P(gender = M)
p1 <- 61 / 183
# P(gender = N)
p2 <- NA
# P(attitude > 3)
p3 <- NA
# P(gender = M, attitude > 4)
p4 <- NA
# Print out the probabilities
round(c(p1), digits = 2)
```
`@solution`
```{r}
# learning2014 is available
# transform attitude into a factor and cross table with gender
attitude <- cut(learning2014$attitude, 1:5, include.lowest = T)
gndr_att <- table(learning2014$gender, attitude)
addmargins(gndr_att)
# P(gender = M)
p1 <- 61 / 183
# P(gender = N)
p2 <- 122 / 183
# P(attitude > 3)
p3 <- (85 + 19) / 183
# P(gender = M, attitude > 4)
p4 <- 12 / 183
# Print out the probabilities
round(c(p1, p2, p3, p4), digits = 2)
```
`@sct`
```{r}
# submission correctness tests
# example tests:
test_object("p2")
test_object("p3")
test_object("p4")
test_output_contains("round(c(p1, p2, p3, p4),2)")
# test if the students code produces an error
test_error()
# Final message the student will see upon completing the exercise
success_msg("Great! You are probably a genius!")
```
---
## Conditional probability
```yaml
type: NormalExercise
key: 562f9818e0
lang: r
xp: 100
skills: 1
```
As you saw before, conditioning is closely related to selecting a subset. The condition(s) defines the subset of the possible cases. This is also a good way to think about conditional probability: The condition defines the subset of possible outcomes.
Formally, conditional probability is defined by the Bayes formula
$$P(A | B) = \frac{P(A \text{ and } B)}{P(B)}$$
But we won't directly need to apply that definition here.
For the questions below assume that we randomly pick a single student from the learning2014 data.
`@instructions`
- Print a table of attitude and gender by calling `addmargins()`
- What is the probability that the student has scored over 4 on attitude, on the condition that the student is male? Save the probability to `p1`
- What is the probability that the student has scored over 4 on attitude, on the condition that the student is female? Save the probability to `p2`.
- What is the probability that the student is male, on the condition that the student has scored over 4 on attitude? Save the probability to `p3`.
- What is the probability that the student is female, on the condition that the student has scored over 4 on attitude? Save the probability to `p4`.
- Combine all the probabilities to a vector with `c()` and print it out, rounded to 2 digits
`@hint`
- Remember that classical probability is the frequency of favourable outcomes over all possible outcomes Now, the condition determines the subset for the possible outcomes
`@pre_exercise_code`
```{r}
learning2014 <- read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/learning2014.txt", sep = "\t", header = TRUE)
# transform attitude into a factor and cross table with gender
attitude <- cut(learning2014$attitude, 1:5, include.lowest = T)
gndr_att <- table(learning2014$gender, attitude)
```
`@sample_code`
```{r}
# Print me!
addmargins(gndr_att)
# P(attitude > 4 | gender = M)
p1 <- 12 / 61
# P(attitude > 4 | gender = N)
p2 <- NA
# P(gender = M | attitude > 4)
p3 <- NA
# P(gender = N | attitude > 4)
p4 <- NA
# Print out the probabilities
round(c(p1), digits = 2)
```
`@solution`
```{r}
# Print me!
addmargins(gndr_att)
# P(attitude > 4 | gender = M)
p1 <- 12 / 61
# P(attitude > 4 | gender = N)
p2 <- 7 / 122
# P(gender = M | attitude > 4)
p3 <- 12 / 19
## P(gender = N | attitude > 4)
p4 <- 7 / 19
# Print out the probabilities
round(c(p1, p2, p3, p4), digits = 2)
```
`@sct`
```{r}
# submission correctness tests
# example tests:
test_object("p1")
test_object("p2")
test_object("p3")
test_object("p4")
test_output_contains("round(c(p1, p2, p3, p4), 2)")
# test if the students code produces an error
test_error()
# Final message the student will see upon completing the exercise
success_msg("Wow!! You are awsome no matter the conditions!")
```