chapter8.Rmd

---
title: Estimation
description: 'Get your brackets ready for diving in the world of statistical inference!'
---

## Exploring estimation with R

```yaml
type: NormalExercise
key: 44d1f23e51
lang: r
xp: 100
skills: 1
```

Welcome to chapter 8: **Estimation**. In this chapter we will focus on point and interval estimation. Interval estimation is based on the important concept of the *sampling distribution*, which is a somewhat theoretical concept and therefore takes a while to get used to.

The sampling distribution is a kind of mind game: If we could collect our random sample again and again and again and calculate the mean for each of the samples, the resulting distribution of these sample means would be the sampling distribution (of the sample mean).

It turns out that for a lot of important point estimates, the sampling distribution can be approximated with the normal distribution with certain known parameters! This is due to a fascinating phenomenom called the *central limit theorem*.

To explore these concepts, we will start this chapter with some technical tools we will need for constructing *loops* and *simulations*.

`@instructions`
- Here in DataCamp, when a single expression continues over multiple rows, the whole chunk of code needs to be executed at once. This can be done by first painting (selecting) all the lines in the expression with a mouse and then pressing `Ctrl + Enter` as normal.
- Execute the 3 rows long for-loop expression
- Click submit answer to move on to the next exercise

`@hint`
- If you cannot paint (select) the expression for some reason, you can just click submit answer instead.

`@pre_exercise_code`
```{r}
# pre exercise code here
```

`@sample_code`
```{r}
# Execute me!
for(i in 1:6) {
  print("I am a loop")
}


```

`@solution`
```{r}
# Execute me!
for(i in 1:6) {
  print("I am a loop")
}


```

`@sct`
```{r}
test_error()
success_msg("Good work!")
```

---

## Indices and brackets

```yaml
type: NormalExercise
key: 23abc9f934
lang: r
xp: 100
skills: 1
```

Vectors in R store multiple values of the same data type. The values in vectors have indices: the first value in a vector has a index `1`, the second value `2` and so on. 

You can set or get a single value from a vector by using indices and brackets `[ ]`. Using an index number inside the brackets gives access to a single value from the vector. Using a vector of indices gives access to multiple values (another vector). 

It is also possible to rearrange the values in a vector by using indices. Look at the example code to see how indices work.

`@instructions`
- Create the `names` vector.
- See how the indices work by executing the example lines.
- Use brackets and indices on `names` to create a new vector `girls` with values "Liisa" and "Anna" (in that order).
- Use brackets and indices on `names` to create a new vector `boys` with values "Pekka" and "Heikki" (in that order).

`@hint`
- Note that space between the vector object and bracets produces an error.
- Index vectors `c(1,2)` and `c(2,1)` do not produce the same outcome. The order of the values is different.

`@pre_exercise_code`
```{r}

```

`@sample_code`
```{r}
# Let's create a vector
names <- c("Matti", "Pekka", "Liisa", "Anna")

# Acess the first value of the vector
names[1]

# Change the first value of the vector
names[1] <- "Heikki"

# Acess the 1. and 3. value of the vector
names[c(1, 3)]

# Use indices and brackets to separate the names vector into two vectors of the specified ordering
girls <-
boys <-


```

`@solution`
```{r}
# Let's create a vector
names <- c("Matti", "Pekka", "Liisa", "Anna")

# Acess the first value of the vector
names[1]

# Change the first value of the vector
names[1] <- "Heikki"

# Acess the 1. and 3. value of the vector
names[c(1, 3)]

# Use indices and brackets to separate the names vector into two vectors of the specified ordering
girls <- names[c(3, 4)]
boys <- names[c(2, 1)]


```

`@sct`
```{r}


```

---

## Easy vectors

```yaml
type: NormalExercise
key: 8e0ac16859
lang: r
xp: 100
skills: 1
```

Sometimes you need a very long index vector. In that case it would be a lot of work to create the vector by combining integer values with `c()`. Luckily, there are convenient ways to create long vectors in R.

For integer values, a specially useful one is the method `start:end`, which creates an integer vector with all the values from start to end. The following two lines hence produce identical results

1:5  
c(1,2,3,4,5)

`@instructions`
- Create an integer vector with values from 1 to 5.
- Create an integer vector with even values from 2 to 10.
- Create object `attitude` and give the values in it names matching the indices.
- Access index values 1-5 of `attitude`
- Use `:` to create an integer vector with the values 1, 2, ..., 10
- Access every second value of the `attitude` vector, starting from the 2. value until the 20th value. These values correspond to the even numbered indeces of the vector: 2, 4, .. , 20

`@hint`
- First you will need an index vector with values 2, 4, .. , 20. The example shows how to create such a vector
- You can then use the index vector together with brackets (`[ ]`) to complete the task

`@pre_exercise_code`
```{r}
learning2014 <- read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/learning2014.txt", sep = "\t", header = TRUE)
```

`@sample_code`
```{r}
# learning2014 is available

# Create an integer vector with values 1,2,..,5
1:5

# Create an integer vector with even values 2, 4, .. , 10
(1:5)*2

# Create object attitude and give the data points names 1, 2, ..
attitude <- learning2014$attitude
names(attitude) <- 1:length(attitude)

# Access the values 5 - 10 of attitude
attitude[5:10]

# Create an integer vector with values 1,2,..,10


# Access every second value of attitude from 2. to the 20th index


```

`@solution`
```{r}
# learning2014 is available

# Create an integer vector with values 1,2,..,5
1:5

# Create an integer vector with even values
(1:5)*2

# Create object attitude and give the data points names 1, 2, ..
attitude <- learning2014$attitude
names(attitude) <- 1:length(attitude)

# Access the values 5 - 10 of attitude
attitude[5:10]

# Create an integer vector with values 1,2,..,10
1:10

# Access every second value of attitude from 2. to the 20th index
attitude[(1:10)*2]


```

`@sct`
```{r}
# submission correctness tests

test_student_typed("1:10", not_typed_msg = "Did you use `:` to create and print out the specified integer vector?")
test_output_contains("attitude[(1:10)*2]")

# test if the students code produces an error
test_error()

# Final message the student will see upon completing the exercise
success_msg("Great work!")

```

---

## Looping

```yaml
type: NormalExercise
key: a26cc4ab51
lang: r
xp: 100
skills: 1
```

In programming, often there is the need to repeat an action multiple times. In statistical programming you might for example have a theory you wish to explore by simulation.

The **for-loop** is perhaps the most common loop in programming. The structure of a for-loop is:

```
for (counter in vector) {
  commands
  more commands
}
```

The for-loop iterates trough the values of a vector by changing the value of the counter one at a time. The counter first takes on the first value of the vector, then the second and so on. 

Inside the curly braces is the *body* of the loop, which can contain regular commands such as function calls. The current value of the counter can be used there. All the commands inside the body are repeated until the counter has taken all the values of the vector.

`@instructions`
- **Please note!** Here in DataCamp, executing a command over multiple lines is done by selecting (painting) the lines with a mouse first and then hitting `Ctrl+Enter` normally. Alternatively you could also click submit to execute the whole script.R
- Loop though the character values using the examle code.  
- Loop through integer values using the example code. 
- In an honor of the legendary electronic duo *Daft Punk* and their 2001 album *Discovery*, construct a for-loop that prints out "One more time!" 27 times.
- Hints: (1) Remember that `1:n` creates an integer vector of length n. (2) Inside the loop body you have to use the `print()` function to print.

`@hint`
- It does not matter what name you give to your counter. You can for example use `i` as is done in the second example.
- In this exercise you don't need to use the values of your counter.
- Remember to use quotation marks to create and print characters.

`@pre_exercise_code`
```{r}
# no pec
```

`@sample_code`
```{r}
# Loop through and print the characters a, b, c, d
for(letter in c("a","b","c","d")) {
  print(letter)
}

# Loop through integers 1, 2, 3, 4, 5
for(i in 1:5) {
  print(i + 5)
}

# Write a for-loop that prints out "One more time!" 27 times


```

`@solution`
```{r}
# loop through characters a,b,c,d
for(letter in c("a","b","c","d")) {
  print(letter)
}

# loop through integers 1, 2, 3, 4, 5
for(i in 1:5) {
  print(i + 5)
}

# Write a for-loop that prints out "One more time!" 27 times
for(i in 1:27) {
  print("One more time!")
}


```

`@sct`
```{r}
# submission correctness tests

test_output_contains("'One more time!'", times = 27, incorrect_msg = "Did you write a for-loop that prints out 'One more time!' 27 times?")

# test if the students code produces an error
test_error()

# Final message the student will see upon completing the exercise
success_msg("Excellent work! Repetition is the key to success :)")

```

---

## Point estimation

```yaml
type: NormalExercise
key: 78e301b70a
lang: r
xp: 100
skills: 1
```

Let us now get back to statistical analysis and estimation. When given a sample of data from a population, the goal of statistical analysis is to use that sample to draw conclusions about the population.  

In a simple case we are interested in a single variable (such as students points in an exam) and want to estimate the *expected value* of that variable. The expected value is the average value in the target population - the population mean. 

We can't directly calculate the population average because we only have a sample, so we have to **estimate** it. Obtaining a point estimate of a population parameter is rather easy: just use the corresponding sample statistic.

population parameter                    | estimate
--------------------------------------- | --------
expected value $\mu$                    | sample mean $\bar{x}$
population standard deviation $\sigma$  | sample standard deviation $s$

<hr>

In statistics, estimates are often denoted with a hat. So, for example, we also use the notation $\hat{\mu} = \bar{x}$, and refer to "mu_hat", respectively.

`@instructions`
- Create object `points`
- Estimate the expected value of `points`. If you can't remember the name of the R function you need, use your favourite search engine or take a hint.
- Estimate the population standard deviation of `points`.
- Combine the estimates to the `estimates` vector (replace `NA`). Notice how `c()` can be used to give names to the values.
- Print out the `estimates` vector, with the values rounded to 2 digits.

`@hint`
- The table above shows the relationships between the sample statistics and the population parameters.
- Use the table to figure out which operation you could use to produce the estimate.
- Mean can be computed with `mean()`

`@pre_exercise_code`
```{r}
learning2014 <-  read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/learning2014.txt", sep = "\t", header = TRUE)
```

`@sample_code`
```{r}
# learning2014 is available

# create object points using the learning2014 data.frame
points <- learning2014$points

# estimate the expected value of points
mu_hat <- 
  
# estimate the population standard deviation of points
sigma_hat <- sd(points)

# Combine the estimates to a named vector and print out the rounded values
estimates <- c("mu_hat" = NA, "sigma_hat" = NA)
round(estimates, digits = 2)


```

`@solution`
```{r}
# learning2014 is available

# create object points using the learning2014 data.frame
points <- learning2014$points

# estimate the expected value of points
mu_hat <- mean(points)

# estimate the population standard deviation of points
sigma_hat <- sd(points)

# Print out the estimated values
estimates <- c("mu_hat" = mu_hat, "sigma_hat" = sigma_hat)
round(estimates, digits = 2)


```

`@sct`
```{r}
# submission correctness tests

test_function("mean", args=c("x"))
test_object("mu_hat", incorrect_msg = "Please create the object mu_hat. Use the correct function on the points vector.")

test_output_contains("round(estimates, 2)", incorrect_msg = "Please insert your hat objects to the estimates vector and do not remove the row with `round()`")

# test if the students code produces an error
test_error()

# Final message the student will see upon completing the exercise
success_msg("Very nice! You get full points for point estimation.")

```

---

## Interval estimation

```yaml
type: NormalExercise
key: 1df061a41a
lang: r
xp: 100
skills: 1
```

Clearly there is uncertainty involved in a single sample mean (our point estimate). But how much uncertainty? This is where we need statistical theory. 

One way to approach this question is to try to come up with an interval around the point estimate to describe our uncertainty. It turns out that the following formula produces an interval which contains the true population mean $100 \cdot (1-\alpha)$% of the time:

$$\bar{x} \pm z_{\alpha / 2} \cdot \frac{s}{\sqrt{n}}$$
  
where $s$ is the sample standard deviation, $n$ is the sample size and $z_{\alpha / 2}$ is a critical (quantile) value from the $N(0,1)$ distribution. The part 

$$\frac{s}{\sqrt{n}}$$

is called the *standard error* of the sample mean $\bar{x}$.

I know what you're thinking. It's confusing: where did the normal distribution come from and what's with all that $z_{\alpha/2}$ stuff? The answers lie in the **sampling distribution** which we will soon get to.

`@instructions`
- Create objects `points`, `n`, `mu_hat` and `s`
- Adjust the code: Compute the standard error of `mu_hat` (the sample mean) and save t to object `error`. `sqrt()` computes the square root.
- Compute the critical value z using a 99% confidence level (alpha = 0.01) and create object `z`
- Compute the lower and upper limits of the confidence interval (CI) by applying the formula given above
- Combine the point estimate and the lower and upper CI values to the object `interval_estimate`.
- Print the values in `interval_estimate`, rounded to 1 digit (hint: `round()`)

`@hint`
- The formula for the standard error is s/sqrt(n)

`@pre_exercise_code`
```{r}
learning2014 <-  read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/learning2014.txt", sep = "\t", header = TRUE)

```

`@sample_code`
```{r}
# learning2014 is available.

# Create object points
points <- learning2014$points

# Number of observations
n <- length(points)

# Sample mean estimates the expected value (mu) 
mu_hat <- mean(points)

# Sample standard deviation
s <- sd(points)

# Compute the standard error of mu_hat
error <- NA

# Compute the critical value z
z <- qnorm(0.01/2, lower.tail = F) 

# Compute the confidence interval
lower_ci <- mu_hat - NA
upper_ci <- mu_hat + NA

# Combine the point and interval estimates to a named vector
interval_estimate <- c("estimate" = mu_hat, "lower99%" = lower_ci, "upper99%" = upper_ci)

# Round and print out the point and interval estimates


```

`@solution`
```{r}
# learning2014 is available.

# Create object points
points <- learning2014$points

# Number of observations
n <- length(points)

# Sample mean estimates the expected value (mu) 
mu_hat <- mean(points)

# Sample standard deviation
s <- sd(points)

# Compute the standard error of mu_hat
error <- s/sqrt(n)

# Compute the critical value z
z <- qnorm(0.01/2, lower.tail = F) 

# compute the confidence interval and print out the results
lower_ci <- mu_hat - z*error
upper_ci <- mu_hat + z*error

# Combine the point and interval estimates to a named vector
interval_estimate <- c("estimate" = mu_hat, "lower99%" = lower_ci, "upper99%" = upper_ci)

# Round and print out the point and interval estimates
round(interval_estimate, digits = 1)


```

`@sct`
```{r}
# submission correctness tests
test_object("error")
test_object("lower_ci")
test_object("upper_ci")
test_output_contains("round(interval_estimate, digits = 1)")

# test if the students code produces an error
test_error()

# Final message the student will see upon completing the exercise
success_msg("You are awsome with 99% confidence!")


```

---

## The sampling distribution

```yaml
type: NormalExercise
key: 65d85c42aa
lang: r
xp: 100
skills: 1
```

Remember that you can think of the sampling distribution as a kind of mind game:  If we could collect our random sample again and again and calculate a sample mean each time, the result would be the sampling distribution of the mean.  

In reality though, we can't produce this distribution, because we only have the one sample. Or can we? Let's pretend for a moment that the data we have *is* the target population. Let's see what happens if we repeatedly 

1. take random samples from the population
2. calculate the sample mean each time and
3. store the results.

The resulting distribution of the multiple sample means satisfies the idea of the sampling distribution, meaning that we can actually simulate the distribution! This method (called [bootstrapping](https://en.wikipedia.org/wiki/Bootstrapping_(statistics))) is widely used in statistics.

`@instructions`
- Create `N` and `means` and print out the empty `means` vector
- Execute the for-loop. What does it do?
- Print out the `means` vector again
- Compute basic summary statistics of `means` using the appropriate function
- Draw a histogram displaying the sampling disribution of the means
- See the help page of the `hist()` function and make it a density histogram

`@hint`
- `summary()` computes basic summary statistics
- `hist()` draws a histogram and the argument `freq = F` can be used to draw a density histogram
- ?hist opens the help page of `hist()`

`@pre_exercise_code`
```{r}
learning2014 <-  read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/learning2014.txt", sep = "\t", header = TRUE)
# give a seed value to the random number generator of R
set.seed(888)

```

`@sample_code`
```{r}
# learning2014 is available

# Create an empty vector of length N
N <- 100
means <- numeric(N)

# Print out the empty vector


# Repeat N times: 
  # (1) draw a random sample of n = 50 exam points
  # (2) compute the mean of the sampled exam points and 
  # (3) store it in the means vector
for(i in 1:N) {
  points_sample <- sample(learning2014$points, size = 50, replace = F)
  means[i] <- mean(points_sample)
}

# Print out the means vector


# Compute basic summary statistics


# Visualize the distribution of the means with a histogram


```

`@solution`
```{r}
# learning2014 is available

# Create an empty vector of length N (size of our experiment)
N <- 100
means <- numeric(N)

# Print out the empty vector
means

# Repeat N times: 
  # (1) draw a random sample of n = 50 exam points
  # (2) compute the mean of the sampled exam points and 
  # (3) store it in the means vector
for(i in 1:N) {
  points_sample <- sample(learning2014$points, size = 50, replace = F)
  means[i] <- mean(points_sample)
}

# Print out the means vector
means

# Compute basic summary statistics
summary(means)

# Visualize the distribution of the means with a histogram
hist(means, freq = F)


```

`@sct`
```{r}
# submission correctness tests

test_output_contains("numeric(N)")
test_output_contains("means")
test_function("summary", args=c("object"))
test_function("hist", args=c("x","freq"))

# test if the students code produces an error
test_error()

# Final message the student will see upon completing the exercise
success_msg("Very good work!")

```

---

## The central limit theorem

```yaml
type: NormalExercise
key: 38bf66cd9c
lang: r
xp: 100
skills: 1
```

Let's now get back to the confidence interval for the population mean:

$$\bar{x} \pm z_{\alpha / 2} \cdot \frac{s}{\sqrt{n}}$$

where $z$ is a quantile point from the $N(0,1)$ distribution. What is the relationship between the normal distribution and the (sampling) distribution of the sample mean?  

A mean is a scaled sum, so according to the [central limit theorem](https://en.wikipedia.org/wiki/Central_limit_theorem) (CLT), its distribution is approximately normal.  

The confidence interval above is based on a normal approximation (for the sampling distribution of the sample mean) and it is a good approximation in theory, according to the CLT. In this exercise we'll explore if that approximation seems to work in practise.

`@instructions`
- Create object `n` and take a random sample of 50 students' exam points and create object `points_sample`
- Create objects `mu_hat`, `s` and `error`
- Print out `mu_hat` and it's standard error, rounded to 2 digits
- Inside `dnorm()`, set the argument `mean` = the estimate of the population mean of the exam points, based on the random sample of size 50  
- Inside `dnorm()` set the argument `sd` = the standard error (the estimate of the standard deviation) of the mean, based on the random sample of size 50  
- Execute the line with `hist()`  and `curve()`, which visualize (1) the previously simulated sampling distribution of the means, and (2) a normal approximation to that distribution based on the random sample of size 50. (Techically it's two lines of code: `;` is code for line change)
- Does the normal approximation seem useful?

`@hint`
- `mu_hat` is the sample mean (the estimate of the population mean)
- `error` is the standard error of the sample mean

`@pre_exercise_code`
```{r}
learning2014 <-  read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/learning2014.txt", sep = "\t", header = TRUE)
set.seed(888)

# Create an empty vector of length N (size of the simulation)
N <- 100
means <- numeric(N)

# Repeat N times: 
  # (1) draw a random sample of n = 50 exam points
  # (2) compute the mean of the sampled exam points and 
  # (3) store it in the means vector
for(i in 1:N) {
  points_sample <- sample(learning2014$points, size = 50, replace = F)
  means[i] <- mean(points_sample)
}

# Visualize the distribution of the means with a histogram
hist(means, freq = F)

```

`@sample_code`
```{r}
# learning214 and means are available


# Draw a single sample of n = 50 exam points
n <- 50
points_sample <- sample(learning2014$points, size = n, replace = F)

# Compute the sample mean
mu_hat <- mean(points_sample)

# Compute the sample standard deviation
s <- NA

# Compute the standard error of the mean
error <- NA

# Round and print
c(mu_hat, error)

# (1) Histogram of the previously simulated sample means and
# (2) The normal approximation of that distribution, based on a single sample
hist(means, freq = F); curve(dnorm(x, mean = NA, sd = NA), add = T)


```

`@solution`
```{r}
# learning214 and means are available


# Draw a single sample of n = 50 exam points
n <- 50
points_sample <- sample(learning2014$points, size = n, replace = F)

# Compute the sample mean
mu_hat <- mean(points_sample)

# Compute the sample standard deviation
s <- sd(points_sample)

# Compute the standard error of the mean
error <- s/ sqrt(n)

# print
c(mu_hat, error)

# (1) Histogram of the previously simulated sample means and
# (2) The normal approximation of that distribution, based on a single sample
hist(means, freq = F); curve(dnorm(x, mean = mu_hat, sd = error), add = T)


```

`@sct`
```{r}
# submission correctness tests

test_object("s")
test_object("error")

test_function("dnorm", args = c("mean","sd"))

# test if the students code produces an error
test_error()

# Final message the student will see upon completing the exercise
success_msg("Very well done!! You are making really good progress!")

```