01-Introduction.Rmd

# Introduction {#introduction}


## Aim of this guide and further readings

This is a short introductory guide that shows the basic procedures to weight a survey. This guide intends to be a practical document and a step-by-step walkthrough for weighting a survey sample. It provides [R code](https://www.r-project.org/) for all steps of the process: from reading, wrangling and presenting data to modelling and calibration. This should allow readers to reproduce procedures and results as well as to inspect objects at any given part of this guide. The source code in 'R Markdown' format can be found and publicly accessed from the [author's Github page](/~https://github.com/JosepER/PPMI_how_to_weight_a_survey/01-_Basic_steps_on_how_to_weight_a_survey_sample.Rmd). The guide does not provide a detailed comment on general functions used in this walk-through. Basic 'R' skills might be required to follow all steps explained in this guide. Readers with no knowledge of 'R' can still benefit from this note as it explains the steps and principles behind weighting. 

For questions, clarifications or suggestions feel free to contact the author at *jespasareig at gmail.com*. This text intentionally avoids explaining complex or advanced methods. Instead, it aims at providing users with a standard way of weighting and a limited number of variations.

The next sections will give a very broad glimpse at all [main survey weighting steps](#basic_steps).The second section deals with importing data into R, data manipulation and briefly [presenting the dataset used for this guide](#exploring and presenting the dataset). Readers interested in a specific step, familiar with the 7th round of the European Social Survey or that want to jump directly into weighting procedures can skip these two parts of the guide. The next three sections are the main components of this guide and show how to compute [design weights](#design_weights), [non-response weights](#nonresponse_weights) and [calibration weights](#calibration). Two more sections will be added to this survey in the future. These correspond to the analysis of weight variability and computing weighted estimates.

For more information you can check the following introductory texts:

  * Valliant et al. (2013) *Practical Tools for Designing and Weighting Survey Samples*. New York: Springer Science Business Media.
  * Lohr,S.L. (2009) *Sampling: Design and Analysis*. 2nd Edition. Boston: Books/Cole.
  * Blair, E. & Blair, J. (2015) *Applied Survey Sampling*. London: SAGE Publications Inc.  
  
And the book accompanying the R 'survey' package:  

  * Lumley,T. (2010) *Complex Surveys: A Guide to Analysis Using R*. New Jersey: John Wiley & Sons Inc.

It might be also worth keeping an eye on the (still incipient) R package [*srvyr*](https://cran.r-project.org/web/packages/srvyr/index.html), developed and maintained by Greg Freedman Ellis. This package should make the application of the techniques explained here more simple.

Note: This guide focuses on surveys based on 'probability sample'. These are surveys where all units in the statistical population have a chance of being selected and the probability of selection is known to the researcher. A brief note on how to weight non-probability samples is included at the end of the guide. 

## Basic steps in weighting a survey {#basic_steps}

Weights are applied to reduce survey bias. In plain words, weighting consists on making our sample of survey respondents (more) representative of a statistical population. By statistical population we mean all those units for which we want to infer the computed estimates.

There are four basic steps in weighting. These are:

1. __Base/design weights__
2. __Non-response weights__
3. __Use of auxiliary data/calibration__
4. __Analysis of weight variability/trimming__

The first step consists on computing weights to take into account the differences of units in the probability of being sampled.
'Being sampled' means being selected from the survey sampling frame (i.e. the list of all units from which the sample was obtained) to be approached for a survey response.
This step can be skipped if all units in the survey frame have the same probability of being sampled. This happens, for example:
* when all units in the survey frame are approached for the sample or; 
* with certain sampling designs (such as 'simple random sampling without replacement' or 'stratified random sampling without replacement' with distribution of sampled units across stratums proportional to the number of units in each stratum). These are usually called 'self-weighted' surveys. 

In the second step we need to adjust our responses by the differences in the probability of sampled units to reply to our survey. Our estimates would be biased if some profile of sampled units had higher propensity to reply than another and these profiles had differences in  our survey variables of interest. In this step, we need to estimate the probability of response using information available for both respondents and non-respondents. Non-response adjustment is not needed if all sampled units responded to the survey (i.e. in probability sampling surveys with 100% response rates).

The third step consists on adjusting our weights using available information about total population estimates. Note that this requires data that is different from that needed in non-response adjustment (second step). Here we need auxiliary data which tells us information (i.e. estimates such as proportions, means, sums, counts) about the statistical population. The same variables should be available from our respondents but here we don't need information about non-respondents.

The last step is to check the variability in our computed weights. High variation in weights can lead to some observations having too much importance in our sample. Even if weights reduce bias, they might largely inflate variance of estimates. Therefore, some survey practitioners worry about dealing with highly unequal weights.