Skip to content

Commit 53d1674

Browse files
committed
Merge pull request jtleek#81 from bcaffo/master
Final merge request before the name change. Mostly this is typos and small corrections
2 parents 037e26b + 3487973 commit 53d1674

35 files changed

Lines changed: 6463 additions & 5632 deletions

File tree

Lines changed: 158 additions & 158 deletions
Original file line numberDiff line numberDiff line change
@@ -1,158 +1,158 @@
1-
---
2-
title : Introduction to statistical inference
3-
subtitle :
4-
author : Brian Caffo, Jeff Leek, Roger Peng
5-
job : Johns Hopkins Bloomberg School of Public Health
6-
logo : bloomberg_shield.png
7-
framework : io2012 # {io2012, html5slides, shower, dzslides, ...}
8-
highlighter : highlight.js # {highlight.js, prettify, highlight}
9-
hitheme : tomorrow #
10-
url:
11-
lib: ../../libraries
12-
assets: ../../assets
13-
widgets : [mathjax] # {mathjax, quiz, bootstrap}
14-
mode : selfcontained # {standalone, draft}
15-
---
16-
## Statistical inference defined
17-
18-
Statistical inference is the process of drawing formal conclusions from
19-
data.
20-
21-
In our class, we wil define formal statistical inference as settings where one wants to infer facts about a population using noisy
22-
statistical data where uncertainty must be accounted for.
23-
24-
---
25-
26-
## Motivating example: who's going to win the election?
27-
28-
In every major election, pollsters would like to know, ahead of the
29-
actual election, who's going to win. Here, the target of
30-
estimation (the estimand) is clear, the percentage of people in
31-
a particular group (city, state, county, country or other electoral
32-
grouping) who will vote for each candidate.
33-
34-
We can not poll everyone. Even if we could, some polled
35-
may change their vote by the time the election occurs.
36-
How do we collect a reasonable subset of data and quantify the
37-
uncertainty in the process to produce a good guess at who will win?
38-
39-
---
40-
41-
## Motivating example: is hormone replacement therapy effective?
42-
43-
A large clinical trial (the Women’s Health Initiative) published results in 2002 that contradicted prior evidence on the efficacy of hormone replacement therapy for post menopausal women and suggested a negative impact of HRT for several key health outcomes. **Based on a statistically based protocol, the study was stopped early due an excess number of negative events.**
44-
45-
Here's there's two inferential problems.
46-
47-
1. Is HRT effective?
48-
2. How long should we continue the trial in the presence of contrary
49-
evidence?
50-
51-
See WHI writing group paper JAMA 2002, Vol 288:321 - 333. for the paper and Steinkellner et al. Menopause 2012, Vol 19:616 621 for adiscussion of the long term impacts
52-
53-
---
54-
55-
## Motivating example: ECMO
56-
57-
In 1985 a group at a major neonatal intensive care center published the results of a trial comparing a standard treatment and a promising new extracorporeal membrane oxygenation treatment (ECMO) for newborn infants with severe respiratory failure. **Ethical considerations lead to a statistical randomization scheme whereby one infant received the control therapy, thereby opening the study to sample-size based criticisms.**
58-
59-
For a review and statistical discussion, see Royall Statistical Science 1991, Vol 6, No. 1, 52-88
60-
61-
---
62-
63-
## Summary
64-
65-
- These examples illustrate many of the difficulties of trying
66-
to use data to create general conclusions about a population.
67-
- Paramount among our concerns are:
68-
- Is the sample representative of the population that we'd like to draw inferences about?
69-
- Are there known and observed, known and unobserved or unknown and unobserved variables that contaminate our conclusions?
70-
- Is there systematic bias created by missing data or the design or conduct of the study?
71-
- What randomness exists in the data and how do we use or adjust for it? Here randomness can either be explicit via randomization
72-
or random sampling, or implicit as the aggregation of many complex uknown processes.
73-
- Are we trying to estimate an underlying mechanistic model of phenomena under study?
74-
- Statistical inference requires navigating the set of assumptions and
75-
tools and subsequently thinking about how to draw conclusions from data.
76-
77-
---
78-
## Example goals of inference
79-
80-
1. Estimate and quantify the uncertainty of an estimate of
81-
a population quantity (the proportion of people who will
82-
vote for a candidate).
83-
2. Determine whether a population quantity
84-
is a benchmark value ("is the treatment effective?").
85-
3. Infer a mechanistic relationship when quantities are measured with
86-
noise ("What is the slope for Hooke's law?")
87-
4. Determine the impact of a policy? ("If we reduce polution levels,
88-
will asthma rates decline?")
89-
90-
91-
---
92-
## Example tools of the trade
93-
94-
1. Randomization: concerned with balancing unobserved variables that may confound inferences of interest
95-
2. Random sampling: concerned with obtaining data that is representative
96-
of the population of interest
97-
3. Sampling models: concerned with creating a model for the sampling
98-
process, the most common is so called "iid".
99-
4. Hypothesis testing: concerned with decision making in the presence of uncertainty
100-
5. Confidence intervals: concerned with quantifying uncertainty in
101-
estimation
102-
6. Probability models: a formal connection between the data and a population of interest. Often probability models are assumed or are
103-
approximated.
104-
7. Study design: the process of designing an experiment to minimize biases and variability.
105-
8. Nonparametric bootstrapping: the process of using the data to,
106-
with minimal probability model assumptions, create inferences.
107-
9. Permutation, randomization and exchangeability testing: the process
108-
of using data permutations to perform inferences.
109-
110-
---
111-
## Different thinking about probability leads to different styles of inference
112-
113-
We won't spend too much time talking about this, but there are several different
114-
styles of inference. Two broad categories that get discussed a lot are:
115-
116-
1. Frequency probability: is the long run proportion of
117-
times an event occurs in independent, identically distributed
118-
repetitions.
119-
2. Frequency inference: uses frequency interpretations of probabilities
120-
to control error rates. Answers questions like "What should I decide
121-
given my data controlling the long run proportion of mistakes I make at
122-
a tolerable level."
123-
3. Bayesian probability: is the probability calculus of beliefs, given that beliefs follow certain rules.
124-
4. Bayesian inference: the use of Bayesian probability representation
125-
of beliefs to perform inference. Answers questions like "Given my subjective beliefs and the objective information from the data, what
126-
should I believe now?"
127-
128-
Data scientists tend to fall within shades of gray of these and various other schools of inference.
129-
130-
---
131-
## In this class
132-
133-
* In this class, we will primarily focus on basic sampling models,
134-
basic probability models and frequency style analyses
135-
to create standard inferences.
136-
* Being data scientists, we will also consider some inferential strategies that rely heavily on the observed data, such as permutation testing
137-
and bootstrapping.
138-
* As probability modeling will be our starting point, we first build
139-
up basic probability.
140-
141-
---
142-
## Where to learn more on the topics not covered
143-
144-
1. Explicit use of random sampling in inferences: look in references
145-
on "finite population statistics". Used heavily in polling and
146-
sample surveys.
147-
2. Explicit use of randomization in inferences: look in references
148-
on "causal inference" especially in clinical trials.
149-
3. Bayesian probability and Bayesian statistics: look for basic itroductory books (there are many).
150-
4. Missing data: well covered in biostatistics and econometric
151-
references; look for references to "multiple imputation", a popular tool for
152-
addressing missing data.
153-
5. Study design: consider looking in the subject matter area that
154-
you are interested in; some examples with rich histories in design:
155-
1. The epidemiological literature is very focused on using study design to investigate public health.
156-
2. The classical development of study design in agriculture broadly covers design and design principles.
157-
3. The industrial quality control literature covers design thoroughly.
158-
1+
---
2+
title : Introduction to statistical inference
3+
subtitle : Statistical inference
4+
author : Brian Caffo, Jeff Leek, Roger Peng
5+
job : Johns Hopkins Bloomberg School of Public Health
6+
logo : bloomberg_shield.png
7+
framework : io2012 # {io2012, html5slides, shower, dzslides, ...}
8+
highlighter : highlight.js # {highlight.js, prettify, highlight}
9+
hitheme : tomorrow #
10+
url:
11+
lib: ../../librariesNew
12+
assets: ../../assets
13+
widgets : [mathjax] # {mathjax, quiz, bootstrap}
14+
mode : selfcontained # {standalone, draft}
15+
---
16+
## Statistical inference defined
17+
18+
Statistical inference is the process of drawing formal conclusions from
19+
data.
20+
21+
In our class, we wil define formal statistical inference as settings where one wants to infer facts about a population using noisy
22+
statistical data where uncertainty must be accounted for.
23+
24+
---
25+
26+
## Motivating example: who's going to win the election?
27+
28+
In every major election, pollsters would like to know, ahead of the
29+
actual election, who's going to win. Here, the target of
30+
estimation (the estimand) is clear, the percentage of people in
31+
a particular group (city, state, county, country or other electoral
32+
grouping) who will vote for each candidate.
33+
34+
We can not poll everyone. Even if we could, some polled
35+
may change their vote by the time the election occurs.
36+
How do we collect a reasonable subset of data and quantify the
37+
uncertainty in the process to produce a good guess at who will win?
38+
39+
---
40+
41+
## Motivating example: is hormone replacement therapy effective?
42+
43+
A large clinical trial (the Women’s Health Initiative) published results in 2002 that contradicted prior evidence on the efficacy of hormone replacement therapy for post menopausal women and suggested a negative impact of HRT for several key health outcomes. **Based on a statistically based protocol, the study was stopped early due an excess number of negative events.**
44+
45+
Here's there's two inferential problems.
46+
47+
1. Is HRT effective?
48+
2. How long should we continue the trial in the presence of contrary
49+
evidence?
50+
51+
See WHI writing group paper JAMA 2002, Vol 288:321 - 333. for the paper and Steinkellner et al. Menopause 2012, Vol 19:616 621 for adiscussion of the long term impacts
52+
53+
---
54+
55+
## Motivating example: ECMO
56+
57+
In 1985 a group at a major neonatal intensive care center published the results of a trial comparing a standard treatment and a promising new extracorporeal membrane oxygenation treatment (ECMO) for newborn infants with severe respiratory failure. **Ethical considerations lead to a statistical randomization scheme whereby one infant received the control therapy, thereby opening the study to sample-size based criticisms.**
58+
59+
For a review and statistical discussion, see Royall Statistical Science 1991, Vol 6, No. 1, 52-88
60+
61+
---
62+
63+
## Summary
64+
65+
- These examples illustrate many of the difficulties of trying
66+
to use data to create general conclusions about a population.
67+
- Paramount among our concerns are:
68+
- Is the sample representative of the population that we'd like to draw inferences about?
69+
- Are there known and observed, known and unobserved or unknown and unobserved variables that contaminate our conclusions?
70+
- Is there systematic bias created by missing data or the design or conduct of the study?
71+
- What randomness exists in the data and how do we use or adjust for it? Here randomness can either be explicit via randomization
72+
or random sampling, or implicit as the aggregation of many complex uknown processes.
73+
- Are we trying to estimate an underlying mechanistic model of phenomena under study?
74+
- Statistical inference requires navigating the set of assumptions and
75+
tools and subsequently thinking about how to draw conclusions from data.
76+
77+
---
78+
## Example goals of inference
79+
80+
1. Estimate and quantify the uncertainty of an estimate of
81+
a population quantity (the proportion of people who will
82+
vote for a candidate).
83+
2. Determine whether a population quantity
84+
is a benchmark value ("is the treatment effective?").
85+
3. Infer a mechanistic relationship when quantities are measured with
86+
noise ("What is the slope for Hooke's law?")
87+
4. Determine the impact of a policy? ("If we reduce polution levels,
88+
will asthma rates decline?")
89+
90+
91+
---
92+
## Example tools of the trade
93+
94+
1. Randomization: concerned with balancing unobserved variables that may confound inferences of interest
95+
2. Random sampling: concerned with obtaining data that is representative
96+
of the population of interest
97+
3. Sampling models: concerned with creating a model for the sampling
98+
process, the most common is so called "iid".
99+
4. Hypothesis testing: concerned with decision making in the presence of uncertainty
100+
5. Confidence intervals: concerned with quantifying uncertainty in
101+
estimation
102+
6. Probability models: a formal connection between the data and a population of interest. Often probability models are assumed or are
103+
approximated.
104+
7. Study design: the process of designing an experiment to minimize biases and variability.
105+
8. Nonparametric bootstrapping: the process of using the data to,
106+
with minimal probability model assumptions, create inferences.
107+
9. Permutation, randomization and exchangeability testing: the process
108+
of using data permutations to perform inferences.
109+
110+
---
111+
## Different thinking about probability leads to different styles of inference
112+
113+
We won't spend too much time talking about this, but there are several different
114+
styles of inference. Two broad categories that get discussed a lot are:
115+
116+
1. Frequency probability: is the long run proportion of
117+
times an event occurs in independent, identically distributed
118+
repetitions.
119+
2. Frequency inference: uses frequency interpretations of probabilities
120+
to control error rates. Answers questions like "What should I decide
121+
given my data controlling the long run proportion of mistakes I make at
122+
a tolerable level."
123+
3. Bayesian probability: is the probability calculus of beliefs, given that beliefs follow certain rules.
124+
4. Bayesian inference: the use of Bayesian probability representation
125+
of beliefs to perform inference. Answers questions like "Given my subjective beliefs and the objective information from the data, what
126+
should I believe now?"
127+
128+
Data scientists tend to fall within shades of gray of these and various other schools of inference.
129+
130+
---
131+
## In this class
132+
133+
* In this class, we will primarily focus on basic sampling models,
134+
basic probability models and frequency style analyses
135+
to create standard inferences.
136+
* Being data scientists, we will also consider some inferential strategies that rely heavily on the observed data, such as permutation testing
137+
and bootstrapping.
138+
* As probability modeling will be our starting point, we first build
139+
up basic probability.
140+
141+
---
142+
## Where to learn more on the topics not covered
143+
144+
1. Explicit use of random sampling in inferences: look in references
145+
on "finite population statistics". Used heavily in polling and
146+
sample surveys.
147+
2. Explicit use of randomization in inferences: look in references
148+
on "causal inference" especially in clinical trials.
149+
3. Bayesian probability and Bayesian statistics: look for basic itroductory books (there are many).
150+
4. Missing data: well covered in biostatistics and econometric
151+
references; look for references to "multiple imputation", a popular tool for
152+
addressing missing data.
153+
5. Study design: consider looking in the subject matter area that
154+
you are interested in; some examples with rich histories in design:
155+
1. The epidemiological literature is very focused on using study design to investigate public health.
156+
2. The classical development of study design in agriculture broadly covers design and design principles.
157+
3. The industrial quality control literature covers design thoroughly.
158+

0 commit comments

Comments
 (0)