Chapter 5
University of Portland
Compare two numerical variables
Visualize the distribution of one numerical variable
A measure of center of a distribution.
Sample mean \(\overline{x}\)
\[ \overline{x} = \frac{ x_1 + x_2 + x_3 + \ldots + x_n}{n} \]
Population mean \(\mu\) (Greek letter “mu”) \[ \overline{x} \approx \mu \]
Question: what is the average age of the people in this room?
ggplot(data = penguins,
mapping = aes(x = flipper_length_mm)) +
geom_histogram(binwidth = 5, color="white" )
This distribution is bimodal and right skewed
ggplot(data = penguins,
mapping = aes(x = body_mass_g)) +
geom_histogram(binwidth = 250, col="white")
This distribution is unimodal and right skewed
ggplot(data = penguins,
mapping = aes(x = bill_depth_mm)) +
geom_histogram(binwidth = 0.5, col="white")
This distribution is bimodal and right skewed
What do you think it is?
Measure of variation or how spread out distribution is. It’s the average squared distance from the mean.
Sample variance is \(s^2\)
where s is the sample standard deviation \[ s = \sqrt{ \frac{ \sum_{i=1}^n (x_i - \overline{x})^2 }{n-1}} \]
Population variance: \(\sigma^2\) (Greek letter “sigma”)
s is the sample standard deviation. Represents the typical deviation from the mean \[ s = \sqrt{ \frac{ \sum_{i=1}^n (x_i - \overline{x})^2 }{n-1}} \]
Typically, about 68% of the data (observations) lie within one s.d. of the mean.
About 98% of the data lie within two s.d. of the mean.
These percentages are not hard and fast rules!
Using the empirical rule, about 68% of observations lie in what range of temperatures?
Min. 1st Qu. Median Mean 3rd Qu. Max.
96.30 97.80 98.30 98.25 98.70 100.80