AE 05 – Modeling Finnish Fish

Application exercise

For this application exercise, we will work with a data set that contains measurements of fish caught in lake Laengelmavesi in southwestern Finland.

library(tidyverse)
library(tidymodels)

fish <- read_csv("fish.csv")

The data dictionary is below:

variable description
species Species name of fish
weight Weight, in grams
length_body Length from nose to start of tail, in cm
length_notch Length from nose to tail notch, in cm
length_tail Length to end of tail, in cm
height Height, in cm
width Width, in cm

Visualizing Weight and Height

The question we want to investigate is whether there is relationship between the weights and heights of fish. We start by making a scatterplot to visualize the data.

ggplot(fish, aes(x = height, y = weight)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Weights vs. heights of fish",
    x = "Height (cm)",
    y = "Weight (gr)"
  )

Note that in the code chunk above, the line with geom_smooth draws the linear regression line for us.

Question 1

Using terminology from Chapter 7, how would you describe this association?

Add response here

One of the reasons why we want to see the linear regression line is that it can help us make predictions. Based on the graph alone, what do you predict would be the weight of a fish that has a height of 8cm? 15cm? 20cm?

Question 2

Add response here

Model fitting

Now lets make our linear model more precise by finding the actual equation for the line. As we saw in class, we can calculate the coefficients that we need to describe the line with the following summary statistics.

fish |>
  summarize( mean_x = mean(height),
             sd_x = sd(height),
             mean_y = mean(weight),
             sd_y = sd(weight),
             r = cor(height, weight)
  )
# A tibble: 1 × 5
  mean_x  sd_x mean_y  sd_y     r
   <dbl> <dbl>  <dbl> <dbl> <dbl>
1   12.1  4.47   448.  285. 0.954
Question 3

What is the correlation between weight and height? What does this value tell you?

Add your response here

Recall from class that the equation for the linear model is \[ \hat{y} = b_0 + b_1 \hat{x} \]

where \[ b_1 = r \left ( \frac{ s_y}{s_x} \right ) \] and

\[ b_0 = \bar{y} - b_1 \bar{x} \]

Question 4

Calculate \(b_1\) and \(b_0\) and then write down the equation of the linear model.

Add your response here

Question 5

Interpret the significance the slope \(b_1\) in a sentence.

Add your response here

Question 6

Use your model to make more precise predictions about the weights of fish that have heights of 8cm and 15cm.

Add your response here

A Shortcut!

It turns out that R can generate the linear model automatically. The following code chunk demonstrates how to do this. In case you’re curious parsnip refers to an R package designed to simplify model fitting.

linear_reg() |>
  fit(weight ~ height, data = fish)
parsnip model object


Call:
stats::lm(formula = weight ~ height, data = data)

Coefficients:
(Intercept)       height  
    -288.42        60.92  
Question 7

The two numbers shown above should be the same as your \(b_0\) and \(b_1\). Are they? (If they’re close, but not exactly the same, that’s likely due to rounding errors.)

Add your response here

Adding a third variable

As we know from our pengiun friends, species can sometimes be a confounding variable. Does the relationship between heights and weights of fish change if we take into consideration species?

Question 8

Make one change to the code chunk below to color the data points by species. Hint: this is exactly what you did with the penguins. What do you notice?

Add your response here

ggplot(fish, 
       aes(x = height, y = weight)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Weights vs. heights of fish",
    x = "Height (cm)",
    y = "Weight (gr)"
  )
`geom_smooth()` using formula = 'y ~ x'

Challenge!

Once you see that we really need two separate linear models for the two different species, we can reproduce our analysis above by first filtering by species. The two species in this data set are “Bream” and “Roach”

Question 9

Use the filter command to generate a table of summary statistics for only the Bream. Hint: we used the filter command in last week’s AE.

Question 10

Use your new summary statistics to find a new linear model and use this to predict the weight of a fish with height of 15cm.