May 26'23
Exercise
You are given a dataset with two variables, which is graphed below. You want to predict y using x.
Determine which statement regarding using a generalized linear model (GLM) or a random forest is true.
- A random forest is appropriate because the dataset contains only quantitative variables.
- A random forest is appropriate because the data does not follow a straight line.
- A GLM is not appropriate because the variance of y given x is not constant.
- A random forest is appropriate because there is a clear relationship between y and x.
- A GLM is appropriate because it can accommodate polynomial relationships.
May 26'23
Key: E
(A) is false. Trees work better with qualitative data.
(B) is false. While trees accommodate nonlinear relations, as seen in (E) a linear model can work very well here.
(C) is false. The variance is constant, so that is not an issue here.
(D) is false. There is a clear relationship as noted in answer (E).
(E) is true. The points appear to lie on a quadratic curve so a model such as
[[math]]
y = \beta_0 + \beta_1 x + \beta_2 x^2
[[/math]]
can work well here. Recall that linear models must be linear in the coefficients, not necessarily linear in the variables.