May 26'23

Exercise

You are given a dataset with two variables, which is graphed below. You want to predict y using x.

Determine which statement regarding using a generalized linear model (GLM) or a random forest is true.

  • A random forest is appropriate because the dataset contains only quantitative variables.
  • A random forest is appropriate because the data does not follow a straight line.
  • A GLM is not appropriate because the variance of y given x is not constant.
  • A random forest is appropriate because there is a clear relationship between y and x.
  • A GLM is appropriate because it can accommodate polynomial relationships.

Copyright 2023. The Society of Actuaries, Schaumburg, Illinois. Reproduced with permission.

May 26'23

Key: E

(A) is false. Trees work better with qualitative data.

(B) is false. While trees accommodate nonlinear relations, as seen in (E) a linear model can work very well here.

(C) is false. The variance is constant, so that is not an issue here.

(D) is false. There is a clear relationship as noted in answer (E).

(E) is true. The points appear to lie on a quadratic curve so a model such as

[[math]] y = \beta_0 + \beta_1 x + \beta_2 x^2 [[/math]]

can work well here. Recall that linear models must be linear in the coefficients, not necessarily linear in the variables.

Copyright 2023. The Society of Actuaries, Schaumburg, Illinois. Reproduced with permission.

00