Exercise
[math] \require{textmacros} \def \bbeta {\bf \beta} \def\fat#1{\mbox{\boldmath$#1$}} \def\reminder#1{\marginpar{\rule[0pt]{1mm}{11pt}}\textbf{#1}} \def\SSigma{\bf \Sigma} \def\ttheta{\bf \theta} \def\aalpha{\bf \alpha} \def\ddelta{\bf \delta} \def\eeta{\bf \eta} \def\llambda{\bf \lambda} \def\ggamma{\bf \gamma} \def\nnu{\bf \nu} \def\vvarepsilon{\bf \varepsilon} \def\mmu{\bf \mu} \def\nnu{\bf \nu} \def\ttau{\bf \tau} \def\SSigma{\bf \Sigma} \def\TTheta{\bf \Theta} \def\XXi{\bf \Xi} \def\PPi{\bf \Pi} \def\GGamma{\bf \Gamma} \def\DDelta{\bf \Delta} \def\ssigma{\bf \sigma} \def\UUpsilon{\bf \Upsilon} \def\PPsi{\bf \Psi} \def\PPhi{\bf \Phi} \def\LLambda{\bf \Lambda} \def\OOmega{\bf \Omega} [/math]
Consider the linear regression model [math]Y_i = \beta_1 X_{i,1} + \beta_2 X_{i,2} + \varepsilon_i[/math] for [math]i=1, \ldots, n[/math]. Suppose estimates of the regression parameters [math](\beta_1, \beta_2)[/math] of this model are obtained through the minimization of the sum-of-squares augmented with a ridge-type penalty:
with penalty parameters [math]\lambda \in \mathbb{R}_{\gt 0}[/math] and [math]\nu \in (-1, 1)[/math].
- Recall the equivalence between constrained and penalized estimation (cf. Section Constrained estimation ). Sketch (for both [math]\nu=0[/math] and [math]\nu=0.9[/math]) the shape of the parameter constraint induced by the penalty above and describe in words the qualitative difference between both shapes.
- When [math]\nu = -1[/math] and [math]\lambda \rightarrow \infty[/math] the estimates of [math]\beta_1[/math] and [math]\beta_2[/math] (resulting from minimization of the penalized loss function above) converge towards each other:
[math]\lim_{\lambda \rightarrow \infty} \hat{\beta}_1(\lambda, -1) = \lim_{\lambda \rightarrow \infty} \hat{\beta}_2(\lambda, -1)[/math]. Motivated by this observation a data scientists incorporates the equality constraint [math]\beta_1 = \beta = \beta_2[/math] explicitly into the model, and s/he estimates the ‘joint regression parameter’ [math]\beta[/math] through the minimization (with respect to [math]\beta[/math]) of:
[[math]] \begin{eqnarray*} \sum\nolimits_{i=1}^n (Y_i - \beta X_{i,1} - \beta X_{i,2})^2 + \delta \beta^2, \end{eqnarray*} [[/math]]with penalty parameter [math]\delta \in \mathbb{R}_{\gt 0}[/math]. The data scientist is surprised to find that resulting estimate [math]\hat{\beta}(\delta)[/math] does not have the same limiting (in the penalty parameter) behavior as the [math]\hat{\beta}_1(\lambda, -1)[/math], i.e. [math]\lim_{\delta \rightarrow \infty} \hat{\beta} (\delta) \not= \lim_{\lambda \rightarrow \infty} \hat{\beta}_1(\lambda, -1)[/math]. Explain the misconception of the data scientist.
- Assume that i) [math]n \gg 2[/math], ii) the unpenalized estimates [math](\hat{\beta}_1(0, 0), \hat{\beta}_2(0, 0))^{\top}[/math] equal [math](-2,2)[/math], and iii) that the two covariates [math]X_1[/math] and [math]X_2[/math] are zero-centered, have equal variance, and are strongly negatively correlated. Consider [math](\hat{\beta}_1(\lambda, \nu), \hat{\beta}_2(\lambda, \nu))^{\top}[/math] for both [math]\nu=-0.9[/math] and [math]\nu=0.9[/math]. For which value of [math]\nu[/math] do you expect the sum of the absolute value of the estimates to be largest? Hint: Distinguish between small and large values of [math]\lambda[/math] and think geometrically!