Exercises

Jun 11'23

What conditions on a training set ensure that there is a unique optimal linear hypothesis map for linear regression?

Jun 11'23

Linear regression uses the squared error loss to measure the quality of a linear hypothesis map. We learn the weights [math]\weights[/math] of a linear map via ERM using a training set [math]\dataset[/math] that consists of [math]\samplesize=100[/math] data points. Each data point is characterized by [math]\featurelen=5[/math] features and a numeric label.

Is there a unique choice for the weights [math]\weights[/math] that results in a linear predictor with minimum average squared error loss on the training set [math]\dataset[/math])?

AAdmin

Jun 11'23

Consider a training set of [math]\samplesize[/math] datapoints, each characterized by a single numeric feature [math]\feature[/math] and numeric label [math]\truelabel[/math]. We learn hypothesis map of the form [math]h(\feature) = \feature + b[/math] with some bias [math]b \in \mathbb{R}[/math].

Can you write down a formula for the optimal [math]b[/math], that minimizes the average squared error on training data [math]\big(\feature^{(1)},\truelabel^{(1)} \big),\ldots,\big(\feature^{(\samplesize)},\truelabel^{(\samplesize)}\big)[/math].

AAdmin

Jun 11'23

Consider polynomial regression for data points with a single numeric feature [math]\feature \in \mathbb{R}[/math] and numeric label [math]\truelabel[/math]. Here, polynomial regression is equivalent to linear regression using the transformed feature vectors [math]\featurevec = \big(\feature^{0},\feature^{1},\ldots,\feature^{\featuredim-1}\big)^{T}[/math].

Given a dataset [math] \dataset= \big(\feature^{(1)},\truelabel^{(1)}\big),\ldots,\big(\feature^{(\samplesize)},\truelabel^{(\samplesize)}\big)[/math], we construct the feature matrix [math]\featuremtx =\big(\featurevec^{(1)},\ldots,\featurevec^{(\samplesize)}\big) \in \mathbb{R}^{\samplesize \times \samplesize}[/math] with its [math]\sampleidx[/math]th column given by the feature vector [math]\featurevec^{(\sampleidx)}[/math].

Verify that this feature matrix is a Vandermonde matrix ^[1]?

How is the determinant of the feature matrix related to the features and labels of data points in the dataset [math]\dataset[/math]?

W. Gautschi and G. Inglese. Lower bounds for the condition number of vandermonde matrices. Numer. Math. 52:241 -- 250, 1988

AAdmin

Jun 12'23

Consider a training set that consists of data points [math]\big(\feature^{(\sampleidx)},\truelabel^{(\sampleidx)} \big)[/math], for [math]\sampleidx = 1,\ldots,\samplesize=100[/math], that are obtained as realizations of iid RVs. The common probability distribution of these RVs is defined by a random data point [math](\feature,\truelabel)[/math]. The feature [math]\feature[/math] of this random data point is a standard Gaussian RV with zero mean and unit variance. The label of a data point is modelled as [math]\truelabel = \feature + e[/math] with Gaussian noise [math]e \sim \mathcal{N}(0,1)[/math]. The feature [math]\feature[/math] and noise [math]e[/math] are statistically independent.

We evaluate the specific hypothesis [math]h(\feature)=0[/math] (which outputs [math]0[/math] no matter what the feature value [math]\feature[/math] is) bythe training error [math]\trainerror = (1/\samplesize) \sum_{\sampleidx=1}^{\samplesize} \big( \truelabel^{(\sampleidx)} - h \big( \feature^{(\sampleidx)} \big) \big)^2[/math]. Note that [math]\trainerror[/math] is the average squared error loss incurred by hypothesis [math]h[/math] on the datapoints [math]\big(\feature^{(\sampleidx)},\truelabel^{(\sampleidx)} \big)[/math], for [math]\sampleidx = 1,\ldots,\samplesize=100[/math].

What is the probability that the training error [math]\trainerror[/math] is at least [math]20[/math] than the expected (squared error) loss [math]\expect \big\{ \big( \truelabel - h(\feature) \big)^{2} \big \}[/math]?

What is the mean (expected value) and variance of the training error ?

AAdmin

Jun 12'23

Let us consider a fictional (idel) optimization method that can be represented as a filter [math]\mathcal{F}[/math]. This filter [math]\mathcal{F}[/math] reads in a real-valued objective function [math]f(\cdot)[/math], defined for all parameter vectors [math]\vw\in\mathbb{R}^{\featuredim}[/math]. The output of the filter [math]\mathcal{F}[/math] is another real-valued function [math]\hat{f}(\vw)[/math] that is defined point-wise as

[[math]] \begin{equation} \hat{f}(\vw) = \begin{cases} 1 & \mbox{ , if } \vw \mbox{ is a local minimum of } f(\cdot) \\ 0 & \mbox{, otherwise.} \end{cases} \end{equation} [[/math]]

Verify that the filter [math]\mathcal{F}[/math] is shift or translation invariant, i.e., [math]\mathcal{F}[/math] commutes with a translation [math]f'(\weights) \defeq f(\weights + \weights^{(o)})[/math] with an arbitrary but fixed (reference) vector [math]\weights^{(o)} \in \mathbb{R}^{\featuredim}[/math].

AAdmin

Jun 12'23

Consider a linear regression method that uses ERM to learn weights [math]\widehat{\weights}[/math] of a linear hypothesis map [math]h(\featurevec) =\weights^{T} \featurevec[/math]. The weights are learnt by minimizing the average squared error loss incurred by [math]h[/math] on a training set that is constituted by the data points [math]\big( \featurevec^{(\sampleidx)}, \truelabel^{(\sampleidx)} \big)[/math] for [math]\sampleidx=1,\ldots, 100[/math]. Someimtes it is useful to assign sample-weights [math]\sampleweight{\sampleidx}[/math] to the data points and learn [math]\widehat{\weights}[/math]. These sample-weights reflect varying levels of importance or relevance of different data points.

For simplicity we use the sample weights [math]\sampleweight{\sampleidx} = 2 \alpha \in [0,1][/math] for [math]\sampleidx=1,\ldots,50[/math] and [math]\sampleweight{\sampleidx} = 2(1 - \alpha)[/math] for [math]\sampleidx=51,\ldots,100[/math].

Can you find a closed-form expression (similar to equ_close_form_lin_reg) for the weights [math]\widehat{\weights}^{(\alpha)}[/math] that minimize the weighted average squared error

[[math]]f(\weights) \defeq (1/50)\sum_{\sampleidx=1}^{50} \alpha \big( \truelabel^{(\sampleidx)} - \weights^{T} \featurevec^{(\sampleidx)} \big)^{2} + (1/50)\sum_{\sampleidx=51}^{100} (1-\alpha) \big( \truelabel^{(\sampleidx)} - \weights^{T} \featurevec^{(\sampleidx)} \big)^{2}[[/math]]

for different [math]\alpha[/math]?

AAdmin

Jun 12'23

Consider data points characterized by single numeric feature [math]\feature[/math] and label [math]\truelabel[/math]. We learn a hypothesis map of the form [math]h(\feature) = \feature + b[/math] with some bias [math]b \in \mathbb{R}[/math].

Can you write down a formula for the optimal [math]b[/math], that minimizes the average absolute error on training data [math]\big(\feature^{(1)},\truelabel^{(1)} \big),\ldots,\big(\feature^{(\samplesize)},\truelabel^{(\samplesize)}\big)[/math].

[Gautschi1988-1] W. Gautschi and G. Inglese. Lower bounds for the condition number of vandermonde matrices. Numer. Math. 52:241 -- 250, 1988

[1]