Exercises

Jun 11'23

Section Logistic Regression discussed logistic regression as a ML method that learns a linear hypothesis map by minimizing the logistic loss. The logistic loss has computationally pleasant properties as it is smooth and convex. However, in some applications we might be ultimately interested in the accuracy or (equivalently) the average 0/1 loss.

Can we upper bound the average [math]0/1[/math] loss using the average logistic loss incurred by a given hypothesis on a given training set?

AAdmin

Jun 11'23

Consider a predictor map [math]h(\feature)[/math] which is piece-wise linear and consisting of [math]1000[/math] pieces. Assume we want to represent this map by an artificial neural network (ANN) using neurons with one hidden layer of neurons having a rectified linear unit (ReLU) activation function. The output layer consists of a single neuron with linear activation function.

How many neurons must the ANN contain at least ?

AAdmin

Jun 11'23

Consider a ANN with [math]\featuredim=10[/math] input neurons following by three hidden layers consisting of [math]4[/math], [math]9[/math] and [math]3[/math] nodes. The three hidden layers are followed by the output layer consisting of a single neuron. Assume that all neurons use a linear activation function and no bias term.

What is the effective dimension [math]\effdim{\hypospace}[/math] of the hypothesis space [math]\hypospace[/math] that consists of all hypothesis maps that can be obtained from this ANN.

AAdmin

Jun 11'23

Consider data points characterized by feature vectors [math]\featurevec \in \mathbb{R}^{\featuredim}[/math] and binary labels [math]\truelabel \in\{-1,1\}[/math].

We are interested in finding a good linear classifier which is such that the feature vectors resulting in [math]h(\featurevec) = 1[/math] is a half-space.

Which of the methods discussed in this chapter aim at learning a linear classifier?

AAdmin

Jun 11'23

Consider a ML application involving data points with features [math]\featurevec \in \mathbb{R}^{6}[/math] and a numeric label [math]\truelabel \in \mathbb{R}[/math]. We learn a hypothesis by minimizing the average loss incurred on a training set [math]\dataset = \big\{\big(\featurevec^{(1)},\truelabel^{(1)}\big),\ldots,\big(\featurevec^{(\samplesize)},\truelabel^{(\samplesize)}\big)\big\}[/math].

Which of the following ML methods uses a hypothesis space that depends on the dataset [math]\dataset[/math]?

AAdmin

Jun 11'23

Consider the ANN in Figure fig_ANN using the ReLU activation function (see Figure fig_activate_neuron).

Show that there is a particular choice for the weights [math]\weights =(\weight_{1},\ldots,\weight_{9})^{T}[/math] such that the resulting hypothesis map [math]h^{(\weights)}(\feature)[/math] is a triangle as depicted in the figure below.

Can you also find a choice for the weights [math]\weights =(\weight_{1},\ldots,\weight_{9})^{T}[/math] that produce the same triangle shape if we replace the ReLU activation function with the linear function [math]\actfun(z) =10 \cdot z[/math]?

A hypothesis map [math]h: \mathbb{R} \rightarrow \mathbb{R}[/math] with the shape of a triangle.

AAdmin

Jun 11'23

Try to approximate the hypothesis map depicted in the figure below by an element of [math]\hypospace_{\rm Gauss}[/math] (see equ_def_Gauss_hypospace) using [math]\sigma=1/10[/math], [math]\featuredim=10[/math] and [math]\mu_{\featureidx} = -1 + (2\featureidx/10)[/math].

AAdmin

Jun 11'23

Consider a [math]k[/math]-NN method for a binary classification problem. We use [math]k=1[/math] and a given training set whose data points characterize humans. Each human is characterized by a feature vector and label that indicates sensitive information (e.g., some sickness).

Assume that you have access to the feature vectors of the data points in the training set but not to their labels.

Can you infer the label value of a data point in the training set based on the prediction that you obtained based on your feature vector?

AAdmin

Jun 11'23

Consider a binary classification problem involving data points that are characterized by feature vectors [math]\featurevec \in \mathbb{R}^{\featuredim}[/math] and binary labels [math]\truelabel \in \{-1,1\}[/math]. We have access to a labeled training set [math]\dataset[/math] of size [math]\samplesize[/math].

Show that the [math]k[/math]-NN hypothesis is obtained from the Bayes estimator by approximating or estimating the conditional probability distribution [math]\prob{\featurevec|\truelabel}[/math] via the density estimator ^[1]^{(Sec. 2.5.2.)}

[[math]] \begin{equation} \hat{p} (\featurevec | \truelabel ) \defeq (k/\samplesize) \frac{1}{{\rm vol}(R_{k})}. \end{equation} [[/math]]

Here, [math]{\rm vol}(R)[/math] denotes the volume of a ball with radius [math]R[/math] and [math]R_{k}[/math] is the distance between [math]\featurevec[/math] and the [math]k[/math]th nearest feature vector of a data point in [math]\dataset[/math].

C. M. Bishop. Pattern Recognition and Machine Learning Springer, 2006

[BishopBook-1] C. M. Bishop. Pattern Recognition and Machine Learning Springer, 2006

[1]