Exercises

Jun 11'23

Consider data points that are characterized by a single numeric feature [math]\feature\!\in\!\mathbb{R}[/math] and a numeric label [math]\truelabel\!\in\mathbb{R}[/math]. We use a ML method to learn a hypothesis map [math]h: \mathbb{R} \rightarrow \mathbb{R}[/math] based on a training set consisting of three data points

[[math]](\feature^{(1)}=1,\truelabel^{(1)} = 3), (\feature^{(2)}=4,\truelabel^{(2)}=-1), (\feature^{(3)}=1,\truelabel^{(3)}=5).[[/math]]

Is there any chance for the ML method to learn a hypothesis map that perfectly fits the data points such that [math]h\big( \feature^{(\sampleidx)} \big) = \truelabel^{(\sampleidx)}[/math] for [math]\sampleidx=1,\ldots,3[/math].

Hint: Try to visualize the data points in a scatterplot and various hypothesis maps (see Figure fig_three_maps_example).

AAdmin

Jun 11'23

Consider a dataset of daily air temperatures [math]\feature^{(1)},\ldots,\feature^{(\samplesize)}[/math] measured at the Finnish Meteorological Institute (FMI) observation station “Utsjoki Nuorgam” during 01.12.2019 and 29.02.2020. Thus, [math]\feature^{(1)}[/math] is the daily temperature measured on 01.12.2019, [math]\feature^{(2)}[/math] is the daily temperature measure don 02.12.2019, and [math]\feature^{(\samplesize)}[/math] is the daily temperature measured on 29.02.2020. You can download this from the dataset. ML methods often determine few parameters to characterize large collections of data points.

Compute, for the above temperature measurement dataset, the following quantities:

the minimum [math]A \defeq \min_{\sampleidx=1,\ldots,\samplesize} \feature^{(\sampleidx)}[/math]
the maximum [math]B \defeq \max_{\sampleidx=1,\ldots,\samplesize} \feature^{(\sampleidx)}[/math]
the average [math]C \defeq (1/\samplesize) \sum_{\sampleidx=1,\ldots,\samplesize} \feature^{(\sampleidx)}[/math]
the standard deviation [math]D \defeq \sqrt{(1/\samplesize)\sum_{\sampleidx=1,\ldots,\samplesize} \big( \feature^{(\sampleidx)}-C \big)^2}[/math]

AAdmin

Jun 11'23

Consider the tiny desktop computer “RaspberryPI” equipped with a total of [math]8[/math] Gigabytes memory ^[1]. We want implement a ML algorithm that learns a hypothesis map that is represented by a deep artificial neural network (ANN) involving [math]\featurelen=10^6[/math] numeric parameters. Each parameter is quantized using [math]8[/math] bits ([math]=1[/math] Byte).

How many different hypotheses can we store at most on a RaspberryPI computer? (You can assume that [math]1 {\rm Gigabyte} = 10^{9} {\rm Bytes}[/math].)

O. Dürr, Y. Pauchard, D. Browarnik, R. Axthelm, and M. Loeser. Deep learning on a raspberry pi for real time face recognition. 01 2015

AAdmin

Jun 11'23

For some applications it can be a good idea to not learn a single hypothesis but to learn a whole ensemble of hypothesis maps [math]h^{(1)},\ldots,h^{(\augparam)}[/math]. These hypotheses might even belong to different hypothesis spaces, [math]h^{(1)} \in \hypospace^{(1)},\ldots,h^{(\augparam)} \in \hypospace^{(\augparam)}[/math].

These hypothesis spaces can be arbitrary except that they are defined for the same feature space and label space. Given such an ensemble we can construct a new (“meta”) hypothesis [math]\tilde{h}[/math] by combining (or aggregating) the individual predictions obtained from each hypothesis,

[[math]] \begin{equation} \label{equ_def_ensemble} \tilde{h}(\featurevec) \defeq a\big( h^{(1)}(\featurevec), \ldots,h^{(\augparam)}(\featurevec) \big). \end{equation} [[/math]]

Here, [math]a(\cdot)[/math] denotes some given (fixed) combination or aggregation function. One example for such an aggregation function is the average [math]a\big( h^{(1)}(\featurevec), \ldots,h^{(\augparam)}(\featurevec) \big) \defeq (1/\augparam) \sum_{\augidx=1}^{\augparam} h^{(\augidx)}(\featurevec)[/math]. We obtain a new “meta” hypothesis space [math]\widetilde{\hypospace}[/math], that consists of all hypotheses of the form \eqref{equ_def_ensemble} with [math]h^{(1)} \in \hypospace^{(1)},\ldots,h^{(\augparam)} \in \hypospace^{(\augparam)}[/math].

Which conditions on the aggregation function [math]a(\cdot)[/math] and the individual hypothesis spaces [math]\hypospace^{(1)},\ldots,\hypospace^{(\augparam)}[/math] ensure that [math]\widetilde{\hypospace}[/math] contains each individual hypothesis space, i.e., [math]\hypospace^{(1)},\ldots,\hypospace^{(\augparam)} \subseteq \widetilde{\hypospace}[/math].

AAdmin

Jun 11'23

Consider the ML problem underlying a music information retrieval smartphone app ^[1]. Such an app aims at identifying a song title based on a short audio recording of a song interpretation. Here, the feature vector [math]\featurevec[/math] represents the sampled audio signal and the label [math]\truelabel[/math]is a particular song title out of a huge music database.

What is the length [math]\featuredim[/math] of the feature vector [math]\featurevec \in \mathbb{R}^{\featuredim}[/math] if its entries are the signal amplitudes of a [math]20[/math]-second long recording which is sampled at a rate of [math]44[/math] kHz?

A. Wang. An industrial-strength audio search algorithm. In International Symposium on Music Information Retrieval Baltimore, MD, 2003

AAdmin

Jun 11'23

Consider data points that are characterized by a feature vector [math]\featurevec \in \mathbb{R}^{10}[/math] and a vector-valued label [math]\labelvec \in \mathbb{R}^{30}[/math]. Such vector-valued labels arise in multi-label classification problems. We want to predict the label vector using a linear predictor map

[[math]] \begin{equation} \label{equ_lin_predictor_multilabel} \vh(\featurevec) = \mathbf{W} \featurevec \mbox{ with some matrix } \mathbf{W} \in \mathbb{R}^{30 \times 10}. \end{equation} [[/math]]

How many different linear predictors \eqref{equ_lin_predictor_multilabel} are there ? [math]10[/math], [math]30[/math], [math]40[/math], or infinite?

AAdmin

Jun 11'23

Consider the hypothesis space constituted by all linear maps [math]h(\featurevec) = \weights^{T} \featurevec[/math] with some weight vector [math]\weights \in \mathbb{R}^{\featuredim}[/math]. We try to find the best linear map by minimizing the average squared error loss (the empirical risk) incurred on labeled data points (training set) [math](\featurevec^{(1)},\truelabel^{(1)}),(\featurevec^{(2)},\truelabel^{(2)}),\ldots,(\featurevec^{(\samplesize)},\truelabel^{(\samplesize)})[/math].

Is it possible to represent the resulting empirical risk as a convex quadratic function[math] f(\weights) = \weights^{T} \mathbf{C} \weights + \vb \weights + c[/math]?

If this is possible, how are the matrix [math]\mathbf{C}[/math], vector [math]\vb[/math] and constant [math]c[/math] related to the features and labels of data points in the training set?

AAdmin

Jun 11'23

Consider linear hypothesis space consisting of linear maps [math]h^{(\weights)}(\featurevec) = \weights^{T} \featurevec[/math] that are parametrized by a weight vector [math]\weights[/math]. We learn an optimal weight vector by minimizing the average squared error loss [math]f(\weights) = \emperror \big( h^{(\weights)} | \dataset\big)[/math] incurred by [math]h^{(\weights)}(\featurevec)[/math] on the training set [math]\dataset = \big(\featurevec^{(1)},\truelabel^{(1)}\big),\ldots,\big(\featurevec^{(\samplesize)},\truelabel^{(\samplesize)}\big)[/math].

Is it possible to reconstruct the dataset [math]\dataset[/math] just from knowing the function [math]f(\weights)[/math]?.

Is the resulting labeled training data unique or are there different training sets that could have resulted in the same empirical risk function?

Hint: Write down the training error [math]f(\weights)[/math] in the form [math]f(\weights) = \weights^{T} \mathbf{Q} \weights + c + \vb^{T} \weights[/math] with some matrix [math]\mathbf{Q}[/math], vector [math]\vb[/math] and scalar [math]c[/math] that might depend on the features and labels of the training data points.

AAdmin

Jun 11'23

Show that any hypothesis map of the form [math]h(\feature) = \weight_{1} \feature +\weight_{0}[/math] can be obtained from the concatenation of a feature map [math]\featuremap: \feature \mapsto \rawfeaturevec[/math] with the linear map [math]\tilde{h}(\rawfeaturevec) \defeq \widetilde{\weights}^{T} \rawfeaturevec[/math] using parameter vector [math]\widetilde{\weights} = \big( \weight_{1}, \weight_{0} \big)^{T} \in \mathbb{R}^{2}[/math].

AAdmin

Jun 11'23

Consider an ML application generating data points characterized by a scalar feature [math]x \in \mathbb{R}[/math] and numeric label [math]\truelabel \in \mathbb{R}[/math]. We construct a non-linear map by first transforming the feature [math]\feature[/math] to a new feature vector [math]\rawfeaturevec=(\featuremap_{1}(\feature),\featuremap_{2}(\feature),\featuremap_{3}(\feature),\featuremap_{4}(\feature))^{T} \in \mathbb{R}^{4}[/math].

The components [math]\featuremap_{1}(\feature),\ldots,\featuremap_{4}(\feature)[/math] are indicator functions of intervals [math][-10,-5), [-5,0),[0,5),[5,10][/math]. In particular, [math]\phi_{1}(\feature) = 1[/math] for [math]\feature \in [-10,-5)[/math] and [math]\phi_{1}(\feature)=0[/math] otherwise.

We obtain a hypothesis space [math]\hypospace^{(1)}[/math] by collecting all maps from feature [math]\feature[/math] to predicted label [math]\hat{\truelabel}[/math] that can written as a a weighted linear combination [math]\weights^{T}\rawfeaturevec[/math] (with some parameter vector [math]\weights[/math]) of the transformed features. Which of the following hypothesis maps belong to [math]\hypospace^{(1)}[/math]?

[DLRaspberryPI-1] O. Dürr, Y. Pauchard, D. Browarnik, R. Axthelm, and M. Loeser. Deep learning on a raspberry pi for real time face recognition. 01 2015

[ShazamPaper-1] A. Wang. An industrial-strength audio search algorithm. In International Symposium on Music Information Retrieval Baltimore, MD, 2003

[1]

[1]