EXERCISES

18. Learning from Examples

18.17 Prove that a decision list can represent the same function as a decision tree while using at most as many rules as there are leaves in the decision tree for that function. Give an example of a function represented by a decision list using strictly fewer rules than the number of leaves in a minimal-sized decision tree for that same function.

18.18 [DL-expressivity-exercise]This exercise concerns the expressiveness of decision lists (Section learning-theory-section).

Show that decision lists can represent any Boolean function, if the size of the tests is not limited.
Show that if the tests can contain at most $k$ literals each, then decision lists can represent any function that can be represented by a decision tree of depth $k$.

18.19 [knn-mean-mode] Suppose a $7$-nearest-neighbors regression search returns $ \{7, 6, 8, 4, 7, 11, 100\} $ as the 7 nearest $y$ values for a given $x$ value. What is the value of $\hat{y}$ that minimizes the $L_1$ loss function on this data? There is a common name in statistics for this value as a function of the $y$ values; what is it? Answer the same two questions for the $L_2$ loss function.

18.20 [knn-mean-mode] Suppose a $7$-nearest-neighbors regression search returns $ \{4, 2, 8, 4, 9, 11, 100\} $ as the 7 nearest $y$ values for a given $x$ value. What is the value of $\hat{y}$ that minimizes the $L_1$ loss function on this data? There is a common name in statistics for this value as a function of the $y$ values; what is it? Answer the same two questions for the $L_2$ loss function.

18.21 [svm-ellipse-exercise] Figure kernel-machine-figure showed how a circle at the origin can be linearly separated by mapping from the features $(x_1, x_2)$ to the two dimensions $(x_1^2, x_2^2)$. But what if the circle is not located at the origin? What if it is an ellipse, not a circle? The general equation for a circle (and hence the decision boundary) is $(x_1-a)^2 + (x_2-b)^2 - r^2{{\,{=}\,}}0$, and the general equation for an ellipse is $c(x_1-a)^2 + d(x_2-b)^2 - 1 {{\,{=}\,}}0$.

Expand out the equation for the circle and show what the weights $w_i$ would be for the decision boundary in the four-dimensional feature space $(x_1, x_2, x_1^2, x_2^2)$. Explain why this means that any circle is linearly separable in this space.
Do the same for ellipses in the five-dimensional feature space $(x_1, x_2, x_1^2, x_2^2, x_1 x_2)$.

18.22 [svm-exercise] Construct a support vector machine that computes the xor function. Use values of +1 and –1 (instead of 1 and 0) for both inputs and outputs, so that an example looks like $([-1, 1], 1)$ or $([-1, -1], -1)$. Map the input $[x_1,x_2]$ into a space consisting of $x_1$ and $x_1\,x_2$. Draw the four input points in this space, and the maximal margin separator. What is the margin? Now draw the separating line back in the original Euclidean input space.

18.23 [ensemble-error-exercise] Consider an ensemble learning algorithm that uses simple majority voting among $K$ learned hypotheses. Suppose that each hypothesis has error $\epsilon$ and that the errors made by each hypothesis are independent of the others’. Calculate a formula for the error of the ensemble algorithm in terms of $K$ and $\epsilon$, and evaluate it for the cases where $K{{\,{=}\,}}5$, 10, and 20 and $\epsilon{{\,{=}\,}}{0.1}$, 0.2, and 0.4. If the independence assumption is removed, is it possible for the ensemble error to be worse than $\epsilon$?

18.24 Construct by hand a neural network that computes the xor function of two inputs. Make sure to specify what sort of units you are using.

18.25 A simple perceptron cannot represent xor (or, generally, the parity function of its inputs). Describe what happens to the weights of a four-input, hard-threshold perceptron, beginning with all weights set to 0.1, as examples of the parity function arrive.

18.26 [linear-separability-exercise] Recall from Chapter concept-learning-chapter that there are $2^{2^{{n}}}$ distinct Boolean functions of ${{n}}$ inputs. How many of these are representable by a threshold perceptron?

18.27 Consider the following set of examples, each with six inputs and one target output:


$\textbf{x}_1$	1	1	1	1	1	1	1	0	0	0	0	0	0	0
$\textbf{x}_2$	0	0	0	1	1	0	0	1	1	0	1	0	1	1
$\textbf{x}_3$	1	1	1	0	1	0	0	1	1	0	0	0	1	1
$\textbf{x}_4$	0	1	0	0	1	0	0	1	0	1	1	1	0	1
$\textbf{x}_5$	0	0	1	1	0	1	1	0	1	1	0	0	1	0
$\textbf{x}_6$	0	0	0	1	0	1	0	1	1	0	1	1	1	0
$\textbf{T}$	1	1	1	1	1	1	0	1	0	0	0	0	0	0

Run the perceptron learning rule on these data and show the final weights.
Run the decision tree learning rule, and show the resulting decision tree.
Comment on your results.

18.28 [perceptron-ML-gradient-exercise] Section logistic-regression-section (page logistic-regression-section) noted that the output of the logistic function could be interpreted as a probability $p$ assigned by the model to the proposition that $f(\textbf{x}){{\,{=}\,}}1$; the probability that $f(\textbf{x}){{\,{=}\,}}0$ is therefore $1-p$. Write down the probability $p$ as a function of $\textbf{x}$ and calculate the derivative of $\log p$ with respect to each weight $w_i$. Repeat the process for $\log (1-p)$. These calculations give a learning rule for minimizing the negative-log-likelihood loss function for a probabilistic hypothesis. Comment on any resemblance to other learning rules in the chapter.

18.29 [linear-nn-exercise]Suppose you had a neural network with linear activation functions. That is, for each unit the output is some constant $c$ times the weighted sum of the inputs.

Assume that the network has one hidden layer. For a given assignment to the weights $\textbf{w}$, write down equations for the value of the units in the output layer as a function of $\textbf{w}$ and the input layer $\textbf{x}$, without any explicit mention of the output of the hidden layer. Show that there is a network with no hidden units that computes the same function.
Repeat the calculation in part (a), but this time do it for a network with any number of hidden layers.
Suppose a network with one hidden layer and linear activation functions has $n$ input and output nodes and $h$ hidden nodes. What effect does the transformation in part (a) to a network with no hidden layers have on the total number of weights? Discuss in particular the case $h \ll n$.

18.30 Implement a data structure for layered, feed-forward neural networks, remembering to provide the information needed for both forward evaluation and backward propagation. Using this data structure, write a function NEURAL-NETWORK-OUTPUT that takes an example and a network and computes the appropriate output values.

18.31 Suppose that a training set contains only a single example, repeated 100 times. In 80 of the 100 cases, the single output value is 1; in the other 20, it is 0. What will a back-propagation network predict for this example, assuming that it has been trained and reaches a global optimum? (Hint: to find the global optimum, differentiate the error function and set it to zero.)

18.32 The neural network whose learning performance is measured in Figure restaurant-back-prop-figure has four hidden nodes. This number was chosen somewhat arbitrarily. Use a cross-validation method to find the best number of hidden nodes.

18.33 [embedding-separability-exercise] Consider the problem of separating $N$ data points into positive and negative examples using a linear separator. Clearly, this can always be done for $N{{\,{=}\,}}2$ points on a line of dimension $d{{\,{=}\,}}1$, regardless of how the points are labeled or where they are located (unless the points are in the same place).

Show that it can always be done for $N{{\,{=}\,}}3$ points on a plane of dimension $d{{\,{=}\,}}2$, unless they are collinear.
Show that it cannot always be done for $N{{\,{=}\,}}4$ points on a plane of dimension $d{{\,{=}\,}}2$.
Show that it can always be done for $N{{\,{=}\,}}4$ points in a space of dimension $d{{\,{=}\,}}3$, unless they are coplanar.
Show that it cannot always be done for $N{{\,{=}\,}}5$ points in a space of dimension $d{{\,{=}\,}}3$.
The ambitious student may wish to prove that $N$ points in general position (but not $N+1$) are linearly separable in a space of dimension $N-1$.