Machine Learning

Regularization and Model Selection

Regularization There is a tradeoff between bias (underfitting) and variance (overfitting). The optimal tradeoff requires computing the correct model complexity. Model Complexity : Can be a function of the parameters ($l_2$ norm) and not just the number of parameters Regularization : Allows us to: Control model complexity Prevent overfitting Regularizer Function A regularizer $R(\theta)$, is a function which measures model complexity. It is usually nonnegative. In classical methods : $R(\theta)$ depends only on parameters $\theta$...

Generalization

Generalisation The ultimate goal of machine learning is to create a predictive model which performs well on unseen examples (it generalises well). Generalization is a model’s performance on unseen test data, measured by test error (loss on unseen test data). Test Error Loss/error on test examples $(x,y)$ sampled from a test distribution $\mathcal{D}$ $$ L(\theta) = \mathbb{E}_{(x,y)\sim\mathcal{D}} [ (y-h_\theta(x))^2] $$ The expectation $\mathbb E$ can be approximated by averaging many samples...

Reinforcement Learning

Reinforcement Learning Many sequential decision making and control problems are hard to provide explicit supervision for. Instead provide a reward function and let the learning algorithm figure out how to choose actions over time. Successful in applications such as helicopter flight, legged locomotion, network routing, marketing strategy selection, etc Markov Decision Processes (MDP) Provide formalism for many RL problems Terminology States : $s$ Actions : $a$ State Transition Probabilities : $P_{sa}$ Discount Factor : $\gamma \in [0,1)$ Reward Function : $R : S \times A \mapsto \mathbb{R}$ Dynamics of MDPs Start in some state $s_0$ Choose some action $a_0 \in A$ MDP randomly transitions to successor state $s_1 \sim P_{s_0a_0}$ Pick another action Repeat Represent as $$ s_0 \overset{a_0}{\rightarrow} s_1 \overset{a_1}{\rightarrow} s_2 \overset{a_2}{\rightarrow} s_3 \overset{a_3}{\rightarrow} \dots $$ Total Payoff $$ R(s_0, a_0) + \gamma R(s_1, a_1) + \gamma^2 R(s_2, a_2) + \dots $$ Goal : Maximise expected value of total payoff $$ E \left[ R(s_0) + \gamma R(s_1) + \gamma^2 R(s_2) + \dots \right] $$...

Deep Learning

Deep Learning Supervised with Nonlinear Models Supervised learning is Predict $y$ from input $x$ Suppose model/hypothesis is $h_\theta(x)$ Previous methods have considered Linear regression : $h_\theta(x) = \theta^Tx$ Kernel method : $h_\theta(x) = \theta^T \phi(x)$ Both are linear in $\theta$ Now consider models that are nonlinear in both Parameters : $\theta$ Inputs : $x$ The most common of which is the neural network Cost/Loss Function Define least-squares cost for one sample $$ J^{(i)}(\theta) = \frac{1}{2} ( h_\theta(x^{(i)}) - y^{(i)})^2 $$ and mean-square cost for the dataset $$ J(\theta) = \frac{1}{n} \sum_{i=1}^{n} J^{(i)}(\theta) $$...

Support Vector Machine (SVM)

Support Vector Machines The support vector machine (SVM) is a supervised learning method that finds the optimal margin for either regression or classification. For the linear classifier $$ h_{w,b}(x) = g(w^T x + b) $$ with $y \in {-1, 1}$ Margins Margins represent the idea of how confident and correct a prediction is. Geometric Margin, $\gamma$ The geometric margin is the Euclidean distance from a sample to the decision boundary, defined as...

Kernel Methods

Kernel Methods Kernel methods are an efficient way to perform nonlinear regression or classification, by calculating the dot product instead of the entire feature map. Feature Maps A feature map $\phi$ is a function that maps input attributes to some new (nonlinear) feature variables. LMS with Features First lets define: Input : $x \in \mathbb R ^ d$ Features : $\phi(x) \in \mathbb R^{p}$ Weights : $\theta \in \mathbb R^d$ Modify gradient descent for ordinary least squares problem $$ \theta := \theta + \alpha \sum_{i=1}^{n}{(y^{(i)} - \theta^T x^{(i)})\ x^{(i)}} $$ With a feature map $\phi : \mathbb R^d \rightarrow \mathbb R^p$ that maps $x$ to $\phi(x)$ $$ \theta := \theta + \alpha \sum_{i=1}^{n}{(y^{(i)} - \theta^T \underbrace{\phi(x^{(i)})} )\ \underbrace{\phi(x^{(i)}})} $$ Which will have SGD update rule $$ \theta := \theta + \alpha (y^{(i)} - \theta^T \phi(x^{(i)}))\ \phi(x^{(i)}) $$...

Generative Learning

Generative Learning Generative learning is a different approach to learning as opposed to discriminative learning. It tries to model $p(x|y)$ and $p(y)$ instead of learning $p(y|x)$ directly. $p(x|y)$ : Distribution of the target’s features $p(y)$ : Class priors It uses Bayes rule to calculate posterior distribution $p(y|x)$ $$ p(y|x) = \frac{p(x|y) p(y)}{p(x)} $$ And can be simplified when using for prediction because the denominator is constant and doesn’t matter $$ \arg \max_y P(y|x) = \arg \max_y p(x|y) p(y) $$...

Generalised Linear Models (GLM)

Generalised Linear Models (GLM) Generalised Linear Models (GLMs) are a family of models which include many common distributions such as Gaussian, Bernoulli and Multinomial. The Exponential Family The exponential family serves as a starting point for GLMs. The exponential family is defined as $$ p(y; \eta) = b(y) \ \exp{\left( \eta^T\ T(y) - a(\eta) \right)} $$ Natural (Canonical) parameter : $\eta$ Sufficient statistic : $T(y)$ Log Partition function : $a(\eta)$ $T(y)$ is often chosen to be $T(y) = y$...

Supervised Learning - Classification

Classification Classification is similar to regression, except output $y$ only takes on a small number of discrete values, or classes. Logistic Regression Ignoring the fact that $y$ is discrete will often result in very poor performance. $h_\theta(x)$ should also be constrained to $y \in { 0, 1 }$ One approach is to modify the hypothesis function to use the logistic / sigmoid function $$ h_\theta(x) = g(\theta^Tx) = \frac{1}{1 + e^{-\theta^Tx}} $$ where $g(z) = \frac{1}{1 + e^{-z}}$ is the logistic/sigmoid function...

Supervised Learning - Regression

Supervised Learning Supervised learning is the task of learning a function mapping from input to output $$ y = f(x) $$ The relationship can be linear/nonlinear or convex/nonconvex. The approach learns from labeled data. Terminology Input (Features) : $x^{(i)}$ Output (Target) : $y^{(i)}$ Training example : $(x^{(i)}, y^{(i)})$ Hypothesis : $h(x)$ Parameters / Weights : $\theta$ Types Regression : Continuous values Classification : Discrete values Linear Regression Objective Learn parameters $\theta$ for a given hypthesis function $h$ to best predict output $y$ from input $x$...