View file src/colab/copie_de_neural_ode.py - Download

# -*- coding: utf-8 -*-
"""Copie de neural-ode.ipynb

Automatically generated by Colab.

Original file is located at
    https://colab.research.google.com/drive/1EbiiM19azgFEiIq4WNo-09wiDIiI8zSx

Original : https://colab.research.google.com/github/williamgilpin/cphy/blob/main/talks/overfitting_bias_variance_free_lunch.ipynb

# Overfitting, the Bias-Variance tradeoff, Regularization, and Double Descent

Preamble: Run the cells below to import the necessary Python packages

These notes are modified and extended from [notes](https://github.com/kuleshov/cornell-cs5785-2022-applied-ml/blob/main/slides/lecture5-regularization.ipynb) by **Volodymyr Kuleshov** at Cornell Tech. These have been modified from the originals to segue into a the physics course, and to motivate double-descent
"""

# Commented out IPython magic to ensure Python compatibility.
## Preamble / required packages
import numpy as np
np.random.seed(0)

## Import local plotting functions and in-notebook display functions
import matplotlib.pyplot as plt
from IPython.display import Image, display
# %matplotlib inline

import warnings
## Comment this out to activate warnings
warnings.filterwarnings('ignore')

"""# Overfitting and Underfitting

We have seen a number of supervised learning algorithms.

Next, let's look at why they work (and why sometimes they don't).
"""



"""# Polynomial Regression

Recall that in linear regression, we fit a linear model to a dataset $X \in \mathbb{R}^{N_\text{data} \times N_\text{features}}$ and a vector of labels $y \in \mathbb{R}^N_\text{data}$

$$
\mathbf{y} = \boldsymbol{\theta}^\top \mathbf{X}
$$

where $\boldsymbol{\theta} \in \mathbb{R}^{N_\text{features} \times 1}$ is a vector of parameters that we want to learn, which we sometimes call the model weights.


What if we wanted to fit a polynomial instead? We could instead solve a polynomial regression problem of the form

$$
\mathbf{y} = \boldsymbol{\theta}^\top \Phi(\mathbf{X})
$$

where $\Phi(\mathbf{X})$ is a matrix of features that are polynomial transformations of the original features,
$$
\Phi(\mathbf{X}) = \begin{bmatrix}
1 & x_{11} & x_{12} & \dots & x_{1d} & x_{11}^2 & x_{12}^2 & \dots & x_{1d}^2 & \dots \\
1 & x_{21} & x_{22} & \dots & x_{2d} & x_{21}^2 & x_{22}^2 & \dots & x_{2d}^2 & \dots \\
\vdots & \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots & \ddots \\
1 & x_{n1} & x_{n2} & \dots & x_{nd} & x_{n1}^2 & x_{n2}^2 & \dots & x_{nd}^2 & \dots \\
\end{bmatrix}
$$

For example, if we have a single feature $x$ ($N_\text{features} = 1$), we could use the lifted featurization $\Phi(x) = [1, x, x^2, x^3, \ldots, x^p]$ to fit a polynomial of degree $p$. Sometimes these feature-wise transformations are called *basis functions* or *kernels*.

$$ f_\theta(x) := \theta^\top \phi(x) = \sum_{j=0}^p \theta_j x^j $$
that is linear in $\theta$ but non-linear in $x$ because the features
$$\phi(x) = [1\; x\; \ldots\; x^p]$$
are non-linear. Using these features, we can fit any polynomial of degree $p$ by first transforming the features and then fitting a linear model.

### Create a synthetic dataset

+ As a demonstration, we'll make a simple synthetic dataset with a single feature $x$ and a single label $y$.
+ We don't know this function in practice. It's the unknown generator of our data.
"""

true_fn = lambda X: np.cos(1.5 * np.pi * X)

np.random.seed(0)
n_samples = 30
X = np.sort(np.random.rand(n_samples))
y = true_fn(X) + np.random.randn(n_samples) * 0.1

X_test = np.linspace(0, 1, 100)
plt.plot(X_test, true_fn(X_test), 'k', label="True")
plt.plot(X, y, '.b', markersize=10, label="Sampled")
plt.legend(loc="best")

plt.xlabel("x")
plt.ylabel("y")

"""## The Vandermonde matrix

The Vandermonde matrix $\Phi \in \mathbb{R}^{N \times p}$ is a matrix whose rows are the features $\phi(x)$ for a set of points $x_1,\ldots,x_n$. For example, if we have $n=3$ points $x_1=1$, $x_2=2$, $x_3=3$, and $p=2$, then the Vandermonde matrix is
$$
\begin{bmatrix}
1 & 1 & 1 \\
1 & 2 & 4 \\
1 & 3 & 9
\end{bmatrix}
$$
"""

xx = np.array([1, 2, 3])

print(np.vander(xx, 2, increasing=True), end='\n\n')

print(np.vander(xx, 3, increasing=True), end='\n\n')

print(np.vander(xx, 5, increasing=True), end='\n\n')

print(np.vander(xx, 10, increasing=True), end='\n\n')

"""

### Solving yet another least-squares problem

In principle, we can simply fit a polynomial of degree $p$ to a set of points $(x_1,y_1),\ldots,(x_n,y_n)$ by solving the linear system $\Phi\theta = y$.

$$
\begin{bmatrix}
1 & x_1 & x_1^2 & \ldots & x_1^p \\
1 & x_2 & x_2^2 & \ldots & x_2^p \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
1 & x_n & x_n^2 & \ldots & x_n^p
\end{bmatrix}

\begin{bmatrix}
\theta_0 \\
\theta_1 \\
\vdots \\
\theta_p
\end{bmatrix}

=

\begin{bmatrix}
y_1 \\
y_2 \\
\vdots \\
y_n
\end{bmatrix}
$$

Notice how we "linearized" our problem by applying nonlinear functions to our features. This is the basis of many "kernel methods" in machine learning.

### Properties of the Vandermonde matrix

+ The Vandermonde matrix $\Phi$ is a square matrix of size $n\times p$, where $n$ is the number of datapoints and $p$ is one plus the degree of the polynomial that we want to fit. In this sense, the "features" of the model are just higher powers of existing features, making the features redundant.

+ If $p = n$, then the Vandermonde matrix is square and invertible, and the linear system has a unique solution via $\Phi^{-1}$. Notice how the Vandermond matrix always has full rank in this case, if the points $x_1,\ldots,x_n$ are all distinct. However, the matrix becomes singular if any of the datapoints are repeated.

+ However, if $p > n$, then the Vandermonde matrix is rectangular and the linear system is not guaranteed to have a unique solution. Conversely, if $p < n$, then the Vandermonde matrix is rank-deficient and the linear system is not guaranteed to have a solution. In this case, we can find the least-squares solution using the Moore-Penrose pseudoinverse $(\Phi^\top\Phi)^{-1}\Phi^\top$.

+ **Important**, as the number of points increases, the Vandermonde matrix becomes ill-conditioned, which degrades the accuracy of the solution.
"""

xx = np.array([1, 2, 3]) # 3 data points

pvals = range(2, 20)
all_condition_numbers = []
for i in pvals:
    all_condition_numbers.append(np.linalg.cond(np.vander(xx, i, increasing=True)))

plt.plot(pvals, all_condition_numbers)
plt.xlabel('Polynomial degree at fixed data quantity')
plt.ylabel('Condition number')

"""## How can we interpret this effect?

Let's try making unit vectors out of the powers of our data vector
"""

xx = np.array([1.0, 2.0]) # 2 data points

phi = np.vander(xx, 10, increasing=True)

phi /= np.linalg.norm(phi, axis=0, keepdims=True)

def plot_vector(vec, **kwargs):
    plt.plot([0, vec[0]], [0, vec[1]], **kwargs)

plt.figure(figsize=(6, 6))
plot_vector(phi[:, 0], color='k', label='First column')
plot_vector(phi[:, 1], color='r', label='Second column')
plt.title("First two columns of the Vandermonde matrix")


plt.figure(figsize=(6, 6))
plot_vector(phi[:, -2], color='k', label='First column')
plot_vector(phi[:, -1], color='r', label='Second column')
plt.title('Last two columns')

# plot with color gradient
plt.figure(figsize=(6, 6))
for i in range(10):
    plot_vector(phi[:, i], color=plt.cm.viridis(i / 10))
plt.title('All columns')

plt.imshow(phi)
plt.xlabel('Power index')
plt.ylabel('Datapoint index')

"""The largest element of our data vector $x$ begins to dominate at higher powers of $x$, causing the column vectors to become more and more similar to each other, and resulting in the least-squares problem becoming ill-conditioned.

**Nothing comes for free!** If we don't have sufficient features or measurement channels, we can't always just make up new ones as functions of our existing features. Eventually, redundancy catches up with us. We can't convert an underdetermined problem to an overdetermined problem by adding redundant features.

"""





"""# Polynomials Regression

+ When we switch from linear models to polynomials, we can better fit the data and increase the accuracy of our models.

+ Instead of directly computing the Vandermonde matrix, we can use the `PolynomialFeatures` class from `sklearn.preprocessing` to compute the Vandermonde matrix for us. Because it's built into scikit-learn, it's optimized for speed and memory usage.

+ We need to do two operations: featurize the data using the Vandermonde matrix, and then train a linear regression. `scitkit-learn` allows us to combine these operations into a single combined model with the `Pipeline` class.
"""

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

degrees = [1, 2, 3]
plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
    ax = plt.subplot(1, len(degrees), i + 1)

    polynomial_features = PolynomialFeatures(degree=degrees[i])
    linear_regression = LinearRegression()
    pipeline = Pipeline([("pf", polynomial_features), ("lr", linear_regression)])
    pipeline.fit(X[:, np.newaxis], y)

    ax.plot(X_test, true_fn(X_test), color='k', label="True function")
    ax.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
    ax.plot(X, y, '.b', markersize=10, label="Sampled")
    ax.set_xlim((0, 1))
    ax.set_ylim((-2, 2))
    ax.legend(loc="best")
    ax.set_title("Polynomial of Degree {}".format(degrees[i]))

"""+ Although fitting a linear model does not work well, quadratic or cubic polynomials seem to improve the fit.

+ We can assess whether a given model is underfitting by plotting the residuals, which are the differences between the predicted values and the true values. If the residuals are not random, then the model is underfitting.

If the true values are $y$ and the predicted values are $\hat{y}$, then the residuals are $y - \hat{y}$. The reason we expect uniform scatter is that our linear regression model can be seen as defining a data-generating process of the form

$$
y = \theta^\top \phi(x) + \epsilon
$$

where $\epsilon$ is a random variable with zero mean and constant variance. The residuals are therefore given by

$$
y - \theta^\top \phi(x) = \epsilon
$$
"""

## Plot residuals for each model
plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
    ax = plt.subplot(1, len(degrees), i + 1)

    polynomial_features = PolynomialFeatures(degree=degrees[i])
    linear_regression = LinearRegression()
    pipeline = Pipeline([("pf", polynomial_features), ("lr", linear_regression)])
    pipeline.fit(X[:, np.newaxis], y)

    ax.plot(X, y - pipeline.predict(X[:, np.newaxis]), '.b', markersize=10, label="Samples")
    ax.set_xlim((0, 1))
    ax.set_ylim((-2, 2))
    ax.legend(loc="best")
    ax.set_title("Residuals for Degree {}".format(degrees[i]))

"""We can see one heuristic for determining whether we have chosen a sufficiently-expressive model: **Uniform scatter in the residuals**

Hypothetically, we want to view our data generating process as comprising a deterministic function $f$ and a random noise term $\epsilon$. We want to fit a model that is close to $f$ and that is able to capture everything except the variability in $\epsilon$.

$$
y_i = f(x_i) + \epsilon_i
$$

where $f(x_i) = \theta_0 + \theta_1 x_i + ...$ and $\epsilon_i$ is a random variable with mean 0 and variance $\sigma^2$. When we compute our residuals, we are computing the difference between the true values $y_i$ and our model's predictions $\hat{y}_i = f(x_i)$.
$$
r_i = y_i - f(x_i)
$$

Thus we want to $r_i$ to look like a zero-centered random distribution, consistent with our implicit model of the generating process for this dataset.
"""









"""# Why not go even higher?

+ Increase the degree of our polynomial improved our fit accuracy by producing a model that explained more of the variance in the data.
+ Our residuals appeared more uniform as well, suggesting that our model stopped underfitting

What happens if we further increase the degree of the polynomial?

Runge's phenomenon: can fit an $N$-point dataset perfectly with a polynomial of degree $N$.
"""

degrees = [30]
plt.figure(figsize=(10, 6))
for i in range(len(degrees)):
    ax = plt.subplot(1, len(degrees), i + 1)

    polynomial_features = PolynomialFeatures(degree=degrees[i])
    linear_regression = LinearRegression()
    pipeline = Pipeline([("pf", polynomial_features), ("lr", linear_regression)])
    pipeline.fit(X[:, np.newaxis], y)

    X_test = np.linspace(0, 1, 100)
    ax.plot(X_test, true_fn(X_test), color='k', label="True function")
    ax.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
    ax.plot(X, y, '.b', markersize=10, label="Sampled")
    ax.set_xlim((0, 1))
    ax.set_ylim((-2, 2))
    ax.legend(loc="best")
    ax.set_title("Polynomial of Degree {}".format(degrees[i]))

"""#### Let's plot the error on the training points

+ Nice.
+ This is a more obvious version of the exact same phenomenon we saw with our other models.
"""

## Compute train MSE
polynomial_features = PolynomialFeatures(degree=30)
linear_regression = LinearRegression()
pipeline = Pipeline([("pf", polynomial_features), ("lr", linear_regression)])
pipeline.fit(X[:, np.newaxis], y)
print("Train MSE: {}".format(np.mean((pipeline.predict(X[:, np.newaxis]) - y)**2)))

"""### Let's plot the error on a test set

+ Uh-oh
"""

## Compute test MSE
X_test = np.linspace(0, 1, 100)
print("Test MSE: {}".format(np.mean((pipeline.predict(X_test[:, np.newaxis]) - true_fn(X_test))**2)))

"""### Model complexity

+ This is a nebulous concept, in machine learning. In principle, it's the number of trainable parameters in a model

+ However, we've seen before that regularizers and constraints can reduce the effective number of degrees of freedom, and that not all parameter combinations are "reachable" during training.

+ Nonetheless, we'll treat the number of parameters as an upper bound on the complexity of a model.


#### Plotting train/test error vs. model complexity

+ This is a key plot used in model selection and hyperparameter tuning.
"""

## Plot train and test error versus model size

degrees = range(1, 21)
train_errors = []
test_errors = []
for degree in degrees:
    polynomial_features = PolynomialFeatures(degree=degree,)
    linear_regression = LinearRegression()
    pipeline = Pipeline([("pf", polynomial_features), ("lr", linear_regression)])
    pipeline.fit(X[:, np.newaxis], y)

    ## Compute errors on training and testing data
    train_errors.append(np.mean((pipeline.predict(X[:, np.newaxis]) - y)**2))
    test_errors.append(np.mean((pipeline.predict(X_test[:, np.newaxis]) - true_fn(X_test))**2))

plt.figure(figsize=(6, 6))
plt.semilogy(degrees, train_errors, label='Train error')
plt.semilogy(degrees, test_errors, label='Test error')
plt.legend(loc='best')
plt.xlabel('Model complexity (polynomial degree)')
plt.ylabel('Mean Squared Error')

"""### We'd say that the best model is the one that minimizes the test error."""



"""# The Bias-Variance Tradeoff

### Overfitting

+ A very expressive model (e.g., a high degree polynomial) fits the training dataset perfectly.

+ But the model makes highly incorrect predictions outside this dataset, and doesn't generalize.

+ We would say that the overfit model exhibits **high variance** in its score across different hypothetical testing sets

+ We could also say that the model has **low bias** because it will fit any training data very well.

### Underfitting

+ A small model (e.g. a straight line), will not fit the training data well.

+ Therefore, it will also not be accurate on new data.

+ The model will, however, exhibit **low variance** in its error on random testing sets.

+ We say that it has **high bias** due to its strong tendency to do a limited number of things.
"""



"""### Determining Overfitting vs. Underfitting

We can diagnose overfitting and underfitting by measuring performance on a the held out test dataset (not used for training).

* If training perforance is high but holdout performance is low, we are overfitting.

* If training perforance is low but holdout performance is low, we are underfitting.

As we've seen previously, the gap between train and test scores is one measure of overfitting. The larger the gap, the more overfitting.

How do we choose the correct model complexity? For our polynomial example, $p$ is a hyperparameter. We need to tune it on a validation set, or run cross-validation on our training partition
"""

degrees = [1, 20, 5]
titles = ['Underfitting', 'Overfitting', 'A Good Fit']
plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
    ax = plt.subplot(1, len(degrees), i + 1)

    polynomial_features = PolynomialFeatures(degree=degrees[i])
    linear_regression = LinearRegression()
    pipeline = Pipeline([("pf", polynomial_features), ("lr", linear_regression)])
    pipeline.fit(X[:, np.newaxis], y)

    ax.plot(X_test, true_fn(X_test), color='k', label="True function")
    ax.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
    ax.plot(X, y, '.b', markersize=10, label="Sampled")

    ax.set_xlim((0, 1))
    ax.set_ylim((-2, 2))
    ax.legend(loc="best")
    ax.set_title("{} (Degree {})".format(titles[i], degrees[i]))

"""## How to Fix Underfitting

+ Feature engineering: find better features that will make the dataset easier to fit.

+ Pick a more expressive model class (random forests or neural networks instead of linear models).

+ Pick an optimization algorithm that can find better parameters (SGD instead of gradient descent).

## How to Fix Overfitting

+ Use a simpler model class (a linear model instead of a neural net)

+ Use fewer trainable parameters (a smaller neural net)

+ Keep the same model, but collect more training data

+ **Regularization: modify the training process to penalize overly complex models.***
"""







"""
# Regularization

+ We will try applying L2 or Ridge regularization to our polynomial regression model.

+ We will also calculate the accuracy on the training set and the test set. We will again use the coefficient of determination $R^2$ as our metric.

+ Why are we using L2 regularization instead of L1 (Lasso)?

### L2 Regularization for Polynomial Regression

Let's consider an application to the polynomial model we have seen so far. Given polynomial features $\phi(x)$, we optimize the following objective:

$$ J(\theta) = \frac{1}{2n} \sum_{i=1}^n \left( y^{(i)} - \theta^\top \phi(x^{(i)}) \right)^2 + \frac{\lambda}{2} \cdot ||\theta||_2^2. $$

"""

from sklearn.linear_model import Ridge

lambda_values = [0.0, 1e-8, 1e-6, 1e-3, 1e-1, 1e0, 1e1]

plt.figure(figsize=(10, 25))
for i, lambda_value in enumerate(lambda_values):
    ax = plt.subplot(len(lambda_values), 1, i + 1)
    polynomial_features = PolynomialFeatures(degree=20)
    linear_regression = Ridge(alpha=lambda_value)
    pipeline = Pipeline([("pf", polynomial_features), ("lr", linear_regression)])
    pipeline.fit(X[:, np.newaxis], y)

    X_test = np.linspace(0, 1, 100)
    ax.plot(X_test, true_fn(X_test), color='k', label="True function")
    ax.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
    ax.plot(X, y, '.b', markersize=10, label="Sampled")
    ax.set_xlim((0, 1))
    ax.set_ylim((-2, 2))
    ax.legend(loc="best")

    ax.set_title(f"Regularization {lambda_value}, MSE: {pipeline.score(X_test[:, None], true_fn(X_test)):.4f}")





"""# Regularization reduces the variance of the model

We implement regularized and polynomial regression of degree 15 on three random training sets sampled from the same distribution.
"""

from sklearn.linear_model import Ridge

degrees = [15]*3
plt.figure(figsize=(14, 5))
for idx, i in enumerate(range(len(degrees))):
    # sample a dataset
    np.random.seed(idx)
    n_samples = 30
    X = np.sort(np.random.rand(n_samples))
    y = true_fn(X) + np.random.randn(n_samples) * 0.1



    # fit a least squares model
    polynomial_features = PolynomialFeatures(degree=degrees[i], include_bias=False)
    linear_regression = LinearRegression()
    pipeline = Pipeline([("pf", polynomial_features), ("lr", linear_regression)])
    pipeline.fit(X[:, np.newaxis], y)

    # fit a Ridge model
    polynomial_features = PolynomialFeatures(degree=degrees[i], include_bias=False)
    linear_regression = Ridge(alpha=0.1) # sklearn uses alpha instead of lambda
    pipeline2 = Pipeline([("pf", polynomial_features), ("lr", linear_regression)])
    pipeline2.fit(X[:, np.newaxis], y)

    # visualize results
    ax = plt.subplot(1, len(degrees), i + 1)
    ax.plot(X_test, true_fn(X_test), color='k', label="True function")
    ax.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="OLS")
    ax.plot(X_test, pipeline2.predict(X_test[:, np.newaxis]), label="L2 Reg.")
    ax.plot(X, y, '.b', markersize=10, label="Sampled")


    ax.set_xlim((0, 1))
    ax.set_ylim((-2, 2))
    ax.legend(loc="best")
    ax.set_title("Dataset sample #{}".format(idx))

"""We can get a better idea of what's going on by printing the learned weights $\theta$ for each model."""

print("Weights for OLS model")
print(pipeline.named_steps['lr'].coef_[:8])

print("Weights for L2 regularized model")
print(pipeline2.named_steps['lr'].coef_[:8])

"""# The bias-variance tradeoff

+ This idea captures the idea that I can either have a jack-of-all-trades model that is good at everything, but which requires a lot of data to train, or I can have a model that is good at a specific task, and which requires less data to train because it has high bias.

+ Underfitting: high bias, low variance. The model has a strong "bias" towards a certain shape, and is inflexible to capture all of the peculiarities of the training data.

+ Overfitting: low bias, high variance. The model is flexible, but overfits because it attempts to capture all of the peculiarities of the training data, some of which may result from noise or other factors that are not present in the test data.

+ In physics, we often have "inductive bias" towards certain models. For example, we might know that energy is conserved, or that certain symmetries exist. This is a form of bias that can help us learn from less data, by restricting model space.
"""





"""# Hyperparameter Search

+ The Ridge penalty parameter $\lambda$ as a **hyperparameter**, because it's a high-level parameter that controls the properties of the model, rather than a parameter that is learned from the data.

+ We can set $\lambda$ by assessing performance on a held-out validation dataset, or we can use cross-validation to select $\lambda$ from subsets of the training data.
"""



"""# Normal Equations for Ridge regression Models

How, do we fit regularized models? As in the linear case, we can do this easily by deriving generalized normal equations!

Let $L(\theta) = \frac{1}{2} (X \theta - y)^\top  (X \theta - y)$ be our least squares objective. We can write the L2-regularized objective as:
$$ J(\theta) = \frac{1}{2} (X \theta - y)^\top  (X \theta - y) + \frac{1}{2} \lambda ||\theta||_2^2 $$

This allows us to derive the gradient as follows:
\begin{align*}
\nabla_\theta J(\theta)
& = \nabla_\theta \left( \frac{1}{2} (X \theta - y)^\top  (X \theta - y) + \frac{1}{2} \lambda ||\theta||_2^2 \right) \\
& = \nabla_\theta \left( L(\theta) + \frac{1}{2} \lambda \theta^\top \theta \right) \\
& = \nabla_\theta L(\theta) + \lambda \theta \\
& = (X^\top X) \theta - X^\top y + \lambda \theta \\
& = (X^\top X + \lambda I) \theta - X^\top y
\end{align*}

We used the derivation of the normal equations for least squares to obtain $\nabla_\theta L(\theta)$ as well as the fact that: $\nabla_x x^\top x = 2 x$.

We can set the gradient to zero to obtain normal equations for the Ridge model:
$$ (X^\top X + \lambda I) \theta = X^\top y. $$

Hence, the value $\theta^*$ that minimizes this objective is given by:
$$ \theta^* = (X^\top X + \lambda I)^{-1} X^\top y.$$

Note that the matrix $(X^\top X + \lambda I)$ is always invertible, which addresses a problem with least squares that we saw earlier.
"""







"""# L1 vs. L2 Regularization

Recall that L2 regularization (Ridge or Tikhonov) is defined as

$$J(\theta) = \frac{1}{n} \sum_{i=1}^n L(y^{(i)}, \theta^\top x^{(i)}) + \frac{\lambda}{2} \cdot ||\theta||_2^2.$$

L1 regularization (Lasso) is defined as

$$J(\theta) = \frac{1}{n} \sum_{i=1}^n L(y^{(i)}, \theta^\top x^{(i)}) + \lambda \cdot ||\theta||_1.$$

Where $||\theta||_1 = \sum_{j=1}^d |\theta_j|$ is the L1 norm of $\theta$.

What is the difference between L1 and L2 regularization?
"""

from sklearn.linear_model import Lasso, Ridge

# create ridge coefficients
alphas = np.logspace(-10, 0, 50)
ridge_coefs = []
for a in alphas:
    polynomial_features = PolynomialFeatures(degree=degrees[i], include_bias=False)
    linear_regression = Ridge(alpha=a) # sklearn uses alpha instead of lambda
    pipeline2 = Pipeline([("pf", polynomial_features), ("lr", linear_regression)])
    pipeline2.fit(X[:, np.newaxis], y)

    ridge_coefs.append(pipeline2.named_steps['lr'].coef_)

# plot ridge coefficients
plt.figure(figsize=(10, 5))
plt.semilogx(alphas, ridge_coefs)
plt.xlabel('Regularization parameter (lambda)')
plt.ylabel('Magnitude of model parameters')
plt.title('Ridge coefficients as a function of the regularization')
plt.axis('tight')

lasso_coefs = []
for a in alphas:
    polynomial_features = PolynomialFeatures(degree=degrees[i], include_bias=False)
    linear_regression = Lasso(alpha=a) # sklearn uses alpha instead of lambda
    pipeline2 = Pipeline([("pf", polynomial_features), ("lr", linear_regression)])
    pipeline2.fit(X[:, np.newaxis], y)

    lasso_coefs.append(pipeline2.named_steps['lr'].coef_)

# plot lasoo coefficients
plt.figure(figsize=(10, 5))
plt.semilogx(alphas, lasso_coefs)
plt.xlabel('Regularization parameter (lambda)')
plt.ylabel('Magnitude of model parameters')
plt.title('Lasso coefficients as a function of the regularization')
plt.axis('tight')









"""# An emerging topic: double descent

+ Let's do this again, but this time with slightly more data

+ This example is based on [Daniela Witten's description of double descent](https://twitter.com/daniela_witten/status/1292293102103748609)
"""

## Make a dataset
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

true_fn = lambda X: np.cos(1.5 * np.pi * X)

np.random.seed(0)
n_samples = 20
# X = np.linspace(0, 1, n_samples)
X = np.random.rand(n_samples)
y = true_fn(X) + np.random.randn(n_samples) * 0.1

X_test = np.linspace(0, 1, 100)
plt.plot(X_test, true_fn(X_test), 'k', label="True function")
plt.plot(X, y, '.b', markersize=5, label="Samples")
plt.xlabel("x")
plt.ylabel("y")

plt.legend(loc="best")

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge, Lasso

degrees = [1, 6, 15, 50]
plt.figure(figsize=(8, 20))
for i in range(len(degrees)):


    polynomial_features = PolynomialFeatures(degree=degrees[i])
    ridge_regression = Ridge(alpha=0.01)
    pipeline = Pipeline([("pf", polynomial_features), ("lr", ridge_regression)])
    pipeline.fit(X[:, np.newaxis], y)

    ax = plt.subplot(len(degrees), 1, i + 1)
    ax.plot(X_test, true_fn(X_test), color='k', label="True function")
    ax.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
    ax.plot(X, y, '.b', markersize=5, label="Sampled")
    ax.set_xlim((0, 1))
    ax.set_ylim((-2, 2))
    ax.legend(loc="best")
    ax.set_title("Polynomial of Degree {}".format(degrees[i]))
    plt.xlabel("x")
    plt.ylabel("y")
    plt.tight_layout()

    ## Plot residuals for each model
    # ax = plt.subplot(len(degrees), 1, i + 1)
    # ax.plot(X_test, true_fn(X_test) - pipeline.predict(X_test[:, np.newaxis]), '.b', markersize=5, label="Samples")
    # ax.set_xlim((0, 1))
    # ax.set_ylim((-1, 1))
    # plt.xlabel("x")
    # plt.ylabel("Test Residuals")
    # plt.tight_layout()

from sklearn.linear_model import Ridge

degrees = np.arange(3, 50)
all_train_scores, all_test_scores = list(), list()
all_coefs0, all_coefs = [], []
for i in range(len(degrees)):

    polynomial_features = PolynomialFeatures(degree=degrees[i], include_bias=False)
    # linear_regression = RidgeCV()
    linear_regression = Ridge(alpha=0.1)
    # linear_regression = LinearRegression()
    pipeline = Pipeline([("pf", polynomial_features), ("lr", linear_regression)])
    pipeline.fit(X[:, np.newaxis], y)
    y_pred_train = pipeline.predict(X[:, np.newaxis])

    X_test = np.linspace(0, 1, 100)
    y_test = true_fn(X_test)
    y_pred_test = pipeline.predict(X_test[:, np.newaxis])

    all_train_scores.append(np.sqrt(np.sum((y_pred_train - y)**2)))
    all_test_scores.append(np.sqrt(np.sum((y_pred_test - y_test)**2)))
    all_coefs.append(pipeline.named_steps['lr'].coef_)

plt.figure(figsize=(8, 4))
plt.plot(degrees, all_test_scores)
plt.xlabel('Polynomial Degree')
plt.ylabel('Test Error')

print(degrees)
print(all_test_scores)



"""## What's going on?

+ [Double descent](https://www.pnas.org/doi/abs/10.1073/pnas.1903070116) was first described in 2018!

+ It seemingly violates classical statistics, where too many parameters lead to overfitting and thus rising test error.

+ When $p > n$ (more features than data), the model is overparameterized and the training error can be made arbitrarily small.

+ Unexpectedly, the test error starts decreasing again as $p = n$ (the interpolation threshold, where the number of features is equal to the number of data points)

double descent PNAS paper

Header from [Belkin et al 2019](https://www.pnas.org/doi/10.1073/pnas.2003206117)

double descent schematic

Fig. 1 from [Belkin et al 2019](https://www.pnas.org/doi/10.1073/pnas.2003206117)

double descent curves

Fig. 2 from [Belkin et al 2019](https://www.pnas.org/doi/10.1073/pnas.2003206117)

## How does this happen? The overparameterized regime

+ The least-squares problem is underdetermined, and so there are multiple possible solutions once we pass the interpolation threshold.

+ Given many equivalent solutions, `scikit-learn`'s `Ridge` chooses the solution that minimizes the L2 norm of the coefficients

+ When $p$ gets arbitrarily large, then there are more possible solutions to choose from. The one that minimize the L2 norm of the coefficients ends up being closer to the correct solution, and the test error decreases again.

+ Double descent is a loose argument for why neural networks work so well, despite often having many more parameters than data points. Overparametrization gives them more flexibility to find a good solution.

## Implicit Regularization
"""

plt.plot([np.std(item) * np.sqrt(len(item)) for item in all_coefs])
plt.xlabel('Polynomial Degree')
plt.ylabel('L2 norm of coefficients')

# plt.ylim([2.8, 2.9])