Ready for lec 1

Ragav Venkatesan · Ragav Venkatesan · commit 90f328a0702e · 2018-01-12T14:02:15.000-08:00
diff --git a/workshop/lec1/analytical_solution.py b/workshop/lec1/analytical_solution.py
@@ -69,5 +69,28 @@ def plot(self, data = None):
         plt.plot(grid, predictions, 'r')
         plt.show()
 
+class ridge_regressor(regressor):
+    """
+    This is a sample class for lecture 1.
+
+    Args:
+        data: Is a tuple, ``(x,y)``
+              ``x`` is a two or one dimensional ndarray ordered such that axis 0 is independent 
+              data and data is spread along axis 1. If the array had only one dimension, it implies
+              that data is 1D.
+              ``y`` is a 1D ndarray it will be of the same length as axis 0 or x.   
+        alpha: Co-efficient for L2 regularizer.
+                          
+    """
+    def __init__(self, data, alpha = 0.0001):
+        self.x, self.y = data        
+        # Here is where your training and all the other magic should happen. 
+        # Once trained you should have these parameters trained. 
+        x = np.concatenate((np.ones((self.x.shape[0],1)), self.x), axis = 1)
+        w = np.dot(np.linalg.pinv(np.dot(x.T,x) + alpha*np.eye(x.shape[1])), np.dot(x.T,self.y))
+        alpha
+        self.w = w[1:]
+        self.b = w[0]
+        
 if __name__ == '__main__':
     pass 
diff --git a/workshop/lec1/figures/regularization.png b/workshop/lec1/figures/regularization.png
diff --git a/workshop/lec1/lec1.ipynb b/workshop/lec1/lec1.ipynb
@@ -10,7 +10,7 @@
     "\n",
     "## Supervised Learning.\n",
     "\n",
-    "Supervised learning is the task of arriving at a mathematical mapping function from the co-variate space to the variate space using a labeled training dataset. The training dataset is of a set of co-variate - variate sample mapping. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). Colloquially, various names are used for the co-variates and variates, the most common ones being 'features' and 'lables'.\n",
+    "**Supervised learning is the task of arriving at a mathematical mapping function from the co-variate space to the variate space using a labeled training dataset.** The training dataset is of a set of co-variate - variate sample mapping. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). Colloquially, various names are used for the co-variates and variates, the most common ones being 'features' and 'lables'.\n",
     "\n",
     "Let us create a relatable and lower-dimensional dataset to study supervised learning. Assume that you are a human resource manager at Amazon and that you are planning to make strategic human resource expansions in your department. While interviewing candidates, you would like to know antecedently how much that candidate’s pay scale is likely to be. In today’s market where data scientists are in strong demand, most candidates have a free-market value they are predisposed to expect. As a data scientist yourself, and following with Amazon's tradition of relenetlessly relying on data, you could use machine learning to model a future candidate’s potential compensation. Using this knowledge, you can negotiate during the interview. \n",
     "\n",
@@ -22,15 +22,13 @@
     "    \\bf{x_n} & y_n \\end{bmatrix},$$\n",
     "where, $\\bf{x_i} \\in \\mathbb{R}^d$ is a d-dimensional (vector) sample where each sample represents an existing employee and each dimesnion of this ample corresponds to an attribute of the employee that is related to their compensation and $y_i \\in \\mathbb{R}^1$ is the salary of the respective employee. \n",
     "\n",
-    "In this dataset, to *learn* is to establish a mapping between the features and the labels. To model the compensation of the employees, consider for now that, $x_i \\in \\mathbb{R}^1$, is a one-dimensional feature, perhaps the number of years of experience a candidate has in the field. The provided code has a data simulator that will generate some syntehtic data to mimic this scenario. The data might look like something like what is generated by the code-block below."
+    "In this dataset, **to *learn* is to establish a mapping between the features and the labels.** To model the compensation of the employees, consider for now that, $x_i \\in \\mathbb{R}^1$, is a one-dimensional feature, perhaps the number of years of experience a candidate has in the field. The provided code has a data simulator that will generate some syntehtic data to mimic this scenario. The data might look like something like what is generated by the code-block below."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "from dataset import dataset_generator\n",
@@ -49,25 +47,25 @@
     "\n",
     "## Least Squares Linear Regression.\n",
     "\n",
-    "Let us posit that the experience of the candidates and their compensation are linearly related. What this means is that we are making a decision that the relationship between the candidates’ experience and the salaries is captured by a straight line. With this assumption, we have are limiting the architecture of our problem to linear models and converted our problem into a linear regression problem. Essentially, if our data is $x \\in \\mathbb{R}^1$, then our prediction is, \n",
+    "Let us posit that the experience of the candidates and their compensation are **linearly related**. What this means is that we are making a decision that the relationship between the candidates’ experience and the salaries is captured by a straight line. With this assumption, we have are limiting the architecture of our problem to linear models and converted our problem into a linear regression problem. Essentially, if our data is $x \\in \\mathbb{R}^1$, then our prediction is, \n",
     "$$ \\hat{y} = w_1x + b.$$\n",
     "If $\\bf{x} \\in \\mathbb{R}^d $, then \n",
     "$$ \\hat{y} = \\sum_{i=1}^d w_ix^i + b.$$\n",
     "\n",
     "To know how good our predictions are we need some metric to measure our errors. Consider the root-mean-squared error or the RMSE,\n",
     "$$ e_i(\\bf{w}) = \\vert \\vert \\hat{y_i} - y_i \\vert \\vert_2, $$\n",
-    "which, will tell us how *far* away our prediction $\\hat{y_i}$ is from the actual value $y_i, \\forall i \\in [0,n]$ in the Euclidean sense. For our entire dataset, we can have a cumulative error defined as,\n",
+    "which, will tell us **how *far* away our prediction $\\hat{y_i}$ is from the actual value $y_i, \\forall i \\in [0,n]$ in the Euclidean sense**. For our entire dataset, we can have a cumulative error defined as,\n",
     "$$e(\\bf{w}) = \\sum_{i=1}^n \\vert \\vert y_i - \\hat{y_i} \\vert \\vert_2,$$\n",
     "or,\n",
     "$$ e(\\bf{w}) = \\sum_{i=1}^n \\vert \\vert y_i - W^TX + b \\vert \\vert_2.$$\n",
     "\n",
-    "This error is often referred to as the objective. This is what we want to minimize. We want those parameters $w$, that will get us to be as low as possible $e(w)$. Formally, we want,\n",
+    "This error is often referred to as the objective. This is what we want to **minimize**. We want those parameters $w$, that will get us to be as low as possible $e(w)$. Formally, we want,\n",
     "$$ \\hat{w} = \\arg\\min_w e(w). $$\n",
     "We can derive a solution for this optimization problem analytically.\n",
     "$$ e(w) = \\frac{1}{2}(y-w^TX)^T(y-w^TX),$$\n",
     "$$\\frac{\\partial e}{\\partial w} = -X^Tt + X^TXw,$$\n",
     "equating this to zero to obtain minima we get,\n",
-    "$$X^TXw = X^TX,$$\n",
+    "$$X^TXw = X^Ty,$$\n",
     "$$\\hat{w} = (X^TX)^{-1}X^Ty.$$\n",
     "$\\hat{w}$ is will give us the minimum most error possible and this solution is called the analytical solution.\n",
     "\n",
@@ -93,7 +91,7 @@
     "from analytical_solution import regressor\n",
     "data_train = dataset.query_data(samples = 40) # Create a training dataset.  \n",
     "r = regressor(data_train)  # This call should return a regressor object that is fully trained.\n",
-    "params = r.get_params()    # This call should return parameters of the model that are \n",
+    "reg_params = r.get_params()    # This call should return parameters of the model that are \n",
     "                           # fully trained."
    ]
   },
@@ -107,9 +105,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "from errors import rmse\n",
@@ -140,15 +136,56 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We can clearly see here that our simple model works pretty fine. Although for this simple linear model an analytical solution does exist, we find that for more complex problem structures we have to rely on some optimization procedures that are described in the later lectures."
+    "We can clearly see here that our simple model works pretty fine. Although for this simple linear model an analytical solution does exist, we find that for more complex problem structures we have to rely on some optimization procedures that are described in the later lectures.\n",
+    "\n",
+    "## Ridge Regression.\n",
+    "\n",
+    "We used a ``numpy.linalg.pinv`` to solve this problem. We did this because **not always is $x^Tx$ invertible**. What can we do in our analytical solution to make this invertible? One thing that can be done to make this solution more stable is to ensure that the diagonal elements of $w^Tw$ behave nicely. Consider the following analytical solution for $\\hat{w}$,\n",
+    "$$\\hat{w} = (X^TX + \\alpha_2I)^{-1}X^Ty.$$\n",
+    "In this solution, you can be quite sure that this will give a reasonablly good solution. What is this a solution for? \n",
+    "Consider the error function,\n",
+    "$$e(w)=(y-w^Tx)^T(y-w^Tx) + \\alpha_2wTw.$$\n",
+    "Now,\n",
+    "$$\\frac{\\partial e}{\\partial w} = \\frac{\\partial e}{\\partial w} ( w^Tx^Txw - 2y^Txw + y^Ty + \\alpha_2w^Tw),$$\n",
+    "$$ = 2x^Txw - 2x^Ty + 2\\alpha_2I,$$\n",
+    "$$ = 2(x^Tx + \\alpha_2I)w - 2X^Ty,$$\n",
+    "which, when equated to zero to obtain the minima we get,\n",
+    "$$(x^Tx + \\alpha_2I)w = X^Ty,$$\n",
+    "$$\\hat{w} = (X^TX + \\alpha_2I)^{-1}X^Ty.$$"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from analytical_solution import ridge_regressor\n",
+    "data_train = dataset.query_data(samples = 40) # Create a training dataset.  \n",
+    "r = ridge_regressor(data_train, alpha = 0.0001)  # This call should return a regressor object that is fully trained.\n",
+    "ridge_params = r.get_params()    # This call should return parameters of the model that are \n",
+    "                                 # fully trained.\n",
+    "data_test = dataset.query_data(samples = 40)  # Create a random testing dataset.\n",
+    "predictions = r.get_predictions(data_test[0]) # This call should return predictions.\n",
+    "print (\"Rmse error of predictions = \" + str(rmse(data_test[1], predictions)))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Geometry of the $L_2$ regularizer\n",
+    "![Geometry of L2 Regularization](figures/regularization.png)\n",
+    "\n",
+    "The errors we use above are squared errors. With that in mind, if we drew out the errors in the parameters space, we will get an error function which will be a *bowl*. At $\\alpha_2 = 0$, we will be at the center. There are a lot of reasons why we might not want to prefer that. For instance, Smaller weights imply that we know that our weights are stable. In the future, we will notice that smaller weights will help us with noisy data or even to enforce sparsity. We will also see other types of regularizers in later lectures."
    ]
   }
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "conda_python3",
+   "display_name": "Python 3",
    "language": "python",
-   "name": "conda_python3"
+   "name": "python3"
   },
   "language_info": {
    "codemirror_mode": {
@@ -160,7 +197,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.2"
+   "version": "3.6.3"
   }
  },
  "nbformat": 4,