diff --git a/BatchNormalization.ipynb b/BatchNormalization.ipynb
new file mode 100644
index 0000000..e32ddff
--- /dev/null
+++ b/BatchNormalization.ipynb
@@ -0,0 +1,574 @@
+{
+ "nbformat_minor": 0, 
+ "nbformat": 4, 
+ "cells": [
+  {
+   "source": [
+    "# Batch Normalization\n", 
+    "One way to make deep networks easier to train is to use more sophisticated optimization procedures such as SGD+momentum, RMSProp, or Adam. Another strategy is to change the architecture of the network to make it easier to train. One idea along these lines is batch normalization which was recently proposed by [3].\n", 
+    "\n", 
+    "The idea is relatively straightforward. Machine learning methods tend to work better when their input data consists of uncorrelated features with zero mean and unit variance. When training a neural network, we can preprocess the data before feeding it to the network to explicitly decorrelate its features; this will ensure that the first layer of the network sees data that follows a nice distribution. However even if we preprocess the input data, the activations at deeper layers of the network will likely no longer be decorrelated and will no longer have zero mean or unit variance since they are output from earlier layers in the network. Even worse, during the training process the distribution of features at each layer of the network will shift as the weights of each layer are updated.\n", 
+    "\n", 
+    "The authors of [3] hypothesize that the shifting distribution of features inside deep neural networks may make training deep networks more difficult. To overcome this problem, [3] proposes to insert batch normalization layers into the network. At training time, a batch normalization layer uses a minibatch of data to estimate the mean and standard deviation of each feature. These estimated means and standard deviations are then used to center and normalize the features of the minibatch. A running average of these means and standard deviations is kept during training, and at test time these running averages are used to center and normalize features.\n", 
+    "\n", 
+    "It is possible that this normalization strategy could reduce the representational power of the network, since it may sometimes be optimal for certain layers to have features that are not zero-mean or unit variance. To this end, the batch normalization layer includes learnable shift and scale parameters for each feature dimension.\n", 
+    "\n", 
+    "[3] Sergey Ioffe and Christian Szegedy, \"Batch Normalization: Accelerating Deep Network Training by Reducing\n", 
+    "Internal Covariate Shift\", ICML 2015."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "# As usual, a bit of setup\n", 
+    "from __future__ import print_function\n", 
+    "import time\n", 
+    "import numpy as np\n", 
+    "import matplotlib.pyplot as plt\n", 
+    "from cs231n.classifiers.fc_net import *\n", 
+    "from cs231n.data_utils import get_CIFAR10_data\n", 
+    "from cs231n.gradient_check import eval_numerical_gradient, eval_numerical_gradient_array\n", 
+    "from cs231n.solver import Solver\n", 
+    "\n", 
+    "%matplotlib inline\n", 
+    "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", 
+    "plt.rcParams['image.interpolation'] = 'nearest'\n", 
+    "plt.rcParams['image.cmap'] = 'gray'\n", 
+    "\n", 
+    "# for auto-reloading external modules\n", 
+    "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", 
+    "%load_ext autoreload\n", 
+    "%autoreload 2\n", 
+    "\n", 
+    "def rel_error(x, y):\n", 
+    "  \"\"\" returns relative error \"\"\"\n", 
+    "  return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "# Load the (preprocessed) CIFAR10 data.\n", 
+    "\n", 
+    "data = get_CIFAR10_data()\n", 
+    "for k, v in data.items():\n", 
+    "  print('%s: ' % k, v.shape)"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "## Batch normalization: Forward\n", 
+    "In the file `cs231n/layers.py`, implement the batch normalization forward pass in the function `batchnorm_forward`. Once you have done so, run the following to test your implementation."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "# Check the training-time forward pass by checking means and variances\n", 
+    "# of features both before and after batch normalization\n", 
+    "\n", 
+    "# Simulate the forward pass for a two-layer network\n", 
+    "np.random.seed(231)\n", 
+    "N, D1, D2, D3 = 200, 50, 60, 3\n", 
+    "X = np.random.randn(N, D1)\n", 
+    "W1 = np.random.randn(D1, D2)\n", 
+    "W2 = np.random.randn(D2, D3)\n", 
+    "a = np.maximum(0, X.dot(W1)).dot(W2)\n", 
+    "\n", 
+    "print('Before batch normalization:')\n", 
+    "print('  means: ', a.mean(axis=0))\n", 
+    "print('  stds: ', a.std(axis=0))\n", 
+    "\n", 
+    "# Means should be close to zero and stds close to one\n", 
+    "print('After batch normalization (gamma=1, beta=0)')\n", 
+    "a_norm, _ = batchnorm_forward(a, np.ones(D3), np.zeros(D3), {'mode': 'train'})\n", 
+    "print('  mean: ', a_norm.mean(axis=0))\n", 
+    "print('  std: ', a_norm.std(axis=0))\n", 
+    "\n", 
+    "# Now means should be close to beta and stds close to gamma\n", 
+    "gamma = np.asarray([1.0, 2.0, 3.0])\n", 
+    "beta = np.asarray([11.0, 12.0, 13.0])\n", 
+    "a_norm, _ = batchnorm_forward(a, gamma, beta, {'mode': 'train'})\n", 
+    "print('After batch normalization (nontrivial gamma, beta)')\n", 
+    "print('  means: ', a_norm.mean(axis=0))\n", 
+    "print('  stds: ', a_norm.std(axis=0))"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "# Check the test-time forward pass by running the training-time\n", 
+    "# forward pass many times to warm up the running averages, and then\n", 
+    "# checking the means and variances of activations after a test-time\n", 
+    "# forward pass.\n", 
+    "np.random.seed(231)\n", 
+    "N, D1, D2, D3 = 200, 50, 60, 3\n", 
+    "W1 = np.random.randn(D1, D2)\n", 
+    "W2 = np.random.randn(D2, D3)\n", 
+    "\n", 
+    "bn_param = {'mode': 'train'}\n", 
+    "gamma = np.ones(D3)\n", 
+    "beta = np.zeros(D3)\n", 
+    "for t in range(50):\n", 
+    "  X = np.random.randn(N, D1)\n", 
+    "  a = np.maximum(0, X.dot(W1)).dot(W2)\n", 
+    "  batchnorm_forward(a, gamma, beta, bn_param)\n", 
+    "bn_param['mode'] = 'test'\n", 
+    "X = np.random.randn(N, D1)\n", 
+    "a = np.maximum(0, X.dot(W1)).dot(W2)\n", 
+    "a_norm, _ = batchnorm_forward(a, gamma, beta, bn_param)\n", 
+    "\n", 
+    "# Means should be close to zero and stds close to one, but will be\n", 
+    "# noisier than training-time forward passes.\n", 
+    "print('After batch normalization (test-time):')\n", 
+    "print('  means: ', a_norm.mean(axis=0))\n", 
+    "print('  stds: ', a_norm.std(axis=0))"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "## Batch Normalization: backward\n", 
+    "Now implement the backward pass for batch normalization in the function `batchnorm_backward`.\n", 
+    "\n", 
+    "To derive the backward pass you should write out the computation graph for batch normalization and backprop through each of the intermediate nodes. Some intermediates may have multiple outgoing branches; make sure to sum gradients across these branches in the backward pass.\n", 
+    "\n", 
+    "Once you have finished, run the following to numerically check your backward pass."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "# Gradient check batchnorm backward pass\n", 
+    "np.random.seed(231)\n", 
+    "N, D = 4, 5\n", 
+    "x = 5 * np.random.randn(N, D) + 12\n", 
+    "gamma = np.random.randn(D)\n", 
+    "beta = np.random.randn(D)\n", 
+    "dout = np.random.randn(N, D)\n", 
+    "\n", 
+    "bn_param = {'mode': 'train'}\n", 
+    "fx = lambda x: batchnorm_forward(x, gamma, beta, bn_param)[0]\n", 
+    "fg = lambda a: batchnorm_forward(x, a, beta, bn_param)[0]\n", 
+    "fb = lambda b: batchnorm_forward(x, gamma, b, bn_param)[0]\n", 
+    "\n", 
+    "dx_num = eval_numerical_gradient_array(fx, x, dout)\n", 
+    "da_num = eval_numerical_gradient_array(fg, gamma.copy(), dout)\n", 
+    "db_num = eval_numerical_gradient_array(fb, beta.copy(), dout)\n", 
+    "\n", 
+    "_, cache = batchnorm_forward(x, gamma, beta, bn_param)\n", 
+    "dx, dgamma, dbeta = batchnorm_backward(dout, cache)\n", 
+    "print('dx error: ', rel_error(dx_num, dx))\n", 
+    "print('dgamma error: ', rel_error(da_num, dgamma))\n", 
+    "print('dbeta error: ', rel_error(db_num, dbeta))"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "## Batch Normalization: alternative backward (OPTIONAL, +3 points extra credit)\n", 
+    "In class we talked about two different implementations for the sigmoid backward pass. One strategy is to write out a computation graph composed of simple operations and backprop through all intermediate values. Another strategy is to work out the derivatives on paper. For the sigmoid function, it turns out that you can derive a very simple formula for the backward pass by simplifying gradients on paper.\n", 
+    "\n", 
+    "Surprisingly, it turns out that you can also derive a simple expression for the batch normalization backward pass if you work out derivatives on paper and simplify. After doing so, implement the simplified batch normalization backward pass in the function `batchnorm_backward_alt` and compare the two implementations by running the following. Your two implementations should compute nearly identical results, but the alternative implementation should be a bit faster.\n", 
+    "\n", 
+    "NOTE: This part of the assignment is entirely optional, but we will reward 3 points of extra credit if you can complete it."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "np.random.seed(231)\n", 
+    "N, D = 100, 500\n", 
+    "x = 5 * np.random.randn(N, D) + 12\n", 
+    "gamma = np.random.randn(D)\n", 
+    "beta = np.random.randn(D)\n", 
+    "dout = np.random.randn(N, D)\n", 
+    "\n", 
+    "bn_param = {'mode': 'train'}\n", 
+    "out, cache = batchnorm_forward(x, gamma, beta, bn_param)\n", 
+    "\n", 
+    "t1 = time.time()\n", 
+    "dx1, dgamma1, dbeta1 = batchnorm_backward(dout, cache)\n", 
+    "t2 = time.time()\n", 
+    "dx2, dgamma2, dbeta2 = batchnorm_backward_alt(dout, cache)\n", 
+    "t3 = time.time()\n", 
+    "\n", 
+    "print('dx difference: ', rel_error(dx1, dx2))\n", 
+    "print('dgamma difference: ', rel_error(dgamma1, dgamma2))\n", 
+    "print('dbeta difference: ', rel_error(dbeta1, dbeta2))\n", 
+    "print('speedup: %.2fx' % ((t2 - t1) / (t3 - t2)))"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "## Fully Connected Nets with Batch Normalization\n", 
+    "Now that you have a working implementation for batch normalization, go back to your `FullyConnectedNet` in the file `cs2312n/classifiers/fc_net.py`. Modify your implementation to add batch normalization.\n", 
+    "\n", 
+    "Concretely, when the flag `use_batchnorm` is `True` in the constructor, you should insert a batch normalization layer before each ReLU nonlinearity. The outputs from the last layer of the network should not be normalized. Once you are done, run the following to gradient-check your implementation.\n", 
+    "\n", 
+    "HINT: You might find it useful to define an additional helper layer similar to those in the file `cs231n/layer_utils.py`. If you decide to do so, do it in the file `cs231n/classifiers/fc_net.py`."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "np.random.seed(231)\n", 
+    "N, D, H1, H2, C = 2, 15, 20, 30, 10\n", 
+    "X = np.random.randn(N, D)\n", 
+    "y = np.random.randint(C, size=(N,))\n", 
+    "\n", 
+    "for reg in [0, 3.14]:\n", 
+    "  print('Running check with reg = ', reg)\n", 
+    "  model = FullyConnectedNet([H1, H2], input_dim=D, num_classes=C,\n", 
+    "                            reg=reg, weight_scale=5e-2, dtype=np.float64,\n", 
+    "                            use_batchnorm=True)\n", 
+    "\n", 
+    "  loss, grads = model.loss(X, y)\n", 
+    "  print('Initial loss: ', loss)\n", 
+    "\n", 
+    "  for name in sorted(grads):\n", 
+    "    f = lambda _: model.loss(X, y)[0]\n", 
+    "    grad_num = eval_numerical_gradient(f, model.params[name], verbose=False, h=1e-5)\n", 
+    "    print('%s relative error: %.2e' % (name, rel_error(grad_num, grads[name])))\n", 
+    "  if reg == 0: print()"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "# Batchnorm for deep networks\n", 
+    "Run the following to train a six-layer network on a subset of 1000 training examples both with and without batch normalization."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "np.random.seed(231)\n", 
+    "# Try training a very deep net with batchnorm\n", 
+    "hidden_dims = [100, 100, 100, 100, 100]\n", 
+    "\n", 
+    "num_train = 1000\n", 
+    "small_data = {\n", 
+    "  'X_train': data['X_train'][:num_train],\n", 
+    "  'y_train': data['y_train'][:num_train],\n", 
+    "  'X_val': data['X_val'],\n", 
+    "  'y_val': data['y_val'],\n", 
+    "}\n", 
+    "\n", 
+    "weight_scale = 2e-2\n", 
+    "bn_model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, use_batchnorm=True)\n", 
+    "model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, use_batchnorm=False)\n", 
+    "\n", 
+    "bn_solver = Solver(bn_model, small_data,\n", 
+    "                num_epochs=10, batch_size=50,\n", 
+    "                update_rule='adam',\n", 
+    "                optim_config={\n", 
+    "                  'learning_rate': 1e-3,\n", 
+    "                },\n", 
+    "                verbose=True, print_every=200)\n", 
+    "bn_solver.train()\n", 
+    "\n", 
+    "solver = Solver(model, small_data,\n", 
+    "                num_epochs=10, batch_size=50,\n", 
+    "                update_rule='adam',\n", 
+    "                optim_config={\n", 
+    "                  'learning_rate': 1e-3,\n", 
+    "                },\n", 
+    "                verbose=True, print_every=200)\n", 
+    "solver.train()"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "Run the following to visualize the results from two networks trained above. You should find that using batch normalization helps the network to converge much faster."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "plt.subplot(3, 1, 1)\n", 
+    "plt.title('Training loss')\n", 
+    "plt.xlabel('Iteration')\n", 
+    "\n", 
+    "plt.subplot(3, 1, 2)\n", 
+    "plt.title('Training accuracy')\n", 
+    "plt.xlabel('Epoch')\n", 
+    "\n", 
+    "plt.subplot(3, 1, 3)\n", 
+    "plt.title('Validation accuracy')\n", 
+    "plt.xlabel('Epoch')\n", 
+    "\n", 
+    "plt.subplot(3, 1, 1)\n", 
+    "plt.plot(solver.loss_history, 'o', label='baseline')\n", 
+    "plt.plot(bn_solver.loss_history, 'o', label='batchnorm')\n", 
+    "\n", 
+    "plt.subplot(3, 1, 2)\n", 
+    "plt.plot(solver.train_acc_history, '-o', label='baseline')\n", 
+    "plt.plot(bn_solver.train_acc_history, '-o', label='batchnorm')\n", 
+    "\n", 
+    "plt.subplot(3, 1, 3)\n", 
+    "plt.plot(solver.val_acc_history, '-o', label='baseline')\n", 
+    "plt.plot(bn_solver.val_acc_history, '-o', label='batchnorm')\n", 
+    "  \n", 
+    "for i in [1, 2, 3]:\n", 
+    "  plt.subplot(3, 1, i)\n", 
+    "  plt.legend(loc='upper center', ncol=4)\n", 
+    "plt.gcf().set_size_inches(15, 15)\n", 
+    "plt.show()"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "# Batch normalization and initialization\n", 
+    "We will now run a small experiment to study the interaction of batch normalization and weight initialization.\n", 
+    "\n", 
+    "The first cell will train 8-layer networks both with and without batch normalization using different scales for weight initialization. The second layer will plot training accuracy, validation set accuracy, and training loss as a function of the weight initialization scale."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "np.random.seed(231)\n", 
+    "# Try training a very deep net with batchnorm\n", 
+    "hidden_dims = [50, 50, 50, 50, 50, 50, 50]\n", 
+    "\n", 
+    "num_train = 1000\n", 
+    "small_data = {\n", 
+    "  'X_train': data['X_train'][:num_train],\n", 
+    "  'y_train': data['y_train'][:num_train],\n", 
+    "  'X_val': data['X_val'],\n", 
+    "  'y_val': data['y_val'],\n", 
+    "}\n", 
+    "\n", 
+    "bn_solvers = {}\n", 
+    "solvers = {}\n", 
+    "weight_scales = np.logspace(-4, 0, num=20)\n", 
+    "for i, weight_scale in enumerate(weight_scales):\n", 
+    "  print('Running weight scale %d / %d' % (i + 1, len(weight_scales)))\n", 
+    "  bn_model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, use_batchnorm=True)\n", 
+    "  model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, use_batchnorm=False)\n", 
+    "\n", 
+    "  bn_solver = Solver(bn_model, small_data,\n", 
+    "                  num_epochs=10, batch_size=50,\n", 
+    "                  update_rule='adam',\n", 
+    "                  optim_config={\n", 
+    "                    'learning_rate': 1e-3,\n", 
+    "                  },\n", 
+    "                  verbose=False, print_every=200)\n", 
+    "  bn_solver.train()\n", 
+    "  bn_solvers[weight_scale] = bn_solver\n", 
+    "\n", 
+    "  solver = Solver(model, small_data,\n", 
+    "                  num_epochs=10, batch_size=50,\n", 
+    "                  update_rule='adam',\n", 
+    "                  optim_config={\n", 
+    "                    'learning_rate': 1e-3,\n", 
+    "                  },\n", 
+    "                  verbose=False, print_every=200)\n", 
+    "  solver.train()\n", 
+    "  solvers[weight_scale] = solver"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "# Plot results of weight scale experiment\n", 
+    "best_train_accs, bn_best_train_accs = [], []\n", 
+    "best_val_accs, bn_best_val_accs = [], []\n", 
+    "final_train_loss, bn_final_train_loss = [], []\n", 
+    "\n", 
+    "for ws in weight_scales:\n", 
+    "  best_train_accs.append(max(solvers[ws].train_acc_history))\n", 
+    "  bn_best_train_accs.append(max(bn_solvers[ws].train_acc_history))\n", 
+    "  \n", 
+    "  best_val_accs.append(max(solvers[ws].val_acc_history))\n", 
+    "  bn_best_val_accs.append(max(bn_solvers[ws].val_acc_history))\n", 
+    "  \n", 
+    "  final_train_loss.append(np.mean(solvers[ws].loss_history[-100:]))\n", 
+    "  bn_final_train_loss.append(np.mean(bn_solvers[ws].loss_history[-100:]))\n", 
+    "  \n", 
+    "plt.subplot(3, 1, 1)\n", 
+    "plt.title('Best val accuracy vs weight initialization scale')\n", 
+    "plt.xlabel('Weight initialization scale')\n", 
+    "plt.ylabel('Best val accuracy')\n", 
+    "plt.semilogx(weight_scales, best_val_accs, '-o', label='baseline')\n", 
+    "plt.semilogx(weight_scales, bn_best_val_accs, '-o', label='batchnorm')\n", 
+    "plt.legend(ncol=2, loc='lower right')\n", 
+    "\n", 
+    "plt.subplot(3, 1, 2)\n", 
+    "plt.title('Best train accuracy vs weight initialization scale')\n", 
+    "plt.xlabel('Weight initialization scale')\n", 
+    "plt.ylabel('Best training accuracy')\n", 
+    "plt.semilogx(weight_scales, best_train_accs, '-o', label='baseline')\n", 
+    "plt.semilogx(weight_scales, bn_best_train_accs, '-o', label='batchnorm')\n", 
+    "plt.legend()\n", 
+    "\n", 
+    "plt.subplot(3, 1, 3)\n", 
+    "plt.title('Final training loss vs weight initialization scale')\n", 
+    "plt.xlabel('Weight initialization scale')\n", 
+    "plt.ylabel('Final training loss')\n", 
+    "plt.semilogx(weight_scales, final_train_loss, '-o', label='baseline')\n", 
+    "plt.semilogx(weight_scales, bn_final_train_loss, '-o', label='batchnorm')\n", 
+    "plt.legend()\n", 
+    "plt.gca().set_ylim(1.0, 3.5)\n", 
+    "\n", 
+    "plt.gcf().set_size_inches(10, 15)\n", 
+    "plt.show()"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "# Question:\n", 
+    "Describe the results of this experiment, and try to give a reason why the experiment gave the results that it did."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "# Answer:\n"
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }
+ ], 
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 2", 
+   "name": "python2", 
+   "language": "python"
+  }, 
+  "language_info": {
+   "mimetype": "text/x-python", 
+   "nbconvert_exporter": "python", 
+   "name": "python", 
+   "file_extension": ".py", 
+   "version": "2.7.12+", 
+   "pygments_lexer": "ipython2", 
+   "codemirror_mode": {
+    "version": 2, 
+    "name": "ipython"
+   }
+  }
+ }
+}
\ No newline at end of file
diff --git a/ConvolutionalNetworks.ipynb b/ConvolutionalNetworks.ipynb
new file mode 100644
index 0000000..7d88440
--- /dev/null
+++ b/ConvolutionalNetworks.ipynb
@@ -0,0 +1,918 @@
+{
+ "nbformat_minor": 0, 
+ "nbformat": 4, 
+ "cells": [
+  {
+   "source": [
+    "# Convolutional Networks\n", 
+    "So far we have worked with deep fully-connected networks, using them to explore different optimization strategies and network architectures. Fully-connected networks are a good testbed for experimentation because they are very computationally efficient, but in practice all state-of-the-art results use convolutional networks instead.\n", 
+    "\n", 
+    "First you will implement several layer types that are used in convolutional networks. You will then use these layers to train a convolutional network on the CIFAR-10 dataset."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "# As usual, a bit of setup\n", 
+    "from __future__ import print_function\n", 
+    "import numpy as np\n", 
+    "import matplotlib.pyplot as plt\n", 
+    "from cs231n.classifiers.cnn import *\n", 
+    "from cs231n.data_utils import get_CIFAR10_data\n", 
+    "from cs231n.gradient_check import eval_numerical_gradient_array, eval_numerical_gradient\n", 
+    "from cs231n.layers import *\n", 
+    "from cs231n.fast_layers import *\n", 
+    "from cs231n.solver import Solver\n", 
+    "\n", 
+    "%matplotlib inline\n", 
+    "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", 
+    "plt.rcParams['image.interpolation'] = 'nearest'\n", 
+    "plt.rcParams['image.cmap'] = 'gray'\n", 
+    "\n", 
+    "# for auto-reloading external modules\n", 
+    "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", 
+    "%load_ext autoreload\n", 
+    "%autoreload 2\n", 
+    "\n", 
+    "def rel_error(x, y):\n", 
+    "  \"\"\" returns relative error \"\"\"\n", 
+    "  return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "# Load the (preprocessed) CIFAR10 data.\n", 
+    "\n", 
+    "data = get_CIFAR10_data()\n", 
+    "for k, v in data.items():\n", 
+    "  print('%s: ' % k, v.shape)"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "# Convolution: Naive forward pass\n", 
+    "The core of a convolutional network is the convolution operation. In the file `cs231n/layers.py`, implement the forward pass for the convolution layer in the function `conv_forward_naive`. \n", 
+    "\n", 
+    "You don't have to worry too much about efficiency at this point; just write the code in whatever way you find most clear.\n", 
+    "\n", 
+    "You can test your implementation by running the following:"
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "x_shape = (2, 3, 4, 4)\n", 
+    "w_shape = (3, 3, 4, 4)\n", 
+    "x = np.linspace(-0.1, 0.5, num=np.prod(x_shape)).reshape(x_shape)\n", 
+    "w = np.linspace(-0.2, 0.3, num=np.prod(w_shape)).reshape(w_shape)\n", 
+    "b = np.linspace(-0.1, 0.2, num=3)\n", 
+    "\n", 
+    "conv_param = {'stride': 2, 'pad': 1}\n", 
+    "out, _ = conv_forward_naive(x, w, b, conv_param)\n", 
+    "correct_out = np.array([[[[-0.08759809, -0.10987781],\n", 
+    "                           [-0.18387192, -0.2109216 ]],\n", 
+    "                          [[ 0.21027089,  0.21661097],\n", 
+    "                           [ 0.22847626,  0.23004637]],\n", 
+    "                          [[ 0.50813986,  0.54309974],\n", 
+    "                           [ 0.64082444,  0.67101435]]],\n", 
+    "                         [[[-0.98053589, -1.03143541],\n", 
+    "                           [-1.19128892, -1.24695841]],\n", 
+    "                          [[ 0.69108355,  0.66880383],\n", 
+    "                           [ 0.59480972,  0.56776003]],\n", 
+    "                          [[ 2.36270298,  2.36904306],\n", 
+    "                           [ 2.38090835,  2.38247847]]]])\n", 
+    "\n", 
+    "# Compare your output to ours; difference should be around 2e-8\n", 
+    "print('Testing conv_forward_naive')\n", 
+    "print('difference: ', rel_error(out, correct_out))"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "# Aside: Image processing via convolutions\n", 
+    "\n", 
+    "As fun way to both check your implementation and gain a better understanding of the type of operation that convolutional layers can perform, we will set up an input containing two images and manually set up filters that perform common image processing operations (grayscale conversion and edge detection). The convolution forward pass will apply these operations to each of the input images. We can then visualize the results as a sanity check."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "from scipy.misc import imread, imresize\n", 
+    "\n", 
+    "kitten, puppy = imread('kitten.jpg'), imread('puppy.jpg')\n", 
+    "# kitten is wide, and puppy is already square\n", 
+    "d = kitten.shape[1] - kitten.shape[0]\n", 
+    "kitten_cropped = kitten[:, d//2:-d//2, :]\n", 
+    "\n", 
+    "img_size = 200   # Make this smaller if it runs too slow\n", 
+    "x = np.zeros((2, 3, img_size, img_size))\n", 
+    "x[0, :, :, :] = imresize(puppy, (img_size, img_size)).transpose((2, 0, 1))\n", 
+    "x[1, :, :, :] = imresize(kitten_cropped, (img_size, img_size)).transpose((2, 0, 1))\n", 
+    "\n", 
+    "# Set up a convolutional weights holding 2 filters, each 3x3\n", 
+    "w = np.zeros((2, 3, 3, 3))\n", 
+    "\n", 
+    "# The first filter converts the image to grayscale.\n", 
+    "# Set up the red, green, and blue channels of the filter.\n", 
+    "w[0, 0, :, :] = [[0, 0, 0], [0, 0.3, 0], [0, 0, 0]]\n", 
+    "w[0, 1, :, :] = [[0, 0, 0], [0, 0.6, 0], [0, 0, 0]]\n", 
+    "w[0, 2, :, :] = [[0, 0, 0], [0, 0.1, 0], [0, 0, 0]]\n", 
+    "\n", 
+    "# Second filter detects horizontal edges in the blue channel.\n", 
+    "w[1, 2, :, :] = [[1, 2, 1], [0, 0, 0], [-1, -2, -1]]\n", 
+    "\n", 
+    "# Vector of biases. We don't need any bias for the grayscale\n", 
+    "# filter, but for the edge detection filter we want to add 128\n", 
+    "# to each output so that nothing is negative.\n", 
+    "b = np.array([0, 128])\n", 
+    "\n", 
+    "# Compute the result of convolving each input in x with each filter in w,\n", 
+    "# offsetting by b, and storing the results in out.\n", 
+    "out, _ = conv_forward_naive(x, w, b, {'stride': 1, 'pad': 1})\n", 
+    "\n", 
+    "def imshow_noax(img, normalize=True):\n", 
+    "    \"\"\" Tiny helper to show images as uint8 and remove axis labels \"\"\"\n", 
+    "    if normalize:\n", 
+    "        img_max, img_min = np.max(img), np.min(img)\n", 
+    "        img = 255.0 * (img - img_min) / (img_max - img_min)\n", 
+    "    plt.imshow(img.astype('uint8'))\n", 
+    "    plt.gca().axis('off')\n", 
+    "\n", 
+    "# Show the original images and the results of the conv operation\n", 
+    "plt.subplot(2, 3, 1)\n", 
+    "imshow_noax(puppy, normalize=False)\n", 
+    "plt.title('Original image')\n", 
+    "plt.subplot(2, 3, 2)\n", 
+    "imshow_noax(out[0, 0])\n", 
+    "plt.title('Grayscale')\n", 
+    "plt.subplot(2, 3, 3)\n", 
+    "imshow_noax(out[0, 1])\n", 
+    "plt.title('Edges')\n", 
+    "plt.subplot(2, 3, 4)\n", 
+    "imshow_noax(kitten_cropped, normalize=False)\n", 
+    "plt.subplot(2, 3, 5)\n", 
+    "imshow_noax(out[1, 0])\n", 
+    "plt.subplot(2, 3, 6)\n", 
+    "imshow_noax(out[1, 1])\n", 
+    "plt.show()"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "# Convolution: Naive backward pass\n", 
+    "Implement the backward pass for the convolution operation in the function `conv_backward_naive` in the file `cs231n/layers.py`. Again, you don't need to worry too much about computational efficiency.\n", 
+    "\n", 
+    "When you are done, run the following to check your backward pass with a numeric gradient check."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "np.random.seed(231)\n", 
+    "x = np.random.randn(4, 3, 5, 5)\n", 
+    "w = np.random.randn(2, 3, 3, 3)\n", 
+    "b = np.random.randn(2,)\n", 
+    "dout = np.random.randn(4, 2, 5, 5)\n", 
+    "conv_param = {'stride': 1, 'pad': 1}\n", 
+    "\n", 
+    "dx_num = eval_numerical_gradient_array(lambda x: conv_forward_naive(x, w, b, conv_param)[0], x, dout)\n", 
+    "dw_num = eval_numerical_gradient_array(lambda w: conv_forward_naive(x, w, b, conv_param)[0], w, dout)\n", 
+    "db_num = eval_numerical_gradient_array(lambda b: conv_forward_naive(x, w, b, conv_param)[0], b, dout)\n", 
+    "\n", 
+    "out, cache = conv_forward_naive(x, w, b, conv_param)\n", 
+    "dx, dw, db = conv_backward_naive(dout, cache)\n", 
+    "\n", 
+    "# Your errors should be around 1e-8'\n", 
+    "print('Testing conv_backward_naive function')\n", 
+    "print('dx error: ', rel_error(dx, dx_num))\n", 
+    "print('dw error: ', rel_error(dw, dw_num))\n", 
+    "print('db error: ', rel_error(db, db_num))"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "# Max pooling: Naive forward\n", 
+    "Implement the forward pass for the max-pooling operation in the function `max_pool_forward_naive` in the file `cs231n/layers.py`. Again, don't worry too much about computational efficiency.\n", 
+    "\n", 
+    "Check your implementation by running the following:"
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "x_shape = (2, 3, 4, 4)\n", 
+    "x = np.linspace(-0.3, 0.4, num=np.prod(x_shape)).reshape(x_shape)\n", 
+    "pool_param = {'pool_width': 2, 'pool_height': 2, 'stride': 2}\n", 
+    "\n", 
+    "out, _ = max_pool_forward_naive(x, pool_param)\n", 
+    "\n", 
+    "correct_out = np.array([[[[-0.26315789, -0.24842105],\n", 
+    "                          [-0.20421053, -0.18947368]],\n", 
+    "                         [[-0.14526316, -0.13052632],\n", 
+    "                          [-0.08631579, -0.07157895]],\n", 
+    "                         [[-0.02736842, -0.01263158],\n", 
+    "                          [ 0.03157895,  0.04631579]]],\n", 
+    "                        [[[ 0.09052632,  0.10526316],\n", 
+    "                          [ 0.14947368,  0.16421053]],\n", 
+    "                         [[ 0.20842105,  0.22315789],\n", 
+    "                          [ 0.26736842,  0.28210526]],\n", 
+    "                         [[ 0.32631579,  0.34105263],\n", 
+    "                          [ 0.38526316,  0.4       ]]]])\n", 
+    "\n", 
+    "# Compare your output with ours. Difference should be around 1e-8.\n", 
+    "print('Testing max_pool_forward_naive function:')\n", 
+    "print('difference: ', rel_error(out, correct_out))"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "# Max pooling: Naive backward\n", 
+    "Implement the backward pass for the max-pooling operation in the function `max_pool_backward_naive` in the file `cs231n/layers.py`. You don't need to worry about computational efficiency.\n", 
+    "\n", 
+    "Check your implementation with numeric gradient checking by running the following:"
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "np.random.seed(231)\n", 
+    "x = np.random.randn(3, 2, 8, 8)\n", 
+    "dout = np.random.randn(3, 2, 4, 4)\n", 
+    "pool_param = {'pool_height': 2, 'pool_width': 2, 'stride': 2}\n", 
+    "\n", 
+    "dx_num = eval_numerical_gradient_array(lambda x: max_pool_forward_naive(x, pool_param)[0], x, dout)\n", 
+    "\n", 
+    "out, cache = max_pool_forward_naive(x, pool_param)\n", 
+    "dx = max_pool_backward_naive(dout, cache)\n", 
+    "\n", 
+    "# Your error should be around 1e-12\n", 
+    "print('Testing max_pool_backward_naive function:')\n", 
+    "print('dx error: ', rel_error(dx, dx_num))"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "# Fast layers\n", 
+    "Making convolution and pooling layers fast can be challenging. To spare you the pain, we've provided fast implementations of the forward and backward passes for convolution and pooling layers in the file `cs231n/fast_layers.py`.\n", 
+    "\n", 
+    "The fast convolution implementation depends on a Cython extension; to compile it you need to run the following from the `cs231n` directory:\n", 
+    "\n", 
+    "```bash\n", 
+    "python setup.py build_ext --inplace\n", 
+    "```\n", 
+    "\n", 
+    "The API for the fast versions of the convolution and pooling layers is exactly the same as the naive versions that you implemented above: the forward pass receives data, weights, and parameters and produces outputs and a cache object; the backward pass recieves upstream derivatives and the cache object and produces gradients with respect to the data and weights.\n", 
+    "\n", 
+    "**NOTE:** The fast implementation for pooling will only perform optimally if the pooling regions are non-overlapping and tile the input. If these conditions are not met then the fast pooling implementation will not be much faster than the naive implementation.\n", 
+    "\n", 
+    "You can compare the performance of the naive and fast versions of these layers by running the following:"
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "from cs231n.fast_layers import conv_forward_fast, conv_backward_fast\n", 
+    "from time import time\n", 
+    "np.random.seed(231)\n", 
+    "x = np.random.randn(100, 3, 31, 31)\n", 
+    "w = np.random.randn(25, 3, 3, 3)\n", 
+    "b = np.random.randn(25,)\n", 
+    "dout = np.random.randn(100, 25, 16, 16)\n", 
+    "conv_param = {'stride': 2, 'pad': 1}\n", 
+    "\n", 
+    "t0 = time()\n", 
+    "out_naive, cache_naive = conv_forward_naive(x, w, b, conv_param)\n", 
+    "t1 = time()\n", 
+    "out_fast, cache_fast = conv_forward_fast(x, w, b, conv_param)\n", 
+    "t2 = time()\n", 
+    "\n", 
+    "print('Testing conv_forward_fast:')\n", 
+    "print('Naive: %fs' % (t1 - t0))\n", 
+    "print('Fast: %fs' % (t2 - t1))\n", 
+    "print('Speedup: %fx' % ((t1 - t0) / (t2 - t1)))\n", 
+    "print('Difference: ', rel_error(out_naive, out_fast))\n", 
+    "\n", 
+    "t0 = time()\n", 
+    "dx_naive, dw_naive, db_naive = conv_backward_naive(dout, cache_naive)\n", 
+    "t1 = time()\n", 
+    "dx_fast, dw_fast, db_fast = conv_backward_fast(dout, cache_fast)\n", 
+    "t2 = time()\n", 
+    "\n", 
+    "print('\\nTesting conv_backward_fast:')\n", 
+    "print('Naive: %fs' % (t1 - t0))\n", 
+    "print('Fast: %fs' % (t2 - t1))\n", 
+    "print('Speedup: %fx' % ((t1 - t0) / (t2 - t1)))\n", 
+    "print('dx difference: ', rel_error(dx_naive, dx_fast))\n", 
+    "print('dw difference: ', rel_error(dw_naive, dw_fast))\n", 
+    "print('db difference: ', rel_error(db_naive, db_fast))"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "from cs231n.fast_layers import max_pool_forward_fast, max_pool_backward_fast\n", 
+    "np.random.seed(231)\n", 
+    "x = np.random.randn(100, 3, 32, 32)\n", 
+    "dout = np.random.randn(100, 3, 16, 16)\n", 
+    "pool_param = {'pool_height': 2, 'pool_width': 2, 'stride': 2}\n", 
+    "\n", 
+    "t0 = time()\n", 
+    "out_naive, cache_naive = max_pool_forward_naive(x, pool_param)\n", 
+    "t1 = time()\n", 
+    "out_fast, cache_fast = max_pool_forward_fast(x, pool_param)\n", 
+    "t2 = time()\n", 
+    "\n", 
+    "print('Testing pool_forward_fast:')\n", 
+    "print('Naive: %fs' % (t1 - t0))\n", 
+    "print('fast: %fs' % (t2 - t1))\n", 
+    "print('speedup: %fx' % ((t1 - t0) / (t2 - t1)))\n", 
+    "print('difference: ', rel_error(out_naive, out_fast))\n", 
+    "\n", 
+    "t0 = time()\n", 
+    "dx_naive = max_pool_backward_naive(dout, cache_naive)\n", 
+    "t1 = time()\n", 
+    "dx_fast = max_pool_backward_fast(dout, cache_fast)\n", 
+    "t2 = time()\n", 
+    "\n", 
+    "print('\\nTesting pool_backward_fast:')\n", 
+    "print('Naive: %fs' % (t1 - t0))\n", 
+    "print('speedup: %fx' % ((t1 - t0) / (t2 - t1)))\n", 
+    "print('dx difference: ', rel_error(dx_naive, dx_fast))"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "# Convolutional \"sandwich\" layers\n", 
+    "Previously we introduced the concept of \"sandwich\" layers that combine multiple operations into commonly used patterns. In the file `cs231n/layer_utils.py` you will find sandwich layers that implement a few commonly used patterns for convolutional networks."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "from cs231n.layer_utils import conv_relu_pool_forward, conv_relu_pool_backward\n", 
+    "np.random.seed(231)\n", 
+    "x = np.random.randn(2, 3, 16, 16)\n", 
+    "w = np.random.randn(3, 3, 3, 3)\n", 
+    "b = np.random.randn(3,)\n", 
+    "dout = np.random.randn(2, 3, 8, 8)\n", 
+    "conv_param = {'stride': 1, 'pad': 1}\n", 
+    "pool_param = {'pool_height': 2, 'pool_width': 2, 'stride': 2}\n", 
+    "\n", 
+    "out, cache = conv_relu_pool_forward(x, w, b, conv_param, pool_param)\n", 
+    "dx, dw, db = conv_relu_pool_backward(dout, cache)\n", 
+    "\n", 
+    "dx_num = eval_numerical_gradient_array(lambda x: conv_relu_pool_forward(x, w, b, conv_param, pool_param)[0], x, dout)\n", 
+    "dw_num = eval_numerical_gradient_array(lambda w: conv_relu_pool_forward(x, w, b, conv_param, pool_param)[0], w, dout)\n", 
+    "db_num = eval_numerical_gradient_array(lambda b: conv_relu_pool_forward(x, w, b, conv_param, pool_param)[0], b, dout)\n", 
+    "\n", 
+    "print('Testing conv_relu_pool')\n", 
+    "print('dx error: ', rel_error(dx_num, dx))\n", 
+    "print('dw error: ', rel_error(dw_num, dw))\n", 
+    "print('db error: ', rel_error(db_num, db))"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "from cs231n.layer_utils import conv_relu_forward, conv_relu_backward\n", 
+    "np.random.seed(231)\n", 
+    "x = np.random.randn(2, 3, 8, 8)\n", 
+    "w = np.random.randn(3, 3, 3, 3)\n", 
+    "b = np.random.randn(3,)\n", 
+    "dout = np.random.randn(2, 3, 8, 8)\n", 
+    "conv_param = {'stride': 1, 'pad': 1}\n", 
+    "\n", 
+    "out, cache = conv_relu_forward(x, w, b, conv_param)\n", 
+    "dx, dw, db = conv_relu_backward(dout, cache)\n", 
+    "\n", 
+    "dx_num = eval_numerical_gradient_array(lambda x: conv_relu_forward(x, w, b, conv_param)[0], x, dout)\n", 
+    "dw_num = eval_numerical_gradient_array(lambda w: conv_relu_forward(x, w, b, conv_param)[0], w, dout)\n", 
+    "db_num = eval_numerical_gradient_array(lambda b: conv_relu_forward(x, w, b, conv_param)[0], b, dout)\n", 
+    "\n", 
+    "print('Testing conv_relu:')\n", 
+    "print('dx error: ', rel_error(dx_num, dx))\n", 
+    "print('dw error: ', rel_error(dw_num, dw))\n", 
+    "print('db error: ', rel_error(db_num, db))"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "# Three-layer ConvNet\n", 
+    "Now that you have implemented all the necessary layers, we can put them together into a simple convolutional network.\n", 
+    "\n", 
+    "Open the file `cs231n/classifiers/cnn.py` and complete the implementation of the `ThreeLayerConvNet` class. Run the following cells to help you debug:"
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "## Sanity check loss\n", 
+    "After you build a new network, one of the first things you should do is sanity check the loss. When we use the softmax loss, we expect the loss for random weights (and no regularization) to be about `log(C)` for `C` classes. When we add regularization this should go up."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "model = ThreeLayerConvNet()\n", 
+    "\n", 
+    "N = 50\n", 
+    "X = np.random.randn(N, 3, 32, 32)\n", 
+    "y = np.random.randint(10, size=N)\n", 
+    "\n", 
+    "loss, grads = model.loss(X, y)\n", 
+    "print('Initial loss (no regularization): ', loss)\n", 
+    "\n", 
+    "model.reg = 0.5\n", 
+    "loss, grads = model.loss(X, y)\n", 
+    "print('Initial loss (with regularization): ', loss)"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "## Gradient check\n", 
+    "After the loss looks reasonable, use numeric gradient checking to make sure that your backward pass is correct. When you use numeric gradient checking you should use a small amount of artifical data and a small number of neurons at each layer. Note: correct implementations may still have relative errors up to 1e-2."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "num_inputs = 2\n", 
+    "input_dim = (3, 16, 16)\n", 
+    "reg = 0.0\n", 
+    "num_classes = 10\n", 
+    "np.random.seed(231)\n", 
+    "X = np.random.randn(num_inputs, *input_dim)\n", 
+    "y = np.random.randint(num_classes, size=num_inputs)\n", 
+    "\n", 
+    "model = ThreeLayerConvNet(num_filters=3, filter_size=3,\n", 
+    "                          input_dim=input_dim, hidden_dim=7,\n", 
+    "                          dtype=np.float64)\n", 
+    "loss, grads = model.loss(X, y)\n", 
+    "for param_name in sorted(grads):\n", 
+    "    f = lambda _: model.loss(X, y)[0]\n", 
+    "    param_grad_num = eval_numerical_gradient(f, model.params[param_name], verbose=False, h=1e-6)\n", 
+    "    e = rel_error(param_grad_num, grads[param_name])\n", 
+    "    print('%s max relative error: %e' % (param_name, rel_error(param_grad_num, grads[param_name])))"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "## Overfit small data\n", 
+    "A nice trick is to train your model with just a few training samples. You should be able to overfit small datasets, which will result in very high training accuracy and comparatively low validation accuracy."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "np.random.seed(231)\n", 
+    "\n", 
+    "num_train = 100\n", 
+    "small_data = {\n", 
+    "  'X_train': data['X_train'][:num_train],\n", 
+    "  'y_train': data['y_train'][:num_train],\n", 
+    "  'X_val': data['X_val'],\n", 
+    "  'y_val': data['y_val'],\n", 
+    "}\n", 
+    "\n", 
+    "model = ThreeLayerConvNet(weight_scale=1e-2)\n", 
+    "\n", 
+    "solver = Solver(model, small_data,\n", 
+    "                num_epochs=15, batch_size=50,\n", 
+    "                update_rule='adam',\n", 
+    "                optim_config={\n", 
+    "                  'learning_rate': 1e-3,\n", 
+    "                },\n", 
+    "                verbose=True, print_every=1)\n", 
+    "solver.train()"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "Plotting the loss, training accuracy, and validation accuracy should show clear overfitting:"
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "plt.subplot(2, 1, 1)\n", 
+    "plt.plot(solver.loss_history, 'o')\n", 
+    "plt.xlabel('iteration')\n", 
+    "plt.ylabel('loss')\n", 
+    "\n", 
+    "plt.subplot(2, 1, 2)\n", 
+    "plt.plot(solver.train_acc_history, '-o')\n", 
+    "plt.plot(solver.val_acc_history, '-o')\n", 
+    "plt.legend(['train', 'val'], loc='upper left')\n", 
+    "plt.xlabel('epoch')\n", 
+    "plt.ylabel('accuracy')\n", 
+    "plt.show()"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "## Train the net\n", 
+    "By training the three-layer convolutional network for one epoch, you should achieve greater than 40% accuracy on the training set:"
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "model = ThreeLayerConvNet(weight_scale=0.001, hidden_dim=500, reg=0.001)\n", 
+    "\n", 
+    "solver = Solver(model, data,\n", 
+    "                num_epochs=1, batch_size=50,\n", 
+    "                update_rule='adam',\n", 
+    "                optim_config={\n", 
+    "                  'learning_rate': 1e-3,\n", 
+    "                },\n", 
+    "                verbose=True, print_every=20)\n", 
+    "solver.train()"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "scrolled": false, 
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "## Visualize Filters\n", 
+    "You can visualize the first-layer convolutional filters from the trained network by running the following:"
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "from cs231n.vis_utils import visualize_grid\n", 
+    "\n", 
+    "grid = visualize_grid(model.params['W1'].transpose(0, 2, 3, 1))\n", 
+    "plt.imshow(grid.astype('uint8'))\n", 
+    "plt.axis('off')\n", 
+    "plt.gcf().set_size_inches(5, 5)\n", 
+    "plt.show()"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "# Spatial Batch Normalization\n", 
+    "We already saw that batch normalization is a very useful technique for training deep fully-connected networks. Batch normalization can also be used for convolutional networks, but we need to tweak it a bit; the modification will be called \"spatial batch normalization.\"\n", 
+    "\n", 
+    "Normally batch-normalization accepts inputs of shape `(N, D)` and produces outputs of shape `(N, D)`, where we normalize across the minibatch dimension `N`. For data coming from convolutional layers, batch normalization needs to accept inputs of shape `(N, C, H, W)` and produce outputs of shape `(N, C, H, W)` where the `N` dimension gives the minibatch size and the `(H, W)` dimensions give the spatial size of the feature map.\n", 
+    "\n", 
+    "If the feature map was produced using convolutions, then we expect the statistics of each feature channel to be relatively consistent both between different imagesand different locations within the same image. Therefore spatial batch normalization computes a mean and variance for each of the `C` feature channels by computing statistics over both the minibatch dimension `N` and the spatial dimensions `H` and `W`."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "## Spatial batch normalization: forward\n", 
+    "\n", 
+    "In the file `cs231n/layers.py`, implement the forward pass for spatial batch normalization in the function `spatial_batchnorm_forward`. Check your implementation by running the following:"
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "np.random.seed(231)\n", 
+    "# Check the training-time forward pass by checking means and variances\n", 
+    "# of features both before and after spatial batch normalization\n", 
+    "\n", 
+    "N, C, H, W = 2, 3, 4, 5\n", 
+    "x = 4 * np.random.randn(N, C, H, W) + 10\n", 
+    "\n", 
+    "print('Before spatial batch normalization:')\n", 
+    "print('  Shape: ', x.shape)\n", 
+    "print('  Means: ', x.mean(axis=(0, 2, 3)))\n", 
+    "print('  Stds: ', x.std(axis=(0, 2, 3)))\n", 
+    "\n", 
+    "# Means should be close to zero and stds close to one\n", 
+    "gamma, beta = np.ones(C), np.zeros(C)\n", 
+    "bn_param = {'mode': 'train'}\n", 
+    "out, _ = spatial_batchnorm_forward(x, gamma, beta, bn_param)\n", 
+    "print('After spatial batch normalization:')\n", 
+    "print('  Shape: ', out.shape)\n", 
+    "print('  Means: ', out.mean(axis=(0, 2, 3)))\n", 
+    "print('  Stds: ', out.std(axis=(0, 2, 3)))\n", 
+    "\n", 
+    "# Means should be close to beta and stds close to gamma\n", 
+    "gamma, beta = np.asarray([3, 4, 5]), np.asarray([6, 7, 8])\n", 
+    "out, _ = spatial_batchnorm_forward(x, gamma, beta, bn_param)\n", 
+    "print('After spatial batch normalization (nontrivial gamma, beta):')\n", 
+    "print('  Shape: ', out.shape)\n", 
+    "print('  Means: ', out.mean(axis=(0, 2, 3)))\n", 
+    "print('  Stds: ', out.std(axis=(0, 2, 3)))"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "np.random.seed(231)\n", 
+    "# Check the test-time forward pass by running the training-time\n", 
+    "# forward pass many times to warm up the running averages, and then\n", 
+    "# checking the means and variances of activations after a test-time\n", 
+    "# forward pass.\n", 
+    "N, C, H, W = 10, 4, 11, 12\n", 
+    "\n", 
+    "bn_param = {'mode': 'train'}\n", 
+    "gamma = np.ones(C)\n", 
+    "beta = np.zeros(C)\n", 
+    "for t in range(50):\n", 
+    "  x = 2.3 * np.random.randn(N, C, H, W) + 13\n", 
+    "  spatial_batchnorm_forward(x, gamma, beta, bn_param)\n", 
+    "bn_param['mode'] = 'test'\n", 
+    "x = 2.3 * np.random.randn(N, C, H, W) + 13\n", 
+    "a_norm, _ = spatial_batchnorm_forward(x, gamma, beta, bn_param)\n", 
+    "\n", 
+    "# Means should be close to zero and stds close to one, but will be\n", 
+    "# noisier than training-time forward passes.\n", 
+    "print('After spatial batch normalization (test-time):')\n", 
+    "print('  means: ', a_norm.mean(axis=(0, 2, 3)))\n", 
+    "print('  stds: ', a_norm.std(axis=(0, 2, 3)))"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "## Spatial batch normalization: backward\n", 
+    "In the file `cs231n/layers.py`, implement the backward pass for spatial batch normalization in the function `spatial_batchnorm_backward`. Run the following to check your implementation using a numeric gradient check:"
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "np.random.seed(231)\n", 
+    "N, C, H, W = 2, 3, 4, 5\n", 
+    "x = 5 * np.random.randn(N, C, H, W) + 12\n", 
+    "gamma = np.random.randn(C)\n", 
+    "beta = np.random.randn(C)\n", 
+    "dout = np.random.randn(N, C, H, W)\n", 
+    "\n", 
+    "bn_param = {'mode': 'train'}\n", 
+    "fx = lambda x: spatial_batchnorm_forward(x, gamma, beta, bn_param)[0]\n", 
+    "fg = lambda a: spatial_batchnorm_forward(x, gamma, beta, bn_param)[0]\n", 
+    "fb = lambda b: spatial_batchnorm_forward(x, gamma, beta, bn_param)[0]\n", 
+    "\n", 
+    "dx_num = eval_numerical_gradient_array(fx, x, dout)\n", 
+    "da_num = eval_numerical_gradient_array(fg, gamma, dout)\n", 
+    "db_num = eval_numerical_gradient_array(fb, beta, dout)\n", 
+    "\n", 
+    "_, cache = spatial_batchnorm_forward(x, gamma, beta, bn_param)\n", 
+    "dx, dgamma, dbeta = spatial_batchnorm_backward(dout, cache)\n", 
+    "print('dx error: ', rel_error(dx_num, dx))\n", 
+    "print('dgamma error: ', rel_error(da_num, dgamma))\n", 
+    "print('dbeta error: ', rel_error(db_num, dbeta))"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "# Extra Credit Description\n", 
+    "If you implement any additional features for extra credit, clearly describe them here with pointers to any code in this or other files if applicable."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {}
+  }
+ ], 
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3", 
+   "name": "python3", 
+   "language": "python"
+  }, 
+  "language_info": {
+   "mimetype": "text/x-python", 
+   "nbconvert_exporter": "python", 
+   "name": "python", 
+   "file_extension": ".py", 
+   "version": "3.5.0", 
+   "pygments_lexer": "ipython3", 
+   "codemirror_mode": {
+    "version": 3, 
+    "name": "ipython"
+   }
+  }
+ }
+}
\ No newline at end of file
diff --git a/Dropout.ipynb b/Dropout.ipynb
new file mode 100644
index 0000000..7fb4bca
--- /dev/null
+++ b/Dropout.ipynb
@@ -0,0 +1,313 @@
+{
+ "nbformat_minor": 0, 
+ "nbformat": 4, 
+ "cells": [
+  {
+   "source": [
+    "# Dropout\n", 
+    "Dropout [1] is a technique for regularizing neural networks by randomly setting some features to zero during the forward pass. In this exercise you will implement a dropout layer and modify your fully-connected network to optionally use dropout.\n", 
+    "\n", 
+    "[1] Geoffrey E. Hinton et al, \"Improving neural networks by preventing co-adaptation of feature detectors\", arXiv 2012"
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "# As usual, a bit of setup\n", 
+    "from __future__ import print_function\n", 
+    "import time\n", 
+    "import numpy as np\n", 
+    "import matplotlib.pyplot as plt\n", 
+    "from cs231n.classifiers.fc_net import *\n", 
+    "from cs231n.data_utils import get_CIFAR10_data\n", 
+    "from cs231n.gradient_check import eval_numerical_gradient, eval_numerical_gradient_array\n", 
+    "from cs231n.solver import Solver\n", 
+    "\n", 
+    "%matplotlib inline\n", 
+    "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", 
+    "plt.rcParams['image.interpolation'] = 'nearest'\n", 
+    "plt.rcParams['image.cmap'] = 'gray'\n", 
+    "\n", 
+    "# for auto-reloading external modules\n", 
+    "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", 
+    "%load_ext autoreload\n", 
+    "%autoreload 2\n", 
+    "\n", 
+    "def rel_error(x, y):\n", 
+    "  \"\"\" returns relative error \"\"\"\n", 
+    "  return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "# Load the (preprocessed) CIFAR10 data.\n", 
+    "\n", 
+    "data = get_CIFAR10_data()\n", 
+    "for k, v in data.items():\n", 
+    "  print('%s: ' % k, v.shape)"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "# Dropout forward pass\n", 
+    "In the file `cs231n/layers.py`, implement the forward pass for dropout. Since dropout behaves differently during training and testing, make sure to implement the operation for both modes.\n", 
+    "\n", 
+    "Once you have done so, run the cell below to test your implementation."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "np.random.seed(231)\n", 
+    "x = np.random.randn(500, 500) + 10\n", 
+    "\n", 
+    "for p in [0.3, 0.6, 0.75]:\n", 
+    "  out, _ = dropout_forward(x, {'mode': 'train', 'p': p})\n", 
+    "  out_test, _ = dropout_forward(x, {'mode': 'test', 'p': p})\n", 
+    "\n", 
+    "  print('Running tests with p = ', p)\n", 
+    "  print('Mean of input: ', x.mean())\n", 
+    "  print('Mean of train-time output: ', out.mean())\n", 
+    "  print('Mean of test-time output: ', out_test.mean())\n", 
+    "  print('Fraction of train-time output set to zero: ', (out == 0).mean())\n", 
+    "  print('Fraction of test-time output set to zero: ', (out_test == 0).mean())\n", 
+    "  print()"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "# Dropout backward pass\n", 
+    "In the file `cs231n/layers.py`, implement the backward pass for dropout. After doing so, run the following cell to numerically gradient-check your implementation."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "np.random.seed(231)\n", 
+    "x = np.random.randn(10, 10) + 10\n", 
+    "dout = np.random.randn(*x.shape)\n", 
+    "\n", 
+    "dropout_param = {'mode': 'train', 'p': 0.8, 'seed': 123}\n", 
+    "out, cache = dropout_forward(x, dropout_param)\n", 
+    "dx = dropout_backward(dout, cache)\n", 
+    "dx_num = eval_numerical_gradient_array(lambda xx: dropout_forward(xx, dropout_param)[0], x, dout)\n", 
+    "\n", 
+    "print('dx relative error: ', rel_error(dx, dx_num))"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "# Fully-connected nets with Dropout\n", 
+    "In the file `cs231n/classifiers/fc_net.py`, modify your implementation to use dropout. Specificially, if the constructor the the net receives a nonzero value for the `dropout` parameter, then the net should add dropout immediately after every ReLU nonlinearity. After doing so, run the following to numerically gradient-check your implementation."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "np.random.seed(231)\n", 
+    "N, D, H1, H2, C = 2, 15, 20, 30, 10\n", 
+    "X = np.random.randn(N, D)\n", 
+    "y = np.random.randint(C, size=(N,))\n", 
+    "\n", 
+    "for dropout in [0, 0.25, 0.5]:\n", 
+    "  print('Running check with dropout = ', dropout)\n", 
+    "  model = FullyConnectedNet([H1, H2], input_dim=D, num_classes=C,\n", 
+    "                            weight_scale=5e-2, dtype=np.float64,\n", 
+    "                            dropout=dropout, seed=123)\n", 
+    "\n", 
+    "  loss, grads = model.loss(X, y)\n", 
+    "  print('Initial loss: ', loss)\n", 
+    "\n", 
+    "  for name in sorted(grads):\n", 
+    "    f = lambda _: model.loss(X, y)[0]\n", 
+    "    grad_num = eval_numerical_gradient(f, model.params[name], verbose=False, h=1e-5)\n", 
+    "    print('%s relative error: %.2e' % (name, rel_error(grad_num, grads[name])))\n", 
+    "  print()"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "# Regularization experiment\n", 
+    "As an experiment, we will train a pair of two-layer networks on 500 training examples: one will use no dropout, and one will use a dropout probability of 0.75. We will then visualize the training and validation accuracies of the two networks over time."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "# Train two identical nets, one with dropout and one without\n", 
+    "np.random.seed(231)\n", 
+    "num_train = 500\n", 
+    "small_data = {\n", 
+    "  'X_train': data['X_train'][:num_train],\n", 
+    "  'y_train': data['y_train'][:num_train],\n", 
+    "  'X_val': data['X_val'],\n", 
+    "  'y_val': data['y_val'],\n", 
+    "}\n", 
+    "\n", 
+    "solvers = {}\n", 
+    "dropout_choices = [0, 0.75]\n", 
+    "for dropout in dropout_choices:\n", 
+    "  model = FullyConnectedNet([500], dropout=dropout)\n", 
+    "  print(dropout)\n", 
+    "\n", 
+    "  solver = Solver(model, small_data,\n", 
+    "                  num_epochs=25, batch_size=100,\n", 
+    "                  update_rule='adam',\n", 
+    "                  optim_config={\n", 
+    "                    'learning_rate': 5e-4,\n", 
+    "                  },\n", 
+    "                  verbose=True, print_every=100)\n", 
+    "  solver.train()\n", 
+    "  solvers[dropout] = solver"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "scrolled": false, 
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "# Plot train and validation accuracies of the two models\n", 
+    "\n", 
+    "train_accs = []\n", 
+    "val_accs = []\n", 
+    "for dropout in dropout_choices:\n", 
+    "  solver = solvers[dropout]\n", 
+    "  train_accs.append(solver.train_acc_history[-1])\n", 
+    "  val_accs.append(solver.val_acc_history[-1])\n", 
+    "\n", 
+    "plt.subplot(3, 1, 1)\n", 
+    "for dropout in dropout_choices:\n", 
+    "  plt.plot(solvers[dropout].train_acc_history, 'o', label='%.2f dropout' % dropout)\n", 
+    "plt.title('Train accuracy')\n", 
+    "plt.xlabel('Epoch')\n", 
+    "plt.ylabel('Accuracy')\n", 
+    "plt.legend(ncol=2, loc='lower right')\n", 
+    "  \n", 
+    "plt.subplot(3, 1, 2)\n", 
+    "for dropout in dropout_choices:\n", 
+    "  plt.plot(solvers[dropout].val_acc_history, 'o', label='%.2f dropout' % dropout)\n", 
+    "plt.title('Val accuracy')\n", 
+    "plt.xlabel('Epoch')\n", 
+    "plt.ylabel('Accuracy')\n", 
+    "plt.legend(ncol=2, loc='lower right')\n", 
+    "\n", 
+    "plt.gcf().set_size_inches(15, 15)\n", 
+    "plt.show()"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "# Question\n", 
+    "Explain what you see in this experiment. What does it suggest about dropout?"
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "# Answer\n"
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }
+ ], 
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 2", 
+   "name": "python2", 
+   "language": "python"
+  }, 
+  "language_info": {
+   "mimetype": "text/x-python", 
+   "nbconvert_exporter": "python", 
+   "name": "python", 
+   "file_extension": ".py", 
+   "version": "2.7.12+", 
+   "pygments_lexer": "ipython2", 
+   "codemirror_mode": {
+    "version": 2, 
+    "name": "ipython"
+   }
+  }
+ }
+}
\ No newline at end of file
diff --git a/FullyConnectedNets.ipynb b/FullyConnectedNets.ipynb
new file mode 100644
index 0000000..d9b160b
--- /dev/null
+++ b/FullyConnectedNets.ipynb
@@ -0,0 +1,1053 @@
+{
+ "nbformat_minor": 0, 
+ "nbformat": 4, 
+ "cells": [
+  {
+   "source": [
+    "# Fully-Connected Neural Nets\n", 
+    "In the previous homework you implemented a fully-connected two-layer neural network on CIFAR-10. The implementation was simple but not very modular since the loss and gradient were computed in a single monolithic function. This is manageable for a simple two-layer network, but would become impractical as we move to bigger models. Ideally we want to build networks using a more modular design so that we can implement different layer types in isolation and then snap them together into models with different architectures.\n", 
+    "\n", 
+    "In this exercise we will implement fully-connected networks using a more modular approach. For each layer we will implement a `forward` and a `backward` function. The `forward` function will receive inputs, weights, and other parameters and will return both an output and a `cache` object storing data needed for the backward pass, like this:\n", 
+    "\n", 
+    "```python\n", 
+    "def layer_forward(x, w):\n", 
+    "  \"\"\" Receive inputs x and weights w \"\"\"\n", 
+    "  # Do some computations ...\n", 
+    "  z = # ... some intermediate value\n", 
+    "  # Do some more computations ...\n", 
+    "  out = # the output\n", 
+    "   \n", 
+    "  cache = (x, w, z, out) # Values we need to compute gradients\n", 
+    "   \n", 
+    "  return out, cache\n", 
+    "```\n", 
+    "\n", 
+    "The backward pass will receive upstream derivatives and the `cache` object, and will return gradients with respect to the inputs and weights, like this:\n", 
+    "\n", 
+    "```python\n", 
+    "def layer_backward(dout, cache):\n", 
+    "  \"\"\"\n", 
+    "  Receive derivative of loss with respect to outputs and cache,\n", 
+    "  and compute derivative with respect to inputs.\n", 
+    "  \"\"\"\n", 
+    "  # Unpack cache values\n", 
+    "  x, w, z, out = cache\n", 
+    "  \n", 
+    "  # Use values in cache to compute derivatives\n", 
+    "  dx = # Derivative of loss with respect to x\n", 
+    "  dw = # Derivative of loss with respect to w\n", 
+    "  \n", 
+    "  return dx, dw\n", 
+    "```\n", 
+    "\n", 
+    "After implementing a bunch of layers this way, we will be able to easily combine them to build classifiers with different architectures.\n", 
+    "\n", 
+    "In addition to implementing fully-connected networks of arbitrary depth, we will also explore different update rules for optimization, and introduce Dropout as a regularizer and Batch Normalization as a tool to more efficiently optimize deep networks.\n", 
+    "  "
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "# As usual, a bit of setup\n", 
+    "from __future__ import print_function\n", 
+    "import time\n", 
+    "import numpy as np\n", 
+    "import matplotlib.pyplot as plt\n", 
+    "from cs231n.classifiers.fc_net import *\n", 
+    "from cs231n.data_utils import get_CIFAR10_data\n", 
+    "from cs231n.gradient_check import eval_numerical_gradient, eval_numerical_gradient_array\n", 
+    "from cs231n.solver import Solver\n", 
+    "\n", 
+    "%matplotlib inline\n", 
+    "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", 
+    "plt.rcParams['image.interpolation'] = 'nearest'\n", 
+    "plt.rcParams['image.cmap'] = 'gray'\n", 
+    "\n", 
+    "# for auto-reloading external modules\n", 
+    "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", 
+    "%load_ext autoreload\n", 
+    "%autoreload 2\n", 
+    "\n", 
+    "def rel_error(x, y):\n", 
+    "  \"\"\" returns relative error \"\"\"\n", 
+    "  return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "# Load the (preprocessed) CIFAR10 data.\n", 
+    "\n", 
+    "data = get_CIFAR10_data()\n", 
+    "for k, v in list(data.items()):\n", 
+    "  print(('%s: ' % k, v.shape))"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "# Affine layer: foward\n", 
+    "Open the file `cs231n/layers.py` and implement the `affine_forward` function.\n", 
+    "\n", 
+    "Once you are done you can test your implementaion by running the following:"
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "# Test the affine_forward function\n", 
+    "\n", 
+    "num_inputs = 2\n", 
+    "input_shape = (4, 5, 6)\n", 
+    "output_dim = 3\n", 
+    "\n", 
+    "input_size = num_inputs * np.prod(input_shape)\n", 
+    "weight_size = output_dim * np.prod(input_shape)\n", 
+    "\n", 
+    "x = np.linspace(-0.1, 0.5, num=input_size).reshape(num_inputs, *input_shape)\n", 
+    "w = np.linspace(-0.2, 0.3, num=weight_size).reshape(np.prod(input_shape), output_dim)\n", 
+    "b = np.linspace(-0.3, 0.1, num=output_dim)\n", 
+    "\n", 
+    "out, _ = affine_forward(x, w, b)\n", 
+    "correct_out = np.array([[ 1.49834967,  1.70660132,  1.91485297],\n", 
+    "                        [ 3.25553199,  3.5141327,   3.77273342]])\n", 
+    "\n", 
+    "# Compare your output with ours. The error should be around 1e-9.\n", 
+    "print('Testing affine_forward function:')\n", 
+    "print('difference: ', rel_error(out, correct_out))"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "# Affine layer: backward\n", 
+    "Now implement the `affine_backward` function and test your implementation using numeric gradient checking."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "# Test the affine_backward function\n", 
+    "np.random.seed(231)\n", 
+    "x = np.random.randn(10, 2, 3)\n", 
+    "w = np.random.randn(6, 5)\n", 
+    "b = np.random.randn(5)\n", 
+    "dout = np.random.randn(10, 5)\n", 
+    "\n", 
+    "dx_num = eval_numerical_gradient_array(lambda x: affine_forward(x, w, b)[0], x, dout)\n", 
+    "dw_num = eval_numerical_gradient_array(lambda w: affine_forward(x, w, b)[0], w, dout)\n", 
+    "db_num = eval_numerical_gradient_array(lambda b: affine_forward(x, w, b)[0], b, dout)\n", 
+    "\n", 
+    "_, cache = affine_forward(x, w, b)\n", 
+    "dx, dw, db = affine_backward(dout, cache)\n", 
+    "\n", 
+    "# The error should be around 1e-10\n", 
+    "print('Testing affine_backward function:')\n", 
+    "print('dx error: ', rel_error(dx_num, dx))\n", 
+    "print('dw error: ', rel_error(dw_num, dw))\n", 
+    "print('db error: ', rel_error(db_num, db))"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "# ReLU layer: forward\n", 
+    "Implement the forward pass for the ReLU activation function in the `relu_forward` function and test your implementation using the following:"
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "# Test the relu_forward function\n", 
+    "\n", 
+    "x = np.linspace(-0.5, 0.5, num=12).reshape(3, 4)\n", 
+    "\n", 
+    "out, _ = relu_forward(x)\n", 
+    "correct_out = np.array([[ 0.,          0.,          0.,          0.,        ],\n", 
+    "                        [ 0.,          0.,          0.04545455,  0.13636364,],\n", 
+    "                        [ 0.22727273,  0.31818182,  0.40909091,  0.5,       ]])\n", 
+    "\n", 
+    "# Compare your output with ours. The error should be around 5e-8\n", 
+    "print('Testing relu_forward function:')\n", 
+    "print('difference: ', rel_error(out, correct_out))"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "# ReLU layer: backward\n", 
+    "Now implement the backward pass for the ReLU activation function in the `relu_backward` function and test your implementation using numeric gradient checking:"
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "np.random.seed(231)\n", 
+    "x = np.random.randn(10, 10)\n", 
+    "dout = np.random.randn(*x.shape)\n", 
+    "\n", 
+    "dx_num = eval_numerical_gradient_array(lambda x: relu_forward(x)[0], x, dout)\n", 
+    "\n", 
+    "_, cache = relu_forward(x)\n", 
+    "dx = relu_backward(dout, cache)\n", 
+    "\n", 
+    "# The error should be around 3e-12\n", 
+    "print('Testing relu_backward function:')\n", 
+    "print('dx error: ', rel_error(dx_num, dx))"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "# \"Sandwich\" layers\n", 
+    "There are some common patterns of layers that are frequently used in neural nets. For example, affine layers are frequently followed by a ReLU nonlinearity. To make these common patterns easy, we define several convenience layers in the file `cs231n/layer_utils.py`.\n", 
+    "\n", 
+    "For now take a look at the `affine_relu_forward` and `affine_relu_backward` functions, and run the following to numerically gradient check the backward pass:"
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "from cs231n.layer_utils import affine_relu_forward, affine_relu_backward\n", 
+    "np.random.seed(231)\n", 
+    "x = np.random.randn(2, 3, 4)\n", 
+    "w = np.random.randn(12, 10)\n", 
+    "b = np.random.randn(10)\n", 
+    "dout = np.random.randn(2, 10)\n", 
+    "\n", 
+    "out, cache = affine_relu_forward(x, w, b)\n", 
+    "dx, dw, db = affine_relu_backward(dout, cache)\n", 
+    "\n", 
+    "dx_num = eval_numerical_gradient_array(lambda x: affine_relu_forward(x, w, b)[0], x, dout)\n", 
+    "dw_num = eval_numerical_gradient_array(lambda w: affine_relu_forward(x, w, b)[0], w, dout)\n", 
+    "db_num = eval_numerical_gradient_array(lambda b: affine_relu_forward(x, w, b)[0], b, dout)\n", 
+    "\n", 
+    "print('Testing affine_relu_forward:')\n", 
+    "print('dx error: ', rel_error(dx_num, dx))\n", 
+    "print('dw error: ', rel_error(dw_num, dw))\n", 
+    "print('db error: ', rel_error(db_num, db))"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "# Loss layers: Softmax and SVM\n", 
+    "You implemented these loss functions in the last assignment, so we'll give them to you for free here. You should still make sure you understand how they work by looking at the implementations in `cs231n/layers.py`.\n", 
+    "\n", 
+    "You can make sure that the implementations are correct by running the following:"
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "np.random.seed(231)\n", 
+    "num_classes, num_inputs = 10, 50\n", 
+    "x = 0.001 * np.random.randn(num_inputs, num_classes)\n", 
+    "y = np.random.randint(num_classes, size=num_inputs)\n", 
+    "\n", 
+    "dx_num = eval_numerical_gradient(lambda x: svm_loss(x, y)[0], x, verbose=False)\n", 
+    "loss, dx = svm_loss(x, y)\n", 
+    "\n", 
+    "# Test svm_loss function. Loss should be around 9 and dx error should be 1e-9\n", 
+    "print('Testing svm_loss:')\n", 
+    "print('loss: ', loss)\n", 
+    "print('dx error: ', rel_error(dx_num, dx))\n", 
+    "\n", 
+    "dx_num = eval_numerical_gradient(lambda x: softmax_loss(x, y)[0], x, verbose=False)\n", 
+    "loss, dx = softmax_loss(x, y)\n", 
+    "\n", 
+    "# Test softmax_loss function. Loss should be 2.3 and dx error should be 1e-8\n", 
+    "print('\\nTesting softmax_loss:')\n", 
+    "print('loss: ', loss)\n", 
+    "print('dx error: ', rel_error(dx_num, dx))"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "# Two-layer network\n", 
+    "In the previous assignment you implemented a two-layer neural network in a single monolithic class. Now that you have implemented modular versions of the necessary layers, you will reimplement the two layer network using these modular implementations.\n", 
+    "\n", 
+    "Open the file `cs231n/classifiers/fc_net.py` and complete the implementation of the `TwoLayerNet` class. This class will serve as a model for the other networks you will implement in this assignment, so read through it to make sure you understand the API. You can run the cell below to test your implementation."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "np.random.seed(231)\n", 
+    "N, D, H, C = 3, 5, 50, 7\n", 
+    "X = np.random.randn(N, D)\n", 
+    "y = np.random.randint(C, size=N)\n", 
+    "\n", 
+    "std = 1e-3\n", 
+    "model = TwoLayerNet(input_dim=D, hidden_dim=H, num_classes=C, weight_scale=std)\n", 
+    "\n", 
+    "print('Testing initialization ... ')\n", 
+    "W1_std = abs(model.params['W1'].std() - std)\n", 
+    "b1 = model.params['b1']\n", 
+    "W2_std = abs(model.params['W2'].std() - std)\n", 
+    "b2 = model.params['b2']\n", 
+    "assert W1_std < std / 10, 'First layer weights do not seem right'\n", 
+    "assert np.all(b1 == 0), 'First layer biases do not seem right'\n", 
+    "assert W2_std < std / 10, 'Second layer weights do not seem right'\n", 
+    "assert np.all(b2 == 0), 'Second layer biases do not seem right'\n", 
+    "\n", 
+    "print('Testing test-time forward pass ... ')\n", 
+    "model.params['W1'] = np.linspace(-0.7, 0.3, num=D*H).reshape(D, H)\n", 
+    "model.params['b1'] = np.linspace(-0.1, 0.9, num=H)\n", 
+    "model.params['W2'] = np.linspace(-0.3, 0.4, num=H*C).reshape(H, C)\n", 
+    "model.params['b2'] = np.linspace(-0.9, 0.1, num=C)\n", 
+    "X = np.linspace(-5.5, 4.5, num=N*D).reshape(D, N).T\n", 
+    "scores = model.loss(X)\n", 
+    "correct_scores = np.asarray(\n", 
+    "  [[11.53165108,  12.2917344,   13.05181771,  13.81190102,  14.57198434, 15.33206765,  16.09215096],\n", 
+    "   [12.05769098,  12.74614105,  13.43459113,  14.1230412,   14.81149128, 15.49994135,  16.18839143],\n", 
+    "   [12.58373087,  13.20054771,  13.81736455,  14.43418138,  15.05099822, 15.66781506,  16.2846319 ]])\n", 
+    "scores_diff = np.abs(scores - correct_scores).sum()\n", 
+    "assert scores_diff < 1e-6, 'Problem with test-time forward pass'\n", 
+    "\n", 
+    "print('Testing training loss (no regularization)')\n", 
+    "y = np.asarray([0, 5, 1])\n", 
+    "loss, grads = model.loss(X, y)\n", 
+    "correct_loss = 3.4702243556\n", 
+    "assert abs(loss - correct_loss) < 1e-10, 'Problem with training-time loss'\n", 
+    "\n", 
+    "model.reg = 1.0\n", 
+    "loss, grads = model.loss(X, y)\n", 
+    "correct_loss = 26.5948426952\n", 
+    "assert abs(loss - correct_loss) < 1e-10, 'Problem with regularization loss'\n", 
+    "\n", 
+    "for reg in [0.0, 0.7]:\n", 
+    "  print('Running numeric gradient check with reg = ', reg)\n", 
+    "  model.reg = reg\n", 
+    "  loss, grads = model.loss(X, y)\n", 
+    "\n", 
+    "  for name in sorted(grads):\n", 
+    "    f = lambda _: model.loss(X, y)[0]\n", 
+    "    grad_num = eval_numerical_gradient(f, model.params[name], verbose=False)\n", 
+    "    print('%s relative error: %.2e' % (name, rel_error(grad_num, grads[name])))"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "# Solver\n", 
+    "In the previous assignment, the logic for training models was coupled to the models themselves. Following a more modular design, for this assignment we have split the logic for training models into a separate class.\n", 
+    "\n", 
+    "Open the file `cs231n/solver.py` and read through it to familiarize yourself with the API. After doing so, use a `Solver` instance to train a `TwoLayerNet` that achieves at least `50%` accuracy on the validation set."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "model = TwoLayerNet()\n", 
+    "solver = None\n", 
+    "\n", 
+    "##############################################################################\n", 
+    "# TODO: Use a Solver instance to train a TwoLayerNet that achieves at least  #\n", 
+    "# 50% accuracy on the validation set.                                        #\n", 
+    "##############################################################################\n", 
+    "pass\n", 
+    "##############################################################################\n", 
+    "#                             END OF YOUR CODE                               #\n", 
+    "##############################################################################"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "# Run this cell to visualize training loss and train / val accuracy\n", 
+    "\n", 
+    "plt.subplot(2, 1, 1)\n", 
+    "plt.title('Training loss')\n", 
+    "plt.plot(solver.loss_history, 'o')\n", 
+    "plt.xlabel('Iteration')\n", 
+    "\n", 
+    "plt.subplot(2, 1, 2)\n", 
+    "plt.title('Accuracy')\n", 
+    "plt.plot(solver.train_acc_history, '-o', label='train')\n", 
+    "plt.plot(solver.val_acc_history, '-o', label='val')\n", 
+    "plt.plot([0.5] * len(solver.val_acc_history), 'k--')\n", 
+    "plt.xlabel('Epoch')\n", 
+    "plt.legend(loc='lower right')\n", 
+    "plt.gcf().set_size_inches(15, 12)\n", 
+    "plt.show()"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "# Multilayer network\n", 
+    "Next you will implement a fully-connected network with an arbitrary number of hidden layers.\n", 
+    "\n", 
+    "Read through the `FullyConnectedNet` class in the file `cs231n/classifiers/fc_net.py`.\n", 
+    "\n", 
+    "Implement the initialization, the forward pass, and the backward pass. For the moment don't worry about implementing dropout or batch normalization; we will add those features soon."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "## Initial loss and gradient check"
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "As a sanity check, run the following to check the initial loss and to gradient check the network both with and without regularization. Do the initial losses seem reasonable?\n", 
+    "\n", 
+    "For gradient checking, you should expect to see errors around 1e-6 or less."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "np.random.seed(231)\n", 
+    "N, D, H1, H2, C = 2, 15, 20, 30, 10\n", 
+    "X = np.random.randn(N, D)\n", 
+    "y = np.random.randint(C, size=(N,))\n", 
+    "\n", 
+    "for reg in [0, 3.14]:\n", 
+    "  print('Running check with reg = ', reg)\n", 
+    "  model = FullyConnectedNet([H1, H2], input_dim=D, num_classes=C,\n", 
+    "                            reg=reg, weight_scale=5e-2, dtype=np.float64)\n", 
+    "\n", 
+    "  loss, grads = model.loss(X, y)\n", 
+    "  print('Initial loss: ', loss)\n", 
+    "\n", 
+    "  for name in sorted(grads):\n", 
+    "    f = lambda _: model.loss(X, y)[0]\n", 
+    "    grad_num = eval_numerical_gradient(f, model.params[name], verbose=False, h=1e-5)\n", 
+    "    print('%s relative error: %.2e' % (name, rel_error(grad_num, grads[name])))"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "As another sanity check, make sure you can overfit a small dataset of 50 images. First we will try a three-layer network with 100 units in each hidden layer. You will need to tweak the learning rate and initialization scale, but you should be able to overfit and achieve 100% training accuracy within 20 epochs."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "# TODO: Use a three-layer Net to overfit 50 training examples.\n", 
+    "\n", 
+    "num_train = 50\n", 
+    "small_data = {\n", 
+    "  'X_train': data['X_train'][:num_train],\n", 
+    "  'y_train': data['y_train'][:num_train],\n", 
+    "  'X_val': data['X_val'],\n", 
+    "  'y_val': data['y_val'],\n", 
+    "}\n", 
+    "\n", 
+    "weight_scale = 1e-2\n", 
+    "learning_rate = 1e-4\n", 
+    "model = FullyConnectedNet([100, 100],\n", 
+    "              weight_scale=weight_scale, dtype=np.float64)\n", 
+    "solver = Solver(model, small_data,\n", 
+    "                print_every=10, num_epochs=20, batch_size=25,\n", 
+    "                update_rule='sgd',\n", 
+    "                optim_config={\n", 
+    "                  'learning_rate': learning_rate,\n", 
+    "                }\n", 
+    "         )\n", 
+    "solver.train()\n", 
+    "\n", 
+    "plt.plot(solver.loss_history, 'o')\n", 
+    "plt.title('Training loss history')\n", 
+    "plt.xlabel('Iteration')\n", 
+    "plt.ylabel('Training loss')\n", 
+    "plt.show()"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "scrolled": false, 
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "Now try to use a five-layer network with 100 units on each layer to overfit 50 training examples. Again you will have to adjust the learning rate and weight initialization, but you should be able to achieve 100% training accuracy within 20 epochs."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "# TODO: Use a five-layer Net to overfit 50 training examples.\n", 
+    "\n", 
+    "num_train = 50\n", 
+    "small_data = {\n", 
+    "  'X_train': data['X_train'][:num_train],\n", 
+    "  'y_train': data['y_train'][:num_train],\n", 
+    "  'X_val': data['X_val'],\n", 
+    "  'y_val': data['y_val'],\n", 
+    "}\n", 
+    "\n", 
+    "learning_rate = 1e-3\n", 
+    "weight_scale = 1e-5\n", 
+    "model = FullyConnectedNet([100, 100, 100, 100],\n", 
+    "                weight_scale=weight_scale, dtype=np.float64)\n", 
+    "solver = Solver(model, small_data,\n", 
+    "                print_every=10, num_epochs=20, batch_size=25,\n", 
+    "                update_rule='sgd',\n", 
+    "                optim_config={\n", 
+    "                  'learning_rate': learning_rate,\n", 
+    "                }\n", 
+    "         )\n", 
+    "solver.train()\n", 
+    "\n", 
+    "plt.plot(solver.loss_history, 'o')\n", 
+    "plt.title('Training loss history')\n", 
+    "plt.xlabel('Iteration')\n", 
+    "plt.ylabel('Training loss')\n", 
+    "plt.show()"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "# Inline question: \n", 
+    "Did you notice anything about the comparative difficulty of training the three-layer net vs training the five layer net?\n", 
+    "\n", 
+    "# Answer:\n", 
+    "[FILL THIS IN]\n"
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "# Update rules\n", 
+    "So far we have used vanilla stochastic gradient descent (SGD) as our update rule. More sophisticated update rules can make it easier to train deep networks. We will implement a few of the most commonly used update rules and compare them to vanilla SGD."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "# SGD+Momentum\n", 
+    "Stochastic gradient descent with momentum is a widely used update rule that tends to make deep networks converge faster than vanilla stochstic gradient descent.\n", 
+    "\n", 
+    "Open the file `cs231n/optim.py` and read the documentation at the top of the file to make sure you understand the API. Implement the SGD+momentum update rule in the function `sgd_momentum` and run the following to check your implementation. You should see errors less than 1e-8."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "from cs231n.optim import sgd_momentum\n", 
+    "\n", 
+    "N, D = 4, 5\n", 
+    "w = np.linspace(-0.4, 0.6, num=N*D).reshape(N, D)\n", 
+    "dw = np.linspace(-0.6, 0.4, num=N*D).reshape(N, D)\n", 
+    "v = np.linspace(0.6, 0.9, num=N*D).reshape(N, D)\n", 
+    "\n", 
+    "config = {'learning_rate': 1e-3, 'velocity': v}\n", 
+    "next_w, _ = sgd_momentum(w, dw, config=config)\n", 
+    "\n", 
+    "expected_next_w = np.asarray([\n", 
+    "  [ 0.1406,      0.20738947,  0.27417895,  0.34096842,  0.40775789],\n", 
+    "  [ 0.47454737,  0.54133684,  0.60812632,  0.67491579,  0.74170526],\n", 
+    "  [ 0.80849474,  0.87528421,  0.94207368,  1.00886316,  1.07565263],\n", 
+    "  [ 1.14244211,  1.20923158,  1.27602105,  1.34281053,  1.4096    ]])\n", 
+    "expected_velocity = np.asarray([\n", 
+    "  [ 0.5406,      0.55475789,  0.56891579, 0.58307368,  0.59723158],\n", 
+    "  [ 0.61138947,  0.62554737,  0.63970526,  0.65386316,  0.66802105],\n", 
+    "  [ 0.68217895,  0.69633684,  0.71049474,  0.72465263,  0.73881053],\n", 
+    "  [ 0.75296842,  0.76712632,  0.78128421,  0.79544211,  0.8096    ]])\n", 
+    "\n", 
+    "print('next_w error: ', rel_error(next_w, expected_next_w))\n", 
+    "print('velocity error: ', rel_error(expected_velocity, config['velocity']))"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "Once you have done so, run the following to train a six-layer network with both SGD and SGD+momentum. You should see the SGD+momentum update rule converge faster."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "num_train = 4000\n", 
+    "small_data = {\n", 
+    "  'X_train': data['X_train'][:num_train],\n", 
+    "  'y_train': data['y_train'][:num_train],\n", 
+    "  'X_val': data['X_val'],\n", 
+    "  'y_val': data['y_val'],\n", 
+    "}\n", 
+    "\n", 
+    "solvers = {}\n", 
+    "\n", 
+    "for update_rule in ['sgd', 'sgd_momentum']:\n", 
+    "  print('running with ', update_rule)\n", 
+    "  model = FullyConnectedNet([100, 100, 100, 100, 100], weight_scale=5e-2)\n", 
+    "\n", 
+    "  solver = Solver(model, small_data,\n", 
+    "                  num_epochs=5, batch_size=100,\n", 
+    "                  update_rule=update_rule,\n", 
+    "                  optim_config={\n", 
+    "                    'learning_rate': 1e-2,\n", 
+    "                  },\n", 
+    "                  verbose=True)\n", 
+    "  solvers[update_rule] = solver\n", 
+    "  solver.train()\n", 
+    "  print()\n", 
+    "\n", 
+    "plt.subplot(3, 1, 1)\n", 
+    "plt.title('Training loss')\n", 
+    "plt.xlabel('Iteration')\n", 
+    "\n", 
+    "plt.subplot(3, 1, 2)\n", 
+    "plt.title('Training accuracy')\n", 
+    "plt.xlabel('Epoch')\n", 
+    "\n", 
+    "plt.subplot(3, 1, 3)\n", 
+    "plt.title('Validation accuracy')\n", 
+    "plt.xlabel('Epoch')\n", 
+    "\n", 
+    "for update_rule, solver in list(solvers.items()):\n", 
+    "  plt.subplot(3, 1, 1)\n", 
+    "  plt.plot(solver.loss_history, 'o', label=update_rule)\n", 
+    "  \n", 
+    "  plt.subplot(3, 1, 2)\n", 
+    "  plt.plot(solver.train_acc_history, '-o', label=update_rule)\n", 
+    "\n", 
+    "  plt.subplot(3, 1, 3)\n", 
+    "  plt.plot(solver.val_acc_history, '-o', label=update_rule)\n", 
+    "  \n", 
+    "for i in [1, 2, 3]:\n", 
+    "  plt.subplot(3, 1, i)\n", 
+    "  plt.legend(loc='upper center', ncol=4)\n", 
+    "plt.gcf().set_size_inches(15, 15)\n", 
+    "plt.show()"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "scrolled": false, 
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "# RMSProp and Adam\n", 
+    "RMSProp [1] and Adam [2] are update rules that set per-parameter learning rates by using a running average of the second moments of gradients.\n", 
+    "\n", 
+    "In the file `cs231n/optim.py`, implement the RMSProp update rule in the `rmsprop` function and implement the Adam update rule in the `adam` function, and check your implementations using the tests below.\n", 
+    "\n", 
+    "[1] Tijmen Tieleman and Geoffrey Hinton. \"Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude.\" COURSERA: Neural Networks for Machine Learning 4 (2012).\n", 
+    "\n", 
+    "[2] Diederik Kingma and Jimmy Ba, \"Adam: A Method for Stochastic Optimization\", ICLR 2015."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "# Test RMSProp implementation; you should see errors less than 1e-7\n", 
+    "from cs231n.optim import rmsprop\n", 
+    "\n", 
+    "N, D = 4, 5\n", 
+    "w = np.linspace(-0.4, 0.6, num=N*D).reshape(N, D)\n", 
+    "dw = np.linspace(-0.6, 0.4, num=N*D).reshape(N, D)\n", 
+    "cache = np.linspace(0.6, 0.9, num=N*D).reshape(N, D)\n", 
+    "\n", 
+    "config = {'learning_rate': 1e-2, 'cache': cache}\n", 
+    "next_w, _ = rmsprop(w, dw, config=config)\n", 
+    "\n", 
+    "expected_next_w = np.asarray([\n", 
+    "  [-0.39223849, -0.34037513, -0.28849239, -0.23659121, -0.18467247],\n", 
+    "  [-0.132737,   -0.08078555, -0.02881884,  0.02316247,  0.07515774],\n", 
+    "  [ 0.12716641,  0.17918792,  0.23122175,  0.28326742,  0.33532447],\n", 
+    "  [ 0.38739248,  0.43947102,  0.49155973,  0.54365823,  0.59576619]])\n", 
+    "expected_cache = np.asarray([\n", 
+    "  [ 0.5976,      0.6126277,   0.6277108,   0.64284931,  0.65804321],\n", 
+    "  [ 0.67329252,  0.68859723,  0.70395734,  0.71937285,  0.73484377],\n", 
+    "  [ 0.75037008,  0.7659518,   0.78158892,  0.79728144,  0.81302936],\n", 
+    "  [ 0.82883269,  0.84469141,  0.86060554,  0.87657507,  0.8926    ]])\n", 
+    "\n", 
+    "print('next_w error: ', rel_error(expected_next_w, next_w))\n", 
+    "print('cache error: ', rel_error(expected_cache, config['cache']))"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "# Test Adam implementation; you should see errors around 1e-7 or less\n", 
+    "from cs231n.optim import adam\n", 
+    "\n", 
+    "N, D = 4, 5\n", 
+    "w = np.linspace(-0.4, 0.6, num=N*D).reshape(N, D)\n", 
+    "dw = np.linspace(-0.6, 0.4, num=N*D).reshape(N, D)\n", 
+    "m = np.linspace(0.6, 0.9, num=N*D).reshape(N, D)\n", 
+    "v = np.linspace(0.7, 0.5, num=N*D).reshape(N, D)\n", 
+    "\n", 
+    "config = {'learning_rate': 1e-2, 'm': m, 'v': v, 't': 5}\n", 
+    "next_w, _ = adam(w, dw, config=config)\n", 
+    "\n", 
+    "expected_next_w = np.asarray([\n", 
+    "  [-0.40094747, -0.34836187, -0.29577703, -0.24319299, -0.19060977],\n", 
+    "  [-0.1380274,  -0.08544591, -0.03286534,  0.01971428,  0.0722929],\n", 
+    "  [ 0.1248705,   0.17744702,  0.23002243,  0.28259667,  0.33516969],\n", 
+    "  [ 0.38774145,  0.44031188,  0.49288093,  0.54544852,  0.59801459]])\n", 
+    "expected_v = np.asarray([\n", 
+    "  [ 0.69966,     0.68908382,  0.67851319,  0.66794809,  0.65738853,],\n", 
+    "  [ 0.64683452,  0.63628604,  0.6257431,   0.61520571,  0.60467385,],\n", 
+    "  [ 0.59414753,  0.58362676,  0.57311152,  0.56260183,  0.55209767,],\n", 
+    "  [ 0.54159906,  0.53110598,  0.52061845,  0.51013645,  0.49966,   ]])\n", 
+    "expected_m = np.asarray([\n", 
+    "  [ 0.48,        0.49947368,  0.51894737,  0.53842105,  0.55789474],\n", 
+    "  [ 0.57736842,  0.59684211,  0.61631579,  0.63578947,  0.65526316],\n", 
+    "  [ 0.67473684,  0.69421053,  0.71368421,  0.73315789,  0.75263158],\n", 
+    "  [ 0.77210526,  0.79157895,  0.81105263,  0.83052632,  0.85      ]])\n", 
+    "\n", 
+    "print('next_w error: ', rel_error(expected_next_w, next_w))\n", 
+    "print('v error: ', rel_error(expected_v, config['v']))\n", 
+    "print('m error: ', rel_error(expected_m, config['m']))"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "Once you have debugged your RMSProp and Adam implementations, run the following to train a pair of deep networks using these new update rules:"
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "learning_rates = {'rmsprop': 1e-4, 'adam': 1e-3}\n", 
+    "for update_rule in ['adam', 'rmsprop']:\n", 
+    "  print('running with ', update_rule)\n", 
+    "  model = FullyConnectedNet([100, 100, 100, 100, 100], weight_scale=5e-2)\n", 
+    "\n", 
+    "  solver = Solver(model, small_data,\n", 
+    "                  num_epochs=5, batch_size=100,\n", 
+    "                  update_rule=update_rule,\n", 
+    "                  optim_config={\n", 
+    "                    'learning_rate': learning_rates[update_rule]\n", 
+    "                  },\n", 
+    "                  verbose=True)\n", 
+    "  solvers[update_rule] = solver\n", 
+    "  solver.train()\n", 
+    "  print()\n", 
+    "\n", 
+    "plt.subplot(3, 1, 1)\n", 
+    "plt.title('Training loss')\n", 
+    "plt.xlabel('Iteration')\n", 
+    "\n", 
+    "plt.subplot(3, 1, 2)\n", 
+    "plt.title('Training accuracy')\n", 
+    "plt.xlabel('Epoch')\n", 
+    "\n", 
+    "plt.subplot(3, 1, 3)\n", 
+    "plt.title('Validation accuracy')\n", 
+    "plt.xlabel('Epoch')\n", 
+    "\n", 
+    "for update_rule, solver in list(solvers.items()):\n", 
+    "  plt.subplot(3, 1, 1)\n", 
+    "  plt.plot(solver.loss_history, 'o', label=update_rule)\n", 
+    "  \n", 
+    "  plt.subplot(3, 1, 2)\n", 
+    "  plt.plot(solver.train_acc_history, '-o', label=update_rule)\n", 
+    "\n", 
+    "  plt.subplot(3, 1, 3)\n", 
+    "  plt.plot(solver.val_acc_history, '-o', label=update_rule)\n", 
+    "  \n", 
+    "for i in [1, 2, 3]:\n", 
+    "  plt.subplot(3, 1, i)\n", 
+    "  plt.legend(loc='upper center', ncol=4)\n", 
+    "plt.gcf().set_size_inches(15, 15)\n", 
+    "plt.show()"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "# Train a good model!\n", 
+    "Train the best fully-connected model that you can on CIFAR-10, storing your best model in the `best_model` variable. We require you to get at least 50% accuracy on the validation set using a fully-connected net.\n", 
+    "\n", 
+    "If you are careful it should be possible to get accuracies above 55%, but we don't require it for this part and won't assign extra credit for doing so. Later in the assignment we will ask you to train the best convolutional network that you can on CIFAR-10, and we would prefer that you spend your effort working on convolutional nets rather than fully-connected nets.\n", 
+    "\n", 
+    "You might find it useful to complete the `BatchNormalization.ipynb` and `Dropout.ipynb` notebooks before completing this part, since those techniques can help you train powerful models."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "best_model = None\n", 
+    "################################################################################\n", 
+    "# TODO: Train the best FullyConnectedNet that you can on CIFAR-10. You might   #\n", 
+    "# batch normalization and dropout useful. Store your best model in the         #\n", 
+    "# best_model variable.                                                         #\n", 
+    "################################################################################\n", 
+    "pass\n", 
+    "################################################################################\n", 
+    "#                              END OF YOUR CODE                                #\n", 
+    "################################################################################"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "scrolled": false, 
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "source": [
+    "# Test you model\n", 
+    "Run your best model on the validation and test sets. You should achieve above 50% accuracy on the validation set."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {
+    "editable": true, 
+    "deletable": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "y_test_pred = np.argmax(best_model.loss(data['X_test']), axis=1)\n", 
+    "y_val_pred = np.argmax(best_model.loss(data['X_val']), axis=1)\n", 
+    "print('Validation set accuracy: ', (y_val_pred == data['y_val']).mean())\n", 
+    "print('Test set accuracy: ', (y_test_pred == data['y_test']).mean())"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": false, 
+    "editable": true, 
+    "deletable": true
+   }
+  }
+ ], 
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3", 
+   "name": "python3", 
+   "language": "python"
+  }, 
+  "language_info": {
+   "mimetype": "text/x-python", 
+   "nbconvert_exporter": "python", 
+   "name": "python", 
+   "file_extension": ".py", 
+   "version": "3.5.0", 
+   "pygments_lexer": "ipython3", 
+   "codemirror_mode": {
+    "version": 3, 
+    "name": "ipython"
+   }
+  }
+ }
+}
\ No newline at end of file
diff --git a/README.md b/README.md
index 4d9ba74..33367a3 100644
--- a/README.md
+++ b/README.md
@@ -1 +1 @@
-# convolutional_nn
\ No newline at end of file
+Details about this assignment can be found [here](http://cs231n.github.io/assignments2017/assignment2/).
diff --git a/TensorFlow.ipynb b/TensorFlow.ipynb
new file mode 100644
index 0000000..bcd7694
--- /dev/null
+++ b/TensorFlow.ipynb
@@ -0,0 +1,712 @@
+{
+ "nbformat_minor": 1, 
+ "nbformat": 4, 
+ "cells": [
+  {
+   "source": [
+    "## What's this TensorFlow business?\n", 
+    "\n", 
+    "You've written a lot of code in this assignment to provide a whole host of neural network functionality. Dropout, Batch Norm, and 2D convolutions are some of the workhorses of deep learning in computer vision. You've also worked hard to make your code efficient and vectorized.\n", 
+    "\n", 
+    "For the last part of this assignment, though, we're going to leave behind your beautiful codebase and instead migrate to one of two popular deep learning frameworks: in this instance, TensorFlow (or PyTorch, if you switch over to that notebook)\n", 
+    "\n", 
+    "#### What is it?\n", 
+    "TensorFlow is a system for executing computational graphs over Tensor objects, with native support for performing backpropogation for its Variables. In it, we work with Tensors which are n-dimensional arrays analogous to the numpy ndarray.\n", 
+    "\n", 
+    "#### Why?\n", 
+    "\n", 
+    "* Our code will now run on GPUs! Much faster training. Writing your own modules to run on GPUs is beyond the scope of this class, unfortunately.\n", 
+    "* We want you to be ready to use one of these frameworks for your project so you can experiment more efficiently than if you were writing every feature you want to use by hand. \n", 
+    "* We want you to stand on the shoulders of giants! TensorFlow and PyTorch are both excellent frameworks that will make your lives a lot easier, and now that you understand their guts, you are free to use them :) \n", 
+    "* We want you to be exposed to the sort of deep learning code you might run into in academia or industry. "
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {}
+  }, 
+  {
+   "source": [
+    "## How will I learn TensorFlow?\n", 
+    "\n", 
+    "TensorFlow has many excellent tutorials available, including those from [Google themselves](https://www.tensorflow.org/get_started/get_started).\n", 
+    "\n", 
+    "Otherwise, this notebook will walk you through much of what you need to do to train models in TensorFlow. See the end of the notebook for some links to helpful tutorials if you want to learn more or need further clarification on topics that aren't fully explained here."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {}
+  }, 
+  {
+   "source": [
+    "## Load Datasets\n"
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {}
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "import tensorflow as tf\n", 
+    "import numpy as np\n", 
+    "import math\n", 
+    "import timeit\n", 
+    "import matplotlib.pyplot as plt\n", 
+    "%matplotlib inline"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "from cs231n.data_utils import load_CIFAR10\n", 
+    "\n", 
+    "def get_CIFAR10_data(num_training=49000, num_validation=1000, num_test=10000):\n", 
+    "    \"\"\"\n", 
+    "    Load the CIFAR-10 dataset from disk and perform preprocessing to prepare\n", 
+    "    it for the two-layer neural net classifier. These are the same steps as\n", 
+    "    we used for the SVM, but condensed to a single function.  \n", 
+    "    \"\"\"\n", 
+    "    # Load the raw CIFAR-10 data\n", 
+    "    cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'\n", 
+    "    X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)\n", 
+    "\n", 
+    "    # Subsample the data\n", 
+    "    mask = range(num_training, num_training + num_validation)\n", 
+    "    X_val = X_train[mask]\n", 
+    "    y_val = y_train[mask]\n", 
+    "    mask = range(num_training)\n", 
+    "    X_train = X_train[mask]\n", 
+    "    y_train = y_train[mask]\n", 
+    "    mask = range(num_test)\n", 
+    "    X_test = X_test[mask]\n", 
+    "    y_test = y_test[mask]\n", 
+    "\n", 
+    "    # Normalize the data: subtract the mean image\n", 
+    "    mean_image = np.mean(X_train, axis=0)\n", 
+    "    X_train -= mean_image\n", 
+    "    X_val -= mean_image\n", 
+    "    X_test -= mean_image\n", 
+    "\n", 
+    "    return X_train, y_train, X_val, y_val, X_test, y_test\n", 
+    "\n", 
+    "\n", 
+    "# Invoke the above function to get our data.\n", 
+    "X_train, y_train, X_val, y_val, X_test, y_test = get_CIFAR10_data()\n", 
+    "print('Train data shape: ', X_train.shape)\n", 
+    "print('Train labels shape: ', y_train.shape)\n", 
+    "print('Validation data shape: ', X_val.shape)\n", 
+    "print('Validation labels shape: ', y_val.shape)\n", 
+    "print('Test data shape: ', X_test.shape)\n", 
+    "print('Test labels shape: ', y_test.shape)"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": true
+   }
+  }, 
+  {
+   "source": [
+    "## Example Model\n", 
+    "\n", 
+    "### Some useful utilities\n", 
+    "\n", 
+    ". Remember that our image data is initially N x H x W x C, where:\n", 
+    "* N is the number of datapoints\n", 
+    "* H is the height of each image in pixels\n", 
+    "* W is the height of each image in pixels\n", 
+    "* C is the number of channels (usually 3: R, G, B)\n", 
+    "\n", 
+    "This is the right way to represent the data when we are doing something like a 2D convolution, which needs spatial understanding of where the pixels are relative to each other. When we input image data into fully connected affine layers, however, we want each data example to be represented by a single vector -- it's no longer useful to segregate the different channels, rows, and columns of the data."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {}
+  }, 
+  {
+   "source": [
+    "### The example model itself\n", 
+    "\n", 
+    "The first step to training your own model is defining its architecture.\n", 
+    "\n", 
+    "Here's an example of a convolutional neural network defined in TensorFlow -- try to understand what each line is doing, remembering that each layer is composed upon the previous layer. We haven't trained anything yet - that'll come next - for now, we want you to understand how everything gets set up. \n", 
+    "\n", 
+    "In that example, you see 2D convolutional layers (Conv2d), ReLU activations, and fully-connected layers (Linear). You also see the Hinge loss function, and the Adam optimizer being used. \n", 
+    "\n", 
+    "Make sure you understand why the parameters of the Linear layer are 5408 and 10.\n", 
+    "\n", 
+    "### TensorFlow Details\n", 
+    "In TensorFlow, much like in our previous notebooks, we'll first specifically initialize our variables, and then our network model."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {}
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "# clear old variables\n", 
+    "tf.reset_default_graph()\n", 
+    "\n", 
+    "# setup input (e.g. the data that changes every batch)\n", 
+    "# The first dim is None, and gets sets automatically based on batch size fed in\n", 
+    "X = tf.placeholder(tf.float32, [None, 32, 32, 3])\n", 
+    "y = tf.placeholder(tf.int64, [None])\n", 
+    "is_training = tf.placeholder(tf.bool)\n", 
+    "\n", 
+    "def simple_model(X,y):\n", 
+    "    # define our weights (e.g. init_two_layer_convnet)\n", 
+    "    \n", 
+    "    # setup variables\n", 
+    "    Wconv1 = tf.get_variable(\"Wconv1\", shape=[7, 7, 3, 32])\n", 
+    "    bconv1 = tf.get_variable(\"bconv1\", shape=[32])\n", 
+    "    W1 = tf.get_variable(\"W1\", shape=[5408, 10])\n", 
+    "    b1 = tf.get_variable(\"b1\", shape=[10])\n", 
+    "\n", 
+    "    # define our graph (e.g. two_layer_convnet)\n", 
+    "    a1 = tf.nn.conv2d(X, Wconv1, strides=[1,2,2,1], padding='VALID') + bconv1\n", 
+    "    h1 = tf.nn.relu(a1)\n", 
+    "    h1_flat = tf.reshape(h1,[-1,5408])\n", 
+    "    y_out = tf.matmul(h1_flat,W1) + b1\n", 
+    "    return y_out\n", 
+    "\n", 
+    "y_out = simple_model(X,y)\n", 
+    "\n", 
+    "# define our loss\n", 
+    "total_loss = tf.losses.hinge_loss(tf.one_hot(y,10),logits=y_out)\n", 
+    "mean_loss = tf.reduce_mean(total_loss)\n", 
+    "\n", 
+    "# define our optimizer\n", 
+    "optimizer = tf.train.AdamOptimizer(5e-4) # select optimizer and set learning rate\n", 
+    "train_step = optimizer.minimize(mean_loss)"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": true
+   }
+  }, 
+  {
+   "source": [
+    "TensorFlow supports many other layer types, loss functions, and optimizers - you will experiment with these next. Here's the official API documentation for these (if any of the parameters used above were unclear, this resource will also be helpful). \n", 
+    "\n", 
+    "* Layers, Activations, Loss functions : https://www.tensorflow.org/api_guides/python/nn\n", 
+    "* Optimizers: https://www.tensorflow.org/api_guides/python/train#Optimizers\n", 
+    "* BatchNorm: https://www.tensorflow.org/api_docs/python/tf/layers/batch_normalization"
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {}
+  }, 
+  {
+   "source": [
+    "### Training the model on one epoch\n", 
+    "While we have defined a graph of operations above, in order to execute TensorFlow Graphs, by feeding them input data and computing the results, we first need to create a `tf.Session` object. A session encapsulates the control and state of the TensorFlow runtime. For more information, see the TensorFlow [Getting started](https://www.tensorflow.org/get_started/get_started) guide.\n", 
+    "\n", 
+    "Optionally we can also specify a device context such as `/cpu:0` or `/gpu:0`. For documentation on this behavior see [this TensorFlow guide](https://www.tensorflow.org/tutorials/using_gpu)\n", 
+    "\n", 
+    "You should see a validation loss of around 0.4 to 0.6 and an accuracy of 0.30 to 0.35 below"
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {}
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "def run_model(session, predict, loss_val, Xd, yd,\n", 
+    "              epochs=1, batch_size=64, print_every=100,\n", 
+    "              training=None, plot_losses=False):\n", 
+    "    # have tensorflow compute accuracy\n", 
+    "    correct_prediction = tf.equal(tf.argmax(predict,1), y)\n", 
+    "    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))\n", 
+    "    \n", 
+    "    # shuffle indicies\n", 
+    "    train_indicies = np.arange(Xd.shape[0])\n", 
+    "    np.random.shuffle(train_indicies)\n", 
+    "\n", 
+    "    training_now = training is not None\n", 
+    "    \n", 
+    "    # setting up variables we want to compute (and optimizing)\n", 
+    "    # if we have a training function, add that to things we compute\n", 
+    "    variables = [mean_loss,correct_prediction,accuracy]\n", 
+    "    if training_now:\n", 
+    "        variables[-1] = training\n", 
+    "    \n", 
+    "    # counter \n", 
+    "    iter_cnt = 0\n", 
+    "    for e in range(epochs):\n", 
+    "        # keep track of losses and accuracy\n", 
+    "        correct = 0\n", 
+    "        losses = []\n", 
+    "        # make sure we iterate over the dataset once\n", 
+    "        for i in range(int(math.ceil(Xd.shape[0]/batch_size))):\n", 
+    "            # generate indicies for the batch\n", 
+    "            start_idx = (i*batch_size)%Xd.shape[0]\n", 
+    "            idx = train_indicies[start_idx:start_idx+batch_size]\n", 
+    "            \n", 
+    "            # create a feed dictionary for this batch\n", 
+    "            feed_dict = {X: Xd[idx,:],\n", 
+    "                         y: yd[idx],\n", 
+    "                         is_training: training_now }\n", 
+    "            # get batch size\n", 
+    "            actual_batch_size = yd[idx].shape[0]\n", 
+    "            \n", 
+    "            # have tensorflow compute loss and correct predictions\n", 
+    "            # and (if given) perform a training step\n", 
+    "            loss, corr, _ = session.run(variables,feed_dict=feed_dict)\n", 
+    "            \n", 
+    "            # aggregate performance stats\n", 
+    "            losses.append(loss*actual_batch_size)\n", 
+    "            correct += np.sum(corr)\n", 
+    "            \n", 
+    "            # print every now and then\n", 
+    "            if training_now and (iter_cnt % print_every) == 0:\n", 
+    "                print(\"Iteration {0}: with minibatch training loss = {1:.3g} and accuracy of {2:.2g}\"\\\n", 
+    "                      .format(iter_cnt,loss,np.sum(corr)/actual_batch_size))\n", 
+    "            iter_cnt += 1\n", 
+    "        total_correct = correct/Xd.shape[0]\n", 
+    "        total_loss = np.sum(losses)/Xd.shape[0]\n", 
+    "        print(\"Epoch {2}, Overall loss = {0:.3g} and accuracy of {1:.3g}\"\\\n", 
+    "              .format(total_loss,total_correct,e+1))\n", 
+    "        if plot_losses:\n", 
+    "            plt.plot(losses)\n", 
+    "            plt.grid(True)\n", 
+    "            plt.title('Epoch {} Loss'.format(e+1))\n", 
+    "            plt.xlabel('minibatch number')\n", 
+    "            plt.ylabel('minibatch loss')\n", 
+    "            plt.show()\n", 
+    "    return total_loss,total_correct\n", 
+    "\n", 
+    "with tf.Session() as sess:\n", 
+    "    with tf.device(\"/cpu:0\"): #\"/cpu:0\" or \"/gpu:0\" \n", 
+    "        sess.run(tf.global_variables_initializer())\n", 
+    "        print('Training')\n", 
+    "        run_model(sess,y_out,mean_loss,X_train,y_train,1,64,100,train_step,True)\n", 
+    "        print('Validation')\n", 
+    "        run_model(sess,y_out,mean_loss,X_val,y_val,1,64)"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": true
+   }
+  }, 
+  {
+   "source": [
+    "## Training a specific model\n", 
+    "\n", 
+    "In this section, we're going to specify a model for you to construct. The goal here isn't to get good performance (that'll be next), but instead to get comfortable with understanding the TensorFlow documentation and configuring your own model. \n", 
+    "\n", 
+    "Using the code provided above as guidance, and using the following TensorFlow documentation, specify a model with the following architecture:\n", 
+    "\n", 
+    "* 7x7 Convolutional Layer with 32 filters and stride of 1\n", 
+    "* ReLU Activation Layer\n", 
+    "* Spatial Batch Normalization Layer (trainable parameters, with scale and centering)\n", 
+    "* 2x2 Max Pooling layer with a stride of 2\n", 
+    "* Affine layer with 1024 output units\n", 
+    "* ReLU Activation Layer\n", 
+    "* Affine layer from 1024 input units to 10 outputs\n", 
+    "\n"
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {}
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "# clear old variables\n", 
+    "tf.reset_default_graph()\n", 
+    "\n", 
+    "# define our input (e.g. the data that changes every batch)\n", 
+    "# The first dim is None, and gets sets automatically based on batch size fed in\n", 
+    "X = tf.placeholder(tf.float32, [None, 32, 32, 3])\n", 
+    "y = tf.placeholder(tf.int64, [None])\n", 
+    "is_training = tf.placeholder(tf.bool)\n", 
+    "\n", 
+    "# define model\n", 
+    "def complex_model(X,y,is_training):\n", 
+    "    pass\n", 
+    "\n", 
+    "y_out = complex_model(X,y,is_training)"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": true
+   }
+  }, 
+  {
+   "source": [
+    "To make sure you're doing the right thing, use the following tool to check the dimensionality of your output (it should be 64 x 10, since our batches have size 64 and the output of the final affine layer should be 10, corresponding to our 10 classes):"
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {}
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "# Now we're going to feed a random batch into the model \n", 
+    "# and make sure the output is the right size\n", 
+    "x = np.random.randn(64, 32, 32,3)\n", 
+    "with tf.Session() as sess:\n", 
+    "    with tf.device(\"/cpu:0\"): #\"/cpu:0\" or \"/gpu:0\"\n", 
+    "        tf.global_variables_initializer().run()\n", 
+    "\n", 
+    "        ans = sess.run(y_out,feed_dict={X:x,is_training:True})\n", 
+    "        %timeit sess.run(y_out,feed_dict={X:x,is_training:True})\n", 
+    "        print(ans.shape)\n", 
+    "        print(np.array_equal(ans.shape, np.array([64, 10])))"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": true
+   }
+  }, 
+  {
+   "source": [
+    "You should see the following from the run above \n", 
+    "\n", 
+    "`(64, 10)`\n", 
+    "\n", 
+    "`True`"
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {}
+  }, 
+  {
+   "source": [
+    "### GPU!\n", 
+    "\n", 
+    "Now, we're going to try and start the model under the GPU device, the rest of the code stays unchanged and all our variables and operations will be computed using accelerated code paths. However, if there is no GPU, we get a Python exception and have to rebuild our graph. On a dual-core CPU, you might see around 50-80ms/batch running the above, while the Google Cloud GPUs (run below) should be around 2-5ms/batch."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {}
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "try:\n", 
+    "    with tf.Session() as sess:\n", 
+    "        with tf.device(\"/gpu:0\") as dev: #\"/cpu:0\" or \"/gpu:0\"\n", 
+    "            tf.global_variables_initializer().run()\n", 
+    "\n", 
+    "            ans = sess.run(y_out,feed_dict={X:x,is_training:True})\n", 
+    "            %timeit sess.run(y_out,feed_dict={X:x,is_training:True})\n", 
+    "except tf.errors.InvalidArgumentError:\n", 
+    "    print(\"no gpu found, please use Google Cloud if you want GPU acceleration\")    \n", 
+    "    # rebuild the graph\n", 
+    "    # trying to start a GPU throws an exception \n", 
+    "    # and also trashes the original graph\n", 
+    "    tf.reset_default_graph()\n", 
+    "    X = tf.placeholder(tf.float32, [None, 32, 32, 3])\n", 
+    "    y = tf.placeholder(tf.int64, [None])\n", 
+    "    is_training = tf.placeholder(tf.bool)\n", 
+    "    y_out = complex_model(X,y,is_training)"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": true
+   }
+  }, 
+  {
+   "source": [
+    "You should observe that even a simple forward pass like this is significantly faster on the GPU. So for the rest of the assignment (and when you go train your models in assignment 3 and your project!), you should use GPU devices. However, with TensorFlow, the default device is a GPU if one is available, and a CPU otherwise, so we can skip the device specification from now on."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {}
+  }, 
+  {
+   "source": [
+    "### Train the model.\n", 
+    "\n", 
+    "Now that you've seen how to define a model and do a single forward pass of some data through it, let's  walk through how you'd actually train one whole epoch over your training data (using the complex_model you created provided above).\n", 
+    "\n", 
+    "Make sure you understand how each TensorFlow function used below corresponds to what you implemented in your custom neural network implementation.\n", 
+    "\n", 
+    "First, set up an **RMSprop optimizer** (using a 1e-3 learning rate) and a **cross-entropy loss** function. See the TensorFlow documentation for more information\n", 
+    "* Layers, Activations, Loss functions : https://www.tensorflow.org/api_guides/python/nn\n", 
+    "* Optimizers: https://www.tensorflow.org/api_guides/python/train#Optimizers"
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {}
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "# Inputs\n", 
+    "#     y_out: is what your model computes\n", 
+    "#     y: is your TensorFlow variable with label information\n", 
+    "# Outputs\n", 
+    "#    mean_loss: a TensorFlow variable (scalar) with numerical loss\n", 
+    "#    optimizer: a TensorFlow optimizer\n", 
+    "# This should be ~3 lines of code!\n", 
+    "mean_loss = None\n", 
+    "optimizer = None\n", 
+    "pass\n"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "# batch normalization in tensorflow requires this extra dependency\n", 
+    "extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)\n", 
+    "with tf.control_dependencies(extra_update_ops):\n", 
+    "    train_step = optimizer.minimize(mean_loss)"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": true
+   }
+  }, 
+  {
+   "source": [
+    "### Train the model\n", 
+    "Below we'll create a session and train the model over one epoch. You should see a loss of 1.4 to 2.0 and an accuracy of 0.4 to 0.5. There will be some variation due to random seeds and differences in initialization"
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {}
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "sess = tf.Session()\n", 
+    "\n", 
+    "sess.run(tf.global_variables_initializer())\n", 
+    "print('Training')\n", 
+    "run_model(sess,y_out,mean_loss,X_train,y_train,1,64,100,train_step)"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": true
+   }
+  }, 
+  {
+   "source": [
+    "### Check the accuracy of the model.\n", 
+    "\n", 
+    "Let's see the train and test code in action -- feel free to use these methods when evaluating the models you develop below. You should see a loss of 1.3 to 2.0 with an accuracy of 0.45 to 0.55."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {}
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "print('Validation')\n", 
+    "run_model(sess,y_out,mean_loss,X_val,y_val,1,64)"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": true
+   }
+  }, 
+  {
+   "source": [
+    "## Train a _great_ model on CIFAR-10!\n", 
+    "\n", 
+    "Now it's your job to experiment with architectures, hyperparameters, loss functions, and optimizers to train a model that achieves ** >= 70% accuracy on the validation set** of CIFAR-10. You can use the `run_model` function from above."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {}
+  }, 
+  {
+   "source": [
+    "### Things you should try:\n", 
+    "- **Filter size**: Above we used 7x7; this makes pretty pictures but smaller filters may be more efficient\n", 
+    "- **Number of filters**: Above we used 32 filters. Do more or fewer do better?\n", 
+    "- **Pooling vs Strided Convolution**: Do you use max pooling or just stride convolutions?\n", 
+    "- **Batch normalization**: Try adding spatial batch normalization after convolution layers and vanilla batch normalization after affine layers. Do your networks train faster?\n", 
+    "- **Network architecture**: The network above has two layers of trainable parameters. Can you do better with a deep network? Good architectures to try include:\n", 
+    "    - [conv-relu-pool]xN -> [affine]xM -> [softmax or SVM]\n", 
+    "    - [conv-relu-conv-relu-pool]xN -> [affine]xM -> [softmax or SVM]\n", 
+    "    - [batchnorm-relu-conv]xN -> [affine]xM -> [softmax or SVM]\n", 
+    "- **Use TensorFlow Scope**: Use TensorFlow scope and/or [tf.layers](https://www.tensorflow.org/api_docs/python/tf/layers) to make it easier to write deeper networks. See [this tutorial](https://www.tensorflow.org/tutorials/layers) for how to use `tf.layers`. \n", 
+    "- **Use Learning Rate Decay**: [As the notes point out](http://cs231n.github.io/neural-networks-3/#anneal), decaying the learning rate might help the model converge. Feel free to decay every epoch, when loss doesn't change over an entire epoch, or any other heuristic you find appropriate. See the [Tensorflow documentation](https://www.tensorflow.org/versions/master/api_guides/python/train#Decaying_the_learning_rate) for learning rate decay.\n", 
+    "- **Global Average Pooling**: Instead of flattening and then having multiple affine layers, perform convolutions until your image gets small (7x7 or so) and then perform an average pooling operation to get to a 1x1 image picture (1, 1 , Filter#), which is then reshaped into a (Filter#) vector. This is used in [Google's Inception Network](https://arxiv.org/abs/1512.00567) (See Table 1 for their architecture).\n", 
+    "- **Regularization**: Add l2 weight regularization, or perhaps use [Dropout as in the TensorFlow MNIST tutorial](https://www.tensorflow.org/get_started/mnist/pros)\n", 
+    "\n", 
+    "### Tips for training\n", 
+    "For each network architecture that you try, you should tune the learning rate and regularization strength. When doing this there are a couple important things to keep in mind:\n", 
+    "\n", 
+    "- If the parameters are working well, you should see improvement within a few hundred iterations\n", 
+    "- Remember the coarse-to-fine approach for hyperparameter tuning: start by testing a large range of hyperparameters for just a few training iterations to find the combinations of parameters that are working at all.\n", 
+    "- Once you have found some sets of parameters that seem to work, search more finely around these parameters. You may need to train for more epochs.\n", 
+    "- You should use the validation set for hyperparameter search, and we'll save the test set for evaluating your architecture on the best parameters as selected by the validation set.\n", 
+    "\n", 
+    "### Going above and beyond\n", 
+    "If you are feeling adventurous there are many other features you can implement to try and improve your performance. You are **not required** to implement any of these; however they would be good things to try for extra credit.\n", 
+    "\n", 
+    "- Alternative update steps: For the assignment we implemented SGD+momentum, RMSprop, and Adam; you could try alternatives like AdaGrad or AdaDelta.\n", 
+    "- Alternative activation functions such as leaky ReLU, parametric ReLU, ELU, or MaxOut.\n", 
+    "- Model ensembles\n", 
+    "- Data augmentation\n", 
+    "- New Architectures\n", 
+    "  - [ResNets](https://arxiv.org/abs/1512.03385) where the input from the previous layer is added to the output.\n", 
+    "  - [DenseNets](https://arxiv.org/abs/1608.06993) where inputs into previous layers are concatenated together.\n", 
+    "  - [This blog has an in-depth overview](https://chatbotslife.com/resnets-highwaynets-and-densenets-oh-my-9bb15918ee32)\n", 
+    "\n", 
+    "If you do decide to implement something extra, clearly describe it in the \"Extra Credit Description\" cell below.\n", 
+    "\n", 
+    "### What we expect\n", 
+    "At the very least, you should be able to train a ConvNet that gets at **>= 70% accuracy on the validation set**. This is just a lower bound - if you are careful it should be possible to get accuracies much higher than that! Extra credit points will be awarded for particularly high-scoring models or unique approaches.\n", 
+    "\n", 
+    "You should use the space below to experiment and train your network. The final cell in this notebook should contain the training and validation set accuracies for your final trained network.\n", 
+    "\n", 
+    "Have fun and happy training!"
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {}
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "# Feel free to play with this cell\n", 
+    "\n", 
+    "def my_model(X,y,is_training):\n", 
+    "    pass\n", 
+    "\n", 
+    "tf.reset_default_graph()\n", 
+    "\n", 
+    "X = tf.placeholder(tf.float32, [None, 32, 32, 3])\n", 
+    "y = tf.placeholder(tf.int64, [None])\n", 
+    "is_training = tf.placeholder(tf.bool)\n", 
+    "\n", 
+    "y_out = my_model(X,y,is_training)\n", 
+    "mean_loss = None\n", 
+    "optimizer = None\n", 
+    "\n", 
+    "\n", 
+    "pass\n", 
+    "\n", 
+    "# batch normalization in tensorflow requires this extra dependency\n", 
+    "extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)\n", 
+    "with tf.control_dependencies(extra_update_ops):\n", 
+    "    train_step = optimizer.minimize(mean_loss)"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "# Feel free to play with this cell\n", 
+    "# This default code creates a session\n", 
+    "# and trains your model for 10 epochs\n", 
+    "# then prints the validation set accuracy\n", 
+    "sess = tf.Session()\n", 
+    "\n", 
+    "sess.run(tf.global_variables_initializer())\n", 
+    "print('Training')\n", 
+    "run_model(sess,y_out,mean_loss,X_train,y_train,10,64,100,train_step,True)\n", 
+    "print('Validation')\n", 
+    "run_model(sess,y_out,mean_loss,X_val,y_val,1,64)"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": true
+   }
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "# Test your model here, and make sure \n", 
+    "# the output of this cell is the accuracy\n", 
+    "# of your best model on the training and val sets\n", 
+    "# We're looking for >= 70% accuracy on Validation\n", 
+    "print('Training')\n", 
+    "run_model(sess,y_out,mean_loss,X_train,y_train,1,64)\n", 
+    "print('Validation')\n", 
+    "run_model(sess,y_out,mean_loss,X_val,y_val,1,64)"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": true
+   }
+  }, 
+  {
+   "source": [
+    "### Describe what you did here\n", 
+    "In this cell you should also write an explanation of what you did, any additional features that you implemented, and any visualizations or graphs that you make in the process of training and evaluating your network"
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {}
+  }, 
+  {
+   "source": [
+    "_Tell us here_"
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {}
+  }, 
+  {
+   "source": [
+    "### Test Set - Do this only once\n", 
+    "Now that we've gotten a result that we're happy with, we test our final model on the test set. This would be the score we would achieve on a competition. Think about how this compares to your validation set accuracy."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {}
+  }, 
+  {
+   "execution_count": null, 
+   "cell_type": "code", 
+   "source": [
+    "print('Test')\n", 
+    "run_model(sess,y_out,mean_loss,X_test,y_test,1,64)"
+   ], 
+   "outputs": [], 
+   "metadata": {
+    "collapsed": true
+   }
+  }, 
+  {
+   "source": [
+    "## Going further with TensorFlow\n", 
+    "\n", 
+    "The next assignment will make heavy use of TensorFlow. You might also find it useful for your projects. \n"
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {}
+  }, 
+  {
+   "source": [
+    "# Extra Credit Description\n", 
+    "If you implement any additional features for extra credit, clearly describe them here with pointers to any code in this or other files if applicable."
+   ], 
+   "cell_type": "markdown", 
+   "metadata": {}
+  }
+ ], 
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3", 
+   "name": "python3", 
+   "language": "python"
+  }, 
+  "language_info": {
+   "mimetype": "text/x-python", 
+   "nbconvert_exporter": "python", 
+   "name": "python", 
+   "file_extension": ".py", 
+   "version": "3.5.2", 
+   "pygments_lexer": "ipython3", 
+   "codemirror_mode": {
+    "version": 3, 
+    "name": "ipython"
+   }
+  }
+ }
+}
\ No newline at end of file
diff --git a/cs231n/.gitignore b/cs231n/.gitignore
new file mode 100644
index 0000000..fbb42c2
--- /dev/null
+++ b/cs231n/.gitignore
@@ -0,0 +1,3 @@
+build/*
+im2col_cython.c
+im2col_cython.so
diff --git a/cs231n/__init__.py b/cs231n/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/cs231n/classifiers/__init__.py b/cs231n/classifiers/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/cs231n/classifiers/cnn.py b/cs231n/classifiers/cnn.py
new file mode 100644
index 0000000..8859a8a
--- /dev/null
+++ b/cs231n/classifiers/cnn.py
@@ -0,0 +1,103 @@
+from builtins import object
+import numpy as np
+
+from cs231n.layers import *
+from cs231n.fast_layers import *
+from cs231n.layer_utils import *
+
+
+class ThreeLayerConvNet(object):
+    """
+    A three-layer convolutional network with the following architecture:
+
+    conv - relu - 2x2 max pool - affine - relu - affine - softmax
+
+    The network operates on minibatches of data that have shape (N, C, H, W)
+    consisting of N images, each with height H and width W and with C input
+    channels.
+    """
+
+    def __init__(self, input_dim=(3, 32, 32), num_filters=32, filter_size=7,
+                 hidden_dim=100, num_classes=10, weight_scale=1e-3, reg=0.0,
+                 dtype=np.float32):
+        """
+        Initialize a new network.
+
+        Inputs:
+        - input_dim: Tuple (C, H, W) giving size of input data
+        - num_filters: Number of filters to use in the convolutional layer
+        - filter_size: Size of filters to use in the convolutional layer
+        - hidden_dim: Number of units to use in the fully-connected hidden layer
+        - num_classes: Number of scores to produce from the final affine layer.
+        - weight_scale: Scalar giving standard deviation for random initialization
+          of weights.
+        - reg: Scalar giving L2 regularization strength
+        - dtype: numpy datatype to use for computation.
+        """
+        self.params = {}
+        self.reg = reg
+        self.dtype = dtype
+
+        ############################################################################
+        # TODO: Initialize weights and biases for the three-layer convolutional    #
+        # network. Weights should be initialized from a Gaussian with standard     #
+        # deviation equal to weight_scale; biases should be initialized to zero.   #
+        # All weights and biases should be stored in the dictionary self.params.   #
+        # Store weights and biases for the convolutional layer using the keys 'W1' #
+        # and 'b1'; use keys 'W2' and 'b2' for the weights and biases of the       #
+        # hidden affine layer, and keys 'W3' and 'b3' for the weights and biases   #
+        # of the output affine layer.                                              #
+        ############################################################################
+        pass
+        ############################################################################
+        #                             END OF YOUR CODE                             #
+        ############################################################################
+
+        for k, v in self.params.items():
+            self.params[k] = v.astype(dtype)
+
+
+    def loss(self, X, y=None):
+        """
+        Evaluate loss and gradient for the three-layer convolutional network.
+
+        Input / output: Same API as TwoLayerNet in fc_net.py.
+        """
+        W1, b1 = self.params['W1'], self.params['b1']
+        W2, b2 = self.params['W2'], self.params['b2']
+        W3, b3 = self.params['W3'], self.params['b3']
+
+        # pass conv_param to the forward pass for the convolutional layer
+        filter_size = W1.shape[2]
+        conv_param = {'stride': 1, 'pad': (filter_size - 1) // 2}
+
+        # pass pool_param to the forward pass for the max-pooling layer
+        pool_param = {'pool_height': 2, 'pool_width': 2, 'stride': 2}
+
+        scores = None
+        ############################################################################
+        # TODO: Implement the forward pass for the three-layer convolutional net,  #
+        # computing the class scores for X and storing them in the scores          #
+        # variable.                                                                #
+        ############################################################################
+        pass
+        ############################################################################
+        #                             END OF YOUR CODE                             #
+        ############################################################################
+
+        if y is None:
+            return scores
+
+        loss, grads = 0, {}
+        ############################################################################
+        # TODO: Implement the backward pass for the three-layer convolutional net, #
+        # storing the loss and gradients in the loss and grads variables. Compute  #
+        # data loss using softmax, and make sure that grads[k] holds the gradients #
+        # for self.params[k]. Don't forget to add L2 regularization!               #
+        ############################################################################
+        pass
+        ############################################################################
+        #                             END OF YOUR CODE                             #
+        ############################################################################
+
+        return loss, grads
diff --git a/cs231n/classifiers/fc_net.py b/cs231n/classifiers/fc_net.py
new file mode 100644
index 0000000..6d186c8
--- /dev/null
+++ b/cs231n/classifiers/fc_net.py
@@ -0,0 +1,252 @@
+from builtins import range
+from builtins import object
+import numpy as np
+
+from cs231n.layers import *
+from cs231n.layer_utils import *
+
+
+class TwoLayerNet(object):
+    """
+    A two-layer fully-connected neural network with ReLU nonlinearity and
+    softmax loss that uses a modular layer design. We assume an input dimension
+    of D, a hidden dimension of H, and perform classification over C classes.
+
+    The architecure should be affine - relu - affine - softmax.
+
+    Note that this class does not implement gradient descent; instead, it
+    will interact with a separate Solver object that is responsible for running
+    optimization.
+
+    The learnable parameters of the model are stored in the dictionary
+    self.params that maps parameter names to numpy arrays.
+    """
+
+    def __init__(self, input_dim=3*32*32, hidden_dim=100, num_classes=10,
+                 weight_scale=1e-3, reg=0.0):
+        """
+        Initialize a new network.
+
+        Inputs:
+        - input_dim: An integer giving the size of the input
+        - hidden_dim: An integer giving the size of the hidden layer
+        - num_classes: An integer giving the number of classes to classify
+        - dropout: Scalar between 0 and 1 giving dropout strength.
+        - weight_scale: Scalar giving the standard deviation for random
+          initialization of the weights.
+        - reg: Scalar giving L2 regularization strength.
+        """
+        self.params = {}
+        self.reg = reg
+
+        ############################################################################
+        # TODO: Initialize the weights and biases of the two-layer net. Weights    #
+        # should be initialized from a Gaussian with standard deviation equal to   #
+        # weight_scale, and biases should be initialized to zero. All weights and  #
+        # biases should be stored in the dictionary self.params, with first layer  #
+        # weights and biases using the keys 'W1' and 'b1' and second layer weights #
+        # and biases using the keys 'W2' and 'b2'.                                 #
+        ############################################################################
+        pass
+        ############################################################################
+        #                             END OF YOUR CODE                             #
+        ############################################################################
+
+
+    def loss(self, X, y=None):
+        """
+        Compute loss and gradient for a minibatch of data.
+
+        Inputs:
+        - X: Array of input data of shape (N, d_1, ..., d_k)
+        - y: Array of labels, of shape (N,). y[i] gives the label for X[i].
+
+        Returns:
+        If y is None, then run a test-time forward pass of the model and return:
+        - scores: Array of shape (N, C) giving classification scores, where
+          scores[i, c] is the classification score for X[i] and class c.
+
+        If y is not None, then run a training-time forward and backward pass and
+        return a tuple of:
+        - loss: Scalar value giving the loss
+        - grads: Dictionary with the same keys as self.params, mapping parameter
+          names to gradients of the loss with respect to those parameters.
+        """
+        scores = None
+        ############################################################################
+        # TODO: Implement the forward pass for the two-layer net, computing the    #
+        # class scores for X and storing them in the scores variable.              #
+        ############################################################################
+        pass
+        ############################################################################
+        #                             END OF YOUR CODE                             #
+        ############################################################################
+
+        # If y is None then we are in test mode so just return scores
+        if y is None:
+            return scores
+
+        loss, grads = 0, {}
+        ############################################################################
+        # TODO: Implement the backward pass for the two-layer net. Store the loss  #
+        # in the loss variable and gradients in the grads dictionary. Compute data #
+        # loss using softmax, and make sure that grads[k] holds the gradients for  #
+        # self.params[k]. Don't forget to add L2 regularization!                   #
+        #                                                                          #
+        # NOTE: To ensure that your implementation matches ours and you pass the   #
+        # automated tests, make sure that your L2 regularization includes a factor #
+        # of 0.5 to simplify the expression for the gradient.                      #
+        ############################################################################
+        pass
+        ############################################################################
+        #                             END OF YOUR CODE                             #
+        ############################################################################
+
+        return loss, grads
+
+
+class FullyConnectedNet(object):
+    """
+    A fully-connected neural network with an arbitrary number of hidden layers,
+    ReLU nonlinearities, and a softmax loss function. This will also implement
+    dropout and batch normalization as options. For a network with L layers,
+    the architecture will be
+
+    {affine - [batch norm] - relu - [dropout]} x (L - 1) - affine - softmax
+
+    where batch normalization and dropout are optional, and the {...} block is
+    repeated L - 1 times.
+
+    Similar to the TwoLayerNet above, learnable parameters are stored in the
+    self.params dictionary and will be learned using the Solver class.
+    """
+
+    def __init__(self, hidden_dims, input_dim=3*32*32, num_classes=10,
+                 dropout=0, use_batchnorm=False, reg=0.0,
+                 weight_scale=1e-2, dtype=np.float32, seed=None):
+        """
+        Initialize a new FullyConnectedNet.
+
+        Inputs:
+        - hidden_dims: A list of integers giving the size of each hidden layer.
+        - input_dim: An integer giving the size of the input.
+        - num_classes: An integer giving the number of classes to classify.
+        - dropout: Scalar between 0 and 1 giving dropout strength. If dropout=0 then
+          the network should not use dropout at all.
+        - use_batchnorm: Whether or not the network should use batch normalization.
+        - reg: Scalar giving L2 regularization strength.
+        - weight_scale: Scalar giving the standard deviation for random
+          initialization of the weights.
+        - dtype: A numpy datatype object; all computations will be performed using
+          this datatype. float32 is faster but less accurate, so you should use
+          float64 for numeric gradient checking.
+        - seed: If not None, then pass this random seed to the dropout layers. This
+          will make the dropout layers deteriminstic so we can gradient check the
+          model.
+        """
+        self.use_batchnorm = use_batchnorm
+        self.use_dropout = dropout > 0
+        self.reg = reg
+        self.num_layers = 1 + len(hidden_dims)
+        self.dtype = dtype
+        self.params = {}
+
+        ############################################################################
+        # TODO: Initialize the parameters of the network, storing all values in    #
+        # the self.params dictionary. Store weights and biases for the first layer #
+        # in W1 and b1; for the second layer use W2 and b2, etc. Weights should be #
+        # initialized from a normal distribution with standard deviation equal to  #
+        # weight_scale and biases should be initialized to zero.                   #
+        #                                                                          #
+        # When using batch normalization, store scale and shift parameters for the #
+        # first layer in gamma1 and beta1; for the second layer use gamma2 and     #
+        # beta2, etc. Scale parameters should be initialized to one and shift      #
+        # parameters should be initialized to zero.                                #
+        ############################################################################
+        pass
+        ############################################################################
+        #                             END OF YOUR CODE                             #
+        ############################################################################
+
+        # When using dropout we need to pass a dropout_param dictionary to each
+        # dropout layer so that the layer knows the dropout probability and the mode
+        # (train / test). You can pass the same dropout_param to each dropout layer.
+        self.dropout_param = {}
+        if self.use_dropout:
+            self.dropout_param = {'mode': 'train', 'p': dropout}
+            if seed is not None:
+                self.dropout_param['seed'] = seed
+
+        # With batch normalization we need to keep track of running means and
+        # variances, so we need to pass a special bn_param object to each batch
+        # normalization layer. You should pass self.bn_params[0] to the forward pass
+        # of the first batch normalization layer, self.bn_params[1] to the forward
+        # pass of the second batch normalization layer, etc.
+        self.bn_params = []
+        if self.use_batchnorm:
+            self.bn_params = [{'mode': 'train'} for i in range(self.num_layers - 1)]
+
+        # Cast all parameters to the correct datatype
+        for k, v in self.params.items():
+            self.params[k] = v.astype(dtype)
+
+
+    def loss(self, X, y=None):
+        """
+        Compute loss and gradient for the fully-connected net.
+
+        Input / output: Same as TwoLayerNet above.
+        """
+        X = X.astype(self.dtype)
+        mode = 'test' if y is None else 'train'
+
+        # Set train/test mode for batchnorm params and dropout param since they
+        # behave differently during training and testing.
+        if self.use_dropout:
+            self.dropout_param['mode'] = mode
+        if self.use_batchnorm:
+            for bn_param in self.bn_params:
+                bn_param['mode'] = mode
+
+        scores = None
+        ############################################################################
+        # TODO: Implement the forward pass for the fully-connected net, computing  #
+        # the class scores for X and storing them in the scores variable.          #
+        #                                                                          #
+        # When using dropout, you'll need to pass self.dropout_param to each       #
+        # dropout forward pass.                                                    #
+        #                                                                          #
+        # When using batch normalization, you'll need to pass self.bn_params[0] to #
+        # the forward pass for the first batch normalization layer, pass           #
+        # self.bn_params[1] to the forward pass for the second batch normalization #
+        # layer, etc.                                                              #
+        ############################################################################
+        pass
+        ############################################################################
+        #                             END OF YOUR CODE                             #
+        ############################################################################
+
+        # If test mode return early
+        if mode == 'test':
+            return scores
+
+        loss, grads = 0.0, {}
+        ############################################################################
+        # TODO: Implement the backward pass for the fully-connected net. Store the #
+        # loss in the loss variable and gradients in the grads dictionary. Compute #
+        # data loss using softmax, and make sure that grads[k] holds the gradients #
+        # for self.params[k]. Don't forget to add L2 regularization!               #
+        #                                                                          #
+        # When using batch normalization, you don't need to regularize the scale   #
+        # and shift parameters.                                                    #
+        #                                                                          #
+        # NOTE: To ensure that your implementation matches ours and you pass the   #
+        # automated tests, make sure that your L2 regularization includes a factor #
+        # of 0.5 to simplify the expression for the gradient.                      #
+        ############################################################################
+        pass
+        ############################################################################
+        #                             END OF YOUR CODE                             #
+        ############################################################################
+
+        return loss, grads
diff --git a/cs231n/data_utils.py b/cs231n/data_utils.py
new file mode 100644
index 0000000..7688518
--- /dev/null
+++ b/cs231n/data_utils.py
@@ -0,0 +1,262 @@
+from __future__ import print_function
+
+from builtins import range
+from six.moves import cPickle as pickle
+import numpy as np
+import os
+from scipy.misc import imread
+import platform
+
+def load_pickle(f):
+    version = platform.python_version_tuple()
+    if version[0] == '2':
+        return  pickle.load(f)
+    elif version[0] == '3':
+        return  pickle.load(f, encoding='latin1')
+    raise ValueError("invalid python version: {}".format(version))
+
+def load_CIFAR_batch(filename):
+    """ load single batch of cifar """
+    with open(filename, 'rb') as f:
+        datadict = load_pickle(f)
+        X = datadict['data']
+        Y = datadict['labels']
+        X = X.reshape(10000, 3, 32, 32).transpose(0,2,3,1).astype("float")
+        Y = np.array(Y)
+        return X, Y
+
+def load_CIFAR10(ROOT):
+    """ load all of cifar """
+    xs = []
+    ys = []
+    for b in range(1,6):
+        f = os.path.join(ROOT, 'data_batch_%d' % (b, ))
+        X, Y = load_CIFAR_batch(f)
+        xs.append(X)
+        ys.append(Y)
+    Xtr = np.concatenate(xs)
+    Ytr = np.concatenate(ys)
+    del X, Y
+    Xte, Yte = load_CIFAR_batch(os.path.join(ROOT, 'test_batch'))
+    return Xtr, Ytr, Xte, Yte
+
+
+def get_CIFAR10_data(num_training=49000, num_validation=1000, num_test=1000,
+                     subtract_mean=True):
+    """
+    Load the CIFAR-10 dataset from disk and perform preprocessing to prepare
+    it for classifiers. These are the same steps as we used for the SVM, but
+    condensed to a single function.
+    """
+    # Load the raw CIFAR-10 data
+    cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'
+    X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)
+
+    # Subsample the data
+    mask = list(range(num_training, num_training + num_validation))
+    X_val = X_train[mask]
+    y_val = y_train[mask]
+    mask = list(range(num_training))
+    X_train = X_train[mask]
+    y_train = y_train[mask]
+    mask = list(range(num_test))
+    X_test = X_test[mask]
+    y_test = y_test[mask]
+
+    # Normalize the data: subtract the mean image
+    if subtract_mean:
+        mean_image = np.mean(X_train, axis=0)
+        X_train -= mean_image
+        X_val -= mean_image
+        X_test -= mean_image
+
+    # Transpose so that channels come first
+    X_train = X_train.transpose(0, 3, 1, 2).copy()
+    X_val = X_val.transpose(0, 3, 1, 2).copy()
+    X_test = X_test.transpose(0, 3, 1, 2).copy()
+
+    # Package data into a dictionary
+    return {
+      'X_train': X_train, 'y_train': y_train,
+      'X_val': X_val, 'y_val': y_val,
+      'X_test': X_test, 'y_test': y_test,
+    }
+
+
+def load_tiny_imagenet(path, dtype=np.float32, subtract_mean=True):
+    """
+    Load TinyImageNet. Each of TinyImageNet-100-A, TinyImageNet-100-B, and
+    TinyImageNet-200 have the same directory structure, so this can be used
+    to load any of them.
+
+    Inputs:
+    - path: String giving path to the directory to load.
+    - dtype: numpy datatype used to load the data.
+    - subtract_mean: Whether to subtract the mean training image.
+
+    Returns: A dictionary with the following entries:
+    - class_names: A list where class_names[i] is a list of strings giving the
+      WordNet names for class i in the loaded dataset.
+    - X_train: (N_tr, 3, 64, 64) array of training images
+    - y_train: (N_tr,) array of training labels
+    - X_val: (N_val, 3, 64, 64) array of validation images
+    - y_val: (N_val,) array of validation labels
+    - X_test: (N_test, 3, 64, 64) array of testing images.
+    - y_test: (N_test,) array of test labels; if test labels are not available
+      (such as in student code) then y_test will be None.
+    - mean_image: (3, 64, 64) array giving mean training image
+    """
+    # First load wnids
+    with open(os.path.join(path, 'wnids.txt'), 'r') as f:
+        wnids = [x.strip() for x in f]
+
+    # Map wnids to integer labels
+    wnid_to_label = {wnid: i for i, wnid in enumerate(wnids)}
+
+    # Use words.txt to get names for each class
+    with open(os.path.join(path, 'words.txt'), 'r') as f:
+        wnid_to_words = dict(line.split('\t') for line in f)
+        for wnid, words in wnid_to_words.items():
+            wnid_to_words[wnid] = [w.strip() for w in words.split(',')]
+    class_names = [wnid_to_words[wnid] for wnid in wnids]
+
+    # Next load training data.
+    X_train = []
+    y_train = []
+    for i, wnid in enumerate(wnids):
+        if (i + 1) % 20 == 0:
+            print('loading training data for synset %d / %d'
+                  % (i + 1, len(wnids)))
+        # To figure out the filenames we need to open the boxes file
+        boxes_file = os.path.join(path, 'train', wnid, '%s_boxes.txt' % wnid)
+        with open(boxes_file, 'r') as f:
+            filenames = [x.split('\t')[0] for x in f]
+        num_images = len(filenames)
+
+        X_train_block = np.zeros((num_images, 3, 64, 64), dtype=dtype)
+        y_train_block = wnid_to_label[wnid] * \
+                        np.ones(num_images, dtype=np.int64)
+        for j, img_file in enumerate(filenames):
+            img_file = os.path.join(path, 'train', wnid, 'images', img_file)
+            img = imread(img_file)
+            if img.ndim == 2:
+        ## grayscale file
+                img.shape = (64, 64, 1)
+            X_train_block[j] = img.transpose(2, 0, 1)
+        X_train.append(X_train_block)
+        y_train.append(y_train_block)
+
+    # We need to concatenate all training data
+    X_train = np.concatenate(X_train, axis=0)
+    y_train = np.concatenate(y_train, axis=0)
+
+    # Next load validation data
+    with open(os.path.join(path, 'val', 'val_annotations.txt'), 'r') as f:
+        img_files = []
+        val_wnids = []
+        for line in f:
+            img_file, wnid = line.split('\t')[:2]
+            img_files.append(img_file)
+            val_wnids.append(wnid)
+        num_val = len(img_files)
+        y_val = np.array([wnid_to_label[wnid] for wnid in val_wnids])
+        X_val = np.zeros((num_val, 3, 64, 64), dtype=dtype)
+        for i, img_file in enumerate(img_files):
+            img_file = os.path.join(path, 'val', 'images', img_file)
+            img = imread(img_file)
+            if img.ndim == 2:
+                img.shape = (64, 64, 1)
+            X_val[i] = img.transpose(2, 0, 1)
+
+    # Next load test images
+    # Students won't have test labels, so we need to iterate over files in the
+    # images directory.
+    img_files = os.listdir(os.path.join(path, 'test', 'images'))
+    X_test = np.zeros((len(img_files), 3, 64, 64), dtype=dtype)
+    for i, img_file in enumerate(img_files):
+        img_file = os.path.join(path, 'test', 'images', img_file)
+        img = imread(img_file)
+        if img.ndim == 2:
+            img.shape = (64, 64, 1)
+        X_test[i] = img.transpose(2, 0, 1)
+
+    y_test = None
+    y_test_file = os.path.join(path, 'test', 'test_annotations.txt')
+    if os.path.isfile(y_test_file):
+        with open(y_test_file, 'r') as f:
+            img_file_to_wnid = {}
+            for line in f:
+                line = line.split('\t')
+                img_file_to_wnid[line[0]] = line[1]
+        y_test = [wnid_to_label[img_file_to_wnid[img_file]]
+                  for img_file in img_files]
+        y_test = np.array(y_test)
+
+    mean_image = X_train.mean(axis=0)
+    if subtract_mean:
+        X_train -= mean_image[None]
+        X_val -= mean_image[None]
+        X_test -= mean_image[None]
+
+    return {
+      'class_names': class_names,
+      'X_train': X_train,
+      'y_train': y_train,
+      'X_val': X_val,
+      'y_val': y_val,
+      'X_test': X_test,
+      'y_test': y_test,
+      'class_names': class_names,
+      'mean_image': mean_image,
+    }
+
+
+def load_models(models_dir):
+    """
+    Load saved models from disk. This will attempt to unpickle all files in a
+    directory; any files that give errors on unpickling (such as README.txt)
+    will be skipped.
+
+    Inputs:
+    - models_dir: String giving the path to a directory containing model files.
+      Each model file is a pickled dictionary with a 'model' field.
+
+    Returns:
+    A dictionary mapping model file names to models.
+    """
+    models = {}
+    for model_file in os.listdir(models_dir):
+        with open(os.path.join(models_dir, model_file), 'rb') as f:
+            try:
+                models[model_file] = load_pickle(f)['model']
+            except pickle.UnpicklingError:
+                continue
+    return models
+
+
+def load_imagenet_val(num=None):
+    """Load a handful of validation images from ImageNet.
+
+    Inputs:
+    - num: Number of images to load (max of 25)
+
+    Returns:
+    - X: numpy array with shape [num, 224, 224, 3]
+    - y: numpy array of integer image labels, shape [num]
+    - class_names: dict mapping integer label to class name
+    """
+    imagenet_fn = 'cs231n/datasets/imagenet_val_25.npz'
+    if not os.path.isfile(imagenet_fn):
+      print('file %s not found' % imagenet_fn)
+      print('Run the following:')
+      print('cd cs231n/datasets')
+      print('bash get_imagenet_val.sh')
+      assert False, 'Need to download imagenet_val_25.npz'
+    f = np.load(imagenet_fn)
+    X = f['X']
+    y = f['y']
+    class_names = f['label_map'].item()
+    if num is not None:
+        X = X[:num]
+        y = y[:num]
+    return X, y, class_names
diff --git a/cs231n/datasets/.gitignore b/cs231n/datasets/.gitignore
new file mode 100644
index 0000000..0232c3a
--- /dev/null
+++ b/cs231n/datasets/.gitignore
@@ -0,0 +1,4 @@
+cifar-10-batches-py/*
+tiny-imagenet-100-A*
+tiny-imagenet-100-B*
+tiny-100-A-pretrained/*
diff --git a/cs231n/datasets/get_datasets.sh b/cs231n/datasets/get_datasets.sh
new file mode 100755
index 0000000..0dd9362
--- /dev/null
+++ b/cs231n/datasets/get_datasets.sh
@@ -0,0 +1,4 @@
+# Get CIFAR10
+wget http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
+tar -xzvf cifar-10-python.tar.gz
+rm cifar-10-python.tar.gz 
diff --git a/cs231n/fast_layers.py b/cs231n/fast_layers.py
new file mode 100644
index 0000000..95875c0
--- /dev/null
+++ b/cs231n/fast_layers.py
@@ -0,0 +1,271 @@
+from __future__ import print_function
+import numpy as np
+try:
+    from cs231n.im2col_cython import col2im_cython, im2col_cython
+    from cs231n.im2col_cython import col2im_6d_cython
+except ImportError:
+    print('run the following from the cs231n directory and try again:')
+    print('python setup.py build_ext --inplace')
+    print('You may also need to restart your iPython kernel')
+
+from cs231n.im2col import *
+
+
+def conv_forward_im2col(x, w, b, conv_param):
+    """
+    A fast implementation of the forward pass for a convolutional layer
+    based on im2col and col2im.
+    """
+    N, C, H, W = x.shape
+    num_filters, _, filter_height, filter_width = w.shape
+    stride, pad = conv_param['stride'], conv_param['pad']
+
+    # Check dimensions
+    assert (W + 2 * pad - filter_width) % stride == 0, 'width does not work'
+    assert (H + 2 * pad - filter_height) % stride == 0, 'height does not work'
+
+    # Create output
+    out_height = (H + 2 * pad - filter_height) // stride + 1
+    out_width = (W + 2 * pad - filter_width) // stride + 1
+    out = np.zeros((N, num_filters, out_height, out_width), dtype=x.dtype)
+
+    # x_cols = im2col_indices(x, w.shape[2], w.shape[3], pad, stride)
+    x_cols = im2col_cython(x, w.shape[2], w.shape[3], pad, stride)
+    res = w.reshape((w.shape[0], -1)).dot(x_cols) + b.reshape(-1, 1)
+
+    out = res.reshape(w.shape[0], out.shape[2], out.shape[3], x.shape[0])
+    out = out.transpose(3, 0, 1, 2)
+
+    cache = (x, w, b, conv_param, x_cols)
+    return out, cache
+
+
+def conv_forward_strides(x, w, b, conv_param):
+    N, C, H, W = x.shape
+    F, _, HH, WW = w.shape
+    stride, pad = conv_param['stride'], conv_param['pad']
+
+    # Check dimensions
+    #assert (W + 2 * pad - WW) % stride == 0, 'width does not work'
+    #assert (H + 2 * pad - HH) % stride == 0, 'height does not work'
+
+    # Pad the input
+    p = pad
+    x_padded = np.pad(x, ((0, 0), (0, 0), (p, p), (p, p)), mode='constant')
+
+    # Figure out output dimensions
+    H += 2 * pad
+    W += 2 * pad
+    out_h = (H - HH) // stride + 1
+    out_w = (W - WW) // stride + 1
+
+    # Perform an im2col operation by picking clever strides
+    shape = (C, HH, WW, N, out_h, out_w)
+    strides = (H * W, W, 1, C * H * W, stride * W, stride)
+    strides = x.itemsize * np.array(strides)
+    x_stride = np.lib.stride_tricks.as_strided(x_padded,
+                  shape=shape, strides=strides)
+    x_cols = np.ascontiguousarray(x_stride)
+    x_cols.shape = (C * HH * WW, N * out_h * out_w)
+
+    # Now all our convolutions are a big matrix multiply
+    res = w.reshape(F, -1).dot(x_cols) + b.reshape(-1, 1)
+
+    # Reshape the output
+    res.shape = (F, N, out_h, out_w)
+    out = res.transpose(1, 0, 2, 3)
+
+    # Be nice and return a contiguous array
+    # The old version of conv_forward_fast doesn't do this, so for a fair
+    # comparison we won't either
+    out = np.ascontiguousarray(out)
+
+    cache = (x, w, b, conv_param, x_cols)
+    return out, cache
+
+
+def conv_backward_strides(dout, cache):
+    x, w, b, conv_param, x_cols = cache
+    stride, pad = conv_param['stride'], conv_param['pad']
+
+    N, C, H, W = x.shape
+    F, _, HH, WW = w.shape
+    _, _, out_h, out_w = dout.shape
+
+    db = np.sum(dout, axis=(0, 2, 3))
+
+    dout_reshaped = dout.transpose(1, 0, 2, 3).reshape(F, -1)
+    dw = dout_reshaped.dot(x_cols.T).reshape(w.shape)
+
+    dx_cols = w.reshape(F, -1).T.dot(dout_reshaped)
+    dx_cols.shape = (C, HH, WW, N, out_h, out_w)
+    dx = col2im_6d_cython(dx_cols, N, C, H, W, HH, WW, pad, stride)
+
+    return dx, dw, db
+
+
+def conv_backward_im2col(dout, cache):
+    """
+    A fast implementation of the backward pass for a convolutional layer
+    based on im2col and col2im.
+    """
+    x, w, b, conv_param, x_cols = cache
+    stride, pad = conv_param['stride'], conv_param['pad']
+
+    db = np.sum(dout, axis=(0, 2, 3))
+
+    num_filters, _, filter_height, filter_width = w.shape
+    dout_reshaped = dout.transpose(1, 2, 3, 0).reshape(num_filters, -1)
+    dw = dout_reshaped.dot(x_cols.T).reshape(w.shape)
+
+    dx_cols = w.reshape(num_filters, -1).T.dot(dout_reshaped)
+    # dx = col2im_indices(dx_cols, x.shape, filter_height, filter_width, pad, stride)
+    dx = col2im_cython(dx_cols, x.shape[0], x.shape[1], x.shape[2], x.shape[3],
+                       filter_height, filter_width, pad, stride)
+
+    return dx, dw, db
+
+
+conv_forward_fast = conv_forward_strides
+conv_backward_fast = conv_backward_strides
+
+
+def max_pool_forward_fast(x, pool_param):
+    """
+    A fast implementation of the forward pass for a max pooling layer.
+
+    This chooses between the reshape method and the im2col method. If the pooling
+    regions are square and tile the input image, then we can use the reshape
+    method which is very fast. Otherwise we fall back on the im2col method, which
+    is not much faster than the naive method.
+    """
+    N, C, H, W = x.shape
+    pool_height, pool_width = pool_param['pool_height'], pool_param['pool_width']
+    stride = pool_param['stride']
+
+    same_size = pool_height == pool_width == stride
+    tiles = H % pool_height == 0 and W % pool_width == 0
+    if same_size and tiles:
+        out, reshape_cache = max_pool_forward_reshape(x, pool_param)
+        cache = ('reshape', reshape_cache)
+    else:
+        out, im2col_cache = max_pool_forward_im2col(x, pool_param)
+        cache = ('im2col', im2col_cache)
+    return out, cache
+
+
+def max_pool_backward_fast(dout, cache):
+    """
+    A fast implementation of the backward pass for a max pooling layer.
+
+    This switches between the reshape method an the im2col method depending on
+    which method was used to generate the cache.
+    """
+    method, real_cache = cache
+    if method == 'reshape':
+        return max_pool_backward_reshape(dout, real_cache)
+    elif method == 'im2col':
+        return max_pool_backward_im2col(dout, real_cache)
+    else:
+        raise ValueError('Unrecognized method "%s"' % method)
+
+
+def max_pool_forward_reshape(x, pool_param):
+    """
+    A fast implementation of the forward pass for the max pooling layer that uses
+    some clever reshaping.
+
+    This can only be used for square pooling regions that tile the input.
+    """
+    N, C, H, W = x.shape
+    pool_height, pool_width = pool_param['pool_height'], pool_param['pool_width']
+    stride = pool_param['stride']
+    assert pool_height == pool_width == stride, 'Invalid pool params'
+    assert H % pool_height == 0
+    assert W % pool_height == 0
+    x_reshaped = x.reshape(N, C, H // pool_height, pool_height,
+                           W // pool_width, pool_width)
+    out = x_reshaped.max(axis=3).max(axis=4)
+
+    cache = (x, x_reshaped, out)
+    return out, cache
+
+
+def max_pool_backward_reshape(dout, cache):
+    """
+    A fast implementation of the backward pass for the max pooling layer that
+    uses some clever broadcasting and reshaping.
+
+    This can only be used if the forward pass was computed using
+    max_pool_forward_reshape.
+
+    NOTE: If there are multiple argmaxes, this method will assign gradient to
+    ALL argmax elements of the input rather than picking one. In this case the
+    gradient will actually be incorrect. However this is unlikely to occur in
+    practice, so it shouldn't matter much. One possible solution is to split the
+    upstream gradient equally among all argmax elements; this should result in a
+    valid subgradient. You can make this happen by uncommenting the line below;
+    however this results in a significant performance penalty (about 40% slower)
+    and is unlikely to matter in practice so we don't do it.
+    """
+    x, x_reshaped, out = cache
+
+    dx_reshaped = np.zeros_like(x_reshaped)
+    out_newaxis = out[:, :, :, np.newaxis, :, np.newaxis]
+    mask = (x_reshaped == out_newaxis)
+    dout_newaxis = dout[:, :, :, np.newaxis, :, np.newaxis]
+    dout_broadcast, _ = np.broadcast_arrays(dout_newaxis, dx_reshaped)
+    dx_reshaped[mask] = dout_broadcast[mask]
+    dx_reshaped /= np.sum(mask, axis=(3, 5), keepdims=True)
+    dx = dx_reshaped.reshape(x.shape)
+
+    return dx
+
+
+def max_pool_forward_im2col(x, pool_param):
+    """
+    An implementation of the forward pass for max pooling based on im2col.
+
+    This isn't much faster than the naive version, so it should be avoided if
+    possible.
+    """
+    N, C, H, W = x.shape
+    pool_height, pool_width = pool_param['pool_height'], pool_param['pool_width']
+    stride = pool_param['stride']
+
+    assert (H - pool_height) % stride == 0, 'Invalid height'
+    assert (W - pool_width) % stride == 0, 'Invalid width'
+
+    out_height = (H - pool_height) // stride + 1
+    out_width = (W - pool_width) // stride + 1
+
+    x_split = x.reshape(N * C, 1, H, W)
+    x_cols = im2col(x_split, pool_height, pool_width, padding=0, stride=stride)
+    x_cols_argmax = np.argmax(x_cols, axis=0)
+    x_cols_max = x_cols[x_cols_argmax, np.arange(x_cols.shape[1])]
+    out = x_cols_max.reshape(out_height, out_width, N, C).transpose(2, 3, 0, 1)
+
+    cache = (x, x_cols, x_cols_argmax, pool_param)
+    return out, cache
+
+
+def max_pool_backward_im2col(dout, cache):
+    """
+    An implementation of the backward pass for max pooling based on im2col.
+
+    This isn't much faster than the naive version, so it should be avoided if
+    possible.
+    """
+    x, x_cols, x_cols_argmax, pool_param = cache
+    N, C, H, W = x.shape
+    pool_height, pool_width = pool_param['pool_height'], pool_param['pool_width']
+    stride = pool_param['stride']
+
+    dout_reshaped = dout.transpose(2, 3, 0, 1).flatten()
+    dx_cols = np.zeros_like(x_cols)
+    dx_cols[x_cols_argmax, np.arange(dx_cols.shape[1])] = dout_reshaped
+    dx = col2im_indices(dx_cols, (N * C, 1, H, W), pool_height, pool_width,
+                padding=0, stride=stride)
+    dx = dx.reshape(x.shape)
+
+    return dx
diff --git a/cs231n/gradient_check.py b/cs231n/gradient_check.py
new file mode 100644
index 0000000..e1189fc
--- /dev/null
+++ b/cs231n/gradient_check.py
@@ -0,0 +1,129 @@
+from __future__ import print_function
+from builtins import range
+from past.builtins import xrange
+
+import numpy as np
+from random import randrange
+
+def eval_numerical_gradient(f, x, verbose=True, h=0.00001):
+    """
+    a naive implementation of numerical gradient of f at x
+    - f should be a function that takes a single argument
+    - x is the point (numpy array) to evaluate the gradient at
+    """
+
+    fx = f(x) # evaluate function value at original point
+    grad = np.zeros_like(x)
+    # iterate over all indexes in x
+    it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
+    while not it.finished:
+
+        # evaluate function at x+h
+        ix = it.multi_index
+        oldval = x[ix]
+        x[ix] = oldval + h # increment by h
+        fxph = f(x) # evalute f(x + h)
+        x[ix] = oldval - h
+        fxmh = f(x) # evaluate f(x - h)
+        x[ix] = oldval # restore
+
+        # compute the partial derivative with centered formula
+        grad[ix] = (fxph - fxmh) / (2 * h) # the slope
+        if verbose:
+            print(ix, grad[ix])
+        it.iternext() # step to next dimension
+
+    return grad
+
+
+def eval_numerical_gradient_array(f, x, df, h=1e-5):
+    """
+    Evaluate a numeric gradient for a function that accepts a numpy
+    array and returns a numpy array.
+    """
+    grad = np.zeros_like(x)
+    it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
+    while not it.finished:
+        ix = it.multi_index
+
+        oldval = x[ix]
+        x[ix] = oldval + h
+        pos = f(x).copy()
+        x[ix] = oldval - h
+        neg = f(x).copy()
+        x[ix] = oldval
+
+        grad[ix] = np.sum((pos - neg) * df) / (2 * h)
+        it.iternext()
+    return grad
+
+
+def eval_numerical_gradient_blobs(f, inputs, output, h=1e-5):
+    """
+    Compute numeric gradients for a function that operates on input
+    and output blobs.
+
+    We assume that f accepts several input blobs as arguments, followed by a
+    blob where outputs will be written. For example, f might be called like:
+
+    f(x, w, out)
+
+    where x and w are input Blobs, and the result of f will be written to out.
+
+    Inputs:
+    - f: function
+    - inputs: tuple of input blobs
+    - output: output blob
+    - h: step size
+    """
+    numeric_diffs = []
+    for input_blob in inputs:
+        diff = np.zeros_like(input_blob.diffs)
+        it = np.nditer(input_blob.vals, flags=['multi_index'],
+                       op_flags=['readwrite'])
+        while not it.finished:
+            idx = it.multi_index
+            orig = input_blob.vals[idx]
+
+            input_blob.vals[idx] = orig + h
+            f(*(inputs + (output,)))
+            pos = np.copy(output.vals)
+            input_blob.vals[idx] = orig - h
+            f(*(inputs + (output,)))
+            neg = np.copy(output.vals)
+            input_blob.vals[idx] = orig
+
+            diff[idx] = np.sum((pos - neg) * output.diffs) / (2.0 * h)
+
+            it.iternext()
+        numeric_diffs.append(diff)
+    return numeric_diffs
+
+
+def eval_numerical_gradient_net(net, inputs, output, h=1e-5):
+    return eval_numerical_gradient_blobs(lambda *args: net.forward(),
+                inputs, output, h=h)
+
+
+def grad_check_sparse(f, x, analytic_grad, num_checks=10, h=1e-5):
+    """
+    sample a few random elements and only return numerical
+    in this dimensions.
+    """
+
+    for i in range(num_checks):
+        ix = tuple([randrange(m) for m in x.shape])
+
+        oldval = x[ix]
+        x[ix] = oldval + h # increment by h
+        fxph = f(x) # evaluate f(x + h)
+        x[ix] = oldval - h # increment by h
+        fxmh = f(x) # evaluate f(x - h)
+        x[ix] = oldval # reset
+
+        grad_numerical = (fxph - fxmh) / (2 * h)
+        grad_analytic = analytic_grad[ix]
+        rel_error = (abs(grad_numerical - grad_analytic) /
+                    (abs(grad_numerical) + abs(grad_analytic)))
+        print('numerical: %f analytic: %f, relative error: %e'
+              %(grad_numerical, grad_analytic, rel_error))
diff --git a/cs231n/im2col.py b/cs231n/im2col.py
new file mode 100644
index 0000000..3d1b3f2
--- /dev/null
+++ b/cs231n/im2col.py
@@ -0,0 +1,56 @@
+from builtins import range
+import numpy as np
+
+
+def get_im2col_indices(x_shape, field_height, field_width, padding=1, stride=1):
+    # First figure out what the size of the output should be
+    N, C, H, W = x_shape
+    assert (H + 2 * padding - field_height) % stride == 0
+    assert (W + 2 * padding - field_height) % stride == 0
+    out_height = (H + 2 * padding - field_height) / stride + 1
+    out_width = (W + 2 * padding - field_width) / stride + 1
+
+    i0 = np.repeat(np.arange(field_height), field_width)
+    i0 = np.tile(i0, C)
+    i1 = stride * np.repeat(np.arange(out_height), out_width)
+    j0 = np.tile(np.arange(field_width), field_height * C)
+    j1 = stride * np.tile(np.arange(out_width), out_height)
+    i = i0.reshape(-1, 1) + i1.reshape(1, -1)
+    j = j0.reshape(-1, 1) + j1.reshape(1, -1)
+
+    k = np.repeat(np.arange(C), field_height * field_width).reshape(-1, 1)
+
+    return (k, i, j)
+
+
+def im2col_indices(x, field_height, field_width, padding=1, stride=1):
+    """ An implementation of im2col based on some fancy indexing """
+    # Zero-pad the input
+    p = padding
+    x_padded = np.pad(x, ((0, 0), (0, 0), (p, p), (p, p)), mode='constant')
+
+    k, i, j = get_im2col_indices(x.shape, field_height, field_width, padding,
+                                 stride)
+
+    cols = x_padded[:, k, i, j]
+    C = x.shape[1]
+    cols = cols.transpose(1, 2, 0).reshape(field_height * field_width * C, -1)
+    return cols
+
+
+def col2im_indices(cols, x_shape, field_height=3, field_width=3, padding=1,
+                   stride=1):
+    """ An implementation of col2im based on fancy indexing and np.add.at """
+    N, C, H, W = x_shape
+    H_padded, W_padded = H + 2 * padding, W + 2 * padding
+    x_padded = np.zeros((N, C, H_padded, W_padded), dtype=cols.dtype)
+    k, i, j = get_im2col_indices(x_shape, field_height, field_width, padding,
+                                 stride)
+    cols_reshaped = cols.reshape(C * field_height * field_width, -1, N)
+    cols_reshaped = cols_reshaped.transpose(2, 0, 1)
+    np.add.at(x_padded, (slice(None), k, i, j), cols_reshaped)
+    if padding == 0:
+        return x_padded
+    return x_padded[:, :, padding:-padding, padding:-padding]
+
+pass
diff --git a/cs231n/im2col_cython.pyx b/cs231n/im2col_cython.pyx
new file mode 100644
index 0000000..d6e33c6
--- /dev/null
+++ b/cs231n/im2col_cython.pyx
@@ -0,0 +1,121 @@
+import numpy as np
+cimport numpy as np
+cimport cython
+
+# DTYPE = np.float64
+# ctypedef np.float64_t DTYPE_t
+
+ctypedef fused DTYPE_t:
+    np.float32_t
+    np.float64_t
+
+def im2col_cython(np.ndarray[DTYPE_t, ndim=4] x, int field_height,
+                  int field_width, int padding, int stride):
+    cdef int N = x.shape[0]
+    cdef int C = x.shape[1]
+    cdef int H = x.shape[2]
+    cdef int W = x.shape[3]
+    
+    cdef int HH = (H + 2 * padding - field_height) / stride + 1
+    cdef int WW = (W + 2 * padding - field_width) / stride + 1
+
+    cdef int p = padding
+    cdef np.ndarray[DTYPE_t, ndim=4] x_padded = np.pad(x,
+            ((0, 0), (0, 0), (p, p), (p, p)), mode='constant')
+
+    cdef np.ndarray[DTYPE_t, ndim=2] cols = np.zeros(
+            (C * field_height * field_width, N * HH * WW),
+            dtype=x.dtype)
+
+    # Moving the inner loop to a C function with no bounds checking works, but does
+    # not seem to help performance in any measurable way.
+
+    im2col_cython_inner(cols, x_padded, N, C, H, W, HH, WW,
+                        field_height, field_width, padding, stride)
+    return cols
+
+
+@cython.boundscheck(False)
+cdef int im2col_cython_inner(np.ndarray[DTYPE_t, ndim=2] cols,
+                             np.ndarray[DTYPE_t, ndim=4] x_padded,
+                             int N, int C, int H, int W, int HH, int WW,
+                             int field_height, int field_width, int padding, int stride) except? -1:
+    cdef int c, ii, jj, row, yy, xx, i, col
+
+    for c in range(C):
+        for yy in range(HH):
+            for xx in range(WW):
+                for ii in range(field_height):
+                    for jj in range(field_width):
+                        row = c * field_width * field_height + ii * field_height + jj
+                        for i in range(N):
+                            col = yy * WW * N + xx * N + i
+                            cols[row, col] = x_padded[i, c, stride * yy + ii, stride * xx + jj]
+
+
+
+def col2im_cython(np.ndarray[DTYPE_t, ndim=2] cols, int N, int C, int H, int W,
+                  int field_height, int field_width, int padding, int stride):
+    cdef np.ndarray x = np.empty((N, C, H, W), dtype=cols.dtype)
+    cdef int HH = (H + 2 * padding - field_height) / stride + 1
+    cdef int WW = (W + 2 * padding - field_width) / stride + 1
+    cdef np.ndarray[DTYPE_t, ndim=4] x_padded = np.zeros((N, C, H + 2 * padding, W + 2 * padding),
+                                        dtype=cols.dtype)
+
+    # Moving the inner loop to a C-function with no bounds checking improves
+    # performance quite a bit for col2im.
+    col2im_cython_inner(cols, x_padded, N, C, H, W, HH, WW, 
+                        field_height, field_width, padding, stride)
+    if padding > 0:
+        return x_padded[:, :, padding:-padding, padding:-padding]
+    return x_padded
+
+
+@cython.boundscheck(False)
+cdef int col2im_cython_inner(np.ndarray[DTYPE_t, ndim=2] cols,
+                             np.ndarray[DTYPE_t, ndim=4] x_padded,
+                             int N, int C, int H, int W, int HH, int WW,
+                             int field_height, int field_width, int padding, int stride) except? -1:
+    cdef int c, ii, jj, row, yy, xx, i, col
+
+    for c in range(C):
+        for ii in range(field_height):
+            for jj in range(field_width):
+                row = c * field_width * field_height + ii * field_height + jj
+                for yy in range(HH):
+                    for xx in range(WW):
+                        for i in range(N):
+                            col = yy * WW * N + xx * N + i
+                            x_padded[i, c, stride * yy + ii, stride * xx + jj] += cols[row, col]
+
+
+@cython.boundscheck(False)
+@cython.wraparound(False)
+cdef col2im_6d_cython_inner(np.ndarray[DTYPE_t, ndim=6] cols,
+                            np.ndarray[DTYPE_t, ndim=4] x_padded,
+                            int N, int C, int H, int W, int HH, int WW,
+                            int out_h, int out_w, int pad, int stride):
+
+    cdef int c, hh, ww, n, h, w
+    for n in range(N):
+        for c in range(C):
+            for hh in range(HH):
+                for ww in range(WW):
+                    for h in range(out_h):
+                        for w in range(out_w):
+                            x_padded[n, c, stride * h + hh, stride * w + ww] += cols[c, hh, ww, n, h, w]
+    
+
+def col2im_6d_cython(np.ndarray[DTYPE_t, ndim=6] cols, int N, int C, int H, int W,
+        int HH, int WW, int pad, int stride):
+    cdef np.ndarray x = np.empty((N, C, H, W), dtype=cols.dtype)
+    cdef int out_h = (H + 2 * pad - HH) / stride + 1
+    cdef int out_w = (W + 2 * pad - WW) / stride + 1
+    cdef np.ndarray[DTYPE_t, ndim=4] x_padded = np.zeros((N, C, H + 2 * pad, W + 2 * pad),
+                                                  dtype=cols.dtype)
+
+    col2im_6d_cython_inner(cols, x_padded, N, C, H, W, HH, WW, out_h, out_w, pad, stride)
+
+    if pad > 0:
+        return x_padded[:, :, pad:-pad, pad:-pad]
+    return x_padded 
diff --git a/cs231n/layer_utils.py b/cs231n/layer_utils.py
new file mode 100644
index 0000000..de0b3db
--- /dev/null
+++ b/cs231n/layer_utils.py
@@ -0,0 +1,106 @@
+pass
+from cs231n.layers import *
+from cs231n.fast_layers import *
+
+
+def affine_relu_forward(x, w, b):
+    """
+    Convenience layer that perorms an affine transform followed by a ReLU
+
+    Inputs:
+    - x: Input to the affine layer
+    - w, b: Weights for the affine layer
+
+    Returns a tuple of:
+    - out: Output from the ReLU
+    - cache: Object to give to the backward pass
+    """
+    a, fc_cache = affine_forward(x, w, b)
+    out, relu_cache = relu_forward(a)
+    cache = (fc_cache, relu_cache)
+    return out, cache
+
+
+def affine_relu_backward(dout, cache):
+    """
+    Backward pass for the affine-relu convenience layer
+    """
+    fc_cache, relu_cache = cache
+    da = relu_backward(dout, relu_cache)
+    dx, dw, db = affine_backward(da, fc_cache)
+    return dx, dw, db
+
+
+def conv_relu_forward(x, w, b, conv_param):
+    """
+    A convenience layer that performs a convolution followed by a ReLU.
+
+    Inputs:
+    - x: Input to the convolutional layer
+    - w, b, conv_param: Weights and parameters for the convolutional layer
+
+    Returns a tuple of:
+    - out: Output from the ReLU
+    - cache: Object to give to the backward pass
+    """
+    a, conv_cache = conv_forward_fast(x, w, b, conv_param)
+    out, relu_cache = relu_forward(a)
+    cache = (conv_cache, relu_cache)
+    return out, cache
+
+
+def conv_relu_backward(dout, cache):
+    """
+    Backward pass for the conv-relu convenience layer.
+    """
+    conv_cache, relu_cache = cache
+    da = relu_backward(dout, relu_cache)
+    dx, dw, db = conv_backward_fast(da, conv_cache)
+    return dx, dw, db
+
+
+def conv_bn_relu_forward(x, w, b, gamma, beta, conv_param, bn_param):
+    a, conv_cache = conv_forward_fast(x, w, b, conv_param)
+    an, bn_cache = spatial_batchnorm_forward(a, gamma, beta, bn_param)
+    out, relu_cache = relu_forward(an)
+    cache = (conv_cache, bn_cache, relu_cache)
+    return out, cache
+
+
+def conv_bn_relu_backward(dout, cache):
+    conv_cache, bn_cache, relu_cache = cache
+    dan = relu_backward(dout, relu_cache)
+    da, dgamma, dbeta = spatial_batchnorm_backward(dan, bn_cache)
+    dx, dw, db = conv_backward_fast(da, conv_cache)
+    return dx, dw, db, dgamma, dbeta
+
+
+def conv_relu_pool_forward(x, w, b, conv_param, pool_param):
+    """
+    Convenience layer that performs a convolution, a ReLU, and a pool.
+
+    Inputs:
+    - x: Input to the convolutional layer
+    - w, b, conv_param: Weights and parameters for the convolutional layer
+    - pool_param: Parameters for the pooling layer
+
+    Returns a tuple of:
+    - out: Output from the pooling layer
+    - cache: Object to give to the backward pass
+    """
+    a, conv_cache = conv_forward_fast(x, w, b, conv_param)
+    s, relu_cache = relu_forward(a)
+    out, pool_cache = max_pool_forward_fast(s, pool_param)
+    cache = (conv_cache, relu_cache, pool_cache)
+    return out, cache
+
+
+def conv_relu_pool_backward(dout, cache):
+    """
+    Backward pass for the conv-relu-pool convenience layer
+    """
+    conv_cache, relu_cache, pool_cache = cache
+    ds = max_pool_backward_fast(dout, pool_cache)
+    da = relu_backward(ds, relu_cache)
+    dx, dw, db = conv_backward_fast(da, conv_cache)
+    return dx, dw, db
diff --git a/cs231n/layers.py b/cs231n/layers.py
new file mode 100644
index 0000000..c6551d7
--- /dev/null
+++ b/cs231n/layers.py
@@ -0,0 +1,559 @@
+from builtins import range
+import numpy as np
+
+
+def affine_forward(x, w, b):
+    """
+    Computes the forward pass for an affine (fully-connected) layer.
+
+    The input x has shape (N, d_1, ..., d_k) and contains a minibatch of N
+    examples, where each example x[i] has shape (d_1, ..., d_k). We will
+    reshape each input into a vector of dimension D = d_1 * ... * d_k, and
+    then transform it to an output vector of dimension M.
+
+    Inputs:
+    - x: A numpy array containing input data, of shape (N, d_1, ..., d_k)
+    - w: A numpy array of weights, of shape (D, M)
+    - b: A numpy array of biases, of shape (M,)
+
+    Returns a tuple of:
+    - out: output, of shape (N, M)
+    - cache: (x, w, b)
+    """
+    out = None
+    ###########################################################################
+    # TODO: Implement the affine forward pass. Store the result in out. You   #
+    # will need to reshape the input into rows.                               #
+    ###########################################################################
+    pass
+    ###########################################################################
+    #                             END OF YOUR CODE                            #
+    ###########################################################################
+    cache = (x, w, b)
+    return out, cache
+
+
+def affine_backward(dout, cache):
+    """
+    Computes the backward pass for an affine layer.
+
+    Inputs:
+    - dout: Upstream derivative, of shape (N, M)
+    - cache: Tuple of:
+      - x: Input data, of shape (N, d_1, ... d_k)
+      - w: Weights, of shape (D, M)
+
+    Returns a tuple of:
+    - dx: Gradient with respect to x, of shape (N, d1, ..., d_k)
+    - dw: Gradient with respect to w, of shape (D, M)
+    - db: Gradient with respect to b, of shape (M,)
+    """
+    x, w, b = cache
+    dx, dw, db = None, None, None
+    ###########################################################################
+    # TODO: Implement the affine backward pass.                               #
+    ###########################################################################
+    pass
+    ###########################################################################
+    #                             END OF YOUR CODE                            #
+    ###########################################################################
+    return dx, dw, db
+
+
+def relu_forward(x):
+    """
+    Computes the forward pass for a layer of rectified linear units (ReLUs).
+
+    Input:
+    - x: Inputs, of any shape
+
+    Returns a tuple of:
+    - out: Output, of the same shape as x
+    - cache: x
+    """
+    out = None
+    ###########################################################################
+    # TODO: Implement the ReLU forward pass.                                  #
+    ###########################################################################
+    pass
+    ###########################################################################
+    #                             END OF YOUR CODE                            #
+    ###########################################################################
+    cache = x
+    return out, cache
+
+
+def relu_backward(dout, cache):
+    """
+    Computes the backward pass for a layer of rectified linear units (ReLUs).
+
+    Input:
+    - dout: Upstream derivatives, of any shape
+    - cache: Input x, of same shape as dout
+
+    Returns:
+    - dx: Gradient with respect to x
+    """
+    dx, x = None, cache
+    ###########################################################################
+    # TODO: Implement the ReLU backward pass.                                 #
+    ###########################################################################
+    pass
+    ###########################################################################
+    #                             END OF YOUR CODE                            #
+    ###########################################################################
+    return dx
+
+
+def batchnorm_forward(x, gamma, beta, bn_param):
+    """
+    Forward pass for batch normalization.
+
+    During training the sample mean and (uncorrected) sample variance are
+    computed from minibatch statistics and used to normalize the incoming data.
+    During training we also keep an exponentially decaying running mean of the
+    mean and variance of each feature, and these averages are used to normalize
+    data at test-time.
+
+    At each timestep we update the running averages for mean and variance using
+    an exponential decay based on the momentum parameter:
+
+    running_mean = momentum * running_mean + (1 - momentum) * sample_mean
+    running_var = momentum * running_var + (1 - momentum) * sample_var
+
+    Note that the batch normalization paper suggests a different test-time
+    behavior: they compute sample mean and variance for each feature using a
+    large number of training images rather than using a running average. For
+    this implementation we have chosen to use running averages instead since
+    they do not require an additional estimation step; the torch7
+    implementation of batch normalization also uses running averages.
+
+    Input:
+    - x: Data of shape (N, D)
+    - gamma: Scale parameter of shape (D,)
+    - beta: Shift paremeter of shape (D,)
+    - bn_param: Dictionary with the following keys:
+      - mode: 'train' or 'test'; required
+      - eps: Constant for numeric stability
+      - momentum: Constant for running mean / variance.
+      - running_mean: Array of shape (D,) giving running mean of features
+      - running_var Array of shape (D,) giving running variance of features
+
+    Returns a tuple of:
+    - out: of shape (N, D)
+    - cache: A tuple of values needed in the backward pass
+    """
+    mode = bn_param['mode']
+    eps = bn_param.get('eps', 1e-5)
+    momentum = bn_param.get('momentum', 0.9)
+
+    N, D = x.shape
+    running_mean = bn_param.get('running_mean', np.zeros(D, dtype=x.dtype))
+    running_var = bn_param.get('running_var', np.zeros(D, dtype=x.dtype))
+
+    out, cache = None, None
+    if mode == 'train':
+        #######################################################################
+        # TODO: Implement the training-time forward pass for batch norm.      #
+        # Use minibatch statistics to compute the mean and variance, use      #
+        # these statistics to normalize the incoming data, and scale and      #
+        # shift the normalized data using gamma and beta.                     #
+        #                                                                     #
+        # You should store the output in the variable out. Any intermediates  #
+        # that you need for the backward pass should be stored in the cache   #
+        # variable.                                                           #
+        #                                                                     #
+        # You should also use your computed sample mean and variance together #
+        # with the momentum variable to update the running mean and running   #
+        # variance, storing your result in the running_mean and running_var   #
+        # variables.                                                          #
+        #######################################################################
+        pass
+        #######################################################################
+        #                           END OF YOUR CODE                          #
+        #######################################################################
+    elif mode == 'test':
+        #######################################################################
+        # TODO: Implement the test-time forward pass for batch normalization. #
+        # Use the running mean and variance to normalize the incoming data,   #
+        # then scale and shift the normalized data using gamma and beta.      #
+        # Store the result in the out variable.                               #
+        #######################################################################
+        pass
+        #######################################################################
+        #                          END OF YOUR CODE                           #
+        #######################################################################
+    else:
+        raise ValueError('Invalid forward batchnorm mode "%s"' % mode)
+
+    # Store the updated running means back into bn_param
+    bn_param['running_mean'] = running_mean
+    bn_param['running_var'] = running_var
+
+    return out, cache
+
+
+def batchnorm_backward(dout, cache):
+    """
+    Backward pass for batch normalization.
+
+    For this implementation, you should write out a computation graph for
+    batch normalization on paper and propagate gradients backward through
+    intermediate nodes.
+
+    Inputs:
+    - dout: Upstream derivatives, of shape (N, D)
+    - cache: Variable of intermediates from batchnorm_forward.
+
+    Returns a tuple of:
+    - dx: Gradient with respect to inputs x, of shape (N, D)
+    - dgamma: Gradient with respect to scale parameter gamma, of shape (D,)
+    - dbeta: Gradient with respect to shift parameter beta, of shape (D,)
+    """
+    dx, dgamma, dbeta = None, None, None
+    ###########################################################################
+    # TODO: Implement the backward pass for batch normalization. Store the    #
+    # results in the dx, dgamma, and dbeta variables.                         #
+    ###########################################################################
+    pass
+    ###########################################################################
+    #                             END OF YOUR CODE                            #
+    ###########################################################################
+
+    return dx, dgamma, dbeta
+
+
+def batchnorm_backward_alt(dout, cache):
+    """
+    Alternative backward pass for batch normalization.
+
+    For this implementation you should work out the derivatives for the batch
+    normalizaton backward pass on paper and simplify as much as possible. You
+    should be able to derive a simple expression for the backward pass.
+
+    Note: This implementation should expect to receive the same cache variable
+    as batchnorm_backward, but might not use all of the values in the cache.
+
+    Inputs / outputs: Same as batchnorm_backward
+    """
+    dx, dgamma, dbeta = None, None, None
+    ###########################################################################
+    # TODO: Implement the backward pass for batch normalization. Store the    #
+    # results in the dx, dgamma, and dbeta variables.                         #
+    #                                                                         #
+    # After computing the gradient with respect to the centered inputs, you   #
+    # should be able to compute gradients with respect to the inputs in a     #
+    # single statement; our implementation fits on a single 80-character line.#
+    ###########################################################################
+    pass
+    ###########################################################################
+    #                             END OF YOUR CODE                            #
+    ###########################################################################
+
+    return dx, dgamma, dbeta
+
+
+def dropout_forward(x, dropout_param):
+    """
+    Performs the forward pass for (inverted) dropout.
+
+    Inputs:
+    - x: Input data, of any shape
+    - dropout_param: A dictionary with the following keys:
+      - p: Dropout parameter. We drop each neuron output with probability p.
+      - mode: 'test' or 'train'. If the mode is train, then perform dropout;
+        if the mode is test, then just return the input.
+      - seed: Seed for the random number generator. Passing seed makes this
+        function deterministic, which is needed for gradient checking but not
+        in real networks.
+
+    Outputs:
+    - out: Array of the same shape as x.
+    - cache: tuple (dropout_param, mask). In training mode, mask is the dropout
+      mask that was used to multiply the input; in test mode, mask is None.
+    """
+    p, mode = dropout_param['p'], dropout_param['mode']
+    if 'seed' in dropout_param:
+        np.random.seed(dropout_param['seed'])
+
+    mask = None
+    out = None
+
+    if mode == 'train':
+        #######################################################################
+        # TODO: Implement training phase forward pass for inverted dropout.   #
+        # Store the dropout mask in the mask variable.                        #
+        #######################################################################
+        pass
+        #######################################################################
+        #                           END OF YOUR CODE                          #
+        #######################################################################
+    elif mode == 'test':
+        #######################################################################
+        # TODO: Implement the test phase forward pass for inverted dropout.   #
+        #######################################################################
+        pass
+        #######################################################################
+        #                            END OF YOUR CODE                         #
+        #######################################################################
+
+    cache = (dropout_param, mask)
+    out = out.astype(x.dtype, copy=False)
+
+    return out, cache
+
+
+def dropout_backward(dout, cache):
+    """
+    Perform the backward pass for (inverted) dropout.
+
+    Inputs:
+    - dout: Upstream derivatives, of any shape
+    - cache: (dropout_param, mask) from dropout_forward.
+    """
+    dropout_param, mask = cache
+    mode = dropout_param['mode']
+
+    dx = None
+    if mode == 'train':
+        #######################################################################
+        # TODO: Implement training phase backward pass for inverted dropout   #
+        #######################################################################
+        pass
+        #######################################################################
+        #                          END OF YOUR CODE                           #
+        #######################################################################
+    elif mode == 'test':
+        dx = dout
+    return dx
+
+
+def conv_forward_naive(x, w, b, conv_param):
+    """
+    A naive implementation of the forward pass for a convolutional layer.
+
+    The input consists of N data points, each with C channels, height H and
+    width W. We convolve each input with F different filters, where each filter
+    spans all C channels and has height HH and width HH.
+
+    Input:
+    - x: Input data of shape (N, C, H, W)
+    - w: Filter weights of shape (F, C, HH, WW)
+    - b: Biases, of shape (F,)
+    - conv_param: A dictionary with the following keys:
+      - 'stride': The number of pixels between adjacent receptive fields in the
+        horizontal and vertical directions.
+      - 'pad': The number of pixels that will be used to zero-pad the input.
+
+    Returns a tuple of:
+    - out: Output data, of shape (N, F, H', W') where H' and W' are given by
+      H' = 1 + (H + 2 * pad - HH) / stride
+      W' = 1 + (W + 2 * pad - WW) / stride
+    - cache: (x, w, b, conv_param)
+    """
+    out = None
+    ###########################################################################
+    # TODO: Implement the convolutional forward pass.                         #
+    # Hint: you can use the function np.pad for padding.                      #
+    ###########################################################################
+    pass
+    ###########################################################################
+    #                             END OF YOUR CODE                            #
+    ###########################################################################
+    cache = (x, w, b, conv_param)
+    return out, cache
+
+
+def conv_backward_naive(dout, cache):
+    """
+    A naive implementation of the backward pass for a convolutional layer.
+
+    Inputs:
+    - dout: Upstream derivatives.
+    - cache: A tuple of (x, w, b, conv_param) as in conv_forward_naive
+
+    Returns a tuple of:
+    - dx: Gradient with respect to x
+    - dw: Gradient with respect to w
+    - db: Gradient with respect to b
+    """
+    dx, dw, db = None, None, None
+    ###########################################################################
+    # TODO: Implement the convolutional backward pass.                        #
+    ###########################################################################
+    pass
+    ###########################################################################
+    #                             END OF YOUR CODE                            #
+    ###########################################################################
+    return dx, dw, db
+
+
+def max_pool_forward_naive(x, pool_param):
+    """
+    A naive implementation of the forward pass for a max pooling layer.
+
+    Inputs:
+    - x: Input data, of shape (N, C, H, W)
+    - pool_param: dictionary with the following keys:
+      - 'pool_height': The height of each pooling region
+      - 'pool_width': The width of each pooling region
+      - 'stride': The distance between adjacent pooling regions
+
+    Returns a tuple of:
+    - out: Output data
+    - cache: (x, pool_param)
+    """
+    out = None
+    ###########################################################################
+    # TODO: Implement the max pooling forward pass                            #
+    ###########################################################################
+    pass
+    ###########################################################################
+    #                             END OF YOUR CODE                            #
+    ###########################################################################
+    cache = (x, pool_param)
+    return out, cache
+
+
+def max_pool_backward_naive(dout, cache):
+    """
+    A naive implementation of the backward pass for a max pooling layer.
+
+    Inputs:
+    - dout: Upstream derivatives
+    - cache: A tuple of (x, pool_param) as in the forward pass.
+
+    Returns:
+    - dx: Gradient with respect to x
+    """
+    dx = None
+    ###########################################################################
+    # TODO: Implement the max pooling backward pass                           #
+    ###########################################################################
+    pass
+    ###########################################################################
+    #                             END OF YOUR CODE                            #
+    ###########################################################################
+    return dx
+
+
+def spatial_batchnorm_forward(x, gamma, beta, bn_param):
+    """
+    Computes the forward pass for spatial batch normalization.
+
+    Inputs:
+    - x: Input data of shape (N, C, H, W)
+    - gamma: Scale parameter, of shape (C,)
+    - beta: Shift parameter, of shape (C,)
+    - bn_param: Dictionary with the following keys:
+      - mode: 'train' or 'test'; required
+      - eps: Constant for numeric stability
+      - momentum: Constant for running mean / variance. momentum=0 means that
+        old information is discarded completely at every time step, while
+        momentum=1 means that new information is never incorporated. The
+        default of momentum=0.9 should work well in most situations.
+      - running_mean: Array of shape (D,) giving running mean of features
+      - running_var Array of shape (D,) giving running variance of features
+
+    Returns a tuple of:
+    - out: Output data, of shape (N, C, H, W)
+    - cache: Values needed for the backward pass
+    """
+    out, cache = None, None
+
+    ###########################################################################
+    # TODO: Implement the forward pass for spatial batch normalization.       #
+    #                                                                         #
+    # HINT: You can implement spatial batch normalization using the vanilla   #
+    # version of batch normalization defined above. Your implementation should#
+    # be very short; ours is less than five lines.                            #
+    ###########################################################################
+    pass
+    ###########################################################################
+    #                             END OF YOUR CODE                            #
+    ###########################################################################
+
+    return out, cache
+
+
+def spatial_batchnorm_backward(dout, cache):
+    """
+    Computes the backward pass for spatial batch normalization.
+
+    Inputs:
+    - dout: Upstream derivatives, of shape (N, C, H, W)
+    - cache: Values from the forward pass
+
+    Returns a tuple of:
+    - dx: Gradient with respect to inputs, of shape (N, C, H, W)
+    - dgamma: Gradient with respect to scale parameter, of shape (C,)
+    - dbeta: Gradient with respect to shift parameter, of shape (C,)
+    """
+    dx, dgamma, dbeta = None, None, None
+
+    ###########################################################################
+    # TODO: Implement the backward pass for spatial batch normalization.      #
+    #                                                                         #
+    # HINT: You can implement spatial batch normalization using the vanilla   #
+    # version of batch normalization defined above. Your implementation should#
+    # be very short; ours is less than five lines.                            #
+    ###########################################################################
+    pass
+    ###########################################################################
+    #                             END OF YOUR CODE                            #
+    ###########################################################################
+
+    return dx, dgamma, dbeta
+
+
+def svm_loss(x, y):
+    """
+    Computes the loss and gradient using for multiclass SVM classification.
+
+    Inputs:
+    - x: Input data, of shape (N, C) where x[i, j] is the score for the jth
+      class for the ith input.
+    - y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and
+      0 <= y[i] < C
+
+    Returns a tuple of:
+    - loss: Scalar giving the loss
+    - dx: Gradient of the loss with respect to x
+    """
+    N = x.shape[0]
+    correct_class_scores = x[np.arange(N), y]
+    margins = np.maximum(0, x - correct_class_scores[:, np.newaxis] + 1.0)
+    margins[np.arange(N), y] = 0
+    loss = np.sum(margins) / N
+    num_pos = np.sum(margins > 0, axis=1)
+    dx = np.zeros_like(x)
+    dx[margins > 0] = 1
+    dx[np.arange(N), y] -= num_pos
+    dx /= N
+    return loss, dx
+
+
+def softmax_loss(x, y):
+    """
+    Computes the loss and gradient for softmax classification.
+
+    Inputs:
+    - x: Input data, of shape (N, C) where x[i, j] is the score for the jth
+      class for the ith input.
+    - y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and
+      0 <= y[i] < C
+
+    Returns a tuple of:
+    - loss: Scalar giving the loss
+    - dx: Gradient of the loss with respect to x
+    """
+    shifted_logits = x - np.max(x, axis=1, keepdims=True)
+    Z = np.sum(np.exp(shifted_logits), axis=1, keepdims=True)
+    log_probs = shifted_logits - np.log(Z)
+    probs = np.exp(log_probs)
+    N = x.shape[0]
+    loss = -np.sum(log_probs[np.arange(N), y]) / N
+    dx = probs.copy()
+    dx[np.arange(N), y] -= 1
+    dx /= N
+    return loss, dx
diff --git a/cs231n/optim.py b/cs231n/optim.py
new file mode 100644
index 0000000..84e13c2
--- /dev/null
+++ b/cs231n/optim.py
@@ -0,0 +1,144 @@
+import numpy as np
+
+"""
+This file implements various first-order update rules that are commonly used
+for training neural networks. Each update rule accepts current weights and the
+gradient of the loss with respect to those weights and produces the next set of
+weights. Each update rule has the same interface:
+
+def update(w, dw, config=None):
+
+Inputs:
+  - w: A numpy array giving the current weights.
+  - dw: A numpy array of the same shape as w giving the gradient of the
+    loss with respect to w.
+  - config: A dictionary containing hyperparameter values such as learning
+    rate, momentum, etc. If the update rule requires caching values over many
+    iterations, then config will also hold these cached values.
+
+Returns:
+  - next_w: The next point after the update.
+  - config: The config dictionary to be passed to the next iteration of the
+    update rule.
+
+NOTE: For most update rules, the default learning rate will probably not
+perform well; however the default values of the other hyperparameters should
+work well for a variety of different problems.
+
+For efficiency, update rules may perform in-place updates, mutating w and
+setting next_w equal to w.
+"""
+
+
+def sgd(w, dw, config=None):
+    """
+    Performs vanilla stochastic gradient descent.
+
+    config format:
+    - learning_rate: Scalar learning rate.
+    """
+    if config is None: config = {}
+    config.setdefault('learning_rate', 1e-2)
+
+    w -= config['learning_rate'] * dw
+    return w, config
+
+
+def sgd_momentum(w, dw, config=None):
+    """
+    Performs stochastic gradient descent with momentum.
+
+    config format:
+    - learning_rate: Scalar learning rate.
+    - momentum: Scalar between 0 and 1 giving the momentum value.
+      Setting momentum = 0 reduces to sgd.
+    - velocity: A numpy array of the same shape as w and dw used to store a
+      moving average of the gradients.
+    """
+    if config is None: config = {}
+    config.setdefault('learning_rate', 1e-2)
+    config.setdefault('momentum', 0.9)
+    v = config.get('velocity', np.zeros_like(w))
+
+    next_w = None
+    ###########################################################################
+    # TODO: Implement the momentum update formula. Store the updated value in #
+    # the next_w variable. You should also use and update the velocity v.     #
+    ###########################################################################
+    pass
+    ###########################################################################
+    #                             END OF YOUR CODE                            #
+    ###########################################################################
+    config['velocity'] = v
+
+    return next_w, config
+
+
+
+def rmsprop(x, dx, config=None):
+    """
+    Uses the RMSProp update rule, which uses a moving average of squared
+    gradient values to set adaptive per-parameter learning rates.
+
+    config format:
+    - learning_rate: Scalar learning rate.
+    - decay_rate: Scalar between 0 and 1 giving the decay rate for the squared
+      gradient cache.
+    - epsilon: Small scalar used for smoothing to avoid dividing by zero.
+    - cache: Moving average of second moments of gradients.
+    """
+    if config is None: config = {}
+    config.setdefault('learning_rate', 1e-2)
+    config.setdefault('decay_rate', 0.99)
+    config.setdefault('epsilon', 1e-8)
+    config.setdefault('cache', np.zeros_like(x))
+
+    next_x = None
+    ###########################################################################
+    # TODO: Implement the RMSprop update formula, storing the next value of x #
+    # in the next_x variable. Don't forget to update cache value stored in    #
+    # config['cache'].                                                        #
+    ###########################################################################
+    pass
+    ###########################################################################
+    #                             END OF YOUR CODE                            #
+    ###########################################################################
+
+    return next_x, config
+
+
+def adam(x, dx, config=None):
+    """
+    Uses the Adam update rule, which incorporates moving averages of both the
+    gradient and its square and a bias correction term.
+
+    config format:
+    - learning_rate: Scalar learning rate.
+    - beta1: Decay rate for moving average of first moment of gradient.
+    - beta2: Decay rate for moving average of second moment of gradient.
+    - epsilon: Small scalar used for smoothing to avoid dividing by zero.
+    - m: Moving average of gradient.
+    - v: Moving average of squared gradient.
+    - t: Iteration number.
+    """
+    if config is None: config = {}
+    config.setdefault('learning_rate', 1e-3)
+    config.setdefault('beta1', 0.9)
+    config.setdefault('beta2', 0.999)
+    config.setdefault('epsilon', 1e-8)
+    config.setdefault('m', np.zeros_like(x))
+    config.setdefault('v', np.zeros_like(x))
+    config.setdefault('t', 1)
+
+    next_x = None
+    ###########################################################################
+    # TODO: Implement the Adam update formula, storing the next value of x in #
+    # the next_x variable. Don't forget to update the m, v, and t variables   #
+    # stored in config.                                                       #
+    ###########################################################################
+    pass
+    ###########################################################################
+    #                             END OF YOUR CODE                            #
+    ###########################################################################
+
+    return next_x, config
diff --git a/cs231n/setup.py b/cs231n/setup.py
new file mode 100644
index 0000000..9a2e6ca
--- /dev/null
+++ b/cs231n/setup.py
@@ -0,0 +1,14 @@
+from distutils.core import setup
+from distutils.extension import Extension
+from Cython.Build import cythonize
+import numpy
+
+extensions = [
+  Extension('im2col_cython', ['im2col_cython.pyx'],
+            include_dirs = [numpy.get_include()]
+  ),
+]
+
+setup(
+    ext_modules = cythonize(extensions),
+)
diff --git a/cs231n/solver.py b/cs231n/solver.py
new file mode 100644
index 0000000..6abe92c
--- /dev/null
+++ b/cs231n/solver.py
@@ -0,0 +1,306 @@
+from __future__ import print_function, division
+from future import standard_library
+standard_library.install_aliases()
+from builtins import range
+from builtins import object
+import os
+import pickle as pickle
+
+import numpy as np
+
+from cs231n import optim
+
+
+class Solver(object):
+    """
+    A Solver encapsulates all the logic necessary for training classification
+    models. The Solver performs stochastic gradient descent using different
+    update rules defined in optim.py.
+
+    The solver accepts both training and validataion data and labels so it can
+    periodically check classification accuracy on both training and validation
+    data to watch out for overfitting.
+
+    To train a model, you will first construct a Solver instance, passing the
+    model, dataset, and various optoins (learning rate, batch size, etc) to the
+    constructor. You will then call the train() method to run the optimization
+    procedure and train the model.
+
+    After the train() method returns, model.params will contain the parameters
+    that performed best on the validation set over the course of training.
+    In addition, the instance variable solver.loss_history will contain a list
+    of all losses encountered during training and the instance variables
+    solver.train_acc_history and solver.val_acc_history will be lists of the
+    accuracies of the model on the training and validation set at each epoch.
+
+    Example usage might look something like this:
+
+    data = {
+      'X_train': # training data
+      'y_train': # training labels
+      'X_val': # validation data
+      'y_val': # validation labels
+    }
+    model = MyAwesomeModel(hidden_size=100, reg=10)
+    solver = Solver(model, data,
+                    update_rule='sgd',
+                    optim_config={
+                      'learning_rate': 1e-3,
+                    },
+                    lr_decay=0.95,
+                    num_epochs=10, batch_size=100,
+                    print_every=100)
+    solver.train()
+
+
+    A Solver works on a model object that must conform to the following API:
+
+    - model.params must be a dictionary mapping string parameter names to numpy
+      arrays containing parameter values.
+
+    - model.loss(X, y) must be a function that computes training-time loss and
+      gradients, and test-time classification scores, with the following inputs
+      and outputs:
+
+      Inputs:
+      - X: Array giving a minibatch of input data of shape (N, d_1, ..., d_k)
+      - y: Array of labels, of shape (N,) giving labels for X where y[i] is the
+        label for X[i].
+
+      Returns:
+      If y is None, run a test-time forward pass and return:
+      - scores: Array of shape (N, C) giving classification scores for X where
+        scores[i, c] gives the score of class c for X[i].
+
+      If y is not None, run a training time forward and backward pass and
+      return a tuple of:
+      - loss: Scalar giving the loss
+      - grads: Dictionary with the same keys as self.params mapping parameter
+        names to gradients of the loss with respect to those parameters.
+    """
+
+    def __init__(self, model, data, **kwargs):
+        """
+        Construct a new Solver instance.
+
+        Required arguments:
+        - model: A model object conforming to the API described above
+        - data: A dictionary of training and validation data containing:
+          'X_train': Array, shape (N_train, d_1, ..., d_k) of training images
+          'X_val': Array, shape (N_val, d_1, ..., d_k) of validation images
+          'y_train': Array, shape (N_train,) of labels for training images
+          'y_val': Array, shape (N_val,) of labels for validation images
+
+        Optional arguments:
+        - update_rule: A string giving the name of an update rule in optim.py.
+          Default is 'sgd'.
+        - optim_config: A dictionary containing hyperparameters that will be
+          passed to the chosen update rule. Each update rule requires different
+          hyperparameters (see optim.py) but all update rules require a
+          'learning_rate' parameter so that should always be present.
+        - lr_decay: A scalar for learning rate decay; after each epoch the
+          learning rate is multiplied by this value.
+        - batch_size: Size of minibatches used to compute loss and gradient
+          during training.
+        - num_epochs: The number of epochs to run for during training.
+        - print_every: Integer; training losses will be printed every
+          print_every iterations.
+        - verbose: Boolean; if set to false then no output will be printed
+          during training.
+        - num_train_samples: Number of training samples used to check training
+          accuracy; default is 1000; set to None to use entire training set.
+        - num_val_samples: Number of validation samples to use to check val
+          accuracy; default is None, which uses the entire validation set.
+        - checkpoint_name: If not None, then save model checkpoints here every
+          epoch.
+        """
+        self.model = model
+        self.X_train = data['X_train']
+        self.y_train = data['y_train']
+        self.X_val = data['X_val']
+        self.y_val = data['y_val']
+
+        # Unpack keyword arguments
+        self.update_rule = kwargs.pop('update_rule', 'sgd')
+        self.optim_config = kwargs.pop('optim_config', {})
+        self.lr_decay = kwargs.pop('lr_decay', 1.0)
+        self.batch_size = kwargs.pop('batch_size', 100)
+        self.num_epochs = kwargs.pop('num_epochs', 10)
+        self.num_train_samples = kwargs.pop('num_train_samples', 1000)
+        self.num_val_samples = kwargs.pop('num_val_samples', None)
+
+        self.checkpoint_name = kwargs.pop('checkpoint_name', None)
+        self.print_every = kwargs.pop('print_every', 10)
+        self.verbose = kwargs.pop('verbose', True)
+
+        # Throw an error if there are extra keyword arguments
+        if len(kwargs) > 0:
+            extra = ', '.join('"%s"' % k for k in list(kwargs.keys()))
+            raise ValueError('Unrecognized arguments %s' % extra)
+
+        # Make sure the update rule exists, then replace the string
+        # name with the actual function
+        if not hasattr(optim, self.update_rule):
+            raise ValueError('Invalid update_rule "%s"' % self.update_rule)
+        self.update_rule = getattr(optim, self.update_rule)
+
+        self._reset()
+
+
+    def _reset(self):
+        """
+        Set up some book-keeping variables for optimization. Don't call this
+        manually.
+        """
+        # Set up some variables for book-keeping
+        self.epoch = 0
+        self.best_val_acc = 0
+        self.best_params = {}
+        self.loss_history = []
+        self.train_acc_history = []
+        self.val_acc_history = []
+
+        # Make a deep copy of the optim_config for each parameter
+        self.optim_configs = {}
+        for p in self.model.params:
+            d = {k: v for k, v in self.optim_config.items()}
+            self.optim_configs[p] = d
+
+
+    def _step(self):
+        """
+        Make a single gradient update. This is called by train() and should not
+        be called manually.
+        """
+        # Make a minibatch of training data
+        num_train = self.X_train.shape[0]
+        batch_mask = np.random.choice(num_train, self.batch_size)
+        X_batch = self.X_train[batch_mask]
+        y_batch = self.y_train[batch_mask]
+
+        # Compute loss and gradient
+        loss, grads = self.model.loss(X_batch, y_batch)
+        self.loss_history.append(loss)
+
+        # Perform a parameter update
+        for p, w in self.model.params.items():
+            dw = grads[p]
+            config = self.optim_configs[p]
+            next_w, next_config = self.update_rule(w, dw, config)
+            self.model.params[p] = next_w
+            self.optim_configs[p] = next_config
+
+
+    def _save_checkpoint(self):
+        if self.checkpoint_name is None: return
+        checkpoint = {
+          'model': self.model,
+          'update_rule': self.update_rule,
+          'lr_decay': self.lr_decay,
+          'optim_config': self.optim_config,
+          'batch_size': self.batch_size,
+          'num_train_samples': self.num_train_samples,
+          'num_val_samples': self.num_val_samples,
+          'epoch': self.epoch,
+          'loss_history': self.loss_history,
+          'train_acc_history': self.train_acc_history,
+          'val_acc_history': self.val_acc_history,
+        }
+        filename = '%s_epoch_%d.pkl' % (self.checkpoint_name, self.epoch)
+        if self.verbose:
+            print('Saving checkpoint to "%s"' % filename)
+        with open(filename, 'wb') as f:
+            pickle.dump(checkpoint, f)
+
+
+    def check_accuracy(self, X, y, num_samples=None, batch_size=100):
+        """
+        Check accuracy of the model on the provided data.
+
+        Inputs:
+        - X: Array of data, of shape (N, d_1, ..., d_k)
+        - y: Array of labels, of shape (N,)
+        - num_samples: If not None, subsample the data and only test the model
+          on num_samples datapoints.
+        - batch_size: Split X and y into batches of this size to avoid using
+          too much memory.
+
+        Returns:
+        - acc: Scalar giving the fraction of instances that were correctly
+          classified by the model.
+        """
+
+        # Maybe subsample the data
+        N = X.shape[0]
+        if num_samples is not None and N > num_samples:
+            mask = np.random.choice(N, num_samples)
+            N = num_samples
+            X = X[mask]
+            y = y[mask]
+
+        # Compute predictions in batches
+        num_batches = N // batch_size
+        if N % batch_size != 0:
+            num_batches += 1
+        y_pred = []
+        for i in range(num_batches):
+            start = i * batch_size
+            end = (i + 1) * batch_size
+            scores = self.model.loss(X[start:end])
+            y_pred.append(np.argmax(scores, axis=1))
+        y_pred = np.hstack(y_pred)
+        acc = np.mean(y_pred == y)
+
+        return acc
+
+
+    def train(self):
+        """
+        Run optimization to train the model.
+        """
+        num_train = self.X_train.shape[0]
+        iterations_per_epoch = max(num_train // self.batch_size, 1)
+        num_iterations = self.num_epochs * iterations_per_epoch
+
+        for t in range(num_iterations):
+            self._step()
+
+            # Maybe print training loss
+            if self.verbose and t % self.print_every == 0:
+                print('(Iteration %d / %d) loss: %f' % (
+                       t + 1, num_iterations, self.loss_history[-1]))
+
+            # At the end of every epoch, increment the epoch counter and decay
+            # the learning rate.
+            epoch_end = (t + 1) % iterations_per_epoch == 0
+            if epoch_end:
+                self.epoch += 1
+                for k in self.optim_configs:
+                    self.optim_configs[k]['learning_rate'] *= self.lr_decay
+
+            # Check train and val accuracy on the first iteration, the last
+            # iteration, and at the end of each epoch.
+            first_it = (t == 0)
+            last_it = (t == num_iterations - 1)
+            if first_it or last_it or epoch_end:
+                train_acc = self.check_accuracy(self.X_train, self.y_train,
+                    num_samples=self.num_train_samples)
+                val_acc = self.check_accuracy(self.X_val, self.y_val,
+                    num_samples=self.num_val_samples)
+                self.train_acc_history.append(train_acc)
+                self.val_acc_history.append(val_acc)
+                self._save_checkpoint()
+
+                if self.verbose:
+                    print('(Epoch %d / %d) train acc: %f; val_acc: %f' % (
+                           self.epoch, self.num_epochs, train_acc, val_acc))
+
+                # Keep track of the best model
+                if val_acc > self.best_val_acc:
+                    self.best_val_acc = val_acc
+                    self.best_params = {}
+                    for k, v in self.model.params.items():
+                        self.best_params[k] = v.copy()
+
+        # At the end of training swap the best params into the model
+        self.model.params = self.best_params
diff --git a/cs231n/vis_utils.py b/cs231n/vis_utils.py
new file mode 100644
index 0000000..0aa42c0
--- /dev/null
+++ b/cs231n/vis_utils.py
@@ -0,0 +1,73 @@
+from builtins import range
+from past.builtins import xrange
+
+from math import sqrt, ceil
+import numpy as np
+
+def visualize_grid(Xs, ubound=255.0, padding=1):
+    """
+    Reshape a 4D tensor of image data to a grid for easy visualization.
+
+    Inputs:
+    - Xs: Data of shape (N, H, W, C)
+    - ubound: Output grid will have values scaled to the range [0, ubound]
+    - padding: The number of blank pixels between elements of the grid
+    """
+    (N, H, W, C) = Xs.shape
+    grid_size = int(ceil(sqrt(N)))
+    grid_height = H * grid_size + padding * (grid_size - 1)
+    grid_width = W * grid_size + padding * (grid_size - 1)
+    grid = np.zeros((grid_height, grid_width, C))
+    next_idx = 0
+    y0, y1 = 0, H
+    for y in range(grid_size):
+        x0, x1 = 0, W
+        for x in range(grid_size):
+            if next_idx < N:
+                img = Xs[next_idx]
+                low, high = np.min(img), np.max(img)
+                grid[y0:y1, x0:x1] = ubound * (img - low) / (high - low)
+                # grid[y0:y1, x0:x1] = Xs[next_idx]
+                next_idx += 1
+            x0 += W + padding
+            x1 += W + padding
+        y0 += H + padding
+        y1 += H + padding
+    # grid_max = np.max(grid)
+    # grid_min = np.min(grid)
+    # grid = ubound * (grid - grid_min) / (grid_max - grid_min)
+    return grid
+
+def vis_grid(Xs):
+    """ visualize a grid of images """
+    (N, H, W, C) = Xs.shape
+    A = int(ceil(sqrt(N)))
+    G = np.ones((A*H+A, A*W+A, C), Xs.dtype)
+    G *= np.min(Xs)
+    n = 0
+    for y in range(A):
+        for x in range(A):
+            if n < N:
+                G[y*H+y:(y+1)*H+y, x*W+x:(x+1)*W+x, :] = Xs[n,:,:,:]
+                n += 1
+    # normalize to [0,1]
+    maxg = G.max()
+    ming = G.min()
+    G = (G - ming)/(maxg-ming)
+    return G
+
+def vis_nn(rows):
+    """ visualize array of arrays of images """
+    N = len(rows)
+    D = len(rows[0])
+    H,W,C = rows[0][0].shape
+    Xs = rows[0][0]
+    G = np.ones((N*H+N, D*W+D, C), Xs.dtype)
+    for y in range(N):
+        for x in range(D):
+            G[y*H+y:(y+1)*H+y, x*W+x:(x+1)*W+x, :] = rows[y][x]
+    # normalize to [0,1]
+    maxg = G.max()
+    ming = G.min()
+    G = (G - ming)/(maxg-ming)
+    return G
diff --git a/kitten.jpg b/kitten.jpg
new file mode 100644
index 0000000..e421ec1
Binary files /dev/null and b/kitten.jpg differ
diff --git a/puppy.jpg b/puppy.jpg
new file mode 100644
index 0000000..3cc1234
Binary files /dev/null and b/puppy.jpg differ