{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-13T07:42:04.260313Z",
     "start_time": "2020-05-13T07:42:04.256681Z"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from matplotlib.offsetbox import OffsetImage, AnnotationBbox\n",
    "from matplotlib.lines import Line2D"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Get data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First download some image files from zalando's github!  \n",
    "\n",
    "https://github.com/zalandoresearch/fashion-mnist  \n",
    "\n",
    "I suggest you try and do it with python! There is a small snippet below to get you started."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-13T07:42:04.822947Z",
     "start_time": "2020-05-13T07:42:04.788527Z"
    }
   },
   "outputs": [],
   "source": [
    "import urllib.request\n",
    "url = 'http://i3.ytimg.com/vi/J---aiyznGQ/mqdefault.jpg'\n",
    "urllib.request.urlretrieve(url, './cat.jpg'); # You should see the file appear in the same folder as this script"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-13T07:42:05.291509Z",
     "start_time": "2020-05-13T07:42:05.288830Z"
    }
   },
   "outputs": [],
   "source": [
    "# Your code here\n",
    "\n",
    "# url_images = 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz'\n",
    "# url_labels = 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Load data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that you have the zipped (gz) files of the images and labels, let unpack and load them. Use the loader below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-13T07:42:09.484645Z",
     "start_time": "2020-05-13T07:42:09.151652Z"
    }
   },
   "outputs": [],
   "source": [
    "# Fashion MNIST reader\n",
    "\n",
    "def load_mnist(labels_path, images_path):\n",
    "    '''Loads the gz mnist pictures and labels from their path.  \n",
    "    ./  is the same folder you are running your script from\n",
    "    \n",
    "    Example: load_mnist('./images.gz', './labels.gz')\n",
    "    '''\n",
    "    import os\n",
    "    import gzip\n",
    "    import numpy as np\n",
    "\n",
    "    \"\"\"Load MNIST data from `paths`\"\"\"\n",
    "\n",
    "    with gzip.open(labels_path, 'rb') as lbpath:\n",
    "        labels = np.frombuffer(lbpath.read(), dtype=np.uint8,\n",
    "                               offset=8)\n",
    "\n",
    "    with gzip.open(images_path, 'rb') as imgpath:\n",
    "        images = np.frombuffer(imgpath.read(), dtype=np.uint8,\n",
    "                               offset=16).reshape(len(labels), 784)\n",
    "\n",
    "    return images, labels\n",
    "\n",
    "\n",
    "img, label = load_mnist('./label.gz', './image.gz')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The 60000 clothing images are beautiful, if somewhat low resolution (28x28 pixels)  \n",
    "Each image is already row (a flat array) in the img matrix, which is usefull for the PCA and T-SNE.   \n",
    "  \n",
    "See if you can use numpy (np.reshape()) to convert one image back into 28x28 pixels and plot it with matplotlib (plt.imshow(), remember cmap='gray_r')."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-13T07:43:20.655362Z",
     "start_time": "2020-05-13T07:43:20.652485Z"
    }
   },
   "outputs": [],
   "source": [
    "# your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each of the 60000 rows (images) in the img matrix has a corresponding label in the label array. These labels indicate which kind of clothing it is.  \n",
    "Can you figure out what kind of clothing item is which label?  \n",
    "For example label 9 = shoe?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 181,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Subsample"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "60000 datapoints are a lot. Lets subsample. Take out only 1000 random images with their corresponding labels to play with for now. It will speed things up.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should only use these 1000 in the code below. You can always scale up later."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# PCA"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First lets try and do a PCA of the images"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-13T07:46:04.222887Z",
     "start_time": "2020-05-13T07:46:03.702885Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "PCA done! Time elapsed: 4.267692565917969e-05 seconds\n"
     ]
    }
   ],
   "source": [
    "from sklearn.decomposition import PCA\n",
    "import time\n",
    "time_start = time.time()\n",
    "\n",
    "# Your code here!\n",
    "\n",
    "print('PCA done! Time elapsed: {} seconds'.format(time.time()-time_start))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plot the first two principle componants. Preferably with colors for each label."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-13T08:23:24.657752Z",
     "start_time": "2020-05-13T08:23:24.654732Z"
    }
   },
   "outputs": [],
   "source": [
    "# your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# T-SNE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "use can always use https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html  \n",
    "  \n",
    "  \n",
    "or if you can install it for parallel https://github.com/DmitryUlyanov/Multicore-TSNE  \n",
    "or even better GPU accelerated with nvidia GPU https://github.com/CannyLab/tsne-cuda  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "t-SNE done! Time elapsed: 5.698204040527344e-05 seconds\n"
     ]
    }
   ],
   "source": [
    "from sklearn.manifold import TSNE\n",
    "import time\n",
    "\n",
    "time_start = time.time()\n",
    "\n",
    "# You T-SNE code here\n",
    "\n",
    "print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plot the 2 T-SNE dimensions. Preferably with colors for each label."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-13T08:24:11.631384Z",
     "start_time": "2020-05-13T08:24:11.628679Z"
    }
   },
   "outputs": [],
   "source": [
    "# Your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Try to use the PCA result to perform the T-SNE on instead of the raw images"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "TSNE().fit_transform(pca_result) instead of TSNE().fit_transform(images)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Why might you want to do this?  \n",
    "What is the difference in pca_result compared to the raw images?  \n",
    "What is the difference in the resulting T-SNE?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lets try and actually plot the images instead of points"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "def imscatter(x, y, image, ax=None, zoom=1):\n",
    "    '''Function that plots an image at a given coordinate. The image must be a matrix not a flat array.'''\n",
    "    if ax is None:\n",
    "        ax = plt.gca()\n",
    "    im = OffsetImage(image, zoom=zoom)\n",
    "    x, y = np.atleast_1d(x, y)\n",
    "    artists = []\n",
    "    for x0, y0 in zip(x, y):\n",
    "        ab = AnnotationBbox(im, (x0, y0), xycoords='data', frameon=False)\n",
    "        artists.append(ax.add_artist(ab))\n",
    "    ax.update_datalim(np.column_stack([x, y]))\n",
    "    ax.autoscale()\n",
    "    return artists"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-05-13T08:25:09.842158Z",
     "start_time": "2020-05-13T08:25:09.839787Z"
    }
   },
   "outputs": [],
   "source": [
    "# your code here"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}