{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# A larger example\n",
    "\n",
    "Here we will use a dataset on Italian wine. The dataset is taken from\n",
    "https://archive.ics.uci.edu/ml/machine-learning-databases/wine/\n",
    "\n",
    "The actual data is in the file wine.data, and a description of the data can fe found in wine.names.\n",
    "\n",
    "If we look at the beginning of the data file we see:\n",
    "`1,14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065\n",
    "1,13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050\n",
    "1,13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185\n",
    "1,14.37,1.95,2.5,16.8,113,3.85,3.49,.24,2.18,7.8,.86,3.45,1480\n",
    "1,13.24,2.59,2.87,21,118,2.8,2.69,.39,1.82,4.32,1.04,2.93,735\n",
    "1,14.2,1.76,2.45,15.2,112,3.27,3.39,.34,1.97,6.75,1.05,2.85,1450\n",
    "1,14.39,1.87,2.45,14.6,96,2.5,2.52,.3,1.98,5.25,1.02,3.58,1290\n",
    "1,14.06,2.15,2.61,17.6,121,2.6,2.51,.31,1.25,5.05,1.06,3.58,1295\n",
    "1,14.83,1.64,2.17,14,97,2.8,2.98,.29,1.98,5.2,1.08,2.85,1045\n",
    "1,13.86,1.35,2.27,16,98,2.98,3.15,.22,1.85,7.22,1.01,3.55,1045\n",
    "1,14.1,2.16,2.3,18,105,2.95,3.32,.22,2.38,5.75,1.25,3.17,1510\n",
    "1,14.12,1.48,2.32,16.8,95,2.2,2.43,.26,1.57,5,1.17,2.82,1280\n",
    "1,13.75,1.73,2.41,16,89,2.6,2.76,.29,1.81,5.6,1.15,2.9,1320\n",
    "1,14.75,1.73,2.39,11.4,91,3.1,3.69,.43,2.81,5.4,1.25,2.73,1150\n",
    "1,14.38,1.87,2.38,12,102,3.3,3.64,.29,2.96,7.5,1.2,3,1547\n",
    "1,13.63,1.81,2.7,17.2,112,2.85,2.91,.3,1.46,7.3,1.28,2.88,1310\n",
    "1,14.3,1.92,2.72,20,120,2.8,3.14,.33,1.97,6.2,1.07,2.65,1280`\n",
    "\n",
    "First is the class of the wine and the follows the data, which are (taken from wines.names:\n",
    "1) Alcohol\n",
    "2) Malic acid\n",
    "3) Ash\n",
    "4) Alcalinity of ash  \n",
    "5) Magnesium\n",
    "6) Total phenols\n",
    "7) Flavanoids\n",
    "8) Nonflavanoid phenols\n",
    "9) Proanthocyanins\n",
    "10)Color intensity\n",
    "11)Hue\n",
    "12)OD280/OD315 of diluted wines\n",
    "13)Proline \n",
    "    \n",
    "    \n",
    "The data is clearly a simple CSV file, thus we start by reading the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import csv\n",
    "import numpy\n",
    "\n",
    "with open('wine.data') as input_file:\n",
    "    raw_data = numpy.array([row for row in csv.reader(input_file)]).astype(numpy.float)\n",
    "\n",
    "labels = raw_data[:, 0 ]\n",
    "data   = raw_data[:, 1:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This time the data are not in the simple [0:1] range - so we normalize each column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "_, num_c = data.shape\n",
    "for i in range(num_c):\n",
    "    data[:, i] = data[:, i] / numpy.max(data[:, i])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We know that there are 13 columns in data, but at this point we may as well make our distance meassure indepedent of dimensions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def all_distances(point, db):\n",
    "    result = []\n",
    "    for entry in db:\n",
    "        distance = 0.0\n",
    "        for dim in zip(point, entry):\n",
    "                distance += (dim[0] - dim[1])**2\n",
    "        result.append(numpy.sqrt(distance))\n",
    "    return numpy.array(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can reuse the simple election mechanism."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import collections\n",
    "def classify(point, k=5):\n",
    "    distances = all_distances(point, data)\n",
    "    votes = []\n",
    "    for _ in range(k):\n",
    "        winner = numpy.argmin(distances)\n",
    "        votes.append(labels[winner])\n",
    "        distances[winner] = 1000\n",
    "    return collections.Counter(votes).most_common(1)[0][0]\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can test the result against the database itself."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Matched 176 of 178\n"
     ]
    }
   ],
   "source": [
    "score = 0\n",
    "for point in raw_data:\n",
    "    if point[0] == classify(point[1:], 6):\n",
    "        score += 1\n",
    "print('Matched',score,'of',len(raw_data))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result is quite satisfactory. However, since we are matching against the database itself, the tested point is itself in the test set, which is an unfair advantage compared to a real world scenario. Eliminating this bias is left as an exercise, it is quite simple though."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should play around with values of k as well, to tell the best number of neighbors to match against."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}