{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# A larger example\n", "\n", "Here we will use a dataset on Italian wine. The dataset is taken from\n", "https://archive.ics.uci.edu/ml/machine-learning-databases/wine/\n", "\n", "The actual data is in the file wine.data, and a description of the data can fe found in wine.names.\n", "\n", "If we look at the beginning of the data file we see:\n", "`1,14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065\n", "1,13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050\n", "1,13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185\n", "1,14.37,1.95,2.5,16.8,113,3.85,3.49,.24,2.18,7.8,.86,3.45,1480\n", "1,13.24,2.59,2.87,21,118,2.8,2.69,.39,1.82,4.32,1.04,2.93,735\n", "1,14.2,1.76,2.45,15.2,112,3.27,3.39,.34,1.97,6.75,1.05,2.85,1450\n", "1,14.39,1.87,2.45,14.6,96,2.5,2.52,.3,1.98,5.25,1.02,3.58,1290\n", "1,14.06,2.15,2.61,17.6,121,2.6,2.51,.31,1.25,5.05,1.06,3.58,1295\n", "1,14.83,1.64,2.17,14,97,2.8,2.98,.29,1.98,5.2,1.08,2.85,1045\n", "1,13.86,1.35,2.27,16,98,2.98,3.15,.22,1.85,7.22,1.01,3.55,1045\n", "1,14.1,2.16,2.3,18,105,2.95,3.32,.22,2.38,5.75,1.25,3.17,1510\n", "1,14.12,1.48,2.32,16.8,95,2.2,2.43,.26,1.57,5,1.17,2.82,1280\n", "1,13.75,1.73,2.41,16,89,2.6,2.76,.29,1.81,5.6,1.15,2.9,1320\n", "1,14.75,1.73,2.39,11.4,91,3.1,3.69,.43,2.81,5.4,1.25,2.73,1150\n", "1,14.38,1.87,2.38,12,102,3.3,3.64,.29,2.96,7.5,1.2,3,1547\n", "1,13.63,1.81,2.7,17.2,112,2.85,2.91,.3,1.46,7.3,1.28,2.88,1310\n", "1,14.3,1.92,2.72,20,120,2.8,3.14,.33,1.97,6.2,1.07,2.65,1280`\n", "\n", "First is the class of the wine and the follows the data, which are (taken from wines.names:\n", "1) Alcohol\n", "2) Malic acid\n", "3) Ash\n", "4) Alcalinity of ash \n", "5) Magnesium\n", "6) Total phenols\n", "7) Flavanoids\n", "8) Nonflavanoid phenols\n", "9) Proanthocyanins\n", "10)Color intensity\n", "11)Hue\n", "12)OD280/OD315 of diluted wines\n", "13)Proline \n", " \n", " \n", "The data is clearly a simple CSV file, thus we start by reading the data." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import csv\n", "import numpy\n", "\n", "with open('wine.data') as input_file:\n", " raw_data = numpy.array([row for row in csv.reader(input_file)]).astype(numpy.float)\n", "\n", "labels = raw_data[:, 0 ]\n", "data = raw_data[:, 1:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This time the data are not in the simple [0:1] range - so we normalize each column." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "_, num_c = data.shape\n", "for i in range(num_c):\n", " data[:, i] = data[:, i] / numpy.max(data[:, i])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We know that there are 13 columns in data, but at this point we may as well make our distance meassure indepedent of dimensions." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def all_distances(point, db):\n", " result = []\n", " for entry in db:\n", " distance = 0.0\n", " for dim in zip(point, entry):\n", " distance += (dim[0] - dim[1])**2\n", " result.append(numpy.sqrt(distance))\n", " return numpy.array(result)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can reuse the simple election mechanism." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import collections\n", "def classify(point, k=5):\n", " distances = all_distances(point, data)\n", " votes = []\n", " for _ in range(k):\n", " winner = numpy.argmin(distances)\n", " votes.append(labels[winner])\n", " distances[winner] = 1000\n", " return collections.Counter(votes).most_common(1)[0][0]\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can test the result against the database itself." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Matched 176 of 178\n" ] } ], "source": [ "score = 0\n", "for point in raw_data:\n", " if point[0] == classify(point[1:], 6):\n", " score += 1\n", "print('Matched',score,'of',len(raw_data))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result is quite satisfactory. However, since we are matching against the database itself, the tested point is itself in the test set, which is an unfair advantage compared to a real world scenario. Eliminating this bias is left as an exercise, it is quite simple though." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should play around with values of k as well, to tell the best number of neighbors to match against." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" } }, "nbformat": 4, "nbformat_minor": 2 }