{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Classifying cancer from 32 parameters\n", "\n", "Data is taken from https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29\n", "\n", "We simply read all the data, drop the patient ID and place the label into an array of it's own. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import csv\n", "import numpy\n", "\n", "with open('wdbc.data') as input_file:\n", " text_data = [row for row in csv.reader(input_file, delimiter=',')]\n", "for line in text_data:\n", " _ = line.pop(0) #We remove the ID - no need for it\n", "\n", "known_labels = ','.join([line.pop(0) for line in text_data])\n", "raw_data = numpy.array(text_data).astype(numpy.float)\n", "data = raw_data / numpy.max(raw_data, axis = 0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can write a generic clustering mechanism, similar to the small previous example." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def all_dist(observation, data):\n", " return numpy.sqrt((data[:, 0] - observation[0])**2 + (data[:, 1] - observation[1])**2)\n", "\n", "def cluster(data, k):\n", " samples, _= data.shape\n", " centroids = numpy.array([data[numpy.random.randint(samples), :,] for _ in range(k)])\n", " done = False\n", " while not done:\n", " distances = numpy.empty((k,samples))\n", " for d in range(k):\n", " distances[d, :] = all_dist(centroids[d], data)\n", " winners = numpy.argmin(distances, axis = 0)\n", " clusters = [data[winners == i, :] for i in range(k)]\n", " prev_centroids = centroids\n", " centroids = numpy.array([numpy.average(c, axis = 0) for c in clusters])\n", " if numpy.sum(prev_centroids-centroids) == 0:\n", " done=True\n", " return winners" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can find the clusters, since we have only two categories its rather fast. We cannot know if category 0 is malign or benign, but have to assume that the smaller category is malign. We thus change the labels to that assumption. Then we can easily compare the classifications of each patient and check who well we did." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(array([ 66, 503]), '(Wrong, Right)')\n" ] } ], "source": [ "clusters = cluster(data, 2)\n", "a, b = numpy.bincount(clusters)\n", "labels = known_labels+''\n", "if a0]\n", " k = len(clusters)\n", " centroids = numpy.array([numpy.average(c, axis = 0) for c in clusters])\n", " if len(prev_centroids) == len(centroids):\n", " if numpy.sum(prev_centroids-centroids) == 0:\n", " done=True\n", " return winners, centroids\n", "\n", "target_k = 2\n", "n_centroids = 25\n", "centroids = []\n", "while n_centroids > target_k:\n", " clusters, centroids = cluster(data, n_centroids, centroids)\n", " if ( n_centroids > target_k ) and ( len(centroids) == n_centroids ):\n", " centroid_dist = numpy.sum(numpy.sqrt((centroids[:, numpy.newaxis, :]-centroids)**2), axis =2)\n", " centroid_dist[centroid_dist==0] = 1000.0\n", " centroids = list(centroids)\n", " minpos = numpy.argmin(centroid_dist)\n", " point0, point1 = centroids.pop(minpos//n_centroids), centroids.pop((minpos%n_centroids)-1) #-1 because we pop\n", " centroids.append((point0 + point1)/2)\n", " n_centroids -= 1\n", " else:\n", " n_centroids = len(centroids)\n", "clusters, centroids = cluster(data, n_centroids, centroids) #We have the number of required centroids now\n", "a, b = numpy.bincount(clusters)\n", "labels = known_labels+''\n", "if a