{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Classifying cancer from 32 parameters\n",
    "\n",
    "Data is taken from https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29\n",
    "\n",
    "We simply read all the data, drop the patient ID and place the label into an array of it's own. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import csv\n",
    "import numpy\n",
    "\n",
    "with open('wdbc.data') as input_file:\n",
    "    text_data = [row for row in csv.reader(input_file, delimiter=',')]\n",
    "for line in text_data:\n",
    "    _ = line.pop(0) #We remove the ID - no need for it\n",
    "\n",
    "known_labels = ','.join([line.pop(0) for line in text_data])\n",
    "raw_data = numpy.array(text_data).astype(numpy.float)\n",
    "data   = raw_data / numpy.max(raw_data, axis = 0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can write a generic clustering mechanism, similar to the small previous example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def all_dist(observation, data):\n",
    "    return numpy.sqrt((data[:, 0] - observation[0])**2 + (data[:, 1] - observation[1])**2)\n",
    "\n",
    "def cluster(data, k):\n",
    "    samples, _= data.shape\n",
    "    centroids = numpy.array([data[numpy.random.randint(samples), :,] for _ in range(k)])\n",
    "    done = False\n",
    "    while not done:\n",
    "        distances = numpy.empty((k,samples))\n",
    "        for d in range(k):\n",
    "            distances[d, :] = all_dist(centroids[d], data)\n",
    "        winners = numpy.argmin(distances, axis = 0)\n",
    "        clusters = [data[winners == i, :] for i in range(k)]\n",
    "        prev_centroids = centroids\n",
    "        centroids = numpy.array([numpy.average(c, axis = 0) for c in clusters])\n",
    "        if numpy.sum(prev_centroids-centroids) == 0:\n",
    "            done=True\n",
    "    return winners"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can find the clusters, since we have only two categories its rather fast. We cannot know if category 0 is malign or benign, but have to assume that the smaller category is malign. We thus change the labels to that assumption. Then we can easily compare the classifications of each patient and check who well we did."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(array([ 66, 503]), '(Wrong, Right)')\n"
     ]
    }
   ],
   "source": [
    "clusters = cluster(data, 2)\n",
    "a, b = numpy.bincount(clusters)\n",
    "labels = known_labels+''\n",
    "if a<b:\n",
    "    labels = labels.replace('M','0')\n",
    "    labels = labels.replace('B','1')\n",
    "else:\n",
    "    labels = labels.replace('M','1')\n",
    "    labels = labels.replace('B','0')\n",
    "compare = (numpy.equal(clusters, numpy.array(labels.split(',')).astype(numpy.int)))\n",
    "print(numpy.bincount(compare),'(Wrong, Right)')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run it a few times and realize that success differ extremely. Several approaches can be tried to remedy this.\n",
    "\n",
    "Try and simply remove one or more dimensions to see if they are merely in the way (really: do a PCA but QaD tests are ok as well).\n",
    "\n",
    "Try and change the distance metric for individual dimensions, so rather than simply include or not at in the first appraoch, we can tune the importance of a parameter.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(array([ 66, 503]), '(Wrong, Right)')\n"
     ]
    }
   ],
   "source": [
    "def cluster(data, k, centroids = []):\n",
    "    samples, _= data.shape\n",
    "    if centroids == []:\n",
    "        centroids = numpy.array([data[numpy.random.randint(samples), :,] for _ in range(k)])\n",
    "    done = False\n",
    "    while not done:\n",
    "        distances = numpy.empty((k,samples))\n",
    "        for d in range(k):\n",
    "            distances[d, :] = all_dist(centroids[d], data)\n",
    "        winners = numpy.argmin(distances, axis = 0)\n",
    "        clusters = [data[winners == i, :] for i in range(k)]\n",
    "        prev_centroids = centroids\n",
    "        clusters = [c for c in clusters if len(c)>0]\n",
    "        k = len(clusters)\n",
    "        centroids = numpy.array([numpy.average(c, axis = 0) for c in clusters])\n",
    "        if len(prev_centroids) == len(centroids):\n",
    "            if numpy.sum(prev_centroids-centroids) == 0:\n",
    "                done=True\n",
    "    return winners, centroids\n",
    "\n",
    "target_k = 2\n",
    "n_centroids = 25\n",
    "centroids = []\n",
    "while n_centroids > target_k:\n",
    "    clusters, centroids = cluster(data, n_centroids, centroids)\n",
    "    if ( n_centroids > target_k ) and ( len(centroids) == n_centroids ):\n",
    "        centroid_dist = numpy.sum(numpy.sqrt((centroids[:, numpy.newaxis, :]-centroids)**2), axis =2)\n",
    "        centroid_dist[centroid_dist==0] = 1000.0\n",
    "        centroids = list(centroids)\n",
    "        minpos = numpy.argmin(centroid_dist)\n",
    "        point0, point1 = centroids.pop(minpos//n_centroids), centroids.pop((minpos%n_centroids)-1) #-1 because we pop\n",
    "        centroids.append((point0 + point1)/2)\n",
    "        n_centroids -= 1\n",
    "    else:\n",
    "        n_centroids = len(centroids)\n",
    "clusters, centroids = cluster(data, n_centroids, centroids) #We have the number of required centroids now\n",
    "a, b = numpy.bincount(clusters)\n",
    "labels = known_labels+''\n",
    "if a<b:\n",
    "    labels = labels.replace('M','0')\n",
    "    labels = labels.replace('B','1')\n",
    "else:\n",
    "    labels = labels.replace('M','1')\n",
    "    labels = labels.replace('B','0')\n",
    "compare = (numpy.equal(clusters, numpy.array(labels.split(',')).astype(numpy.int)))\n",
    "print(numpy.bincount(compare),'(Wrong, Right)')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***\n",
    "Note to self - try with many more clusters, and after convergence, \n",
    "fuse the two clusters that are closest to one and repeat training. \n",
    "Repeat until the desired number of clusters are found.\n",
    "\n",
    "Fusing: simple mean, weighted mean or most discriminating (one furthest away from other centroids)\n",
    "***"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}