// ----------------------------------------------------------------------------------- // /* ROOT macro for analysing the famous Anderson-Fisher Iris data set from Gaspe Peninsula. Edgar Anderson took 50 samples of three species (Setosa, Versicolor, and Virginica) of Iris at the Gaspe Peninsula, and measured four distinguishing features on each flower: Sepal Length Sepal Width Petal Length Petal Width Using these measurements, Ronald Fisher was able to make a classification scheme, for which he invented the Fisher Linear Discriminant. References: Glen Cowan, Statistical Data Analysis, pages 51-57 http://en.wikipedia.org/wiki/Iris_flower_data_set http://en.wikipedia.org/wiki/Linear_discriminant_analysis Author: Troels C. Petersen (NBI) Email: petersen@nbi.dk Date: 7th of October 2012 */ // ----------------------------------------------------------------------------------- // // ----------------------------------------------------------------------------------- // void FisherDiscriminant() { // ----------------------------------------------------------------------------------- // gROOT->Reset(); // Set the showing of statistics and fitting results (0 means off, 1111 means all on): gStyle->SetOptStat(1111); // gStyle->SetOptStat(0); gStyle->SetOptFit(1111); // gStyle->SetOptFit(0); // Statistics and fitting results replaced into: // gStyle->SetStatX(0.52); // Top left corner. // gStyle->SetStatY(0.86); // gStyle->SetStatX(0.89); // Bottom right corner. // gStyle->SetStatY(0.33); // Set the graphics: gStyle->SetStatBorderSize(1); gStyle->SetStatFontSize(0.055); gStyle->SetCanvasColor(4); gStyle->SetPalette(1); // ------------------------------------------------------------------ // // Read data from file: // ------------------------------------------------------------------ // // Open data file: FILE *data = fopen("DataSet_AndersonFisherIris.txt","r"); // The information to be read is: // Sepal Length, Sepal Width, Petal Length, Petal Width, and Species. const int Nvar = 4; const int Nmax = 200; int n = 0; double x[Nvar][Nmax]; int Species[Nmax]; // Loop over and read data as long as there is data (i.e. not End-Of-File (EOF)). while (fscanf(data, "%lf %lf %lf %lf %d \n", &x[0][n], &x[1][n], &x[2][n], &x[3][n], &Species[n]) != EOF) { if (n < 5) printf(" Read data: %5.2f %5.2f %5.2f %5.2f %3d \n", x[0][n], x[1][n], x[2][n], x[3][n], Species[n]); n++; } printf(" Found %d entries. \n", n); fclose(data); // ------------------------------------------------------------------ // // Analyse data: // ------------------------------------------------------------------ // } //---------------------------------------------------------------------------------- /* Start by taking a CLOSE look at the data and what the computer reads in! Questions: ---------- 1) First consider the assumption that there are only two species. How would you start analysing the data, and what observation would make you think that this was in fact the case? 2) Now try to make a selection, which separates the two species, i.e. a line like: if (x[0][i] > 0.25 && x[1][i] < 0.7) "select". The selection does not necessarily have to be used all four variables. 3) How good is your selection? That is what efficiency and power does it have on this small sample? Yes, just count how many it gets right and wrong (errors of type I and type II). 4) Now try to separate the two samples, which appear less distinct (Versicolor and Virginica). First use simple cuts, and see how well you can do. Then apply Fisher's linear discriminant to separate the samples. What separation do you get in the two cases? NOTE: For calculating your Fisher distriminant (i.e. linear combination of variables) you need to know the covariance matrices of both signal and background and inverting these. ROOT has this build into a TMatrixD (function: Invert). 5) Would you suspect overtraining? Advanced questions: ------------------- 1) Try to generate a "toy" sample with gaussian distribution and the same correlations in (not easy!!!), and apply the Fisher discriminant here. What is the variation in separation? Could this be considered the "uncertainty" in the separtion. */ //----------------------------------------------------------------------------------