// ----------------------------------------------------------------------------------- // /* ROOT macro for analysing the famous Anderson-Fisher Iris data set from Gaspe Peninsula. Edgar Anderson took 50 samples of three species (Setosa, Versicolor, and Virginica) of Iris at the Gaspe Peninsula, and measured four distinguishing features on each flower: Sepal Length Sepal Width Petal Length Petal Width Using these measurements, Ronald Fisher was able to make a classification scheme, for which he invented the Fisher Linear Discriminant. References: Glen Cowan, Statistical Data Analysis, pages 51-57 http://en.wikipedia.org/wiki/Iris_flower_data_set Author: Troels C. Petersen (NBI) Email: petersen@nbi.dk Date: 2nd of October 2011 */ // ----------------------------------------------------------------------------------- // double sqr(double a) { return a*a; } // ----------------------------------------------------------------------------------- // void FisherDiscriminant() { // ----------------------------------------------------------------------------------- // gROOT->Reset(); // Set the showing of statistics and fitting results (0 means off, 1111 means all on): gStyle->SetOptStat(1111); // gStyle->SetOptStat(0); gStyle->SetOptFit(1111); // gStyle->SetOptFit(0); // Statistics and fitting results replaced in: // gStyle->SetStatX(0.52); // Top left corner. // gStyle->SetStatY(0.86); // gStyle->SetStatX(0.89); // Bottom right corner. // gStyle->SetStatY(0.33); // Set the graphics: gStyle->SetStatBorderSize(1); gStyle->SetStatFontSize(0.055); gStyle->SetCanvasColor(4); gStyle->SetPalette(1); // ------------------------------------------------------------------ // // Read data from file: // ------------------------------------------------------------------ // // Open data file: FILE *data = fopen("DataSet_AndersonFisherIris.txt","r"); // The information to be read is: // Sepal Length, Sepal Width, Petal Length, Petal Width, and Species. const int Nmax = 200; int n = 0; double SL[Nmax], SW[Nmax], PL[Nmax], PW[Nmax]; char* Species[Nmax]; // Loop over and read data as long as there is data (i.e. not End-Of-File (EOF)). while (fscanf(data, "%lf %lf %lf %lf %s \n", &SL[n], &SW[n], &PL[n], &PW[n], &Species[n]) != EOF) { if (n < 5) printf(" Read data: %5.2f %5.2f %5.2f %5.2f %s \n", SL[n], SW[n], PL[n], PW[n], &Species[n]); n++; } printf(" Found %d entries. \n", n); fclose(data); // ------------------------------------------------------------------ // // Analyse data: // ------------------------------------------------------------------ // } //---------------------------------------------------------------------------------- /* Start by taking a CLOSE look at the data and what the computer reads in! Questions: ---------- 1) First consider the assumption that there are two species. How would you start analysing the data, and what observation would make you think that this was in fact the case? 2) Now try to make a selection, which separates the two species. The selection should be based on one variable at a time, and all four variables do not necessarily have to be used. 3) How good is your selection? That is what efficiency and power does it on this small sample) have? 4) Now try to separate the two samples, which appear less distinct. First use simple cuts, and then apply Fisher's linear discriminant to separate the samples. What separation do you get in the two cases? Advanced questions: ------------------- 1) Try to generate a "toy" sample with gaussian distribution and the same correlations in (not easy!!!), and apply the Fisher discriminant here. */ //----------------------------------------------------------------------------------