// ----------------------------------------------------------------------------------- //
/*
  ROOT macro for analysing the famous Anderson-Fisher Iris data set from Gaspe Peninsula.

  Edgar Anderson took 50 samples of three species (Setosa, Versicolor, and Virginica)
  of Iris at the Gaspe Peninsula, and measured four distinguishing features on each
  flower:
    Sepal Length
    Sepal Width
    Petal Length
    Petal Width
  Using these measurements, Ronald Fisher was able to make a classification scheme, for
  which he invented the Fisher Linear Discriminant.

  References:
    Glen Cowan, Statistical Data Analysis, pages 51-57
    http://en.wikipedia.org/wiki/Iris_flower_data_set

  Author: Troels C. Petersen (NBI)
  Email:  petersen@nbi.dk
  Date:   2nd of October 2011
*/
// ----------------------------------------------------------------------------------- //


double sqr(double a) {
  return a*a;
}


// ----------------------------------------------------------------------------------- //
void FisherDiscriminant() {
// ----------------------------------------------------------------------------------- //
  gROOT->Reset();

  // Set the showing of statistics and fitting results (0 means off, 1111 means all on):
  gStyle->SetOptStat(1111);
  // gStyle->SetOptStat(0);
  gStyle->SetOptFit(1111);
  // gStyle->SetOptFit(0);

  // Statistics and fitting results replaced in:
  // gStyle->SetStatX(0.52);    // Top left corner.
  // gStyle->SetStatY(0.86);
  // gStyle->SetStatX(0.89);       // Bottom right corner.
  // gStyle->SetStatY(0.33);

  // Set the graphics:
  gStyle->SetStatBorderSize(1);
  gStyle->SetStatFontSize(0.055);
  gStyle->SetCanvasColor(4);
  gStyle->SetPalette(1);


  // ------------------------------------------------------------------ //
  // Read data from file:
  // ------------------------------------------------------------------ //

  // Open data file:
  FILE *data = fopen("DataSet_AndersonFisherIris.txt","r");

  // The information to be read is:
  // Sepal Length, Sepal Width, Petal Length, Petal Width, and Species.
  const int Nmax = 200;
  int n = 0;
  double SL[Nmax], SW[Nmax], PL[Nmax], PW[Nmax];
  char* Species[Nmax];

  // Loop over and read data as long as there is data (i.e. not End-Of-File (EOF)).
  while (fscanf(data, "%lf %lf %lf %lf %s \n",
		&SL[n], &SW[n], &PL[n], &PW[n], &Species[n]) != EOF) {
    if (n < 5) printf("  Read data: %5.2f %5.2f %5.2f %5.2f   %s \n",
		      SL[n], SW[n], PL[n], PW[n], &Species[n]);
    n++;
  }
  printf(" Found %d entries. \n", n);

  fclose(data);


  // ------------------------------------------------------------------ //
  // Analyse data:
  // ------------------------------------------------------------------ //


}

//---------------------------------------------------------------------------------- 
/*

Start by taking a CLOSE look at the data and what the computer reads in!

Questions:
----------
 1) First consider the assumption that there are two species. How would you start
    analysing the data, and what observation would make you think that this was in
    fact the case?

 2) Now try to make a selection, which separates the two species. The selection
    should be based on one variable at a time, and all four variables do not
    necessarily have to be used.

 3) How good is your selection? That is what efficiency and power does it 
    on this small sample) have?

 4) Now try to separate the two samples, which appear less distinct. First use
    simple cuts, and then apply Fisher's linear discriminant to separate the
    samples. What separation do you get in the two cases?


Advanced questions:
-------------------
 1) Try to generate a "toy" sample with gaussian distribution and the same
    correlations in (not easy!!!), and apply the Fisher discriminant here.


*/
//----------------------------------------------------------------------------------