// ----------------------------------------------------------------------------------- //
/*
  ROOT macro for analysing the famous Anderson-Fisher Iris data set from Gaspe Peninsula.

  Edgar Anderson took 50 samples of three species (Setosa, Versicolor, and Virginica)
  of Iris at the Gaspe Peninsula, and measured four distinguishing features on each
  flower:
    Sepal Length
    Sepal Width
    Petal Length
    Petal Width
  Using these measurements, Ronald Fisher was able to make a classification scheme, for
  which he invented the Fisher Linear Discriminant.

  References:
    Glen Cowan, Statistical Data Analysis, pages 51-57
    http://en.wikipedia.org/wiki/Iris_flower_data_set
    http://en.wikipedia.org/wiki/Linear_discriminant_analysis

  Author: Troels C. Petersen (NBI)
  Email:  petersen@nbi.dk
  Date:   7th of October 2012
*/
// ----------------------------------------------------------------------------------- //


// ----------------------------------------------------------------------------------- //
void FisherDiscriminant() {
// ----------------------------------------------------------------------------------- //
  gROOT->Reset();

  // Set the showing of statistics and fitting results (0 means off, 1111 means all on):
  gStyle->SetOptStat(1111);
  // gStyle->SetOptStat(0);
  gStyle->SetOptFit(1111);
  // gStyle->SetOptFit(0);

  // Statistics and fitting results replaced into:
  // gStyle->SetStatX(0.52);    // Top left corner.
  // gStyle->SetStatY(0.86);
  // gStyle->SetStatX(0.89);    // Bottom right corner.
  // gStyle->SetStatY(0.33);

  // Set the graphics:
  gStyle->SetStatBorderSize(1);
  gStyle->SetStatFontSize(0.055);
  gStyle->SetCanvasColor(4);
  gStyle->SetPalette(1);


  // ------------------------------------------------------------------ //
  // Read data from file:
  // ------------------------------------------------------------------ //

  // Open data file:
  FILE *data = fopen("DataSet_AndersonFisherIris.txt","r");

  // The information to be read is:
  // Sepal Length, Sepal Width, Petal Length, Petal Width, and Species.
  const int Nvar  =   4;
  const int Nmax = 200;
  int n = 0;
  double x[Nvar][Nmax];
  int Species[Nmax];

  // Loop over and read data as long as there is data (i.e. not End-Of-File (EOF)).
  while (fscanf(data, "%lf %lf %lf %lf %d \n",
		&x[0][n], &x[1][n], &x[2][n], &x[3][n], &Species[n]) != EOF) {
    if (n < 5) printf("  Read data: %5.2f %5.2f %5.2f %5.2f   %3d \n",
		      x[0][n], x[1][n], x[2][n], x[3][n], Species[n]);
    n++;
  }
  printf(" Found %d entries. \n", n);

  fclose(data);


  // ------------------------------------------------------------------ //
  // Analyse data:
  // ------------------------------------------------------------------ //


}

//---------------------------------------------------------------------------------- 
/*

Start by taking a CLOSE look at the data and what the computer reads in!

Questions:
----------
 1) First consider the assumption that there are only two species. How would you
    start analysing the data, and what observation would make you think that this
    was in fact the case?

 2) Now try to make a selection, which separates the two species, i.e. a line like:
      if (x[0][i] > 0.25 && x[1][i] < 0.7) "select".
    The selection does not necessarily have to be used all four variables.

 3) How good is your selection? That is what efficiency and power does it have
    on this small sample? Yes, just count how many it gets right and wrong
    (errors of type I and type II).

 4) Now try to separate the two samples, which appear less distinct (Versicolor
    and Virginica). First use simple cuts, and see how well you can do. Then
    apply Fisher's linear discriminant to separate the samples. What separation
    do you get in the two cases?
    NOTE: For calculating your Fisher distriminant (i.e. linear combination of
    variables) you need to know the covariance matrices of both signal and background
    and inverting these. ROOT has this build into a TMatrixD (function: Invert).

 5) Would you suspect overtraining?


Advanced questions:
-------------------
 1) Try to generate a "toy" sample with gaussian distribution and the same
    correlations in (not easy!!!), and apply the Fisher discriminant here.
    What is the variation in separation? Could this be considered the
    "uncertainty" in the separtion.

*/
//----------------------------------------------------------------------------------