#!/usr/bin/env python

# ----------------------------------------------------------------------------------- #
# Python/ROOT script for analysis of ATLAS test beam data.
#
# The program reads an ASCII (i.e. text) data file, containting a large number of events,
# where a charged particle (electron or pion) passed through a slice of the ATLAS detector.
# Each passage is recorded by different detectors, boiling down to eleven numbers (some more
# relevant than others). The exercise is to separate electron and pion events, and in turn
# us this information to measure the interaction of pions and electrons seperately.
#
# NOTE: Though the data is from particle physics, it could in principle have been from ANY
# other source, and the eleven numbers could for example have been indicators of cancer,
# key numbers for investors, or index numbers for identifying potential costumors.
# 
# For more information on ATLAS test beam: 
#   http://www.nbi.dk/~petersen/Teaching/Stat2016/TestBeam/TestBeamDataAnalysis.html
# 
# Author: Troels C. Petersen (NBI)
# Email:  petersen@nbi.dk
# Date:   10th of December 2016
# 
# ----------------------------------------------------------------------------------- #

from ROOT import *
from math import *


# ----------------------------------------------------------------------------------- #
# Read the data file:
# ----------------------------------------------------------------------------------- #

# Write extensive output
verbose = True
Nverbose = 10
Nread = 0

Hist_Cher    = TH1F("Hist_Cher", ";Cherenkov;Frequency", 200, 400.0, 1400.0)
Hist_EM1     = TH1F("Hist_EM1",  ";EM1;Frequency",       200,  -0.5,    2.0)
Hist_CherEM1 = TH2F("Hist_CherEM1", ";Cherenkov;EM1;Frequency", 100, 400.0, 1400.0, 100, -0.5, 2.0)


# Open file
f = open('DataSet_AtlasPid_ElectronPion_2GeV.txt', 'r')

# Loop over lines and extract variables of interest
header = f.readline()

for line in f:
    line = line.strip()
    columns = line.split()

    Cher = float(columns[0])
    nLT  = int(columns[1])
    nHT  = int(columns[2])
    EM0  = float(columns[3])
    EM1  = float(columns[4])
    EM2  = float(columns[5])
    EM3  = float(columns[6])
    Had0 = float(columns[7])
    Had1 = float(columns[8])
    Had2 = float(columns[9])
    Muon = float(columns[10])

    Hist_Cher.Fill(Cher)
    Hist_EM1.Fill(EM1)
    Hist_CherEM1.Fill(Cher,EM1)

    if (verbose and Nread < Nverbose) :
        print "Cher: %6.1f    nLT,nHT: %2d %2d    EM: %5.2f %5.2f %5.2f %5.2f   Had: %5.2f %5.2f %5.2f    Muon: %5.1f"%(Cher, nLT, nHT, EM0, EM1, EM2, EM3, Had0, Had1, Had2, Muon)
    Nread += 1

f.close()

print "  Total number of events read: %d"%Nread


# ----------------------------------------------------------------------------------- #
# Your analysis:
# ----------------------------------------------------------------------------------- #

canvas1 = TCanvas("canvas1", "Distributions", 50, 50, 900, 600)
canvas1.Divide(2,2)

canvas1.cd(1)
Hist_CherEM1.Draw("colz")

canvas1.cd(2)
Hist_EM1.Draw()

canvas1.cd(3)
Hist_Cher.Draw()

canvas1.Update()


raw_input( ' ... Press Enter to exit ... ' )

# ----------------------------------------------------------------------------------- #
# Questions to be answered:
# ----------------------------------------------------------------------------------- #
# 
# Generally, this analysis is about separating electrons and pions (and determining how
# well this can be done), followed by a few questions characterizing the detector
# response to each type of particle.
# Below are questions guiding you, some/most of which your analysis should cover, but
# you do not have to follow them blindly (I've put "Optional" on those that are not
# essential).
# Start by considering the data, and get a feel for the typical range of each variable.
# Plot the variables, both in 1D and perhaps also 2D! From considering these plots,
# guess/estimate an approximate knowledge of how electrons and pions distribute
# themselves in the variables above, and how to make a selection of these.
#
# As described on the webpage introducing the data, the three detectors, Cherenkov,
# TRT and Calorimeters are each capable of separating electrons and pions. As they are
# INDEPENDENT (three separate detectors), they may be used to cross check each other,
# and this is what you should use!
#
# 
# Questions:
# ----------
# 1) Find for each of these three detector systems one variable (i.e. a combination),
# which seem to separate electrons and pions best. For example, start with the
# Cherenkov, which is only a single number, and assume/guess that the large peak
# at low values is mainly from pions, while the upper broad peak is from electrons.
# Now plot the TRT and Calorimeter distributions when the Cherenkov selects a pion
# and afterwards an electron. This should give you a hint of how to separate pions
# and electrons using the TRT and Calorimeters.
#
# Hint: Sometimes variables from a single detector are more powerful, when they are
# combined, e.g. taken ratios of. For the TRT this may be somewhat obvious, but for
# the EMcalo, it is not as simple. While one may simply use the second layer,
# involving this layer in a ratio may enhance the separation power.
#
# 2) Next you should try to see, if you can make a selection, which gives you a
# fairly large and clean electron and pion sample, respectively. The question is, how
# can you know how clean your sample is and how efficient your selection is?
# This can actually be measured in the data itself, using the fact that there are
# three independent detectors. For example, start by making an electron and a pion
# selection using two of the three variables, and plot the third variable for each of
# these selections. Now you can directly see, how electrons and pions will distribute
# themselves in this third variable. Are you worried, that there are pions in your
# electron sample, and vice versa? Well, there will probably be, but so few, that it
# won't matter too much, at least not to begin with.
# Why? Well, let us assume that for each detector, 80% of electrons pass your
# requirement, but also 10% of pions do. Assuming an even number of electrons and
# pions (which is not really the case), then with two detector cuts, you should get
# a sample, which is: 0.8*0.8 / (0.8*0.8 + 0.1*0.1) = 98.5% pure.
#
# Now with this sample based on cuts on the two other detectors, ask what fraction
# of electrons and pions passes your electron selection. The fraction of electrons,
# that are not selected as electrons will be your TYPE I errors, denoted alpha,
# while the fraction of pions, that do pass the cut will be your TYPE II errors,
# denoted beta.
# Measure these for each of the two cuts in the three detector types, and ask yourself
# if they are "reasonable", i.e. something like in the example above. If not, then
# you should perhaps reconsider adjusting your cuts.
#
# By now, you should for each detector have 6 numbers:
# - The electron cut value above which you accept an electron.
# - The efficiency (i.e. 1-alpha) for electrons of this cut.
# - The fake rate (i.e. beta) for pions of this cut.
# - The pion cut value below which you accept a pion (may be same value as above!).
# - The efficiency (i.e. 1-alpha) for pions of this cut.
# - The fake rate (i.e. beta) for electrons of this cut.
#
# 3) Given the efficiencies and fake rates of each cut, try to combine these (again
# assuming that they are independent) into knowledge of your sample purities and
# also the total number of electrons and pions in the whole sample. Do the sum of
# estimated electrons and pions added actually match the number of particles in total?
# This is a good cross check!
#
# 4) If the number of pions was suddenly 1.000 that of elections, would you still be
# able to get a sample of fairly pure (say 90% pure) electrons? And if so, what would
# the efficiency for these electrons be?
#
# 5) One of the purposes of the testbeam what to measure the response of the TRT
# detector to exactly electrons and pions. Consider for example only events that has
# 33 TRT hits (i.e. nLT = 33). As the High-Threshold probability (i.e. probability of
# passing the High-Threshold, given that the Low-Threshold was passed), is assumed to
# be constant in the TRT detector (but quiet different for electrons and pions), what
# distribution should the number of High-Threshold hits (nHT) follow? And is that
# really the case, both for electrons and pions?
# 
# 6) Expanding on problem 2), try now to calculate ROC curves for each of the three
# detectors. These are obtained by making a clean selection using the two other
# detectors for electrons and pions seperately, and then integrating over these two
# distributions, using the running (normalised) integral of each as x and y coordinate
# in a (ROC) graph.
# If you do not manage on your own, perhaps consider the ROC calculator example,
# which is posted along with this exercise.
# 
# (Optional)
# 7) Still considering nLT = 33, and given that there are both electrons and pions in
# the sample (each with a different HT probability), nHT should in the unselected data
# be a combination of two distributions. Try to fit the number of HT hits with two
# distributions combined.
# Do they fit the data? And can this fit be used to estimate the fraction of pions and
# electrons in the sample? And does that estimate match you previous estimate?
# Perhaps retry with other values for the number of TRT hits.
#
# (Optional)
# 8) Try to select pions using three different (mutually exclusive) techniques:
#    a) Passing only a hadronic calorimeter requirement (e.g. that the sum of the three
#       HCal values is above some minimum energy).
#    b) Passing only Cherenkov AND EMcalo requirements.
#    c) Passing both a) and b).
# Try to measure the HT probability (i.e. fraction of High-Threshold hits) for each of
# these three pion samples. Do they agree with each other?
# 
# ----------------------------------------------------------------------------------- #