ML and Big Data 2019 - Small Project

"Much of what we do with machine learning happens beneath the surface. Machine learning drives our algorithms for demand forecasting, product search ranking, product and deals recommendations, merchandising placements, fraud detection, translations, and much more. Though less visible, much of the impact of machine learning will be of this type - quietly but meaningfully improving core operations." [Jeff Bezos, CEO of Amazon]

Project description:
The project concerns classification and/or regression on a structured data set. The data contained in the two files "train.h5" and "test.h5" is simulated data from the ATLAS experiment at CERN, and contains more specifically a long list of measurements (i.e. input variables) for electron candidates. In one of the data sets ("train.h5") there are also two truth variables from the simulation input, namely if it is an electron ("Truth", 0 for background, 1 for signal) and its energy ("p_truth_E"), respectively.
The two data files can be found here: train.h5 (86 MB) and test.h5 (84 MB).
To ensure that you can read the file properly, we have prepared a small script for reading the data: ReadingData.py (8 kB).

You should submit the project at Absalon by 22:00 on Monday the 20th of May 2019.

Problem statement:
Using Machine Learning algorithm(s), try to solve at least one of the following problems:

Identify (i.e. classify) electrons compared to non-electrons based on the target variable "Truth": 0 for non-electrons, 1 for electrons.

Estimate (i.e. make regression for) the energy of electrons based on the target variable "p_truth_E": Energy (in GeV) of the electrons You should use "train.h5" to develop your algorithm, and when you feel satisfied, you should apply it to "test.h5", where you don't know the true values. When training (on "train.h5") remember to divide the sample into a part that you train on, and one that you validate on, such that you don't overtrain (discussed in class). You may or may not want to use k-fold cross validation.
You "only" have to submit ONE solution for ONE of these problems. However, it is typically not hard to rewrite the code to solve the other problem as well. You are welcome to submit up to three solutions for each problem using different algorithms. The solution(s) should NOT USE MORE THAN 30 VARIABLES, but you're welcome to use different variables for each solution. Thus part of the problem is determining, which of your feature values are important, and selecting these. Most ML algorithms have an internal feature ranking. But these are seldom very accurate, and an alternative could be to consider SHAP values.
Note that for the classification problem, you should not use "p_truth_E" as an input feature variable, as it does not exist in "test.h5". Also, for the regression problem, you should require "Truth" to be 1 (i.e. electron), as one only wants the energy for real electrons (and we will evaluate this way).

Solutions and submission:
You (i.e. each student) should hand in (each of) your own solution(s) as TWO separate files:

A list of index/event numbers (1, 2, 3, etc.) followed by your estimate on each event, i.e. (Here is shown a classifcation solution. For a regression solution, the last number should be the energy estimate in GeV):
     0 0.998232
     1 0.410455
     2 0.037859
     3 ...

A list of the variables you've used for each problem, i.e.:
     p_eta
     p_pt_track
     ...
You should name your file as follows: TypeOfProblemSolved_FirstnameLastname_SolutionName(_VariableList).txt
three solution examples of which could for myself be:
Solution 1: Classification_TroelsPetersen_SKLearnAlgo1.txt, Classification_TroelsPetersen_SKLearnAlgo1_VariableList.txt
Solution 2: Classification_TroelsPetersen_XGBoost1.txt, Classification_TroelsPetersen_XGBoost1_VariableList.txt
Solution 3: Regression_TroelsPetersen_kNN-DidNotReallyWorkWell.txt, Regression_TroelsPetersen_kNN-DidNotReallyWorkWell_VariableList.txt

You should submit the project via Absalon by 22:00 on Monday the 20th of May 2019.

Evaluation:
We will evaluate this project mostly on the fact that you handed in (or not!), with smaller attention to your method(s), solution variation, variable choice, and performance. This small project counts 20% of your final points/grade.
Enjoy, have fun, and throw yourself boldly at the data, the main target of which is to do:

Classification: Predict if the particle is an electron or not (target variable "Truth" = 1 or 0).

Regression:: Predict the energy of electrons (target variable "p_truth_E"), thus requiring "Truth = 1".

Last updated: 13th of May 2019.