Applied Machine Learning 2021 - Small ML Project

"Much of what we do with machine learning happens beneath the surface. Machine learning drives our algorithms for demand forecasting, product search ranking, product and deals recommendations, merchandising placements, fraud detection, translations, and much more. Though less visible, much of the impact of machine learning will be of this type - quietly but meaningfully improving core operations." [Jeff Bezos, CEO of Amazon]

Project description:
The project concerns classification, regression, and unsupervised learning (i.e. clustering) on structured data sets. Your task is to apply ML algorithms to the below (training) data set, trying to get the best performance out for each of the three required problems, but also making sure that you experiment and optimize cleverly. Once you're satisfied with the result, you apply it to the test data (for which there are no labels given). You should submit your solutions on Absalon by 22:00 on Monday the 24th of May 2021. Detailed information can be found below.

The Data:
The data consist of XXX (simulated) particle collisions and interactions in the ATLAS detector, situated at CERN's LHC accelerator ourside Geneva. Colliding protons, one is interested in collisions, which produce e.g. electrons (as thus new particles decaying to electrons must have been produced). Electrons interact in a particular way in the detector, leaving a distinct signature different signature than other particles. Each candidate in the file is described by 166 variables.
As the data is simulated, the "ground truth" is known, and we thus have perfect labels (for both particle type and energy), which we can use for the supervised training.
The data sample is split 50/50 into two subsamples: You should use "train.h5" to develop your algorithm, and when you feel satisfied, you should apply it to "test.h5", where you don't know the true values. When training (on "train.h5") remember to divide the sample into a part that you train on, and one that you validate on, such that you don't overtrain (discussed in class). You may or may not want to use k-fold cross validation (recommended!).

Training sample: train.h5 (86 MB)

Testing sample: test.h5 (84 MB)
To ensure that you can read the file properly, we have prepared small scripts for reading the data:

ReadingData.ipynb.

Problem Statement:
One would like to identify the electrons in the best possible way (i.e. separate them from other particles) and also determine their energy as accurately as possible. Finally, one would generally like to distinguish different particle signatures as well as can be done, unsupervised. Using Machine Learning algorithm(s), give 1-3 solution(s) to each of the following problems, along with one single common file giving a (very) short description of each solution (e.g. name of algorithm, central hyperparameters, and training time):

Identify (i.e. classify) electrons vs. non-electrons. This should be based on maximum 25 variables from the Electron Variable List. The target variable for this task is "Truth": 0 for non-electrons, 1 for electrons, but your identification should be continuous in [0,1]. We evaluate algorithm performance using Binary Cross Entropy loss (LogLoss).

Estimate (i.e. make regression for) the energy of electrons. This should be based on maximum 15 variables from the Electron Variable List. The target variable for this task is "p_truth_E": Energy (in GeV) of the electrons, and you should only train on real (i.e. truth identified, "Truth==1") electrons. Note: It is an advantage to ONLY train the regression on true electrons (Truth = 1), but when submitting the solution, the regression estimate should be applied to ALL candidates, as you don't (perfectly) know, which are electrons, and which are not. We evaluate algorithm performance by considering Mean Absolute Error (MAE) on relative estimate accuracy: (E_pred-E_true)/E_true.

Cluster particle signatures into 3+ catagories. This should also be based maximum 10 variables from the Electron Variable List. In this case, there is no target variable, as this is unsupervised learning. Your solution should simply be the number of the category, that you assign the event to belong to, i.e. an integer [0, 1, ..., n] for n+1 categories. We evaluate algorithm performance by ability of your catagories to fit into electron/non-electron classification.

As stated, you should submit 1-3 solution(s) for each of the three problems (thus 3-9 in total). You are required to submit at least one tree based and one neural net (NN) based solution. It is typically not hard to rewrite the code to solve the other problem as well, and you are welcome to reuse the same algorithm for several solutions. However, we encourage (and reward, see below) using a variety of algorithms. You are welcome to use different variables for each solution. Part of the problem is determining, which of your feature values are important, and selecting these. Most ML algorithms have an internal feature ranking, typically based on permutation importance. An alternative could be to consider SHAP values.
Note that the limit on variables is "absolute" in the sense that you are not allowed to do a PCA on all variables, reducing them to the desired number of input variables, as this approach still requires the full variable list to work.

Solutions and submission:
You (i.e. each student) should hand in (each of) your own solution(s) as TWO separate files:

A list of index/event numbers (1, 2, 3, etc.) followed by your estimate on each event, i.e. (Here is shown a classifcation solution. For a regression solution, the last number should be the energy estimate in GeV), and for the clustering problem, it should be the category number (which will automatically reveal how many catagories you chose to divide into):
     0, 0.998232
     1, 0.410455
     2, 0.037859
     3, ...

A list of the variables you've used for each problem, i.e.:
     p_eta
     p_pt_track
     ...

Submission naming:
For each solution, you should name your two solutions files as follows: TypeOfProblemSolved_FirstnameLastname_SolutionName(_VariableList).txt.
Three solution examples for myself (thus 7 total files, including the overall description) should e.g. be named:
Solution 1: Classification_TroelsPetersen_SKLearnAlgo1.txt, Classification_TroelsPetersen_SKLearnAlgo1_VariableList.txt
Solution 2: Regression_TroelsPetersen_XGBoost1.txt, Regression_TroelsPetersen_XGBoost1_VariableList.txt
Solution 3: Regression_TroelsPetersen_PyTorchNN1.txt, Regression_TroelsPetersen_PyTorchNN1_VariableList.txt
Solution 4: Clustering_TroelsPetersen_kNN-DidNotReallyWorkWell.txt, Clustering_TroelsPetersen_kNN-DidNotReallyWorkWell_VariableList.txt
Overall description: Example could be Description_TroelsPetersen.txt

Submission format:
Your solution file should be in Comma-Separated Values (CSV) format, thus human readable text files. In order to test, if your format is correct, we have produced a file submission reader:

SolutionReader.ipynb

Evaluation:
We will evaluate this project mostly on the fact that you handed in (or not!). Following this, we will pay attention to your method(s), solution variation, variable choice, and performance. This individual project counts 40% of your final points/grade.
Enjoy, have fun, and throw yourself boldly at the data, the main target of which - in summary - is to do:

Classification: Predict if the particle is an electron or not (target variable "Truth" = 1 or 0).

Regression:: Predict the energy of electrons (target variable "p_truth_E", requiring "Truth = 1").

Clustering:: Divide the sample into 3+ clusters with unsupervised learning.
based on the variable lists and limitations provided and including at least one TREE and one NN based solution.

In working out your solution, you should primarily focus on getting solutions that works well! Following this, you can pay attention to the points, based on which we evaluate your solutions:

Method(s) applied: Which methods for training, hyper parameter optimisation, etc. did you use? [Thus, cross validation and exploration of hyper parameters is good]

Performance: How well does your algorithms perform? [Thus, getting the best performance is good]

Variable choice: How good was your choice of variables? [Thus, getting the best variables is good]

Solution variation: How many (fundamentally) different solutions did you try? [Thus, nine different and well working solutions is good]
As we don't know the distribution of your performance, number of algorithms, variable choice, etc., we can't give you the exact evaluation calculation until we have your solutions.

Last updated: 19th of May 2021 by Troels Petersen