Applied Machine Learning 2024 - Initial ML Project
"Much of what we do with machine learning happens beneath the surface. Machine learning drives our algorithms for demand forecasting, product search ranking, product and deals recommendations, merchandising placements, fraud detection, translations, and much more. Though less visible, much of the impact of machine learning will be of this type - quietly but meaningfully improving core operations."
[Jeff Bezos, CEO of Amazon]
Project description:
The project concerns classification, regression, and unsupervised learning (i.e. clustering)
on structured data sets. Your task is to apply ML algorithms to the below (training) data set,
trying to get the best performance out for each of the three required problems, but also making
sure that you experiment and optimize cleverly. Once you're satisfied with the result, you apply
it to the test data (for which there are no labels given).
You should submit your solutions on Absalon by 22:00 on Monday the 20th of May 2024.
Detailed information can be found below.
The Data:
The data consist of 180000/60000/40000/20000 (train/test classification/test regression/test clustering) simulated
particle collisions and interactions in the ATLAS detector, situated at CERN's LHC accelerator ourside Geneva.
Colliding protons, one is interested in collisions, which produce e.g. electrons (as thus new particles decaying
to electrons must have been produced, since there are no electrons inside protons).
Electrons interact in a particular way in the detector, leaving a distinct signature which is different than for
other particles. Each candidate in the file is described by 140 variables (not all to be used!).
As the data is simulated, the "ground truth" is known, and we thus have perfect labels
(for both particle type and energy), which we can use for the supervised training.
You should use "AppML_InitialProject_train.csv/h5/parquet.gz" to develop and/or train your algorithm.
When training, remember to divide the sample into a part that you train on, and one that you validate on,
such that you don't overtrain (discussed in class). You may or may not want to use k-fold cross validation.
When you feel satisfied with your models, you should apply these to the three test sets, where you don't
know the true values:
Training sample (180000 cases):
AppML_InitialProject_train.csv, also in
hdf5, and
parquet format (216/96/106 MB).
Testing sample for classification (60000 cases):
AppML_InitialProject_test_classification.csv, also in
hdf5, and
parquet format (70/31/33 MB).
Testing sample for regression (40000 cases):
AppML_InitialProject_test_regression.csv, also in
hdf5, and
parquet format (46/21/22 MB).
Testing sample for clustering (20000 cases):
AppML_InitialProject_test_clustering.csv, also in
hdf5, and
parquet format (24/11/12 MB).
Problem Statement:
One would like to identify the electrons in the best possible way (i.e. separate them from
other particles) and also determine their energy as accurately as possible. Finally, one
would generally like to distinguish different (mainly non-electron) particle signatures as
well as can be done, unsupervised. More specifically:
Identify (i.e. classify) electrons vs. non-electrons. This should be based on
maximum 20 variables from the Variable List.
The target variable for this task is "p_Truth_isElectron": 0 for non-electrons, 1 for electrons,
but your identification should be continuous in ]0,1[.
We evaluate algorithm performance using Binary Cross Entropy loss (LogLoss).
Estimate (i.e. make regression for) the energy of electrons. This should be
based on maximum 25 variables from the Variable
List. The target variable for this task is "p_truth_Energy": Energy (in GeV) of the
electrons, and you should only train on true (i.e. truth identified, "p_Truth_isElectron==1") electrons.
We evaluate algorithm performance on true electrons and consider Mean Absolute Error (MAE)
on relative estimate accuracy: (E_pred-E_true)/E_true.
Cluster particle signatures into 3-25 catagories. This should be based on
maximum 10 variables from the Variable List.
In this case, there is no target variable, as this is unsupervised learning. Your solution should
simply be the number of the category, that you assign the event to belong to, i.e. an
integer [0, 1, ..., n] for n+1 categories (n in [3,25]).
We evaluate algorithm performance by ability of your catagories to fit into mostly non-electron particle
types (no truth variable given for these (to you - I have them!)).
Using Machine Learning algorithm(s), give 1-3 solution(s) to each of the following problems,
along with
one single common file giving a (very) short description of each solution
(e.g. name of algorithm, central hyperparameters, and training time).
As stated, you should submit 1-3 solution(s) for each of the three problems (thus 3-9 in total).
You are required to submit at least one tree based and one neural network based solution.
It is typically not hard to rewrite the code to solve other problems as well, and you are
welcome to reuse the same algorithm for several solutions. However, we encourage (and reward,
see below) using a variety of algorithms. Remember, the real target is for you to learn various
techniques.
You are welcome to use different variables for each solution. Part of the problem is determining,
which of the feature values are important, and selecting these.
Most ML algorithms have an internal feature ranking. Alternatively, you could consider permutation
importance and/or
SHAP values.
Note that the limit on variables is "absolute" in the sense that you are not allowed to do e.g. a PCA
on
all variables, reducing them to the desired number of input variables, as this approach still
requires the full variable list to work.
Solutions and submission:
You (i.e. each student) should hand in (each of) your own solution(s) as TWO separate files:
A list of index/event numbers (0, 1, 2, 3, etc.) followed by your estimate on each event, i.e.
(Here is shown a classifcation solution. For a regression solution, the last number should
be the energy estimate in GeV), and for the clustering problem, it should be the category
number (which will automatically reveal how many catagories you chose to divide into):
     0, 0.998232
     1, 0.410455
     2, 0.037859
     3, ...
A list of the variables you've used for each problem, i.e.:
     p_eta,
     p_pt_track,
     ...
Submission naming:
For each solution, you should name your two solutions files as follows:
TypeOfProblemSolved_FirstnameLastname_SolutionName(_VariableList).csv.
Four solution examples for myself (thus 9 files total, including the overall description) should e.g. be named:
Solution 1: Classification_TroelsPetersen_SKLearnAlgo1.csv, Classification_TroelsPetersen_SKLearnAlgo1_VariableList.csv
Solution 2: Regression_TroelsPetersen_XGBoost1.csv, Regression_TroelsPetersen_XGBoost1_VariableList.csv
Solution 3: Regression_TroelsPetersen_PyTorchNN1.csv, Regression_TroelsPetersen_PyTorchNN1_VariableList.csv
Solution 4: Clustering_TroelsPetersen_kNN-DidNotReallyWorkWell.csv, Clustering_TroelsPetersen_kNN-DidNotReallyWorkWell_VariableList.csv
Overall description (.txt format, not to submitted to Solution Checker): Example could be
Description_TroelsPetersen.txt
Submission format:
Your solution file should be in Comma-Separated Values (CSV) format, thus human readable text files.
In order to test, if your format is correct, we have produced a file submission reader/checker:
SolutionChecker.ipynb
It is mandatory to run your solutions through the Solution Checker (link above),
and surely also a wise thing to do to avoid mistakes.
Evaluation:
We will evaluate this project mostly on the fact that you handed in (or not!). Thus, in working out your solution,
you should primarily focus on getting solutions that works reasonably well! Following this, you can pay attention
to the points, based on which we evaluate your solutions:
Method(s) applied: Which methods for training, hyper parameter optimisation, etc. did you use?
[Thus, cross validation and exploration of hyper parameters is good]
Performance: How well does your algorithms perform?
[Thus, getting the best performance is good]
Variable choice: How good was your choice of variables?
[Thus, getting the best variables is good]
Solution variation: How many (fundamentally) different solutions did you try?
[Thus, nine different and well working solutions is good]
As we don't know the distribution of your performance, number of algorithms, variable choice, etc.,
we can't give you the exact evaluation calculation until we have your solutions.
Enjoy, have fun, and throw yourself boldly at the data....
Last updated: 12th of April 2024 by Troels Petersen