Applied Machine Learning 2025 - Initial ML Project

"Much of what we do with machine learning happens beneath the surface. Machine learning drives our algorithms for demand forecasting, product search ranking, product and deals recommendations, merchandising placements, fraud detection, translations, and much more. Though less visible, much of the impact of machine learning will be of this type - quietly but meaningfully improving core operations."
[Jeff Bezos, CEO of Amazon]

Project description:
The project concerns classification, regression, and unsupervised learning (i.e. clustering) on structured data sets. Your task is to apply ML algorithms to the below (training) data set, trying to get the best performance out for each of the three required problems, but also making sure that you experiment and optimize cleverly. Once you're satisfied with the result, you apply it to the test data (for which there are no labels given). You should submit your solutions on Absalon by 22:00 on Sunday the 18th of May 2025. Detailed information can be found below.

The Data:
The data to be used for the project consist of two parts:

Dataset 1 - supervised (for classification and regression):
180000/60000/40000 (training / testing classification / testing regression) simulated particle collisions and interactions in the ATLAS detector, situated at CERN's LHC accelerator ourside Geneva. Colliding protons, one is interested in collisions, which produce e.g. electrons (as thus new particles decaying to electrons must have been produced, since there are no electrons inside protons). Electrons interact in a particular way in the detector, leaving a distinct signature which is different than for other particles. Each candidate in the file is described by 140 variables (not all to be used!).
As the data is simulated, the "ground truth" is known, and we thus have perfect labels (for both particle type and energy), which you should use for the supervised training.
You should use the file (provided in three formats) "AppML_InitialProject_train.csv/h5/parquet.gz" to develop/train your algorithms. When you feel satisfied with your models, you should apply these to the two test sets (one for classification and one for regression), where you don't know the true values (but we do!):

Training sample (180000 cases): AppML_InitialProject_train.csv, also in hdf5, and parquet format (216/96/106 MB).

Testing sample for classification (60000 cases): AppML_InitialProject_test_classification.csv, also in hdf5, and parquet format (70/31/33 MB).

Testing sample for regression (40000 cases): AppML_InitialProject_test_regression.csv, also in hdf5, and parquet format (46/21/22 MB).

Dataset 2 - unsupervised (for clustering):
5950 (training / testing clustering) stars as observed by the Sloan Digital Sky Survey and the Gaia satellite. The data contains 20 values representing 3 photonic quantities (how the stars shine), 14 abundances of different elements (what the stars consist of), and 3 kinematic quantities (how the stars move).
For the clustering (which is unsupervised) you only have one dataset on which you should try to optimize the performance. In the process you might check for outliers, subdivide the data, and test different variable combinations. Note that unsupervised learning is hard, and for the same reason we don't put much focus on the resulting performance.

Training/Testing sample (5950 cases): SDSS-Gaia_5950stars.csv (3.3 MB).

Problem Statement:
For the supervised part, one should identify the electrons in the best possible way (i.e. separate them from other particles) and also determine their energy as accurately as possible. For the unsupervised part, one would generally like to distinguish different abundance signatures as well as can be done, unsupervised. More specifically the three tasks are:

Identify (i.e. classify) electrons vs. non-electrons in dataset 1. This should be based on maximum 25 variables from the Variable List. The target variable for this task is "p_Truth_isElectron": 0 for non-electrons, 1 for electrons, but your identification should be continuous in the range ]0,1[. We evaluate algorithm performance using Binary Cross Entropy loss (LogLoss).

Estimate (i.e. make regression for) the energy of electrons in dataset 1. This should be based on maximum 12 variables from the Variable List. The target variable for this task is "p_truth_Energy": Energy (in GeV) of the electrons, and you should only train on true (i.e. "p_Truth_isElectron==1") electrons. We evaluate algorithm performance on true electrons only and consider Mean Absolute Error (MAE) of the relative estimate accuracy: (E_pred-E_true)/E_true.

Cluster particle signatures into 5-50 catagories in dataset 2. This should be based on maximum 7 variables from the Variable List 2 (TBP). In this case, there is no target variable, as this is unsupervised learning. Your solution should simply be the number of the category, that you assign the event to belong to, i.e. an integer [0, 1, ..., n] for n+1 categories (n in [5,50]). We evaluate algorithm performance by ability of your catagories to fit into one of the known categories (to us!).

Using Machine Learning algorithm(s), give 1-3 solution(s) to each of the following problems, along with one single common file giving a (very) short description of each solution (e.g. name of algorithm, number of parameters, central hyperparameters, training time, etc.). As stated, you should submit 1-3 solution(s) for each of the three problems (thus 3-9 in total). You are required to submit at least one tree based and one neural network based solution.
It is typically not hard to rewrite the code to solve other problems as well, and you are welcome to reuse the same algorithm for several solution types (e.g. classification and regression). However, we encourage (and reward, see below) using a variety of algorithms. Remember, the real "target" is for you to learn various techniques and algorithms.
You are welcome to use different variables for each solution. Part of the problem is determining, which of the feature values are important, and selecting these. Most ML algorithms have an internal feature ranking. Alternatively, you could consider permutation importance and/or SHAP values. For unsupervised learning, it is harder to determine, which variables are best, but those giving the lowest clustering loss are typically the better.
Note that the limit on variables is "absolute" in the sense that you are not allowed to do e.g. a PCA on all variables, reducing them to the desired number of input variables, as this approach still requires the full variable list to work.

Solutions and submission:
You (i.e. each student) should hand in (each of) your own solution(s) as TWO separate files:

A list of index/event numbers (0, 1, 2, 3, etc.) followed by your estimate on each event, i.e. (Here is shown a classifcation solution. For a regression solution, the last number should be the energy estimate in GeV), and for the clustering problem, it should be the category number (which will automatically reveal how many catagories you chose to divide into):
     0, 0.998232
     1, 0.410455
     2, 0.037859
     3, ...

A list of the variables you've used for each problem, i.e.:
     p_eta,
     p_pt_track,
     ...

Submission naming:
For each solution, you should name your two solutions files as follows: TypeOfProblemSolved_FirstnameLastname_SolutionName(_VariableList).csv.
Four solution examples for myself (thus 9 files total, including the overall description) should e.g. be named:
Solution 1: Classification_TroelsPetersen_SKLearnAlgo1.csv, Classification_TroelsPetersen_SKLearnAlgo1_VariableList.csv
Solution 2: Regression_TroelsPetersen_XGBoost1.csv, Regression_TroelsPetersen_XGBoost1_VariableList.csv
Solution 3: Regression_TroelsPetersen_PyTorchNN1.csv, Regression_TroelsPetersen_PyTorchNN1_VariableList.csv
Solution 4: Clustering_TroelsPetersen_kNN-DidNotReallyWorkWell.csv, Clustering_TroelsPetersen_kNN-DidNotReallyWorkWell_VariableList.csv
Overall description (.txt format, not to submitted to Solution Checker): Example could be Description_TroelsPetersen.txt
You description should try best possible to enable others to reproduce your model (rather than justify choices).

Submission format:
Your solution file should be in Comma-Separated Values (CSV) format, thus human readable text files. In order to test, if your format is correct, we have produced a solution checker of your file submission format.

It is mandatory to run your solutions through the Solution Checker (SolutionChecker.ipynb),
and surely also a wise thing to do to avoid mistakes.

Evaluation:
We will evaluate this project mostly on the fact that you handed in (or not!). Thus, in working out your solution, you should primarily focus on getting solutions that works reasonably well! Following this, you can pay attention to the points, based on which we evaluate your solutions:

Method(s) applied: Which methods for training, hyper parameter optimisation, etc. did you use? [Thus, cross validation and exploration of hyper parameters is good]

Performance: How well does your algorithms perform? [Thus, getting the best performance is good]

Variable choice: How good was your choice of variables? [Thus, getting the best variables is good]

Solution variation: How many (fundamentally) different solutions did you try? [Thus, nine different and well working solutions is good]
As we don't know the distribution of your performance, number of algorithms, variable choice, etc., we can't give you the exact evaluation calculation until we have your solutions.

Enjoy, have fun, and throw yourself boldly at the data....

Last updated: 13th of May 2025 by Troels Petersen