Applied Machine Learning 2026 - Initial ML Project

"Much of what we do with machine learning happens beneath the surface. Machine learning drives our algorithms for demand forecasting, product search ranking, product and deals recommendations, merchandising placements, fraud detection, translations, and much more. Though less visible, much of the impact of machine learning will be of this type - quietly but meaningfully improving core operations."
[Jeff Bezos, CEO of Amazon]

Project description:
The project concerns classification, regression, and unsupervised learning (i.e. clustering) on structured data sets. Your task is to apply ML algorithms to the below (training) data set, trying to get the best performance out for each of the three required problems, but also making sure that you experiment and optimize cleverly. Once you're satisfied with the result, you should apply it to the test data (for which there are no labels given). Solutions are required to pass the SubmissionChecker.py) script to ensure correct format and length.

You should submit your solutions on Absalon by 22:00 on Sunday the 17th of May 2026. Detailed information can be found below.

The Data:
The data to be used for the project consist of two parts:

Dataset 1 - Supervised (for classification and regression):
180000/60000/40000 (training / testing classification / testing regression) simulated particle collisions and interactions in the ATLAS detector, situated at CERN's LHC accelerator ourside Geneva. Colliding protons, one is interested in collisions, which produce e.g. electrons (as thus new particles decaying to electrons must have been produced, since there are no electrons inside protons). Electrons interact in a particular way in the detector, leaving a distinct signature which is different than for other particles. Each candidate in the file is described by 140 variables (not all to be used!).
As the data is simulated, the "ground truth" is known, and we thus have perfect labels (for both particle type and energy), which you should use for the supervised training.
You should use the trainnig sample file (provided in three formats) "AppML_InitialProject_train.csv/h5/parquet.gz" to develop/train your algorithms. When you feel satisfied with your models, you should apply these to the two test samples (one for classification and one for regression), where you don't know the true values (but we do!):

Training sample (180000 cases): AppML_InitialProject_train.csv, also in hdf5, and parquet format (216/96/106 MB).

Testing sample for classification (60000 cases): AppML_InitialProject_test_classification.csv, also in hdf5, and parquet format (70/31/33 MB).

Testing sample for regression (40000 cases): AppML_InitialProject_test_regression.csv, also in hdf5, and parquet format (46/21/22 MB).

Dataset 2 - Unsupervised (for clustering):
5950 (training / testing clustering) stars as observed by the Sloan Digital Sky Survey and the Gaia satellite. The data contains 20 values representing 3 photonic quantities (how the stars shine), 14 abundances of different elements (what the stars consist of), and 3 kinematic quantities (how the stars move).
For the clustering (which is unsupervised) you only have one dataset on which you should try to optimize the performance. In the process you might check for outliers, subdivide the data, and test different variable combinations. Note that unsupervised learning is hard, and for the same reason we don't put much focus on the resulting performance.

Training/Testing sample (5950 cases): SDSS-Gaia_5950stars.csv (3.3 MB).

Problem Statement:
For the supervised part, one should identify the electrons in the best possible way (i.e. separate them from other particles) and also determine their energy as accurately as possible. For the unsupervised part, one would generally like to distinguish different abundance signatures as well as can be done, unsupervised. More specifically the three tasks are:

Identify (i.e. classify) electrons vs. non-electrons in dataset 1. This should be based on maximum 15 variables from the Variable List. The target variable for this task is "p_Truth_isElectron": 0 for non-electrons, 1 for electrons, and your identification should be continuous in the range ]0,1[. We evaluate algorithm performance using Binary Cross Entropy loss (LogLoss).

Estimate (i.e. make regression for) the energy of electrons in dataset 1. This should be based on maximum 20 variables from the Variable List. The target variable for this task is "p_truth_Energy": Energy (in GeV) of the electrons, and you should only train on true (i.e. "p_Truth_isElectron==1") electrons. We evaluate algorithm performance on true electrons only and consider relative Mean Absolute Deviation (MAD) of the relative energy estimate accuracy: RelMAD = mean(abs(E_pred-E_true)/E_true).

Cluster particle signatures into 4-40 catagories in dataset 2. This should be based on maximum 6 variables from the Variable List 2. In this case, there is no target variable, as this is unsupervised learning. Your solution should simply be the number of the category, that you assign the event to belong to, i.e. an integer [0, 1, ..., n] for n+1 categories (n in [4,40]). We evaluate algorithm performance by ability of your catagories to fit into one of the known categories (to us!), more specifically the Adjusted Mutual Information (AMI) score.

Using Machine Learning algorithm(s), give 1-3 solution(s) to each of the three problems (thus 3-9 in total), along with one single common file giving a (very) short description of each solution (e.g. name of algorithm, number of parameters, central hyperparameters, training time, etc.). You are required to submit at least one tree based and one neural network based solution.
It is typically not hard to rewrite the code to solve other problems as well, and you are welcome to reuse the same algorithm for several solution types (e.g. classification and regression). However, we encourage (and reward, see below) using a variety of algorithms. Remember, the real "target" is for you to learn various techniques and algorithms.

You are welcome (and encouraged) to use different variables for each solution. Part of the problem is determining, which of the feature values are important, and selecting these. Most ML algorithms have an internal feature ranking. Alternatively, you could consider permutation importance and/or SHAP values. For unsupervised learning, it is harder to determine, which variables are best, but those giving the lowest clustering loss are typically the better. Note that the limit on variables is "absolute" in the sense that you are not allowed to do e.g. a PCA on all variables, reducing them to the desired number of input variables, as this approach still requires the full variable list to work.

Solutions and submission:
You (i.e. each student) should hand in your solution(s) as TWO separate files following a specific naming scheme:

A list of index/event numbers (0, 1, 2, 3, etc.) followed by your estimate on each event, i.e. (Here is shown a classifcation solution. For a regression solution, the last number should be the energy estimate in MeV, as given in the training file), and for the clustering problem, it should be the category number (which will automatically reveal how many catagories you chose to divide into):
     0, 0.998232
     1, 0.410455
     2, 0.037859
     3, ...

A list of the variables you've used for each problem, i.e.:
     p_eta,
     p_pt_track,
     ...

Example solution file formats can be seen in the associated GitHub repository.

Submission naming:
For each solution, you should name your two solutions files as follows: TypeOfProblemSolved_FirstnameLastname_SolutionName(_VariableList).csv.
Four solution examples for myself (thus 9 files total, including the overall description) should e.g. be named:
Solution 1: Classification_TroelsPetersen_SKLearnAlgo1.csv, Classification_TroelsPetersen_SKLearnAlgo1_VariableList.csv
Solution 2: Regression_TroelsPetersen_XGBoost1.csv, Regression_TroelsPetersen_XGBoost1_VariableList.csv
Solution 3: Regression_TroelsPetersen_PyTorchNN1.csv, Regression_TroelsPetersen_PyTorchNN1_VariableList.csv
Solution 4: Clustering_TroelsPetersen_kNN-DidNotReallyWorkWell.csv, Clustering_TroelsPetersen_kNN-DidNotReallyWorkWell_VariableList.csv
Overall description (.txt format, not to submitted to SubmissionChecker): Example could be Description_TroelsPetersen.txt (though a bit short).
You description should justify choices and try best possible to enable others to reproduce your model.

It is mandatory to run your solutions through the Solution Checker (SubmissionChecker.py),
and seeing that it fully adheres to the requirements/format (and surely also a wise thing to do to avoid mistakes).
Submission executive summary:
You are required to submit two files per solution (predictions and variable list) and a single overall TXT description file. Example filename patterns and formats can be found in GitHub, and lengths should be: Classification: 60000 floats, Regression: 40000 floats, and Clustering: 5950 integers (all in the format of "index, value" per line and no header). The input variable limit is 15 / 20 / 6, and ahead of submitting on Absalon it is MANDATORY to run the provided solution checker locally and submit only if it passes (and a good idea to try it out early in the process!).

Evaluation:
We will evaluate this project mostly on the fact that you handed in working solutions (or not!). Thus, in working out your solution, you should primarily focus on getting solutions that works reasonably well! Following this, you can pay attention to the points, based on which we evaluate your solutions:

Method(s) applied: Which methods for training, hyper parameter optimisation, etc. did you use? [Thus, algorithm architecture, cross validation, and hyper parameter exploration is good]

Performance: How well does your algorithms perform? [Thus, getting the best performance is good]

Variable choice: How good was your choice of variables? [Thus, getting the best variables is good]

Solution variation: How many (fundamentally) different solutions did you try? [Thus, nine different and well working solutions is good]
As we don't know the distribution of your performance, number of algorithms, variable choice, etc., we can't give you the exact evaluation calculation until we have your solutions.

Enjoy, have fun, and throw yourself boldly at the data....

Last updated: 13th of April 2026 by Troels Petersen