Applied Machine Learning 2024 - Initial ML Project

"Much of what we do with machine learning happens beneath the surface. Machine learning drives our algorithms for demand forecasting, product search ranking, product and deals recommendations, merchandising placements, fraud detection, translations, and much more. Though less visible, much of the impact of machine learning will be of this type - quietly but meaningfully improving core operations." [Jeff Bezos, CEO of Amazon]

Project description:
The project concerns classification, regression, and unsupervised learning (i.e. clustering) on structured data sets. Your task is to apply ML algorithms to the below (training) data set, trying to get the best performance out for each of the three required problems, but also making sure that you experiment and optimize cleverly. Once you're satisfied with the result, you apply it to the test data (for which there are no labels given). You should submit your solutions on Absalon by 22:00 on Monday the 20th of May 2024. Detailed information can be found below.


The Data:
The data consist of 180000/60000/40000/20000 (train/test classification/test regression/test clustering) simulated particle collisions and interactions in the ATLAS detector, situated at CERN's LHC accelerator ourside Geneva. Colliding protons, one is interested in collisions, which produce e.g. electrons (as thus new particles decaying to electrons must have been produced, since there are no electrons inside protons). Electrons interact in a particular way in the detector, leaving a distinct signature which is different than for other particles. Each candidate in the file is described by 140 variables (not all to be used!).
As the data is simulated, the "ground truth" is known, and we thus have perfect labels (for both particle type and energy), which we can use for the supervised training.
You should use "AppML_InitialProject_train.csv/h5/parquet.gz" to develop and/or train your algorithm. When training, remember to divide the sample into a part that you train on, and one that you validate on, such that you don't overtrain (discussed in class). You may or may not want to use k-fold cross validation. When you feel satisfied with your models, you should apply these to the three test sets, where you don't know the true values:
  • Training sample (180000 cases): AppML_InitialProject_train.csv, also in hdf5, and parquet format (216/96/106 MB).
  • Testing sample for classification (60000 cases): AppML_InitialProject_test_classification.csv, also in hdf5, and parquet format (70/31/33 MB).
  • Testing sample for regression (40000 cases): AppML_InitialProject_test_regression.csv, also in hdf5, and parquet format (46/21/22 MB).
  • Testing sample for clustering (20000 cases): AppML_InitialProject_test_clustering.csv, also in hdf5, and parquet format (24/11/12 MB).


    Problem Statement:
    One would like to identify the electrons in the best possible way (i.e. separate them from other particles) and also determine their energy as accurately as possible. Finally, one would generally like to distinguish different (mainly non-electron) particle signatures as well as can be done, unsupervised. More specifically:
  • Identify (i.e. classify) electrons vs. non-electrons. This should be based on maximum 20 variables from the Variable List. The target variable for this task is "p_Truth_isElectron": 0 for non-electrons, 1 for electrons, but your identification should be continuous in ]0,1[. We evaluate algorithm performance using Binary Cross Entropy loss (LogLoss).
  • Estimate (i.e. make regression for) the energy of electrons. This should be based on maximum 25 variables from the Variable List. The target variable for this task is "p_truth_Energy": Energy (in GeV) of the electrons, and you should only train on true (i.e. truth identified, "p_Truth_isElectron==1") electrons. We evaluate algorithm performance on true electrons and consider Mean Absolute Error (MAE) on relative estimate accuracy: (E_pred-E_true)/E_true.
  • Cluster particle signatures into 3-25 catagories. This should be based on maximum 10 variables from the Variable List. In this case, there is no target variable, as this is unsupervised learning. Your solution should simply be the number of the category, that you assign the event to belong to, i.e. an integer [0, 1, ..., n] for n+1 categories (n in [3,25]). We evaluate algorithm performance by ability of your catagories to fit into mostly non-electron particle types (no truth variable given for these (to you - I have them!)).

    Using Machine Learning algorithm(s), give 1-3 solution(s) to each of the following problems, along with one single common file giving a (very) short description of each solution (e.g. name of algorithm, central hyperparameters, and training time). As stated, you should submit 1-3 solution(s) for each of the three problems (thus 3-9 in total). You are required to submit at least one tree based and one neural network based solution.
    It is typically not hard to rewrite the code to solve other problems as well, and you are welcome to reuse the same algorithm for several solutions. However, we encourage (and reward, see below) using a variety of algorithms. Remember, the real target is for you to learn various techniques. You are welcome to use different variables for each solution. Part of the problem is determining, which of the feature values are important, and selecting these. Most ML algorithms have an internal feature ranking. Alternatively, you could consider permutation importance and/or SHAP values.
    Note that the limit on variables is "absolute" in the sense that you are not allowed to do e.g. a PCA on all variables, reducing them to the desired number of input variables, as this approach still requires the full variable list to work.


    Solutions and submission:
    You (i.e. each student) should hand in (each of) your own solution(s) as TWO separate files:
  • A list of index/event numbers (0, 1, 2, 3, etc.) followed by your estimate on each event, i.e. (Here is shown a classifcation solution. For a regression solution, the last number should be the energy estimate in GeV), and for the clustering problem, it should be the category number (which will automatically reveal how many catagories you chose to divide into):
         0, 0.998232
         1, 0.410455
         2, 0.037859
         3, ...
  • A list of the variables you've used for each problem, i.e.:
         p_eta,
         p_pt_track,
         ...

    Submission naming:
    For each solution, you should name your two solutions files as follows: TypeOfProblemSolved_FirstnameLastname_SolutionName(_VariableList).csv.
    Four solution examples for myself (thus 9 files total, including the overall description) should e.g. be named:
    Solution 1: Classification_TroelsPetersen_SKLearnAlgo1.csv, Classification_TroelsPetersen_SKLearnAlgo1_VariableList.csv
    Solution 2: Regression_TroelsPetersen_XGBoost1.csv, Regression_TroelsPetersen_XGBoost1_VariableList.csv
    Solution 3: Regression_TroelsPetersen_PyTorchNN1.csv, Regression_TroelsPetersen_PyTorchNN1_VariableList.csv
    Solution 4: Clustering_TroelsPetersen_kNN-DidNotReallyWorkWell.csv, Clustering_TroelsPetersen_kNN-DidNotReallyWorkWell_VariableList.csv
    Overall description: Example could be Description_TroelsPetersen.txt

    Submission format:
    Your solution file should be in Comma-Separated Values (CSV) format, thus human readable text files. In order to test, if your format is correct, we have produced a file submission reader/checker:
  • SolutionChecker.ipynb

    It is mandatory to run your solutions through the Solution Checker (link above),
    and surely also a wise thing to do to avoid mistakes
    .



    Evaluation:
    We will evaluate this project mostly on the fact that you handed in (or not!). Thus, in working out your solution, you should primarily focus on getting solutions that works reasonably well! Following this, you can pay attention to the points, based on which we evaluate your solutions:
  • Method(s) applied: Which methods for training, hyper parameter optimisation, etc. did you use? [Thus, cross validation and exploration of hyper parameters is good]
  • Performance: How well does your algorithms perform? [Thus, getting the best performance is good]
  • Variable choice: How good was your choice of variables? [Thus, getting the best variables is good]
  • Solution variation: How many (fundamentally) different solutions did you try? [Thus, nine different and well working solutions is good]
    As we don't know the distribution of your performance, number of algorithms, variable choice, etc., we can't give you the exact evaluation calculation until we have your solutions.



    Enjoy, have fun, and throw yourself boldly at the data....




    Last updated: 12th of April 2024 by Troels Petersen