Applied Machine Learning - Week 1

Monday the 24th - Friday the 28th of April 2023

Groups: For the final project you should find a group (administrated by Azzurra), but we highly recommend that you also work/collaborate/discuss in a group for exercises.

Reference Data I (Aleph b-quark tagging):
In order to learn about ML, we need to have a nice, simple, scaled, mutually exclusive sampled, nummerically sound, unflawed, possibly large, and perfectly labelled (i.e. simulated) dataset with competitive predictions on to train and test on. It sounds like an impossibility, but I happen to have the "Aleph b-quark tagging" dataset (link to README), with a Neural Net prediction (first paper from 1992!) in, see bottom of page.


Monday 24th of April (afternoon):
Lectures: Intro to course, outline, groups, and discussion of data and goals (TP, CS, AD, AM, TS).
     Introduction to AppML Course and Introduction to Machine Learning (TP). (For reference: Recording of the 2021 Lecture (467 MB)).

Exercise: Setup of infrastructure (Github, ERDA, Zoom, Slack). Test your Python setup with ML_MethodsDemos.ipynb.
     Getting a feel for the Curse of Dimensionality, making life in high dimensions a lonely one!
     Inspecting data and making a "human" decision tree for classification: Code for initial analysis: BjetSelection.ipynb (classifying with if-sentences!)


Wednesday 26th of April (morning - starting exceptionally 8:15!):
Lectures: Introduction to Loss Functions, Stochastic Gradient Descent, Training/Validation, and Introduction to Tree-based algorithms (TP).
     (For reference: Recording of the 2021 Lecture (314 MB) on Tree-based algorithms).

Exercise: Exercise: Classification of b-quark jets in Aleph data with Tree based methods.
     Compare performance to your own Decision Tree and the Aleph NN.
     Additional (reference) data, on classifying stars, galaxies, and quasars: Data_SDSS.txt (6.3 MB).


Wednesday 26th of April (afternoon):
Lectures: Introduction to NeuralNet-based algorithms (TP). (For reference: Recording of the 2021 Lecture (331 MB)).
     Additional slides: ML2023_AppliedML_Top10.pdf

Exercise: Exercise: Classification of b-quark jets in Aleph data with Neural Net based methods.
     Compare performance to your tree based method(s) and the Aleph NN.
     Challenge: Given a "large" dataset on b-jets, see how performance improves with data size.



Example solutions from week 1:
The following are example solutions and related code, which comes with absolutely no warrenty. However, you may let yourself be inspired by these solutions:
  • Solution example of how to separate jet types WITHOUT using Machine Learning.
  • Solution example using LightGBM (tree based) and MLPClassifier, Tensor Flow, and PyTorch (NN based).


    Data (in CSV format):
    AlephBtag_MC_train_Nev5000.csv (0.4 MB), and
    AlephBtag_MC_train_Nev50000.csv (4.2 MB), and
    AlephBtag_MC_train_Nev500000.csv (42 MB), and
    AlephBtag_MC_train_Nev5000000.csv (401 MB), and
    AlephBtag_MC_test_Nev246390.csv (20 MB).

    Data (in HDF5 format):
    AlephBtag_MC_train_Nev5000.h5 (1.5 MB), and
    AlephBtag_MC_train_Nev50000.h5 (5.7 MB), and
    AlephBtag_MC_train_Nev500000.h5 (48 MB), and
    AlephBtag_MC_train_Nev5000000.h5 (450 MB), and
    AlephBtag_MC_test_Nev246390.h5 (24 MB).

    Data (in PARQUET format):
    AlephBtag_MC_train_Nev5000.parquet.gz (0.15 MB), and
    AlephBtag_MC_train_Nev50000.parquet.gz (1.4 MB), and
    AlephBtag_MC_train_Nev500000.parquet.gz (14 MB), and
    AlephBtag_MC_train_Nev5000000.parquet.gz (129 MB), and
    AlephBtag_MC_test_Nev246390.parquet.gz (6.4 MB).


    Last updated: 20th of of April 2023 by Troels Petersen.