Applied Machine Learning - Week 1
Monday the 21st - Friday the 25th of April 2025
Groups: We
highly recommend that you also work/collaborate/discuss in a group for exercises, and for the final project you should find a group (administrated by Norman).
Reference Data I (Aleph b-quark identification):
In order to learn about ML, we need to have a nice, simple, scaled, mutually exclusive sampled, nummerically sound, unflawed,
possibly large, and perfectly labelled (i.e. simulated) dataset with competitive predictions on (to compare performance) to train and test on.
It sounds like an impossibility, but I happen to have the "Aleph b-quark tagging" dataset, with a Neural Net prediction (
first paper from 1992!) in,
see bottom of page.
Wednesday 23rd of April (morning - starting exceptionally 8:15!):
Lectures: Intro to course, outline, groups, and discussion of data and goals (TP, DM, JN, NP, AA).
    
Introduction to AppML Course and
Introduction to Machine Learning (TP).
     Introduction to
Loss Functions,
Stochastic Gradient Descent,
Training/Validation, and
Introduction to Tree-based algorithms (TP).
Exercise: Exercise:
Classification of b-quark jets in Aleph data with
tree based methods.
     Compare performance to your own Decision Tree and the Aleph NN.
     Additional (reference) data, on classifying stars, galaxies, and quasars:
Data_SDSS.txt (6.3 MB).
Wednesday 23rd of April (afternoon):
Lectures:
Introduction to NeuralNet-based algorithms (TP).
     Additional slides:
ML2025_AppliedML_Top10.pdf
Exercise: Exercise:
Classification of b-quark jets in Aleph data with
Neural Net based methods.
     Compare performance to your tree based method(s) and the Aleph NN.
     Challenge: Given a "large" dataset on b-jets, see how performance improves with data size.
Aleph Data (in CSV format):
AlephBtag_MC_train_Nev5000.csv (0.4 MB), and
AlephBtag_MC_train_Nev50000.csv (4.2 MB), and
AlephBtag_MC_train_Nev500000.csv (42 MB), and
AlephBtag_MC_train_Nev5000000.csv (401 MB), and
AlephBtag_MC_test_Nev246390.csv (20 MB).
Alpeh Data (in HDF5 format):
AlephBtag_MC_train_Nev5000.h5 (1.5 MB), and
AlephBtag_MC_train_Nev50000.h5 (5.7 MB), and
AlephBtag_MC_train_Nev500000.h5 (48 MB), and
AlephBtag_MC_train_Nev5000000.h5 (450 MB), and
AlephBtag_MC_test_Nev246390.h5 (24 MB).
Aleph Data (in PARQUET format):
AlephBtag_MC_train_Nev5000.parquet.gz (0.15 MB), and
AlephBtag_MC_train_Nev50000.parquet.gz (1.4 MB), and
AlephBtag_MC_train_Nev500000.parquet.gz (14 MB), and
AlephBtag_MC_train_Nev5000000.parquet.gz (129 MB), and
AlephBtag_MC_test_Nev246390.parquet.gz (6.4 MB).
Flawed Aleph Data (in CSV format):
AlephBtag_MC_train_Nev5000_flawed.csv (0.15 MB) and
AlephBtag_MC_train_Nev50000_flawed.csv (1.4 MB).
Alternative (medical) Data (in CSV format):
Medical_Npatients5000.csv (0.15 MB)
Medical_Npatients50000.csv (1.4 MB)
Last updated: 19th of of April 2025 by Troels Petersen.