Applied Machine Learning - Week 1

Monday the 22nd - Friday the 26th of April 2024

Groups: We highly recommend that you also work/collaborate/discuss in a group for exercises, and for the final project you should find a group (administrated by Arnau).

Reference Data I (Aleph b-quark identification):
In order to learn about ML, we need to have a nice, simple, scaled, mutually exclusive sampled, nummerically sound, unflawed, possibly large, and perfectly labelled (i.e. simulated) dataset with competitive predictions on (to compare performance) to train and test on. It sounds like an impossibility, but I happen to have the "Aleph b-quark tagging" dataset (link to README), with a Neural Net prediction (first paper from 1992!) in, see bottom of page.

Monday 24th of April (afternoon):
Lectures: Intro to course, outline, groups, and discussion of data and goals (TP, DM, AM, TS).
     Introduction to AppML Course and Introduction to Machine Learning (TP). (For reference: Recording of the 2021 Lecture (467 MB)).

Exercise: Setup of infrastructure (Python, Github, Slack). Test your Python setup with ML_MethodsDemos.ipynb.
     Getting a feel for the Curse of Dimensionality, making life in high dimensions a lonely one!
     Inspecting data and making a "human" decision tree for classification: Code for initial analysis: BjetSelection.ipynb (classifying with if-sentences!)


Wednesday 26th of April (morning - starting exceptionally 8:15!):
Lectures: Introduction to Loss Functions, Stochastic Gradient Descent, Training/Validation, and Introduction to Tree-based algorithms (TP).
     (For reference: Recording of the 2021 Lecture (314 MB) on Tree-based algorithms).

Exercise: Exercise: Classification of b-quark jets in Aleph data with Tree based methods.
     Compare performance to your own Decision Tree and the Aleph NN.
     Additional (reference) data, on classifying stars, galaxies, and quasars: Data_SDSS.txt (6.3 MB).


Wednesday 26th of April (afternoon):
Lectures: Introduction to NeuralNet-based algorithms (TP). (For reference: Recording of the 2021 Lecture (331 MB)).
     Additional slides: ML2024_AppliedML_Top10.pdf

Exercise: Exercise: Classification of b-quark jets in Aleph data with Neural Net based methods.
     Compare performance to your tree based method(s) and the Aleph NN.
     Challenge: Given a "large" dataset on b-jets, see how performance improves with data size.


Aleph Data (in CSV format):
AlephBtag_MC_train_Nev5000.csv (0.4 MB), and
AlephBtag_MC_train_Nev50000.csv (4.2 MB), and
AlephBtag_MC_train_Nev500000.csv (42 MB), and
AlephBtag_MC_train_Nev5000000.csv (401 MB), and
AlephBtag_MC_test_Nev246390.csv (20 MB).

Alpeh Data (in HDF5 format):
AlephBtag_MC_train_Nev5000.h5 (1.5 MB), and
AlephBtag_MC_train_Nev50000.h5 (5.7 MB), and
AlephBtag_MC_train_Nev500000.h5 (48 MB), and
AlephBtag_MC_train_Nev5000000.h5 (450 MB), and
AlephBtag_MC_test_Nev246390.h5 (24 MB).

Aleph Data (in PARQUET format):
AlephBtag_MC_train_Nev5000.parquet.gz (0.15 MB), and
AlephBtag_MC_train_Nev50000.parquet.gz (1.4 MB), and
AlephBtag_MC_train_Nev500000.parquet.gz (14 MB), and
AlephBtag_MC_train_Nev5000000.parquet.gz (129 MB), and
AlephBtag_MC_test_Nev246390.parquet.gz (6.4 MB).

Flawed Aleph Data (in CSV format):
AlephBtag_MC_train_Nev5000_flawed.csv (0.15 MB) and
AlephBtag_MC_train_Nev50000_flawed.csv (1.4 MB).

Alternative (medical) Data (in CSV format):
Medical_Npatients5000.csv (0.15 MB)
Medical_Npatients50000.csv (1.4 MB)

Last updated: 16th of of April 2024 by Troels Petersen.