Applied Machine Learning - Week 2

Monday the 29th of April - Friday the 3rd of May 2024

Monday 1st of May (afternoon):
Lectures: Initial project kick-off with introduction to and discussion of elements in this "Kaggle-like" project.
     We will then look into Hyperparameters, Overtraining, and Early stopping. (For reference: Recording of the 2021 lecture).
     Both slides and associated code making them can be found on GitHub.
     If time allows, we'll briefly look at the Week1 exercise solutions, and possibly discuss Loss values in training and validation.

Exercises: The main exercise is to run a Hyper Parameter (HP) optimisation on one of your algorithm trainings. Compare Grad, Grid, and Random in time and performance.
     As an additional exercise - if you have time, and haven't tried already - make a regression model, e.g. absolute value of the jet angle |cTheta| from the other variables in the Aleph b-jet dataset.

Wednesday 5th of May (morning):
Lectures: Input feature ranking and Shapley values and ML method performance overview.

Exercise: Apply the different feature ranking methods to e.g. the Aleph b-jet data, and determine which variables are the important ones (consider the 6 variables used).
     For a discussion of feature ranking, you may also want to see this Towards-Data-Science discussion.
     When you feel, that you understand feature ranking and SHAP values, feel free to start/work on the Initial project.

Wednesday 5th of May (afternoon):
Lectures: Final project inspiration, teaser, and dataset discussion.
     Introduction to Unsupervised Learning: Clustering and Nearest Neighbor algorithms.
     Also, I'll briefly introduce/remind about Principle Component Analysis (For reference: Recording of the 2021 lecture (17 MB)).

Exercise: The exercises is to apply clustering algorithms (and potentially dimensionality reduction) to datasets of increasing size and complexity:
     Simple small examples:
     1. Fisher's famous irises (3 classes, 3x50=150 cases, and 4 features. Can be obtained through SKlearn's toy datasets). Use any methods.
     2. Wine data example (3 classes, 14 features, 178 cases): Wine.ipynb, along with Wine data and Wine data description.
     3. K-means clustering toy example and data: Clustering.ipynb.
     Example on larger (known) data:
     Dataset 1: Aleph b-jets (5000+ cases with 6-9 features, which you know already).
     Dataset 2: The interesting Cosmos2015outlier data (10000 cases with 13 features, from astro physics).
     Dataset 3: The real Cosmos2015 data (20355 cases with 13 features, from astro physics).

Last updated: 26th of April 2024 by Troels Petersen.