Applied Machine Learning - Week 2

Monday the 28th of April - Friday the 2nd of May 2025

Monday 28th of April (afternoon):
Lectures: We will then look into Hyperparameters, Overtraining, and Early stopping. Both slides and associated code making them can be found on GitHub.
     We will also have a look at the Week1 exercise solutions, and possibly discuss Loss values in training and validation.

Exercises: The main exercise is to run a Hyper Parameter (HP) optimisation on one of your algorithm trainings. Compare "Grad", "Grid", and "Random" in time and performance.
     As an additional exercise, make a regression model of e.g. energy or |cTheta| from the other variables in the Aleph b-jet dataset.

Wednesday 30th of April (morning):
Lectures: Input feature ranking and Shapley values and ML method performance overview.
     Initial project kick-off with introduction to this "Kaggle-like" project.

Exercise: Apply the different feature ranking methods to e.g. the Aleph b-jet data, and determine which variables are the important ones (consider the 6 variables used).
     As a more real world data case, try the Housing Price dataset for predicting sales prices (i.e. regression).
     Notice that the data is not as "curated", as the Aleph data, i.e. there are missing values and outliers in.

Wednesday 30th of April (afternoon):
Lectures: Final project inspiration, teaser, and dataset discussion.
     Introduction to Unsupervised Learning: Clustering and Nearest Neighbor algorithms. Also, I'll briefly introduce/remind about Principle Component Analysis

Exercise: The exercises is to apply clustering algorithms (and potentially dimensionality reduction) to datasets of increasing size and complexity:
     Simple small examples:
     1. K-means clustering toy example applied to simulated data: Clustering.ipynb.
     2. Wine data example (3 classes, 14 features, 178 cases): Wine.ipynb, along with Wine data and Wine data description.
     Example on larger (known) data:
     Dataset 1: Aleph b-jets (5000+ cases with 6-9 features, which you know already).
     Dataset 2: The interesting Cosmos2015outlier data (10000 cases with 13 features, from astro physics).
     Dataset 3: The real Cosmos2015 data (20355 cases with 13 features, from astro physics).

Last updated: 25th of April 2025 by Troels Petersen.