Applied Machine Learning 2026 - Datasets

The central part in both statistics and Machine Learning is of course DATA. While GitHub is great for distributing code, there are size limitations on the files in the repository, keeping large data files out.
For this reason - and to give a simple overview of the data available in the Applied ML course - this page summarises and links to the various data files available. They are either public or (mostly) stored on the ERDA server, and we have tried to provide the minimal amount of description and references necessary.
Note that the below categories are not mutually exclusive. We have simply listed the datasets as we found the best fitting.

Tabular datasets:

The Aleph Data:

The Aleph (and Medical) datasets consists of 5801390 (55000) simulated cases of Z0-boson decays to a quark and an anti-quark (patients potentially with lifestyle diseases), each with 10 input variables, one "competitor variables" (from other methods), and a truth label variable (b-quark or not for Aleph, healthy or not for Medical). The data is thus fit for classification. The datasets has been subdivided in many ways to provide data files of different sizes and formats. There are also "flawed" versions of the 5000 and 50000 entry Aleph datasets, in order to practice data cleaning.
Aleph Data (in CSV format - also exists in HDF5 and PARQUET formats, see Week1):
AlephBtag_MC_train_Nev5000.csv (0.4 MB), and
AlephBtag_MC_train_Nev50000.csv (4.2 MB), and
AlephBtag_MC_train_Nev500000.csv (42 MB), and
AlephBtag_MC_train_Nev5000000.csv (401 MB), and
AlephBtag_MC_test_Nev246390.csv (20 MB).

Flawed Aleph Data (in CSV format):
AlephBtag_MC_train_Nev5000_flawed.csv (0.15 MB) and
AlephBtag_MC_train_Nev50000_flawed.csv (1.4 MB).

"Medical" Data (in CSV format):
Medical_Npatients5000.csv (0.15 MB)
Medical_Npatients50000.csv (1.4 MB)

The Housing Price Data:
The Housing Price data is best fit for regression and also practices dealing with incomplete/flawed data.
HousingPrices.csv (2.3 MB).

Swift Gamma Ray Burst Data and derived properties:
This dataset is great for dimensionality reduction, and was actually published on by NBI (former teacher and students in the course). swift_gamma_ray_bursts.zip (55 MB).
SwiftProperties.csv (97 kB).

Image data:
The MNIST data set is a classic for images, and has become the standard goto for testing new algorithms with images.
An "extension" is a similar dataset with letters emnist-letters-test.csv, which allows for anomaly detection.

The Ancient Ice Insoluable Image dataset consists of images of "insoluables" (i.e. things that do not melt or are dissolved in water) from the ice cores obtained from drilling into the Greenlandic ice sheet. A small part of these ice cores are melted (also for chemical analysis) and put through a fine filter, where photos are taken of things that do not pass the filter. These are mostly images of dust and also contaminations from obtaining the ice cores, but in particular volcanic ash (tephra) is of great scientific interest, and also pollen may be found in these ice cores. There are four files in total:
supervised_train.csv (4.2 MB).
supervised_test.csv (0.9 MB).
train.zip (160 MB).
test.zip (29 MB).

Text data:
IMDB movie review data, where the goal is to determine, if the review was positive or negative: encok5nw3y.zip (153 MB).

Unlabelled datasets:
The Sloan Digital Sky Survey (SDSS) dataset consists of 30000 observations of galaxies, quasars, and stars, each with 23 input variables (some of which have 9999 values), and an (approximate) truth type value. While this dataset could be considered for supervised learning, the labels are obtained from the image, and hence approximative.
Data_SDSS.txt (6.3 MB).

Last updated: 29th of April 2026 by Troels Petersen.