Big Data Analysis (Applied Machine Learning), Block 4, 2020

"Big Data is like teenage sex... everyone talks about it, nobody really knows how to do it, everyone else is doing it, so everyone claims they are doing it!"
[Dan Ariely, Professor at Duke University]
Troels C. Petersen Adriano Agnello Brian Vinter Zoe Ansari Carl-Johannes Johnsen
Lecturer - Associate Professor Lecturer - Associate Professor Lecturer - Professor Teaching assistent - Ph.D. Teaching assistent - Ph.D.
NBI - High Energy Physics NBI - Cosmology NBI - Computing NBI - Cosmology NBI - Computing
35 52 54 42 / 26 28 37 39 35 33 76 41 35 32 14 21 81 92 22 88 31 44 42 56
petersennbi.dk adriano.agnellonbi.dk vinternbi.dk zakieh.ansarinbi.dk cjjohnsennbi.dk


What, when, where, prerequisites, books, curriculum and evaluation:
Content: Graduate course on Machine Learning and Big Data usage in science.
Level: Intended for students at graduate level (4th--5th year) and new Ph.D. students.
Prerequisites: Math (calculus and linear algebra) and programming experience (preferably Python).
When: Mondays 13-17 and Wednesdays 9-17 (Week Schedule Group C) in Block 4 (20/04-17/06 2020).
Where: Mondays: Lectures (13-14) + exercises (14-17) in BioCenter 4-0-32.
Wednesdays: Lectures (9-10) + exercises (10-12) in NBB 01.0.G.064/070 and Lectures (13-14) + exercises (14-17) in BioCenter 4-0-32.
Format: Shorter lectures followed by computer exercises and discussion with emphasis on experience and projects.
Text book: References to Elements of Statistical Learning II.
Additional literature: We (and you) will make extensive use of online ML resources, collected throughout the course.
Language: English (occational Danish utterings!). All exercises, problem sets, notes, etc. are in English.
Programming: Primarily Python 3.6+ with a few packages on top, though this is an individual choice.
Communication: Lectures and exercises will be given live via Zoom and a course Slack channel: nbiappliedml2020.slack.com.
Discord is also a widely used channel, and this course has a channel under General HCO studying.
Exam: Final project (possibly virtual) presentations on Wednesday the 10th of June all day (9:00-17:00+).
Evaluation: Small project (40%), and final project (60%), evaluated by lecturers following the Danish 7-step scale.
Credits: 7.5 ECTS (1/8 academic years work, that is 187.5-225 hours of work, thus about 23-28 hours weekly).

Further course information can be found here: ML2020_CourseInformation.pdf
A (highly recommended) questionnaire for the course will be used for everyone to facilitate student collaboration and group work.

Course exam:
The following are the presentation schedule and guidelines.
Here you find the evaluation form for all the projects (1-10 scale).
Zoom links for the morning session and afternoon session (NOT recorded!).
An introduction can be found here: exam introduction.

Below you can find the presentations of the final projects (10th of June 2020):
Morning session:
  • FinalProject1_RasmusPeter.pdf
  • FinalProject2_MariaAndyEmilMads_WalmartKaggle.pdf
  • FinalProject3_HelenaKatjaSimonViktoria.pdf
  • FinalProject4_AnnSofieEmyMartaYanet_RetrievalOfSeaSurfaceTemperatures.pdf
  • FinalProject5_ChristopherJoakimNikolaj_PredictingTheCriticalTempOfSuperconductors.pdf
  • FinalProject6_MikkelMikkelAskeAnnaMoust_GNNonIceCubeData.pdf
    Afternoon session:
  • FinalProject7_HaiderRasmusMS_PredictingMusicPublicationYear.pdf
  • FinalProject8_AlbaMirenEdwinFynn_PredictingBloodCellType.pdf
  • FinalProject9_RuniSimoneMarcusJonathan_CalibrationForNewAstroDataForExoPlanetResearch.pdf
  • FinalProject10_SofusKristofferDavidElias_TrickingFaceTracking.pdf
  • FinalProject11_DinaAlineAlbertMichael_WheatDetection.pdf
  • FinalProject12_LaurentOrestisGiorgosCarlos_TweetSentimentExtraction.pdf
  • FinalProject13_SvendJulius_PredictingMusicGenre.pdf
  • FinalProject14_EmilMartiny_NoisyDataOnCells.pdf
  • FinalProject15_NicolasPedersen_IdentificationOfObjectsIn2DImages.pdf





    Course outline:
    Below is the preliminary course outline, subject to changes throughout the course.

    Week 1 (Introduction to Machine Learning concepts and methods):
    Apr 20: 13:15-17:00: Intro to course, outline, groups, and discussion of data and goals (TP, AA, BV, ZA, CJ). Overview of Machine Learning techniques (TP).
         Exercise: Setup of infrastructure (Github, ERDA, Zoom, Slack). Inspecting data and making "human" decision tree.
    Apr 22: 9:15-12:00: Introduction to Tree-based algorithms (TP).
         Exercise: Classification of b-quark jets in Aleph data with Tree based methods.
    Apr 22: 13:15-17:00: Introduction to NeuralNet-based algorithms (TP).
         Exercise: Classification of b-quark jets in Aleph data with Neural Net based methods.

    Week 2 (Data collection, training, and optimisation):
    Apr 27: 13:15-17:00: Data collection, preprocessing, and dimensionality reduction (AA).
         Exercise: Run a (k)PCA on (a) the b-quark data table, and/or (b) the SDSS data table.
    Apr 29: 9:15-12:00: Training, Validation, Test, Cross Validation, and introduction to basic machinery (AA).
         Exercise: Try to apply cross validation in your training.
    Apr 29: 13:15-17:00: Hyperparameters, Overtraining, and Early stopping (CM+AA).
         Exercise: Hyperparameter optimisation of simple tree and NN algorithms.

    Week 3 (Clustering and Long Short Term Memory networks):
    May 4: 13:15-17:00: Introduction to Clustering and Nearest Neighbor algorithms (BV). Small projects start (TP).
         Exercise: Try to apply the k-NN (and other) algorithms to e.g. breast cancer and/or the Aleph b-jet data.
    May 6: 9:15-12:00: Long Short Term Memory (LSTM) and Recurrent Neural Networks (RNN) (TP, James Avery).
         Exercise: Try to make an LSTM predict the next entries in a sinus (periodic) and Mackay (non-periodic) sequence.
    May 6: 13:15-17:00: Population Mixture Models (AA).
         Exercise: Apply the Expectation-Maximization algorithm to cluster data of your choice.

    Week 4 (Computers and networks, Convolutional Neural Networks, and the t-SNE algorithm):
    May 11: 13:15-17:00: Infrasturcture: Computers, storage, and networks (BV). Final projects introduction/kickoff.
         Exercise: Work on Small project, and coordination of final project.
    May 13: 9:15-12:00: Convolutional Neural Networks (DNN) and images (Alexander Topic).
         Exercise: Recognize images (in this case handwritten numbers) with Convolutional Neural Networks.
    May 13: 13:15-17:00: T-distributed Stochastic Neighbor Embedding (t-SNE) and feature ranking (Alexander Nielsen, TP).
         Exercise: Categorise handwritten numbers with unsupervised methods (PCA and t-SNE).

    Week 5 (GPUs, Generative Adversarial Networks, and project work):
    May 18: 13:15-17:00: GPU accelerated data analysis - Rapids (Mads Kristensen, Nvidia - formerly NBI). Small project should be submitted by 22:00!.
         Exercise: Work on Small project, and coordination of final project.
    May 20: 9:15-12:00: Generative Adversarial Networks (GANs) and work on final project.
         Exercise: Work on final project.
    May 20: 13:15-17:00: Work on final project.
         Exercise: Work on final project.

    Week 6 (CNNs at work, Ethics in ML, and overview of ML methods):
    May 25: 13:15-17:00: Using CNNs in beer quality check (Carl-Johannes Johnsen).
         Exercise: Work on final project.
    May 27: 9:15-12:00: Ethics and Machine Learning (TP).
         Exercise: Work on final project.
    May 27: 13:15-17:00: Overview of Machine Learning methods (TP).
         Exercise: Work on final project.

    Week 7 (ML for Supernova dust detection and results on Small Project):
    Jun 1: 13:15-17:00: No teaching (Whit Monday).
    Jun 3: 9:15-12:00: Supernova dust detection with Machine Learning (Zoe Ansari).
         Exercise: Work on final project.
    Jun 3: 13:15-17:00: Results and Feedback on small project.
         Exercise: Work on final project.

    Week 8 (EXAM: Presentations of final project):
    Jun 8: 13:15-17:00: Final project work.
    Jun 10: 9:15-12:00: Presentations of final projects (TP, AA, BV, ZA, JJ).
    Jun 10: 13:15-17:00: Presentations of final projects (cont.) and possibly course evalution.
         Here you can see the schedule for the day.
         And here you find the evaluation form for all the projects (1-10 scale).




    Below you can find the presentations of the final projects given in last years course (on the 10th of June 2019):
  • Project1_BomberMan.pdf
  • Project2_BoneAge.pdf
  • Project3_SpectralAnalysis.pdf
  • Project4_StockMarketAnalysis.pdf
  • Project5_FindingWallyIn2DImages.pdf
  • Project6_StellarClassificationCNN.pdf
  • Project7_PredictingAgeGenderEthnicity.pdf
  • Project8_UFOSightingDataMining.pdf
  • Project9_ClassificationOfCatsVsDogs.pdf
  • Project10_MulticlassClassificationOfHearBeats.pdf
  • Project11_PredictingAbsorptionEnergies.pdf
  • Project12_PredictingSolarBatteryProperties.pdf
  • Project13_SkinLesionClassification.pdf

    "Some people worry that artificial intelligence will make us feel inferior, but then, anybody in his right mind should have an inferiority complex every time he looks at a flower." [Alan Kay, American computer scientist]


    Last updated 7th of June 2020.