Big Data Analysis 2020 - Final Project

"As a data scientist, I can predict what is likely to happen, but I cannot explain why it is going to happen. I can predict when someone is likely to attrite, or respond to a promotion, or commit fraud, or pick the pink button over the blue button, but I cannot tell you why that's going to happen. And I believe that the inability to explain why something is going to happen is why I struggle to call 'data science' a science."
[Bill Schmarzo, Author of "Big Data: Understanding How Data Powers Big Business"]
Course exam: The following are the presentation schedule and guidelines.
And here you find the evaluation form for all the projects (1-10 scale).


Project description:
The final project is a machine learning project on what-ever-you-want! Figure out, if the stock market can be figured out, test if you can beat your friends in games with AI, try to estimate the selling price of houses, challenge the world on Kaggle, etc.
Now, we highly encouraged you to find your own data, and there are few limits to what we will accept. And it does not have to be physics related! However, it should have some levels of complexity in it, as this is one of the points of evaluation. But it may be numbers, text, images, ???, or a combination of any of these. The only really strict requirements are, that you use Machine Learning algorithms on some sizable data, and that you work in a group.
Groups should ideally be 3-4 persons, but if extraordinary circumstances dictates otherwise, please write us. The aim/target of the ML algorithm developed is secondary and also entirely up to you. However, you may want to consult with us/others regarding the feasibility of your project, though it is no requirement, that you have fully succeeded at the end.

Update (6th of May): Given the extraordinary circumstances, it might be hard for everyone to form a group of 3-4 persons. For this reason, we allow smaller groups (1-2 persons), which will have the same deadline, but who will (in addition to being present and grading on the 10th of June) present their projects on Thursday the 11th of June from 9:15-17:00 (for as long time as needed).

You should submit the project via Absalon by 22:00 on Tuesday the 9th of June 2020. Presentations are the following day (and Thursday the 11th for smaller groups and individuals).

Groups and projects:
In case you do not have or manage to find any suitable data, we have provided a short list of possible "backup" data sets that we have in store:
  1. Rasmus FO, Peter C [Reconstructing neutrino events in the IceCube experiment with Graph NN, Troels]
  2. Maria, Mads, Andy, and Emil [Kaggle Walmart data, Zoe]
  3. Katja, Helena, Viktoria, Simon [Skin lesion dataset from Harvard, Carl]
  4. Marta, Ann-Sofie, Emy, and Yanet [Retrieval of sea surface temperatures (SSTs) from passive microwave measurements, Adriano]
  5. Christopher, Nikolaj, Joakim [Predicting super conductors critical temperatures, Zoe]
  6. Aske, Mikkel RS, Anna, Mikkel LL, Moust [Reconstructing neutrino events in the IceCube experiment with Graph NN, Troels]
  7. Rasmus MS, Haider [Estimating publication year of music from extracted features (http://millionsongdataset.com/), Adriano]
  8. Edwin, Miren, Alba, Fynn [Blood cell type - Kaggle data, Carl]
  9. Runi Nima Soerensen, Simone Vejlgaard, Marcus Bredtved, and Jonathan Jegstrup [Calibration for new astro-data for exo-planet research, Troels]
  10. Sofus Stray, David Dedenbach, Elias Najarro, and Kristoffer Kvist [Learn how to generate (visible) patterns that can be used to trick face-tracking systems under real conditions, Carl]
  11. Dina, Aline, Albert and Michael [Wheat dataset from Kaggle, Carl]
  12. Laurent Lindpointner, Orestis Marantos, Griogos Garidis, Carlos Rodriguez [Extraction of sentiment from Tweets, Zoe]
  13. Svend, Julius (external) [Estimating genre of music, Adriano]
  14. Emil Schou Martiny [Own (very noisy) data, Troels]
  15. Nicolas Remy Hoeegh Pedersen [Identification of objects in 2D images, Carl]
Backup data sets:
In case you do not have or manage to find any suitable data, we have provided a short list of possible "backup" data sets that we have in store:
  • ATLAS V0 particle identification (classification, easy, TP)
  • ATLAS electron energy reconstruction with Convoluted Neural Networks (regression, medium-hard, TP)
  • Reconstruction of neutrino direction from timing information (regression, hard, TP)
  • Identification of objects in 2D images (classification, medium, BV)
  • Finding faults in sugar beads from hyperspectral images (classification, medium, BV)
  • Classification of X-ray sinograms from potatoes (classification, medium-hard, BV)
  • Table of astronomical objects (classification and regression, AA)
  • Spectral analysis of measurements in SQL database (spectral analysis, AA)
  • Clustering of data from Coma Cluster (no pun intended) treasure dataset (clustering, AA)
    The list may change over time and further information may be added (others have data on Human motion action (regression, medium?) and Analysis of Tweets/text (text analysis, hard)). Until such a time, contact the "data set responsible" regarding details.
    In discussion with last year's students, the following is a compilation of experiences from last years final projects (a list of which can be found at the main course webpage).

    Solutions and submission:
    Your final project solutions should be submitted in the form of slides, which are fit for a 15 minute presenation. A rule of thumb is that it is fitting to have one slide pr. minute. Going (much) beyond this ruins the clarity of your presentation. However, you are encouranged (required?) to put a lot of details in the appendix, giving all the technical details of your work. In the slides (e.g. front page or on page in appendix) you should put a statement about each group members contribution to the project. It may be as simple as "All group members have contributed evenly to the project", but we simply want to see what you have agreed in writing. And finally, we will ask for you to also submit your project code, just for reference. To simplify matter, you should name your file as follows: FirstNamesOfGroupMembers_ProjectName.pdf, for example: TroelsAdrianoZoeCarl_StudentGradingRegression.pdf

    You should submit the project via Absalon by 22:00 on Tuesday the 9th of June 2020. Presentations are the following day.

    Evaluation:
    The final projects will be evaluated based on the following criteria:
  • Complexity of problem and depth of solution (incl. appendix)
  • Choice of methods and arguments behind
  • ML performance and own evaluation of it
  • Clarity of presentation
  • Implementation, technical details, optimisation, etc. (your appendix)
  • Ability to evaluate ML usage (your evaluations of the other presentations)
    You will all be presenting your projects on Wednesday the 10th of June 2020 in an all-day presentation frenzy starting 9:00. We will do our best to ensure your comfortability, and at the same time ask you to evaluate the other projects (don't worry - we will not grade based on your evaluations). We require (and will assume) that everybody will be there. If you for some reason can not be there the full day, write us with your conflicts and reasons, and we'll do our best to mend the program or reschedule exam.


    Last updated: 8th of June 2020.