Applied Statistics - Project 2

The purpose of Project 2 is for you to use your newly won statistical skills on data of your own liking. Only through applying the methods yourself do you realize their powers and weaknesses, and get the experience that they require. This is at the same time a chance for you to affiliate yourself with some of the groups at NBI, and to take part in the research that goes on here.

Requirements:
You are free to choose any dataset you like, however they should strive to fulfill the following very loose requirements:
  • There must be 1000+ data points/measurements.
  • These must not be from (simple) simulations.
  • You should apply a hypothesis test on them.
  • Prepare a 8-10 minute presentation for Monday the 16th of January.
    Thus, you don't need to write up a paper, but simply prepare slides, which can be presented in 8-10 minutes. If you have a lot of material, you are welcome to include this in backup/bonus slides.

    Ideas and proposals for projects:
    Project 2 is meant as a possibility to throw yourself at data in your favorit field of research, or one you would like to explore. Talk to research groups around NBI or elsewhere, as surely they all have data or aspects of it that they never got around to analyze.
    If you don't have any data to analyze, I have some interesting and illustrative yet relatively simple data sets, some of which are listed below.
  • Murders in Denmark 1980-2014 (about 1500 observations).
         MurderGenderAge.zip and NotesOnMurdersInDenmark.txt.
  • Rutherford's Experiment (seeing atoms!).
         This data is obtained "live" in collaboration with Ian Bearden.
  • UFO sightings 1950-2014 (about 75000 observations).
         UFOdata.txt and readUFOdata.py.
  • Tidal forces from two gravimeters (10000+ observations).
         TidalData_GDK.dat TidalData_GEK.dat and NotesOnTidalData.txt.
  • LHC 2015 "V0" data (10.000.000+ observations).
         This data set is too large to be posted (24 GB), but can be obtained along with programs to read it from Troels Petersen.

    Finally, the following two links contains many datasets of various sizes: Quora large datasets and Kaggle data competitions.

    Suggestions, comments and advice:
    While you are free to do your project as you see fit, the following are a few pieces of advice and suggestions:
  • Start the presentation by presenting your motive/aim and the data you will use to test your hypothesis. State how much data you have and the format (No, not "It is a text file of 2MB").
  • Describe what you do to the data in enough detail that others will be able to redo what you did.
  • Set up hypothesis tests (or whatever you're looking at) for each subject stated in the opening, and state the result quantitatively.
  • Abstract should be very short (5-10 lines), and it should summarize your results as well.
  • Most importantly, you should think about what figures you want to include, and how to make them the best possible. They should contain as much information as possible (be your main result!) while not getting cluttered and hard to read. Remember, they will be 75% of what people see, refere to, include in slides and posters, and understand from.

    A great project2 example:
    Two great examples of trying to analyse some fun/strange/weird data can be found here:
    Was 2016 especially dangerous for celebrities?
    How many famous people will die in 2017?
    Given the many deaths of prominent artists (David Bowie, Prince, George Michael and Leonard Cohen), Jason Crease and the MIT Media Lab finds data and puts an analysis together to test this hypothesis. Let these be a source of inspiration for your project2.

    Comments:
    Enjoy, have fun, and throw yourself boldly at the data.




    Last updated: 7th of December 2017.