Applied Statistics - From data to results (Winter 2020-21)

"The art of drawing conclusions from experiments and observations consists in evaluating probabilities and in estimating whether they are sufficiently great or numerous enough to constitute proofs.
This kind of calculation is more complicated and more difficult than it is commonly thought to be."
[Antoine Lavoisier, French chemist 1743-1794]
Troels C. Petersen Giulia Sinnl Zuzana Moravcova Anna Suliga John Weaver Nikki Arendse
Lecturer - Associate Professor Teaching assistant - Ph.D. stud. Teaching assistant - Ph.D. stud. Teaching assistant - Ph.D. stud. Teaching assistant - Ph.D. stud. Teaching assistant - Ph.D. stud.
NBI - High Energy Physics NBI - Ice and Climate NBI - High Energy Physics NBI - Astrophysics NBI - Astronomy NBI - Cosmology
Mac user Windows expert Mac expert Linux expert Mac & Linux expert Mac expert
35 52 54 42 / 26 28 37 39 50 30 93 20 50 12 00 33 91 87 06 28 50 21 38 31 +31 637 448 001
petersennbi.dk giulia.sinnlnbi.ku.dk moravcovanbi.ku.dk anna.suliganbi.ku.dk john.weavernbi.ku.dk nikki.arendsenbi.ku.dk


"Without data, you're just another person with an opinion." [William Edwards Deming, US statistician 1900-1993]

What, when, where, prerequisites, books, curriculum and evaluation:
Content: Graduate statistics course giving an advanced introduction to statistics and data analysis.
Level: Intended for students at 3rd-5th year of studies and new Ph.D. students.
Prerequisites: Math (calculus and linear algebra) and programming experience (preferably Python, but see note below).
Note on prerequisites: Programming is an essential tool and necessary for the course!!!
When: Monday 8:15-12:00, Tuesday 13:15-17:00, and Friday 8:15-12:00 (Week Schedule Group B).
Note on morning lectures: After the first three weeks, we will start 9:15 on Mondays and Fridays.
Where: Lectures: By Zoom (only).
Exercises: In Frederiksberg on Thorvaldsensvej 40 (mostly), see KU Room Schedule plan.
Period: Blok 2 (16th of November 2020 - 22nd of January 2021 including exam), 9 weeks total.
Format: Shorter lectures followed by computer exercises, discussion, and occationally experiments.
Text book: Roger Barlow: Statistics: A guide to the use of statistics.
Additional literature: Philip R. Bevington: Data Reduction and Error Analysis, which is a great down-to-earth introduction to statistics.
Glen Cowan: Statistical Data Analysis, which is a shorter, modern introduction to statistics and data analysis.
Programs used: Simple Python (v3.8+) and a few packages on top in Jupyter Notebook (see Nature article).
This has pro's but also con's, both of which are important to know about, e.g. Why I don't like notebooks!
For and introduction to ERDA and related issues, see the ERDA user guide.
Exercise/code repository: All code used for the exercises of the course will be at the AppliedStatisticsNBI GitHub.
Slack channel: The course Slack channel is: NbiAppliedStatistics2020.slack.com.
Pensum/Curriculum: The course curriculum covers chapters 1-8 + 10 with many exceptions, detailed in the link.
Key words: PDFs, Uncertainties, Correlation, Chi-Square, Likelihood, Fitting, Monte Carlo and Data Analysis.
Expected learning: What I expect you to learn is discussed here: Learning objectives
Language: English (occational Danish utterings!). All exercises, problem sets, exams, notes, etc. are in English.
Evaluation: Problem set (20%), Project (20%), and take-home exam (60%).
Exam: Take-home (36 hours!) exam given Thursday the 21st of January 2021 at 8:00.
The exam will start on Thursday the 21st at 8:00 and end on Friday the 22nd of January at 20:00 (36 hours in total).
Censur: Internal censor evaluation (following the Danish 7-step scale)
Credits: 7.5 ECTS (1/8 academic years work, that is 187.5-225 hours of work, thus about 23-28 hours weekly).


Before course start:
Further course information can be found here: Applied Statistics course information.
The "course introduction" questionnaire to be filled out at: Applied Statistics 2020 Questionaire.
List of things to be done by first day of course (Monday the 16th of November): Applied Statistics check list.
For an overview of the course curriculum, please see the overview video (560 MB, 18 min.) (audio) and overview PDF.
Please use this list of students (having filled out the questionnaire) for finding fellow students and collaborators.


Python specific precourse considerations:
The source of all code for this course is the NBI Applied Statistics github repository.
For a quick introduction to the basic git commands, please see the git cheat sheet.
Check that you have access to ERDA (requires KU ID), as an alternative to running on your own computer. Use "Statistics Notebook with Python".
Also, Install Python as described in README.md in version 3.8+, and put a few packages on top.
User Guides for the Minuit minimisation package: iminuit (2018). Perhaps, also see: Minuit Tutorial (2004).

"Essentially, all models are wrong, but some are useful". [George E. P. Box, British Statistician, 1919-2013]


Course outline:
Below is the preliminary course outline (subject to changes and updates throughout the course).

Problem set: (due Sunday 3rd of January 2021 at 22:00)
     The problem set and the associated data files can be found here:
     Here is the associated data_DNAfraction.txt data file for problem 2.3
     Here is the associated data_Cells.txt data file for problem 4.1
     Here is the associated data_BetaCalibration.txt data file for problem 5.2

Week 1 (Introduction, General Concepts, ChiSquare Method):
Nov 16: 8:15-10:00: Introduction to course and overview of curriculum.
     Mean and Standard Deviation. Correlations. Significant digits. Central limit theorem.
Nov 17: Error propagation (which is a science!). Estimate g measurement uncertainties.
Nov 20: ChiSquare method, evaluation, and test. Short Python Q+A (12-13). Formation of Project groups.

Week 2 (PDFs, Likelihood, Systematic Errors):
Nov 23: Probability Density Functions (PDF) especially Binomial, Poisson and Gaussian.
Nov 24: Principle of maximum likelihood and fitting (which is an art!).
Nov 27: 8:15 - Group A: Project (for Sunday the 13th of December) doing experiments in First Lab.
              9:15 - Group B: Systematic Uncertainties and analysis of "Table Measurement data" Discussion of real data analysis (usual rooms).

Week 3 (Using Simulation and More Fitting):
Nov 30: 8:15 - Group B: Project (for Sunday the 13th of December) doing experiments in First Lab.
              9:15 - Group A: Systematic Uncertainties and analysis of "Table Measurement data". Discussion of real data analysis (usual rooms).
Dec 1: Producing random numbers and their use in simulations.
Dec 4: Writing "Weighted mean" function. Fitting strategies.

Week 4 (Hypothesis Testing and limits):
Dec 7: Hypothesis testing. Simple, Chi-Square, Kolmogorov, and runs tests.
Dec 8: Table Measurement solution discussion. Testing your random (?) numbers.
Dec 11: Limits and confidence intervals and Simpson's paradox.

Week 5 (Multivariate Analysis and Calibration):
Dec 14: Bayes theorem. Multi-Variate Analysis (MVA). The linear Fisher discriminant. Project should have been submitted!
Dec 15: Calibration and use of control channels.
Dec 18: Evaluation of project results. Summary of curriculum so far. Session on Problem Set.

Week 5-and-a-half (As you like...):
Dec 21: No teaching!
Dec 22: No teaching!

     For exam training, here is Exam2016.pdf, to be discussed shortly on Monday the 18th of January.
     Here is the associated problem 4.1 data file.
     Here is the associated problem 5.1 data file.
     Here is the associated problem 5.2 data file.
     Here is the solution manual for Exam2016.

Week 6 (Machine Learning and real data classification/analysis):
Jan 4: Machine Learning (ML). Neural Networks, Decision Trees and other MLs. Problem set should have been submitted.
Jan 5: Analysis of real and complex data on separating/classifying events. Analysis of testbeam data (part I).
Jan 8: Shorter lecture on binning and Machine Learning. Analysis of testbeam data (part II).

Week 7 (Advanced fitting, data pipeline, and Problem Set deliberation):
Jan 11: Advanced fitting with both functions and models.
Jan 12: Data pipelines (by Gabriel Brammer).
Jan 15: Problem Set deliberation. Using Monte Carlo for determining pi!

Week 8 (Fitting and exam):
Jan 18: Deliberation on previous (2016) exam. Exercise on fitting.
Jan 19: Summary of course curriculum. Exam questions. Catch up on exercises.
Jan 21: Exam given (posted on course webpage 8:00 in the morning).
Jan 22: 20:00 Exam to be handed in (on www.eksamen.ku.dk).

Week 10 (Returning exam):
Feb 5: 15:15-16:30ish+: Exam solution, grades and course feedback (Link to final session).
     Designing experiments (inspired by "A lady tasting tea") with beer tasting? Or just beer...


"The best thing about being a statistician is that you get to play in everyone else's backyard." [John Tukey, Princeton University]




Notes and links:
In addition to the text book and other literature, some notes may be useful during the course:
  • PDG notes on Probability.
  • PDG notes on Statistics.
  • PDG notes on Monte Carlo Techniques.
  • Note on analytical fit of straight line.
  • Note on Frequentialist vs. Bayesian statistics and discoveries.
  • Note on rejecting data using Chauvenet's criteria.
  • Nature Physics article on discoveries.
  • Fisher's Exact Test on tea drinking lady.
  • Statistics resources.
  • Online course introducing Machine Learning..
  • Power Comparisons between tests of normality (spoiler alert: Shapiro-Wilk wins!)

    Course comments/praise (very biased selection!):
    "This course overqualified me for a course on scientific computing at Harvard the following Summer."
    [Dennis Christensen (2009 course), Venture Cup winner and now researcher at DTU Energy]

    "I recommended this course to everyone I know."
    [Pernille Yde (2009 course), now Head of Section of Data Science Lab at Statistics Denmark]

    "I don't think that you can rightly call yourself a physicist, if you have not had a course of this type."
    [Bo Frederiksen (2010 course)]

    "My second project in the course led to an article now in review for Nature Communications magazine!" (it was accepted)
    [Ninna Rossen (2011 course), now Assistant Professor, Biotech Research & Innovation Centre (BRIC)]

    "If you really want to understand your data, you need a course like this."
    [Julius Bier Kirkegaard (2012 course), now senior PostDoc in BioComplexity group]

    "I realized that I was very well prepared by this course, when I started working at CERN as a Summer Student."
    [Mathias Heltberg (2013 course), now senior PostDoc in BioComplexity group]

    "It is now many years ago, that I followed your course, but there is hardly a day, where I don't think about it"
    [Frederik Beyer (2011 course, in October 2014)]

    "This is without a doubt the single most useful, and possibly most influential, course I have taken during my university education. Thank you."
    [Samuel Walsh (2013 course, in December 2014)]

    "Tak for et fedt kursus. Naar jeg taenker tilbage paa mine 2.5 aars fysikstudier staar Anvendt Statistik frem som noget af det sjoveste og mest spaendende."
    [Martin Hayhurst Appel (2014 course)]

    "Every single sleepless night spent on this course has enriched my way of thinking."
    [Arianna Marchionne (2015 course)]

    "The best lecturer I had in my 3 years of studies in UCPH."
    [Anonymous (2016 course)]

    "I miss the course very much."
    [Niccolo Maffezzoli (instructor in 2015+2016 course, in 2017 as a PostDoc)]

    "I am able to confirm your course is very demanding but indeed worth working for, for I could spend another 7 weeks on this interesting curriculum!"
    [Jan de Boer 2017, upon having been told, that the course is demanding]

    "This course has been one of the most important aspects of my education so far. I have heard this from earlier students again and again - i am happy to say that i understand why now!"
    [Anonymous, Last line in the evaluation of 2017 course]

    "I learned a lot when I took the course, and still a good deal of things the year after, when I was a TA in the course."
    [Christian Michelsen, student in 2016 and TA in 2017+18 course]

    "Jeg gerne udtrykke min taknemmelighed for at have haft muligheden for at deltage i et saa velstruktureret og gennemfoert et kursus, som dit. Du burde vaere en inspiration for alle professorer paa universitetet".
    [From a student in the 2018 course, despite the person chosing the re-exam!]

    "I wanted to tell you, that this is the best course I ever had. And I've studied at four universities!"
    [Vlad-Andrei Neacsu (2018 course)]

    "Thank you for the amazing course. My view on measurements errors and statistics will never be the same."
    [Valdemaras Petrosius (2019 course)]

    "Thank you for the passing of knowledge. It was a mad statistical adventure, much more 'Applied' and worthwhile than any course I ever took!"
    [Che Fall (2019 course, writing from his native Canada after a non-optimal exam)]

    "Dear Troels. Thank you for an amazing course. Taking statistics to a level, where more than 100 students hang on your every word deserves more than just a pair of socks, but nonetheless, we hope that they will bring you as much joy as you have brought us."
    [Johann, Jonas, Jakob, and Christian (2019 course, with a pair of socks!)]

    "The lectures are simply a joy to witness. If only all lecturers were like these, KU would likely be number 1 in terms of having a good time learning (and the p-value of that is 0.99)."
    [Anonymous, 2020 evaluations for Ph.D. students]

    "This course should be part of every phyiscs Bachelor curriculum as it provides essential tools for scientific work."
    [Anonymous, Last line in the evaluation of 2020 course]


    Last updated 21st of January 2021.