Applied Statistics - From data to results (Winter 2018-19)

"Youth is imaginative, and if the imagination can be strengthened by discipline, this energy of imagination can in great measure be preserved through life. The tragedy of the world is that those who are imaginative have but slight experience, and those who are experienced have feeble imagination. Fools act on imagination without knowledge, pedants act on knowledge without imagination. The task of a university is to weld together imagination and experience." [Alfred North Whitehead, English matematician and philosopher, 1861-1947]
Troels C. Petersen Christian Michelsen Vojtech Pacik Etienne Bourbeau Sebastien G. Manigand Giulia Sinnl
Lecturer - Associate Professor Teaching assistant - Ph.D. stud. Teaching assistant - Ph.D. stud. Teaching assistant - Ph.D. stud. Teaching assistant - Ph.D. stud. Teaching assistant - Ph.D. stud.
NBI - High Energy Physics NBI - Physics and Genomics NBI - High Energy Physics NBI - High Energy Physics NBI - Astro- and Planetary Physics NBI - Ice and Climate
Mac user Linux & Mac expert Mac expert Windows & Linux expert Windows & Linux expert Windows user?
35 52 54 42 / 26 28 37 39 50 48 30 95 28 25 56 96 91 84 83 28 28 19 46 43 +39 340 989 1562
petersennbi.dk christianmichelsengmail.com vojtech.paciknbi.ku.dk etienne.bourbeauicecube.wisc.edu sebastiennbi.ku.dk giulia.sinnlnbi.ku.dk


"Without data, you're just another person with an opinion." [William Edwards Deming, US statistician 1900-1993]
An exam check list and some advice can be found here: Applied Statistics exam check list
Course statistics (of limited warranty) on grading and grades can be found here: Course Grading Statistics


What, when, where, prerequisites, books, curriculum and evaluation:
Content: Graduate statistics course giving an advanced introduction to statistics and data analysis.
Level: Intended for students at 3rd-5th year of studies and new Ph.D. students.
Prerequisites: Math (calculus and linear algebra) and programming experience (any language, but see below).
Note on prerequisites: Programming is an essential tool and necessary for the course!!!
When: Monday 9-12, Tuesday 13-17, and Friday 9-12 (Week Schedule Group B).
Where: Monday+Friday: Lectures in Auditorium 5 at HCO, Exercises: BioCenter
Tuesday: Metropolskolen (Sigurdsgade 26, see course info)
Period: Blok 2 (19th of November 2018 - 18th of January 2019), 7.3 weeks total (missing a Monday and a Friday).
Format: Shorter lectures followed by computer exercises, discussion, and occationally experiments.
Text book: Roger Barlow: Statistics: A guide to the use of statistics.
Additional literature: Philip R. Bevington: Data Reduction and Error Analysis, Glen Cowan: Statistical Data Analysis.
Programs used: Simple Python (v3.6) and a few packages on top in Jupyter Notebook (see Nature article).
This has pro's and con's, both of which are important to know about: Why I don't like notebooks!
Regarding installation, please read our guide. For ERDA related issues, see the ERDA user guide.
Pensum/Curriculum: The course curriculum can be found here, which also contains a more detailed discussion.
Key words: PDFs, Uncertainties, Correlation, Chi-Square, Likelihood, Fitting, Monte Carlo and Data Analysis.
Expected learning: What I expect you to learn is discussed here: Learning objectives
Language: English (occational Danish utterings!). All exercises, problem sets, exams, notes, etc. are in English.
Evaluation: Problem set (20%), Project (20%), and take-home exam (60%).
Exam: Take-home (28 hour) exam given Thursday the 17th of January 2019 at 8:15.
Censur: Internal censor evaluation (following the Danish 7-step scale)
Credits: 7.5 ECTS (1/8 academic years work, that is 187.5-225 hours of work, thus about 23-28 hours weekly).

Before course start:
Further course information can be found here: Applied Statistics course information
The "course introduction" questionnaire to be filled out at: Applied Statistics 2018 Questionaire
List of things to be done by first day of course (Monday the 19th of November): Applied Statistics check list

Python specific precourse considerations:
Check your access to ERDA (requires KU ID), possibly following this guide.
NOTE: Many will always have tried ERDA, but for the right setup/packages, use "Statistics Notebook with Python".
Also, Install Python in version 3.6 (if you have it already, just put a few packages on top).
User Guides for the minimisation package: Minuit (2004), iminuit (2018). See also Minuit Tutorial (2004).

Check that it all runs with two notebooks called PythonIntro.ipynb and IntroToPlottingAndFitting.ipynb [ERDA: A0TUw59Rcf].
The above programs also exist in "ordinary/real" Python here: PythonIntro.py and IntroToPlottingAndFitting.py.
Note that the plotting and fitting program uses external functions [ERDA: A2c6fHxh52].


"Essentially, all models are wrong, but some are useful". [George E. P. Box, British Statistician, 1919-2013]


Course outline:
Below is the preliminary course outline, subject to changes throughout the course.

Week 0: (Pre-course-start-session) [ERDA: EBY2RuHUgL]
Nov 13: (10:15-12:00): Setting up Python, introduction, basic tips and trick to Python programming (Aud. A).
  • Introduction to programming, ERDA, notebooks, etc.
  • Introduction to Python: PythonIntro.ipynb and the differences to "regular" Python in PythonIntro.py.
    Nov 15: (10:15-12:00): More introduction, help, tips and trick to Python programming (Aud. A).
  • Fitting and plotting: IntroToPlottingAndFitting.ipynb, which produces a histogram and a graph.
  • Prime numbers/distribution: Calculation: CalcPrimeNumbers.ipynb and plotting: CalcAndPlotPrimeNumbers.ipynb,
         which produces (this figure).

    Week 1 (Introduction, General Concepts, ChiSquare Method): [ERDA: h5UbTa7Luo]
    Nov 19: 8:15-10:00: Intro to course, photos, questionnaire, and table measurements (Aud. A).
         Central limit theorem. Mean, RMS and estimators. Correlation. Significant digits.
    Nov 20: Error propagation (which is a science!). Estimate g measurement uncertainties.
    Nov 23: ChiSquare method. Short Python Q+A (12-13). Formation of Project groups.

    Week 2 (PDFs, Likelihood, Systematic Errors): [ERDA: FxhETpGzhA]
    Nov 26: Probability Density Functions (PDF) especially Binomial, Poisson and Gaussian. Writing "Weighted mean" function.
    Nov 27: Principle of maximum likelihood and fitting (which is an art!).
    Nov 30: 8:15 - Group A: Project [ERDA: bEpsim3Efu] (for Sunday the 16th of December) doing experiments in First Lab.
                  9:15 - Group B: Analysis of "Table Measurement data" and discussion of real data analysis (usual rooms).
                  For inspiration/reference, here is a table measurement example solution (Python using ROOT!).

    Week 3 (Using Simulation, Likelihood, Fitting): [ERDA: G5MICBv1j0]
    Dec 3: 8:15 - Group B: Project [ERDA: bEpsim3Efu] (for Sunday the 16th of December) doing experiments in First Lab.
                9:15 - Group A: Analysis of "Table Measurement data" and discussion of real data analysis (usual rooms).
    Dec 4: Producing random numbers and their use in simulations.
    Dec 7: Likelihood fits and Simpson's paradox. Problem set given (due 6th of January 2019 at 22:00).
         Introducing problem set and data (for Sunday the 6th of January, 22:00).
         Here is the associated problem 4.1 data file and a Python script for reading it.
         Here is the associated problem 5.1 data file and a Python script for reading it.
         Shared link to code and data on ERDA: BGPcwKpx10

    Week 4 (Hypothesis Testing and limits): [ERDA: B9Q922ZbiI]
    Dec 10: Hypothesis testing. Simple, Chi-Square, Kolmogorov, and runs tests.
    Dec 11: Limits and confidence intervals. Testing random numbers.
    Dec 14: Table Measurement solution discussion. Estimating pi and N-dim ball volume from simulation.

    Week 5 (Calibration and Advanced Fitting): [ERDA: FqrCU1yFsA]
    Dec 17: Calibration and use of control channels. Project should have been submitted! (along with residuals).
    Dec 18: Advanced fitting and discussion of fitting strategies.
    Dec 21: Evaluation of project results. Summary of curriculum so far. Session on Problem Set.

    Week 6 (Bayes Theorem and Multivariate Analysis): [ERDA: gO5mHZ7lhV]
    Dec 31: Happy New Year.
    Jan 1: Advanced theoretical statistics and heavy proofs (just kidding - there is of course no teaching!).
    Jan 4: Bayes theorem. Multi-Variate Analysis (MVA). The linear Fisher discriminant.
         For exam training, here is Exam2016.pdf, to be discussed on Monday the 14th of January.
         Here is the associated problem 4.1 data file.
         Here is the associated problem 5.1 data file.
         Here is the associated problem 5.2 data file.
         Here is the solution manual for Exam2016.

    Week 7 (Machine Learning and classification): [ERDA: cgccp2cJPx]
    Jan 7: Machine Learning (ML). Neural Networks, Decision Trees and other MLs. Problem set should have been submitted.
    Jan 8: Analysis of real and complex data on separating/classifying events. Analysis of testbeam data (part I).
    Jan 11: Problem set returned and discussed. Analysis of testbeam data (part II).

    Week 8 (Advanced fitting of real data and exam): [ERDA: bX4dvSEQET]
    Jan 14: Advanced fitting. Short deliberation on 2016 exam.
    Jan 15: Summary of course curriculum. Exam questions. Catch up on exercises.
    Jan 17: Exam given (posted on course webpage 8:15 in the morning).
    Jan 18: 12:00 Exam to be handed in (on www.eksamen.ku.dk).

    Week 9 (Returning exam):
    Jan 25: 15:15-16:30ish+ (Aud. 3 at HCO): Exam solution, grades and course feedback.
         Designing experiments (inspired by "A lady tasting tea") with beer tasting? Or just beer...


    "The best thing about being a statistician is that you get to play in everyone else's backyard." [John Tukey, Princeton University]




    Notes and links:
    In addition to the text book and other literature, some notes may be useful during the course:
  • PDG notes on Probability.
  • PDG notes on Statistics.
  • PDG notes on Monte Carlo Techniques.
  • Note on analytical fit of straight line.
  • Note on Frequentialist vs. Bayesian statistics and discoveries.
  • Note on rejecting data using Chauvenet's criteria.
  • Nature Physics article on discoveries.
  • Fisher's Exact Test on tea drinking lady.
  • Statistics resources.
  • Online course introducing Machine Learning..
  • Power Comparisons between tests of normality (spoiler alert: Shapiro-Wilk wins!)

    Course comments/praise (very biased selection!):
    "This course overqualified me for a course on scientific computing at Harvard the following Summer."
    [Dennis Christensen (2009 course)]

    "I recommended this course to everyone I know." [Pernille Yde (2009 course)]

    "I don't think that you can rightly call yourself a physicist, if you have not had a course of this type."
    [Bo Frederiksen (2010 course)]

    "My second project in the course led to an article now in review for Nature magazine!" (it was accepted)
    [Ninna Rossen (2011 course)]

    "If you really want to understand your data, you need a course like this."
    [Julius Bier Kirkegaard (2012 course)]

    "I realized that I was very well prepared by this course, when I started working at CERN as a Summer Student."
    [Mathias Heltberg (2013 course)]

    "It is now many years ago, that I followed your course, but there is hardly a day, where I don't think about it"
    [Frederik Beyer (2011 course, in October 2014)]

    "This is without a doubt the single most useful, and possibly most influential, course I have taken during my university education. Thank you."
    [Samuel Walsh (2013 course, in December 2014)]

    "Tak for et fedt kursus. Naar jeg taenker tilbage paa mine 2.5 aars fysikstudier staar Anvendt Statistik frem som noget af det sjoveste og mest spaendende."
    [Martin Hayhurst Appel (2014 course)]

    "Every single sleepless night spent on this course has enriched my way of thinking."
    [Arianna Marchionne (2015 course)]

    "The best lecturer I had in my 3 years of studies in UCPH."
    [Anonymous (2016 course)]

    "I miss the course very much."
    [Niccolo Maffezzoli (instructor in 2015+2016 course, in 2017 as a PostDoc)]

    "I am able to confirm your course is very demanding but indeed worth working for, for I could spend another 7 weeks on this interesting curriculum!"
    [Jan de Boer 2017, upon having been told, that the course is demanding]

    "This course has been one of the most important aspects of my education so far. I have heard this from earlier students again and again - i am happy to say that i understand why now!"
    [Anonymous, Last line in the evaluation of 2017 course]

    "I learned a lot when I took the course, and still a good deal of things the year after, when I was a TA in the course."
    [Christian Michelsen, student in 2016 and TA in 2017+18 course]

    "Jeg gerne udtrykke min taknemmelighed for at have haft muligheden for at deltage i et så velstruktureret og gennemført et kursus, som dit. Du burde være en inspiration for alle professorer på universitetet".
    [From a student in the 2018 course, despite the person chosing the re-exam!]

    "I wanted to tell you, that this is the best course I ever had. And I've studied at four universities!"
    [Vlad-Andrei Neacsu (2018 course)]

    "Hvad skal vi dog give os til nu, hvor vi ikke laengere har Applied Statistics at more os med? Jeg savner det allerede!"
    [Lisa Lolk Hauge - last day of 2018 course (evaluation)]


    Last updated 12th of January 2019.