Applied Statistics - From data to results (Winter 2022-23)

"Essentially, all models are wrong, but some are useful". [George E. P. Box, British Statistician, 1919-2013]

Troels C. Petersen Mathias Heltberg Kate M. L. Gould Rajeeb Sharma Emma Ynill Lenander Ting-Yi Lu Malthe Skytte Nordentoft
Lecturer Assistant lecturer Teaching assistant Teaching assistant Teaching assistant Teaching assistant Teaching assistant
Associate Professor Senior PostDoc Ph.D. student Ph.D. student Ph.D. student Ph.D. student Ph.D. student
High Energy Physics Bio Complexity Astrophysics Astrophysics Condenced Matter Astrophysics Bio Complexity
Mac user Mac expert Mac & Linux expert Mac & Windows expert Mac expert Mac expert Mac & Linux expert
Course responsible Continuity responsible Zoom responsible Slack responsible Lab coord. responsible Zoom responsible GitHub responsible
26 28 37 39 26 19 18 89 71 40 24 15 31 56 82 10 52 68 03 92 28 90 19 02
petersennbi.dk heltbergnbi.ku.dk katriona.gouldnbi.ku.dk rajeeb.sharmanbi.ku.dk emma.lenandernbi.ku.dk tingyi-lunbi.ku.dk malthe.nielsennbi.ku.dk

"Without data, you're just another person with an opinion." [William Edwards Deming, US statistician 1900-1993]


Course information:
What, when, where, prerequisites, books, curriculum and evaluation:
Content: Graduate statistics course giving an advanced introduction to statistics and data analysis.
Level: Intended for physics (and science) students at 3rd-5th year of studies and new Ph.D. students.
Prerequisites: Math (calculus and linear algebra) and programming experience (preferably Python, but there are no language requirement).
Note on prerequisites: Programming is an essential tool and necessary for the course!!!
When: Monday 8:15-12:00, Tuesday 13:15-17:00, and Friday 8:15-12:00 (Week Schedule Group B).
Note on morning lectures: After the first three weeks, we will start 9:15 on Mondays and Fridays.
Where: Lectures: Auditorium 4 at HCO.
Exercises: DIKU bib 4-0-17 and NBB 01.0.G.064/070 (Mondays+Tuesdays), changing places (Fridays), see KU Room Schedule plan.
Period: Blok 2 (21st of November 2022 - 20th of January 2023 including exam), 8 weeks total.
Format: Shorter lectures followed by computer exercises, discussion, and occationally experiments.
Text book: Roger Barlow: Statistics: A guide to the use of statistics.
Additional literature: Philip R. Bevington: Data Reduction and Error Analysis, which is a great down-to-earth introduction to statistics.
Glen Cowan: Statistical Data Analysis, which is a shorter, modern introduction to statistics and data analysis.
Programs used: Python (v3.8+) and a few packages on top in Jupyter Notebook (see Nature article).
Jupyter Notebooks has pros and cons, both of which are important to know about, e.g. Why I don't like notebooks!
Exercise/code repository: All code used for the exercises of the course will be at the AppliedStatisticsNBI GitHub.
Slack channel: The course Slack channel is: NbiAppliedStatistics2022.slack.com.
Pensum/Curriculum: The course curriculum covers chapters 1-8 + 10 in Barlow with many exceptions, detailed in the link.
Key words: PDFs, Uncertainties, Correlation, Chi-Square, Likelihood, Fitting, Monte Carlo methods, and Data Analysis.
Expected learning: What I expect you to learn is discussed here: Learning objectives
Language: English (occational Danish utterings!). All exercises, problem sets, exams, notes, etc. are in English.
Evaluation: Problem set (20%), Project (20%), and take-home exam (60%).
Exam: Take-home (36 hours!) exam given Thursday the 19th of January 2023 at 8:00.
The exam will start on Thursday the 19th at 8:00 and end on Friday the 20th of January at 20:00 (36 hours in total).
Credits/Censur: 7.5 ECTS with internal censor evaluation (following the Danish 7-step scale)

"Statistical thinking will one day be as necessary a qualification for efficient citizenship as the ability to read and write." [H.G. Wells]



Before course start:
For an overview of the course curriculum, please see the overview video (1.1 GB, 12 min.) and overview PDF.
Further course information can be found here: Applied Statistics course information.


Course outline:
Below is the preliminary course outline (subject to changes and updates throughout the course).

Week 1 (Introduction, General Concepts, ChiSquare Method):
Nov 21: 8:15-10:00: Introduction to course and overview of curriculum.
     Mean and Standard Deviation. Correlations. Significant digits. Central limit theorem. (12-13 Measuring in Aud. A!)
Nov 22: Error propagation (which is a science!). Estimate g measurement uncertainties.
Nov 25: ChiSquare method, evaluation, and test. Formation of Project groups.

Week 2 (PDFs, Likelihood, Systematic Errors):
Nov 28: Probability Density Functions (PDF) especially Binomial, Poisson and Gaussian.
Nov 29: Principle of maximum likelihood and fitting (which is an art!).
Dec 2: 8:15 - Group A: Project (for Wednesday the 14th of December) doing experiments in First Lab.
            9:15 - Group B: Systematic Uncertainties and analysis of "Table Measurement data" Discussion of real data analysis in Aud. A (+B+C+Ma).

Week 3 (Using Simulation and More Fitting):
Dec 5: 8:15 - Group B: Project (for Wednesday the 14th of December) doing experiments in First Lab.
            9:15 - Group A: Systematic Uncertainties and analysis of "Table Measurement data". Discussion of real data analysis in Aud. A (+B+C+Ma).
Dec 6: Producing random numbers and their use in simulations.
Dec 9: 8:15: Summary of curriculum so far. 9:15 Fitting strategies and Simpson's Paradox (Note: last day starting 8:15).

Week 4 (Hypothesis Testing and limits):
Dec 12: Table Measurement solution discussion. Exercises: Work on analysis of project data.
Dec 13: Hypothesis testing. Simple, Chi-Square, Kolmogorov, and runs tests.
     Project should been submitted by Wednesday the 14th of December at 22:00!
Dec 16: More hypothesis testing, limits, and confidence intervals. Testing your random (?) numbers.

Week 5 (Bayesian statistics and Calibration):
Dec 19: Bayes theorem and Baysian statistics (Mathias).
Dec 20: Calibration and use of control channels.

     For exam training, here is Exam2016.pdf, to be discussed shortly on Tuesday the 17th of January.
     Here is the associated problem 4.1 data file.
     Here is the associated problem 5.1 data file.
     Here is the associated problem 5.2 data file.
     Here is an approximate solution manual.

     Due to public demand, here is Exam2018.pdf.
     Here is the associated problem 4.1 data file.
     Here is the associated problem 5.1 data file.
     Here is an approximate solution manual.

Week 6 (Introduction to Machine Learning, multivariate analysis, and real data classification/analysis):
Jan 2: Machine Learning (ML). Neural Networks, Decision Trees and other MLs. Exercise: ML and/or Problem Set!
Jan 3: Multi-Variate Analysis (MVA). The linear (Fisher) discriminant (compared to PCA).
     Problem set should be submitted by Tuesday the 3rd of January at 22:00!
Jan 6: Analysis of real and complex data on separating/classifying events. Analysis of testbeam data.

Week 7 (Problem Set deliberation, Advanced fitting, and time series):
Jan 9: Discussion of Problem Set solution (grades given Friday the 13th!). Discussion of Analysis of testbeam data.
Jan 10: Advanced fitting with both functions, models, and in 2D.
Jan 13: Time series analysis (Mathias).

Week 8 (Fitting and exam):
Jan 16: Deliberation on previous (2016) exam. Discussion of fitting philosophy.
Jan 17: Discussion of selected parts of course curriculum. Exam questions. Catch up on exercises.
Jan 19: Exam given (posted on course webpage 8:00 in the morning).
Jan 20: 20:00 Exam to be handed in (on www.eksamen.ku.dk).


"The art of drawing conclusions from experiments and observations consists in evaluating probabilities and in estimating whether they are sufficiently great or numerous enough to constitute proofs. This kind of calculation is more complicated and more difficult than it is commonly thought to be."
[Antoine Lavoisier, French chemist 1743-1794]




Notes and links:
In addition to the text book and other literature, some notes may be useful during the course:
  • PDG notes on Probability.
  • PDG notes on Statistics.
  • PDG notes on Monte Carlo Techniques.
  • Note on analytical fit of straight line.
  • Note on Frequentialist vs. Bayesian statistics and discoveries.
  • Note on rejecting data using Chauvenet's criteria.
  • Nature Physics article on discoveries.
  • Fisher's Exact Test on tea drinking lady.
  • Statistics resources.
  • Online course introducing Machine Learning..
  • Power Comparisons between tests of normality (spoiler alert: Shapiro-Wilk wins!)

    Course comments/praise (very biased selection!):
    "Kære Troels. Jeg skriver til dig fordi jeg tog dit kursus, 'Anvendt statistik' tilbage i 2012. Jeg var (er) meget begejstret for det kursus og har brugt den viden utallige gange siden."
    [Rune Gjermundbo, Director of Business Operations, 2022]

    "If you feel a lack of sun in the Danish Winter you can see one of his lectures where he shines bright."
    [Anonymous, Last line in the 2021 evaluations]

    "The lectures are simply a joy to witness. If only all lecturers were like these, KU would likely be number 1 in terms of having a good time learning (and the p-value of that is 0.99)."
    [Anonymous, 2020 evaluations for Ph.D. students]

    "This course should be part of every phyiscs Bachelor curriculum as it provides essential tools for scientific work."
    [Anonymous, Last line in the evaluation of 2020 course]

    Comments from previous years

    Last updated 14th of January 2023.