Advanced Methods in Applied Statistics 2023
Lecturer: D. Jason Koskinen
Email: koskinen (at) nbi.ku.dk
TA: "Juno" Chun Lung Chan
Email: chun.lung.chan (at) nbi.ku.dk
Basic Information
- Block 3 - Timetable A of the 2023 academic
calendar
- Tues 08:00 - 12:00 and Thurs 08:00- 12:00 & 13:00 - 17:00
- Actual
- 08:45 - 09:00 Q&A or discussion with Jason
- 09:00 lecture on new material (not 09:05 or 09:15)
- On Thursday there will often be new material starting at 13:00
- On Thursday it is very unlikely that any new material, lectures, or review will happen after 16:00.
- Class Location: øv - bib 4-0-17, Universitetsparken 1-3, DIKU
- Official schedule (link)
- Classes will be composed of ~20-30% lecture and demonstrations followed by exercise
- While assignments, projects, and exercises can be done in the programming language of the students choice, the examples and demonstrations will be mainly in Python and/or scientific packages thereof, i.e. SciPy, NumPy, etc.
- Required text or textbooks: None
- 2016 Advanced Methods in Applied Statistics webpage
- 2017 Advanced Methods in Applied Statistics webpage
- 2018 Advanced Methods in Applied Statistics webpage
- 2019 Advanced Methods in Applied Statistics webpage
- 2020 Advanced Methods in Applied Statistics webpage
- 2021 Advanced Methods in Applied Statistics webpage
- 2022 Advanced Methods in Applied Statistics webpage
- It is recommended, but not required, to have taken an introdcutory course on statistics, e.g. "Applied Statistics - From data to results" which can be found here
Evaluation
The presentation, the problems sets, and the project will all be submitted and assigned from Absalon. So check Absalon for instructions and due dates. The final exam is handled by the eksamen webpage.
- Presentation and 1-2 page summary (10%)
- Graded
problem sets (20%)
- Project
(30%)
- You may start working on this right now!!
- Final exam (40%)
- 28 hour take home exam starting on the morning of March 30 and ending on afternoon of March 31
- The exam will be similar to problem sets 2 and 3
- A handful of more intensive questions as opposed to numerous
short questions
- The exam will contain problems from any portion of the
course material, excluding guest lectures unless otherwise noted.
- Here
are two extra practice problems and the exams for 2016 and 2017
Course Syllabus
The course is very likely to change once we begin, and future lectures listed below serve as an outline. Even so, we are very likely to cover the following topics which may require additional software support:
- Multivariate analysis (MVA) techniques including Boosted Decision Trees (BDTs)
- The MultiNest bayesian inference tool
- Basis splines
- Markov Chain Monte Carlo
- Likelihood minimization techniques
- Spherical surface pixealization and isotropy (HealPix)
Class notes will be posted here on this webpage as they become available.
Optional Software Help Session (Jan. 27)
- 13:00-16:00 in room 4-0-17 at DIKU
- Optional session with the Teaching Assistant ( "Juno" Chung Lung Chan) for any students who may need assistance with their computer software setup.
- Get a preview with the course and some software tools to install.
Class 1 - Start (Feb. 7)
- Course Information
- Chi-square
- Code chi-square
- Data for exercise 1 (FranksNumbers.txt)
- Review of 'basic' statistics
- Lecture 1
- Start reading paper about how well Gaussian statistics compares to a wide selection of scientific measurements
- "Not Normal: the uncertainties of scientific measurements" link at arXiv
or DOI
- We will be discussing the paper in the next class, i.e. on Thursday
Class 2 - Monte Carlo Simulation & Least Squares (Feb. 9)
- Lecture 2
- Monte Carlo (reminder that lecture starts at 09:00)
- Code for area of the
circle. Note that the code is provided for illustrative purposes, and not as a piece of code that students are expected to be able to execute without modification.
- Example code
from Jean-Loup (2019 TA) in a Jupyter notebook
- Example code
from Tania (2021 TA) in a Jupyter notebook
- From the "Not Normal: the uncertainties of scientific measurements" paper:
- For the ambitious, create a 'toy monte carlo' of the sample and pair distributions for the nuclear physics data in Sec. 2.A. For simplicity assume that all the 'quantities' are gaussian distributed.
- Write functions where you can produce multiple gaussian distributions to sample from and generate a sample of "12380 measurements, 1437 quantities, 66677 pairs".
- Produce the z-distribution (using Eq. 4) plot for just your toy Monte Carlo and see if it matches a gaussian, exponential, student-t
distribution, etc...
- Discussion of "Not Normal: the uncertainties of scientific
measurements" (arXiv or DOI)
- Included here are some prompt questions to accompany discussion and understanding of the paper
- Least Squares (optional)
- Some useful links
Class 3 - Introduction to Likelihoods and Numerical Minimizers (Feb. 14)
Class 4 - Intro. to Bayesian Statistics & Splines (Feb. 16)
- Lecture 4 on Simple Bayesian statistics
- Using priors, posteriors, and likelihoods
- Example code for exercises from Jason, and example code from Tania
- Lecture 4.5
- Splines
- Data files for one of the exercises.
- Interesting article about use of splines and penalty terms
Class 5 - Parameter Estimation and Confidence Intervals (Feb. 21)
- Reminder: oral presentation and 1-2 page article reports will be due soon
Class 6 - Markov Chain(s) (Feb. 23)
- Lecture 6 Markov Chain Monte Carlo (MCMC)
- Look for an external package for Markov Chain Monte Carlo (MCMC), e.g. emcee
- Just like minimizers, syntax and options matter
- Be familiar with your chosen MCMC package
- Some example python code for the exercises (caveat emptor)
- Using emcee, the solution is graciously provided by Niccolo Maffezzoli (2017 TA)
Class 7 - Hypothesis Testing (Feb. 28)
- Lecture 7
- Likelihood ratio
- Data files for one of the exercises. Just use the first column in each file. The second column is unimportant.
Class 8 - Independent work (March 2)
- No new lecture material.
- Time to work on presentation and/or write-up.
- Jason and Juno will be around (in some combination), from 8:30-15:30 in the classroom.
Class 9 - TBD (March 7)
- Maybe something new, but if so the topic would not be part of an assignment or on the final exam.
- Will likely be one of:
- Independent work session
- Topic about sub-threshold anomaly detection in binned data
- Pre-recorded video (available on Absalon) with more content about p-values.
Class 10 - Presentations and Multivariate Analysis techniques (March 9)
- In the morning we are likely to have the presentations from the articles chosen.
- The class will be split in half, with one session being chaired by Jason and the other session chaired by Chun
- Links to some to some of the previous presentations (2016,
2017,
2018,
2019,
2022)
- This years presentations can be found at 2023
The Boosted Decision Trees
- Lecture 10
- Data
- Exercise 2 (16 variable file)
- The first column is the index, hence there are 17 'variables', but the index variable only for book keeping and has no impact on whether an event is signal or background.
- Every even row is the 'signal' and every odd row is the 'background'. Thus, there are two rows for each index in the first column: the first is the signal and the second is the background. [Format is odd, but I got it from a colleague].
- Here is the solution data sets separated into two files (benign
and malignant) for the last exercise of the lecture. Here is also the (python) code that I used to
establish the efficiency for all the submissions from all the students
Kernel Density Estimator
- KDE Lecture Slides
- On Absalon there is a video in the "Media Gallery" tab for a lecture on using Kernel Density Estimators. The slides will be slightly different than what is linked here, but the lecture content remains very similar and relevant.
Class 11 - Work on Project (March 14)
- No new material.
- Unfortunately neither Jason nor Juno will be availabe in person, but may be available via Slack or email.
Class 12 - Statistical Hypothesis Tests and Auto-Correlation (March 16)
- Lecture slides
- Files and some example code
- It is recommended (but not necessary) to have HEALPix software installed on your computer, or some other spherical surface pixelization software. There are options for C, C++, JAVA, Python,
and I see some for MATLAB too. You will be expected to draw plots/graphs using spherical projections, e.g. mollweide maps.
- No afternoon session
Class 13 - Nested Sampling, Bayesian Inference, and MultiNest (March 21)
- Lecture 13
- External packages for conducting nested sampling, e.g. MultiNest, are necessary and some python options are:
- Very good articles that are easy to read
Class 14 - Work on Project (no lecture or new material - March 23)
Class 15 - Course Review, and Non-Parametric Tests Lecture snippet (March 28)
- Lecture 15 (EXTRA)
- Kolmogorov-Smirnov, Anderson-Darling, and Mann-Whitney U tests
- Won't be be covered in class
- Topics include things that may be useful for research
Extra Projects of a more difficult nature, for those who want something
more challenging.