Advanced Methods in Applied Statistics 2018
Lecturer: D. Jason Koskinen
Email: koskinen (at) nbi.ku.dk
Basic Information
- Block 3 - Timetable A of the 2018 academic
calendar
- Tues 08:00 - 12:00 and Thurs 08:00- 12:00 & 13:00 - 17:00
- Actual
- 08:00 - 08:30 student study time for both Tues. and Thurs.
- 08:30 - 09:00 Q&A or discussion with Jason in Aud. M
- 09:00 lecture on new material (not 09:05 or 09:15)
- On Thursday there will very often be new material starting at
13:00
- On Thursday it is very unlikely that any new material, lectures,
or review will happen after 16:00.
- Auditorium M at the Blegdamsvej campus
- Odd-numbered classes are 4-hours while even-numbered consist of 2
blocks of 4-hours
- Classes will be composed of ~20-30% lecture and demonstrations
followed by exercise
- While assignments, projects, and exercises can be done in the
programming language of the students choice, the examples and
demonstrations will be mainly in Python and/or scientific packages
thereof, i.e. SciPy, PyROOT, etc.
- Required text or textbooks: None
- 2016 Advanced Methods in Applied Statistics webpage
- 2017 Advanced Methods in Applied Statistics webpage
- It is recommended, but not
required, to have at least reviewed the little sibling to this course,
i.e. "Applied Statistics - From data to results" which can be found here
Evaluation
- Oral
presentation and 1-2 page summary (10%)
- ~10 minute summary presentation. Plan on ~7 slides if you are doing
a PowerPoint-type presentation.
- Can work alone or in groups of up to 3.
- A 1-2 page summary including any and all group members names
- Presentation does NOT have to be given by all group members
- Talk or email with Jason if you have questions about the
appropriateness of your article
- Be sure to put down which article you are using here
to avoid duplication
- Example presentation
on Finite
Monte Carlo article
- Other example articles (and no, you cannot use any of these articles
for your report/presentation):
- Frequency Difference Gating: A Multivariate Method for Identifying
Subsets That Differ Between Samples (article)
- Probability binning comparison: a metric for quantitating
multivariate distribution differences (article)
- FIREFLY MONTE CARLO: EXACT MCMC WITH SUBSETS OF DATA (article)
- This is just a small sample. Ideally, find something related to
your area of research.
- Include people names and article here
by Feb. 26
- The 1-2 page summary as a .pdf file is due
via email. Submission date is March 7 by 16:00 CET
- Presentations will be selected at random and begin during class time
on March 8. At the discretion of Jason and if needed, some
presentations will be postponed for a later date.
- If you have any questions or concerns email Jason.
- Graded
problem sets (20%)
- Problem set 1 (5% of total grade)
- Due: Feb. 14, 2018 by 08:30 CET
- Problem set 2 (15% of total grade)
- Project
(30%)
- Similar to the oral presentation, this project focuses on using a
method or statistical treatment that is nominally related to your
field of research that you or your group select. Unlike the oral
presentation, the project includes not just understanding and
explaining the method, but also using it on a some appropriate data
set of your own choosing.
- Can be done alone or in groups of up to 3 people
- The only hand-in is a 4-6 page written report. You can submit the
code as well if you would like.
- You may start working on this right now!!
- Final
exam (40%)
- Must work on your own!
- Take home exam
- 28 hour between start and submission
- Begins at 10:00 CET on Thursday April 5, 2018
- The exam must be submitted by 14:00 CET on Friday April 6,
2018
- The exam will be similar to problem set 2
- A handful of more intensive questions as opposed to numerous short
questions
- While the exam will contain problems from any portion of the
course material, the focus will be more on topics in the latter
portion of the course
- Here are two extra practice problems
similar to what has been on the previous exams
- Here is a link to the 2016
exam
- Here is a link to the 2017
exam
- (+2% to final course
grade average on a 1-100% scale)
- 2018 NCAA Men's Basketball Bracket submission due by tip-off of
initial 1st round game on March 15
- This is NOT a requirement, nor is it an obligation for the course
- Information can be found here
- Due: Thursday March 15 by 17:00 CET
Course Syllabus
The course is 100% likely to change once we begin, and future lectures
listed below serve as an outline. Even so, we are
very likely to cover the following topics which may require
additional software support:
- Multivariate analysis (MVA) techniques including Boosted Decision
Trees (BDTs)
- The MultiNest bayesian inference tool
- Basis splines
- Markov Chain Monte Carlo
- Likelihood minimization techniques
Class notes will be posted here:
Class 0 - Pre-Course
- Take a look before the class starts (optional)
- Lecture 0
Class 1 – Start (Feb. 6)
- Course Information
- (description
in Kurser)
- Chi-square
- Code chi-square
- Data for exercise 1 (FranksNumbers.txt)
- Review of 'basic' statistics
- Lecture 1
- Be knowledgeable about the Central Limit Theorem
- Start reading paper about how well Gaussian statistics compares to a
wide selection of scientific measurements
- "Not Normal: the uncertainties of scientific measurements" link at arXiv
or DOI
- We will be discussion the paper in the next class, i.e. on Thursday
- First problem set is assigned
Class 2 - Monte Carlo Simulation & Least Squares (Feb. 8)
- Lecture 2
- Monte Carlo (reminder that lecture starts at 09:00)
- Code for area of the
circle
- From the "Not Normal: the uncertainties of scientific measurements" paper:
- For the ambitious, create a 'toy monte carlo' of the sample and pair
distributions for the nuclear physics data in Sec. 2.A. For simplicity
assume that all the 'quantities' are gaussian distributed
- Write functions where you can produce multiple gaussian
distributions to sample from and generate a sample of "12380
measurements, 1437 quantities, 66677 pairs".
- Produce the z-distribution (using eq. 4) plot for just your toy
monte carlo and see if it matches a gaussian, exponential, student-t
distribution, etc...
- Least Squares lecture (starting at 13:00)
- Some useful links
- Discussion of "Not Normal: the uncertainties of scientific
measurements" (arXiv or DOI)
Class 3 - Introduction to Likelihoods and Numerical Minimizers (Feb. 13)
- Lecture 3
- Maximum likelihood method
- Gradient descent and minimizers
- Example code from Niccolo
(TA in 2017) and some from Jason
(course lecturer)
Class 4 - Intro. to Bayesian Statistics & Splines (Feb. 15)
- Lecture 4 on Simple Bayesian
statistics
- Using priors, posteriors, and likelihoods
- Example code for
exercises from Jason
- Lecture 4.5
- Spliness
- Data files for one of the exercises.
- Interesting article about use of splines and penalty terms
Class 5 - Background Subtraction and sPlots (Feb. 20)
Class 6 - Markov Chain(s) (Feb. 22)
- Be sure to have an external package for Markov Chain Monte Carlo
(MCMC), e.g. emcee, PyMC
- Just like minimizers, syntax and options matter
- Be familiar with your chosen MCMC package
- Lecture 6 Markov Chain Monte
Carlo (MCMC)
- Some example python code for the exercises (caveat emptor)
- Using
PyMC, which wasn't the greatest package (at least last year),
but it got the job done
- Using
emcee, the solution is graciously provided by Niccolo Maffezzoli
(2017 TA)
Class 7 - Parameter Estimation and Confidence Intervals (Feb. 27)
- Reminder: oral presentation and 1-2 page article reports will be
due/covered March 7 and 8 (look here)
Class 8 - Hypothesis Testing (March 1)
- Lecture 8
- Likelihood ratio
- Data files for one of the exercises. Just use the first column in each
file. The second column is unimportant.
Class 9 - Statistical Hypothesis Tests (March 6)
- Guest lecture by Markus Ahlers
- Lecture slides
- Files and some example code
- Be sure to have HEALPix
software installed on your computer. There are options for C, C++,
JAVA, Python, and I see some MATLAB too.
Class 10 - Presentations and Multivariate Analysis techniques (March
8)
- In the morning we will have the oral presentations from the articles
chosen
- Links to some to some of the presentations (2016,
2017)
The following lecture will be covered on March 15 in the afternoon. It
had to be postponed due to the in-class student presentations and
follow-up discussions.
- Boosted Decision Trees
- Lecture 10
- Data
- Exercise 2 (16 variable file)
- The first column is the index, hence there are 17 'variables', but
the index variable only for book keeping and has no impact on
whether an event is signal or background.
- Every even row is the 'signal' and every odd row is the
'background'. Thus, there are two rows for each index in the first
column: the first is the signal and the second is the background.
[Format is odd, but I got it from a colleague].
- Here is the solution data sets
separated into two files (benign
and malignant) for the last
exercise of the lecture. Here is also the (python)
code that I used to establish the efficiency for all the
submissions from all the students
Class 11 - Data Driven Density Estimation (non-parametric) (March
13)
Class 12 - Confidence Intervals, Failures, and
Feldman-Cousins (March 15)
Note that because of the change in schedule, this will be a long day
with activities and course material probably using all the time from
09:00-12:00 and 13:00-17:00. Bring snacks!
- Guest lecture by Dr. Morten Medici
- Under/over coverage in hypothesis tests
- Flip-flopping confidence intervals and corrections via ranking and use
of Feldman-Cousins unified approach
- Paper about
unified approach by G. Feldman and R. Cousins
- We will cover the Multivariate Analysis techniques, specifically Boosted
Decision Trees originally intended for Class 10 in the
afternoon.
Class 13 - Nested Sampling, Bayesian Inference, and MultiNest (March
20)
- Lecture 13
- External packages for conducting nested sampling, e.g. MultiNest, are
necessary and some python options are:
- Very good articles that are easy to read
Class 14 - Work on Project (no lecture or new material)
Class 15 - Course Review, Discussion on Frequentist and
Bayesian concepts, and Non-Parametric Tests Lecture snippet
(April 3)
- Review and recap of a few topics
covered in the course
- Discussion about some Frequentist and Bayesian concepts
- The written project accounting for 30% of the
total course grade is due on April 3
- Lecture 15 (EXTRA)
- Kolmogorov-Smirnov, Anderson-Darling, and Mann-Whitney U tests
- Won't be be covered in class
- Topics include things that may be useful for research
Extra Projects of a more difficult nature, for those who want something
more challenging.