Lecturer: D. Jason Koskinen

Email: koskinen (at) nbi.ku.dk

- Block 3 - Timetable A of the 2019 academic calendar
- Tues 08:00 - 12:00 and Thurs 08:00- 12:00 & 13:00 - 17:00
- Actual
- 08:30 - 09:00 Q&A or discussion with Jason in the classroom
**09:00**lecture on new material (not 09:05 or 09:15)- On Thursday there will very often be new material starting at 13:00
- On Thursday it is very unlikely that any new material, lectures, or review will happen after 16:00.
- There are multiple locations depending on the day (timetable):
- Tuesday is at øv - bib 4-0-17, Universitetsparken 1-3, DIKU
- Thursday morning is at Aud 10, Universitetsparken 5, HCØ
- Thursday afternoon is at Aud 06, Universitetsparken 5, HCØ
- Odd-numbered classes are 4-hours while even-numbered consist of 2 blocks of 4-hours
- Classes will be composed of ~20-30% lecture and demonstrations followed by exercise
- While assignments, projects, and exercises can be done in the programming language of the students choice, the examples and demonstrations will be mainly in Python and/or scientific packages thereof, i.e. SciPy, PyROOT, etc.
- Required text or textbooks: None
- 2016 Advanced Methods in Applied Statistics webpage
- 2017 Advanced Methods in Applied Statistics webpage
- 2018 Advanced Methods in Applied Statistics webpage
- It is recommended, but not required, to have at least reviewed the little sibling to this course, i.e. "Applied Statistics - From data to results" which can be found here

- ~10 minute summary presentation. Plan on ~7 slides if you are doing a PowerPoint-type presentation.
- Can work alone or in groups of up to 3
- A 1-2 page summary including any and all group members names
- I strongly encourage people to use LaTeX for the typesetting of the written summary. For those who do not already have a style, I would recommend trying the formatting style for submission to journals published by the American Physical Society Downloadable here
- Presentation does NOT have to be given by all group members
- Talk or email with Jason if you have questions about the appropriateness of your article
- Be sure to put down which article you are using here to avoid duplication
- Example presentation on Finite Monte Carlo article
- Other example articles (and no, you cannot use any of the listed articles below for your report/presentation):
- Frequency Difference Gating: A Multivariate Method for Identifying Subsets That Differ Between Samples (article)
- Probability binning comparison: a metric for quantitating multivariate distribution differences (article)
- FIREFLY MONTE CARLO: EXACT MCMC WITH SUBSETS OF DATA (article)
- This is just a small sample. Ideally, find something related to your area of research
- Include people names and article

- The 1-2 page summary as a .pdf file is due
via email. Submission date is
**March 6 by 16:00 CET**

- If you have any questions or concerns email Jason

- Problem set 1 (5% of total grade)
- Due: Feb. 13, 2018 by 08:30 CET
- Problem set 2 (15% of total grade)
- Due: March 25, 2018 by 16:00 CET
- Partial Solutions for Problem Set 2

- Similar to the oral presentation, this project focuses on using a method or statistical treatment that is nominally related to your field of research that you or your group select. Unlike the oral presentation, the project includes not just understanding and explaining the method, but also using it on a some appropriate data set of your own choosing.
- Can be done alone or in groups of up to 3 people
- The only hand-in is a 4-6 page written report. You can submit the code as well if you would like.
- The project should be formatted and written as if it were a conference proceeding
- Abstract, introduction, formatted figures w/ captions, citations, etc...
- I strongly encourage people to use LaTeX for the typesetting. For those who do not already have a style, I would recommend trying the formatting style for submission to journals published by the American Physical Society Downloadable here
- Here is an example of a fantastic write-up from 2017.
- You may start working on
**this right now!!** - Due: April 2 by 22:00 CET

- Must work on your own!
- Take home exam
- 28 hour between start and submission
- Begins at 10:00 CET on April 4, 2019
- The exam must be submitted by 14:00 CET on April 5, 2019
- The exam will be similar to problem set 2
- A handful of more intensive questions as opposed to numerous short questions
- While the exam will contain problems from any portion of the course material, the focus will be more on topics in the latter portion of the course
- Here are two extra practice problems similar to what has been on the previous exams
- Here is a link to the 2016 exam
- Here is a link to the 2017 exam

- 2019 NCAA Men's Basketball Bracket submission due by tip-off of initial 1st round game on March 21
- This is NOT a requirement, nor is it an obligation for the course
- Information can be found here
- Due: March 21, 2019

The course is 100% likely to change once we begin, and future lectures listed below serve as an outline. Even so, we are very likely to cover the following topics which may require additional software support:

- Multivariate analysis (MVA) techniques including Boosted Decision Trees (BDTs)
- The MultiNest bayesian inference tool
- Basis splines
- Markov Chain Monte Carlo
- Likelihood minimization techniques

Class 0 - Pre-Course

- Take a look before the class starts (optional)
- Get a preview with the course Teaching Assistant (Jean-Loup Tastet) of some software tools to install
- Lecture 0

- Course Information
- (description in Kurser)
- Chi-square
- Code chi-square
- Data for exercise 1 (FranksNumbers.txt)
- Review of 'basic' statistics
- Lecture 1
- Jason's python code for exercise 1
- Jean-Loup's python 3 code as a Jupyter notebook for exercise 1
- Be knowledgeable about the Central Limit Theorem
- Start reading paper about how well Gaussian statistics compares to a wide selection of scientific measurements

- Lecture 2
- Monte Carlo (reminder that lecture starts at 09:00)
- Code for area of the circle
- Example code from Jean-Loup in a Jupyter notebook
- From the "Not Normal: the uncertainties of scientific measurements" paper:
- For the ambitious, create a 'toy monte carlo' of the sample and pair distributions for the nuclear physics data in Sec. 2.A. For simplicity assume that all the 'quantities' are gaussian distributed
- Write functions where you can produce multiple gaussian distributions to sample from and generate a sample of "12380 measurements, 1437 quantities, 66677 pairs".
- Produce the z-distribution (using eq. 4) plot for just your toy monte carlo and see if it matches a gaussian, exponential, student-t distribution, etc...

- Least Squares lecture (starting at 13:00)
- Some useful links

- Lecture 3
- Maximum likelihood method
- Gradient descent and minimizers
- Example code for exercise 1 and exercises 2 & 3 from Jean-Loup (TA in 2018 & 2019), Niccolo (TA in 2017), some from Jason (course lecturer)

- Remember that the first assignment is due on
**Wednesday**

**Class 4 - Intro. to Bayesian Statistics & Splines (Feb. 14)
**

- Lecture 4 on Simple Bayesian statistics
- Using priors, posteriors, and likelihoods
- Example code for exercises from Jason

- Lecture 4.5
- Spliness
- Data files for one of the exercises.
- Interesting article about use of splines and penalty terms

**Class 5 - ****Parameter Estimation and Confidence Intervals (Feb.
18)**

- Lecture 5 Confidence intervals
- Numerical minimizers for best-fit values
- Data file for one of the exercises (extra data file)

- Reminder: oral presentation and 1-2 page article reports will be due/covered soon
- Article about Supernova first detection time. Look at the caption for the Supplementary Fig. 8

**Class 6 - Markov Chain(s) (Feb. 21)
**

- Look for an external package for Markov Chain Monte Carlo (MCMC), e.g. emcee, PyMC
- Just like minimizers, syntax and options matter
- Be familiar with your chosen MCMC package
- Lecture 6 Markov Chain Monte Carlo (MCMC)
- Some example python code for the exercises (caveat emptor)
- Using PyMC, which wasn't the greatest package (at least in 2017 and 2018), but it got the job done
- Using emcee, the solution is graciously provided by Niccolo Maffezzoli (2017 TA)

**Class 7 - ****Hypothesis Testing (Feb. 26)**

- Lecture 7
- Likelihood ratio
- Data files for one of the exercises. Just use the first column in each file. The second column is unimportant.

**Class 8 - ****Data Driven Density Estimation (non-parametric)**
(**Feb. 28)**

- Kernel Density estimation
- Lecture 8

**Class 9 - ****Confidence Intervals, Failures, and Feldman-Cousins****
(March 5)**

- Guest lecture by Dr. Morten Medici
- Under/over coverage in hypothesis tests
- Flip-flopping confidence intervals and corrections via ranking and use of Feldman-Cousins unified approach
- Paper about unified approach by G. Feldman and R. Cousins

**Class 10 - Presentations and Multivariate Analysis techniques (March
7)**

- In the morning we will have the oral presentations from the articles chosen
- Link to some of the 2019 presentations

The Boosted Decision Tree lecture will be covered on March 14 in the afternoon due to the length of the excellent in-class student presentations and follow-up discussions.

**Class 11 - Divergence Between Distributions and Template Matching
(March 12)**

- Guest Lecture by Prof. Andrew "Andy" Jackson
- Lecture notes on Kullback-Leiber (part 1) and Template Matching (part 2) (PDF, powerpoint)
- Kullback-Leibler divergence as a way to compare the sameness (or tension) of two distributions, also known as a 'measure of surprise' or 'relative entropy'.
- Kullback-Leibler exercises
- Relevant Publications:
- Original publication by Kullback & Leibler
- Application of Kullback-Leibler divergence for properties of the Cosmic Microwave Background
- Strict, or semi-strict, template matching compares data to predetermined 'templates'. Shortcomings of this approach will be covered.

- Guest lecture by Markus Ahlers
- Lecture slides
- Files and some example code
- Data files in .FITS format: eventmap1.fits and truemap1.fits
- Some example code (all in python): C1_produce.py C1_show.py KS_produce.py KS_show.py maxLH_produce.py maxLH_show.py powerspectrum.py twopoint.py Ylm.py
**Be sure**to have HEALPix software installed on your computer. There are options for C, C++, JAVA, Python, and I see some MATLAB too.

**Class 10 - **** Statistical Hypothesis Tests and Auto-Correlation****
(March 14)**

- Boosted Decision Trees
- Lecture 10
- Data
- Exercise 1 (training signal, training background, testing signal, testing background)
- Exercise 2 (16 variable file)
- The first column is the index, hence there are 17 'variables', but the index variable only for book keeping and has no impact on whether an event is signal or background.
- Every even row is the 'signal' and every odd row is the 'background'. Thus, there are two rows for each index in the first column: the first is the signal and the second is the background. [Format is odd, but I got it from a colleague].
- Here is the solution data sets separated into two files (benign and malignant) for the last exercise of the lecture. Here is also the (python) code that I used to establish the efficiency for all the submissions from all the students

**Class 13 - Nested Sampling, Bayesian Inference, and MultiNest (March
19)**

- Lecture 13
- External packages for conducting nested sampling, e.g. MultiNest, are necessary and some python options are:
- pymultinest (https://johannesbuchner.github.io/PyMultiNest/)
- nestle (http://kbarbary.github.io/nestle/)
- SuperBayeS (http://www.ft.uam.es/personal/rruiz/superbayes/?page=main.html)
- Very good articles that are easy to read
- Excellent and readable paper by developer John Skilling on nested sampling (http://www.inference.phy.cam.ac.uk/bayesys/nest.pdf)
- Read up until the section "The Density of States". There will be a
**discussion at the end of class**. - MultiNest academic papers

**Class 14 - Work on Project (no lecture or new material - March 21)
**

**Class 15 - ****Course Review, Discussion on Frequentist and
Bayesian concepts, and **Non-Parametric Tests Lecture snippet
(March 26)

- Review and recap of a few topics covered in the course

- Lecture 15 (EXTRA)
- Kolmogorov-Smirnov, Anderson-Darling, and Mann-Whitney U tests
*Won't be be covered in class*- Topics include things that may be useful for research

Extra Projects of a more difficult nature, for those who want something more challenging.

- Parameter Goodness-of-fit (PG) in Global physics fits