{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lecture 1 — Variance"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "D. Jason Koskinen\n",
    "Dec. 22, 2015\n",
    "\n",
    "*(Modernized by J-L Tastet 2019-01-30)*\n",
    "\n",
    "The exercise herein is to take a .txt file\n",
    "and read in multiple data sets of X and Y\n",
    "input and calculate basic statitics quantities.\n",
    "\n",
    "Do everything in normal python arrays and then\n",
    "use the numpy converter to put the data into\n",
    "numpy arrays. Why? Because numpy arrays are hard.\n",
    "\n",
    "The data set is somehwat of a classic in statistics\n",
    "and is known as Anscombe's quartet.\n",
    "\n",
    "Relevant links:\n",
    "* [Lecture notes](http://www.nbi.dk/~koskinen/Teaching/AdvancedMethodsInAppliedStatistics2019/Lecture1_Basics_ChiSquare.pdf)\n",
    "* [Dataset](http://www.nbi.dk/%7Ekoskinen/Teaching/AdvancedMethodsInAppliedStatistics2018/data/FranksNumbers.txt)\n",
    "* NumPy user guide: https://docs.scipy.org/doc/numpy/user/\n",
    "* NumPy documentation: https://docs.scipy.org/doc/numpy/reference/\n",
    "* SciPy library documentation: https://docs.scipy.org/doc/scipy/reference/\n",
    "* String formatting in Python 3:\n",
    "  * [Good StackOverflow question](https://stackoverflow.com/questions/13945749/string-formatting-in-python-3)\n",
    "  * https://pyformat.info/\n",
    "  * ['f-strings' in Python 3.6+](https://realpython.com/python-f-strings/#python-f-strings-the-pesky-details)\n",
    "* Python 2 countdown: https://pythonclock.org/\n",
    "* Moving to Python 3: https://python3statement.org/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import io\n",
    "import numpy as np\n",
    "import scipy as sp\n",
    "from scipy import stats as stats"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "infile = io.open(\"./FranksNumbers.txt\") # You may need to modify this path"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Making an empty array to fill with arrays (of arrays!). Normally arrays of arrays is a bad sign, but it will work out fine this time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "metaArray = []\n",
    "\n",
    "for line in infile.readlines():\n",
    "    splLine = line.split()\n",
    "    if len(splLine) == 3:# This is when the data sets change\n",
    "        metaArray.append([])\n",
    "    # end if len()\n",
    "    if len(splLine) == 0 or (not splLine[0].isdigit()):\n",
    "        continue\n",
    "    # end not if\n",
    "    \n",
    "    # read in from text is generally a string so make sure\n",
    "    # to explicitly cast the variable as a float\n",
    "    x = float(splLine[0])\n",
    "    y = float(splLine[1])\n",
    "    metaArray[-1].append([x,y])\n",
    "# end for line"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Convert the array of arrays of arrays into a list of 2D numpy arrays, so that nice calculations can be made with ease.\n",
    "\n",
    "We have to use a list instead of a 3D array because the 2D arrays it contains have different dimensions.\n",
    "\n",
    "To build the list, we use a list comprehension."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "a = [np.asarray(arr2d) for arr2d in metaArray]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The analysis is the same for each dataset, so let's write a function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def compute_chi_squared(a, i):\n",
    "    \"\"\"\n",
    "    a: List of arrays containing all datasets\n",
    "    i: Index of the dataset for which we want to compute the χ²\n",
    "    \"\"\"\n",
    "    slope = 0.48\n",
    "    intercept = 3.02\n",
    "    # The following code 'flattens' the tuple, which then includes\n",
    "    # the x-values (1st column in the file) as part of the set over\n",
    "    # which to compute the variance.\n",
    "    # but actually we just want to compute the variance of the y-values.\n",
    "    wrong_var = np.var(a[i])\n",
    "    print(f\"Variance for dataset {i}: {wrong_var:.3f} (WRONG VALUE)\")\n",
    "\n",
    "    # The following code tells numpy (via the axis=0) to calculate\n",
    "    # the variance over the\n",
    "    # separate data columns (x and y), where we're mostly interested in the\n",
    "    # variance in y. Also, there are two ways to think of the\n",
    "    # exercise as written in the lecture notes:\n",
    "    # A) you are given the line and therefore the degrees of freedom\n",
    "    # are equal to the number of data points, or\n",
    "    # B) the variance should be calculated using the 'unbiased'\n",
    "    # estimator (shown on slide 3) which corrects the\n",
    "    # degrees of freedom to be N-1. By default numpy uses\n",
    "    # that the change to the degrees of freedom (ddof) is zero.\n",
    "    # Ergo, for an unbiased estimator we maybe, possibly, kinda, sort of,\n",
    "    # should use N-1 stead of N. Also, Troels said that he stressed this\n",
    "    # in his class, so all of the students from his course should\n",
    "    # know this.\n",
    "    biased_var = np.var(a[i], axis=0, ddof=0)[1]\n",
    "    print(f\"Variance for dataset {i}: {biased_var:.3f} (CORRECT VALUE FOR BIASED VARIANCE)\")\n",
    "    unbiased_var = np.var(a[i], axis=0, ddof=1)[1]\n",
    "    print(f\"Variance for dataset {i}: {unbiased_var:.3f} (CORRECT VALUE FOR UNBIASED VARIANCE)\")\n",
    "    linreg = stats.linregress(a[i])\n",
    "    print(f\"linear regression:  y={linreg[0]:0.2f}x + {linreg[1]:0.2f}\")\n",
    "    \n",
    "    # just get the y-values, i.e. the observed data.\n",
    "    # Note that this is more easily done if the data sets\n",
    "    # have the exact numbers of entries, unlike here. The\n",
    "    # difference is where you put the [:,1] and whether it\n",
    "    # is necessary to 'recreate' a new numpy array.\n",
    "    \n",
    "    observed = sp.array(a[i])[:,1]\n",
    "    expected = []\n",
    "    chisq_value     = 0\n",
    "    chisq_valuewith = 0\n",
    "\n",
    "    # loop over all the data points in the data set\n",
    "    # to calculate the expected values of y at each\n",
    "    # value of x.\n",
    "    for j in range(0, len(a[i])):\n",
    "        x = a[i][j][0]\n",
    "        y = x*slope + intercept\n",
    "        expected.append(y)\n",
    "        chisq_value += (y - observed[j])**2/y\n",
    "        chisq_valuewith += (y - observed[j])**2/(1.22**2)\n",
    "    # end for x,y\n",
    "    \n",
    "    print(\"chi-squared By hand:    {:.3f}\".format(chisq_value))\n",
    "    print(\"chi-squared From SciPy: {:.3f}\".format(stats.chisquare(observed,expected)[0]))\n",
    "    print(\"chi-squared (w/ ±1.22 uncertainty):    {:.3f}\".format(chisq_valuewith))\n",
    "    print(\"Reduced chi-squared:    {:.3f}\".format((chisq_value)/(len(a[i]))))\n",
    "    print(\"Reduced chi-squared (w/ ±1.22 uncertainty):    {:.3f}\".format((chisq_valuewith)/(len(a[i]))))\n",
    "    print(\"\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compute $\\chi^2$ for each dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Variance for dataset 0: 7.438 (WRONG VALUE)\n",
      "Variance for dataset 0: 3.752 (CORRECT VALUE FOR BIASED VARIANCE)\n",
      "Variance for dataset 0: 4.127 (CORRECT VALUE FOR UNBIASED VARIANCE)\n",
      "linear regression:  y=0.50x + 3.00\n",
      "chi-squared By hand:    1.887\n",
      "chi-squared From SciPy: 1.887\n",
      "chi-squared (w/ ±1.22 uncertainty):    9.468\n",
      "Reduced chi-squared:    0.210\n",
      "Reduced chi-squared (w/ ±1.22 uncertainty):    1.052\n",
      "\n",
      "\n",
      "Variance for dataset 1: 7.438 (WRONG VALUE)\n",
      "Variance for dataset 1: 3.752 (CORRECT VALUE FOR BIASED VARIANCE)\n",
      "Variance for dataset 1: 4.128 (CORRECT VALUE FOR UNBIASED VARIANCE)\n",
      "linear regression:  y=0.50x + 3.00\n",
      "chi-squared By hand:    2.072\n",
      "chi-squared From SciPy: 2.072\n",
      "chi-squared (w/ ±1.22 uncertainty):    9.477\n",
      "Reduced chi-squared:    0.230\n",
      "Reduced chi-squared (w/ ±1.22 uncertainty):    1.053\n",
      "\n",
      "\n",
      "Variance for dataset 2: 7.436 (WRONG VALUE)\n",
      "Variance for dataset 2: 3.748 (CORRECT VALUE FOR BIASED VARIANCE)\n",
      "Variance for dataset 2: 4.123 (CORRECT VALUE FOR UNBIASED VARIANCE)\n",
      "linear regression:  y=0.50x + 3.00\n",
      "chi-squared By hand:    1.555\n",
      "chi-squared From SciPy: 1.555\n",
      "chi-squared (w/ ±1.22 uncertainty):    9.460\n",
      "Reduced chi-squared:    0.173\n",
      "Reduced chi-squared (w/ ±1.22 uncertainty):    1.051\n",
      "\n",
      "\n",
      "Variance for dataset 3: 7.436 (WRONG VALUE)\n",
      "Variance for dataset 3: 3.748 (CORRECT VALUE FOR BIASED VARIANCE)\n",
      "Variance for dataset 3: 4.123 (CORRECT VALUE FOR UNBIASED VARIANCE)\n",
      "linear regression:  y=0.50x + 3.00\n",
      "chi-squared By hand:    2.043\n",
      "chi-squared From SciPy: 2.043\n",
      "chi-squared (w/ ±1.22 uncertainty):    9.454\n",
      "Reduced chi-squared:    0.227\n",
      "Reduced chi-squared (w/ ±1.22 uncertainty):    1.050\n",
      "\n",
      "\n",
      "Variance for dataset 4: 7.437 (WRONG VALUE)\n",
      "Variance for dataset 4: 3.750 (CORRECT VALUE FOR BIASED VARIANCE)\n",
      "Variance for dataset 4: 3.837 (CORRECT VALUE FOR UNBIASED VARIANCE)\n",
      "linear regression:  y=0.50x + 3.00\n",
      "chi-squared By hand:    7.556\n",
      "chi-squared From SciPy: 7.556\n",
      "chi-squared (w/ ±1.22 uncertainty):    37.858\n",
      "Reduced chi-squared:    0.180\n",
      "Reduced chi-squared (w/ ±1.22 uncertainty):    0.901\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "for i in range(0,len(a)):\n",
    "    compute_chi_squared(a, i)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is a larger questions here related to calculation of the chi-squared value; we can do it, but if we do not know actually what the data is (money, number of cows, speed of a toddler, etc.) can the chi-squared or the reduced chi-squared tell use anything meaningful?"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}