// ----------------------------------------------------------------------------------- // /* ROOT macro for fitting the number of urban areas with populations above a certain number. The urban areas of the world have almost always grown in size, and while the largest areas (> 3.000.000 inhabitants) are easily identified, it is harder to keep track of the medium size ones (2.000.000 < inhabitants < 3.000.000) and especially of the smaller (1.000.000 < inhabitants < 2.000.000). A way to estimate the number of medium size areas is by extrapolating from the those more populous, which is the goal of this exercise. Given a dataset, which lists the largest urban areas, these are to be fitted in order to give an estimate of the number of medium size urban areas. References: Cowan: Chapter 7 Bevington: Chapter 9 Author: Troels C. Petersen (NBI/CERN) Email: Troels.Petersen@cern.ch Date: 17th of September 2012 */ // ----------------------------------------------------------------------------------- // // Include from C++: #include #include #include #include #include

// Include from ROOT:
#include #include #include #include // ----------------------------------------------------------------------------------- //
double func_pop(double *x, double *par) {
// ----------------------------------------------------------------------------------- //
  if (x[0] - par[2] > 0.0) return par[0] + par[1] / pow((x[0] - par[2]), par[3]);
  else return 9999999999.0;
}



// ----------------------------------------------------------------------------------- //
void CityPop2() {
// ----------------------------------------------------------------------------------- //

  // Set the showing of statistics and fitting results (0 means off, 1111 means all on):
  gROOT->Reset();
  gStyle->SetOptStat(1111);
  gStyle->SetOptFit(1111);

  // Set the graphics:
  gStyle->SetStatBorderSize(1);
  gStyle->SetCanvasColor(0);
  gStyle->SetPalette(1);


  // ------------------------------------------------------------------ //
  // Read data and assign uncertainties:
  // ------------------------------------------------------------------ //

  // Open data file:
  FILE *data = fopen("DataSet1.txt","r");

  const int Nmax = 250;
  int dummy, n = 0;
  int population[Nmax], area[Nmax], density[Nmax];
  float growth[Nmax];
  char* info1[Nmax], info2[Nmax];

  // x (urban area population ranking) and y (urban area population) with errors:
  float x_plc[Nmax], ex_plc[Nmax];
  float y_pop[Nmax], ey_pop[Nmax];

  // Loop over and read data as long as there is data (i.e. not End-Of-File (EOF)).
  while (fscanf(data, "%d %d %d %d %lf %s %s \n", &dummy,
                &population[n], &area[n], &density[n], &growth[n],
				&info1[n], &info2[n]) != EOF) {
    x_plc[n]  = float(n);
    ex_plc[n] = 0.0;
    y_pop[n]  = float(population[n]);
    ey_pop[n] = 20000.0 + y_pop[n] / 100.0;    // A guess for the uncertainty!
    
    printf(" %3d:  %9.0f +- %6.0f     %7d   %5d \n",
	   n+1, y_pop[n], ey_pop[n], area[n], density[n]);
    n++;
  }
  printf(" Found %d entries. \n", n);

  fclose(data);



  // ------------------------------------------------------------------ //
  // Fit data:
  // ------------------------------------------------------------------ //

  TGraphErrors *graph_pop = new TGraphErrors(n, x_plc, y_pop, ex_plc, ey_pop);

  // Define fitting function (see above) and its range and number of parameters:
  TF1 *fit_pop = new TF1("fit_pop", func_pop, 10.0, 240.0, 4);
  // Give the fit a good starting point:
  fit_pop->SetParameters(2500000.0, 250000000.0, -12.0, 1.3);

  // Make canvas and histogram for defining ranges:
  canvas = new TCanvas("canvas","", 650, 20, 600, 450);
  canvas->SetFillColor(0);
  canvas->SetLogy();
  TH1F *histo = canvas->DrawFrame(0.0, 500000.0, 250.0, 50000000.0);
  histo->GetXaxis()->SetTitle("City ranking by population");
  histo->GetYaxis()->SetTitle("Number of inhabitants");

  // Fit and plot graph:
  fit_pop->SetLineColor(2);
  fit_pop->SetLineWidth(2);
  graph_pop->Fit("fit_pop", "R");
  graph_pop->Draw("P same");

  TLine *line1 = new TLine(0.0, 1000000.0, 240.0, 1000000.0);
  line1->Draw("same");
  TLine *line2 = new TLine(0.0, 2000000.0, 240.0, 2000000.0);
  line2->Draw("same");
  TLine *line3 = new TLine(0.0, 3000000.0, 150.0, 3000000.0);
  line3->Draw("same");

  canvas->Update();
  // canvas->SaveAs("Population_fit.eps");
}


//---------------------------------------------------------------------------------- 
/*

Start by taking a close look at the data, both by inspecting the numbers, and then by
considering the histograms produced by running the macro.


Questions:
----------
 1) Think about what uncertainties to assign the population number! This is not
    easy, as it is NOT an uncertainty on the actual number of citizens, but rather
    how much one can expect this to jump around from one to the next. Consider the
    data, and then decide.
    Once you have put your estimate into the program, try to run the program and
    see if your estimate is reasonable. If not, then change it. Eventually, the fit
    ChiSquare will tell you, if you did OK!

 2) Define a (better) fitting function, which actually matches the data, and then
    make an estimate at how many urban areas in the world with a population above
    2.000.000 inhabitants. Remember to assign an uncertainty to your estimate,
    also including basic systematic uncertainties if possible.

 3) Consider again the data, this time visually. Are there any irregularities, or
    other features, which affects the fit quality? Try to change the fitting range,
    such that this is avoided. Does this result in a better fit? Would you like to
    change your answer from above?

 4) Once you have an answer, ask your teacher for the true answer (which we have!),
    and then consider if you did OK or not.

 5) Repeat the exercise with DataSet2.txt, which contains further data, and try to
    give an estimate of the number of cities above 1.000.000 inhabitants. This time,
    we do NOT know the answer! But try to compare with others, who have an estimate.


Advanced Questions:
-------------------
 1) The dataset (population and area) is obtained by using many different techniques,
    coded by the letters A through H, defined as follows:
      A: National census authority data.
      B: Demographia land area estimate based upon map or satellite photograph analysis.
      C: Demographia population "build up" from third, fourth or fifth order jurisdictions.
      D: Population estimate based upon United Nations agglomeration estimate.
      E: Demographia population estimate from national census authority agglomeration data.
      F: Other Demographia population estimate.
      G: Estimate based upon projected growth rate from last census.
      H: Combination of adjacent national census authority agglomerations.
    The most reliable population estimates are "A" and "H." Population estimates "C" are
    generally reliable. Population estimates "D" and "E" are less reliable. Population
    estimates coded "F" and "G" are the least reliable.

    Investigate how much of an effect (and thus systematic error) a variation of the
    uncertainties depending on the classification would have on the result both for
    determination of the medium size and the smaller urban areas.


*/
//----------------------------------------------------------------------------------