Introduction: ------------- In the Aleph experiment at CERN's LEP e+e- collider, Z0-boson (mass = 91 GeV) were produced, which subsequently decayed to both leptons (pairs of electrons, muons, taus or neutrinos) or pairs of quarks (u, d, s, c, and b) creating jets. This (simulated) data concerns only the quark decays, and the purpose is to identify jets from b-quarks (b-jets). These have a small but significant lifetime, allowing them to travel a bit before they decay, and they also produce a wider jet possibly with leptons of higher (relative) pt in them. The input variables (X) are: energy: Measured energy of the jet in GeV. Should be 45 GeV, but fluctuates. cTheta: cos(theta), i.e. the polar angle of the jet with respect to the beam axis. The detector works best in the central region (|cTheta| small) and less well in the forward regions. phi: The azimuth angle of the jet. As the detector is uniform in phi, this should not matter (much). prob_b: Probability of being a b-jet from the pointing of the tracks to the vertex. spheri: Sphericity of the event, i.e. how spherical it is. pt2rel: The transverse momentum squared of the tracks relative to the jet axis, i.e. width of the jet. multip: Multiplicity of the jet (in a relative measure). bqvjet: b-quark vertex of the jet, i.e. the probability of a detached vertex. ptlrel: Transverse momentum (in GeV) of possible lepton with respect to jet axis (about 0 if no leptons). nnbjet: Value of original Aleph b-jet tagging algorithm (for comparison). The target variable (Y) is: isb: 1 if it is from a b-quark and 0, if it is not. This file is relatively small, and (much) more simulated data AND real data exists. As it turns out, it is in fact possible to accurately determine the b-jet identification efficiency in data, where no labels exists. But that is another (and slightly challeging) problem. Goal: ----- The overall goal is to identify (categorize) b-quark jets in the best possible manner! However, we will start in a simple manner, and then progress to more and more advanced methods. So, to start with, you should do this in the "traditional / old fashion" way, namely using if-statements! So based on the variables available, try to define a set of selections, which determins if this is from a b-quark or not. At the end, you should evaluate your algorithm, by considering the 2x2 confusion matrix, namely how many of those who were actually b-jets (and not b-jets) you identified as being b-jets (and not b-jets). Those four numbers define how well your algorithm works, and if you were to boil it down to one number, you could aim at getting the lowest fraction of wrong assignments (i.e. smallest sum of the two off-diagonal entries).