10:37:18	 From Aske R. : you can't submit without filling out the second question?
10:39:44	 From Aske R. : yes
10:41:12	 From svend korsgaard : Treols, is there any way we can get some data from cern regarding the research in dark matter?
10:43:15	 From Aske R. : the lightgbm vs xgboost has some example code
10:44:08	 From Sofus Kjærsgaard Stray : What are we training/testing on?
10:44:24	 From Marta : when are you going to put up the recorded lecture from today up on the course webpage?
10:44:45	 From Marta : wonderful thanks
10:44:48	 From Jonathan : I have a quick question regarding the last exercise regarding the cut criterion - can I ask this here?
10:47:16	 From Jonathan Stubkjær Jegstrup : Should we split the data into training and testing? And if so, what should the ratio be? 30% train and 70% test, 50/50 or anything else?
10:48:02	 From Sofus Kjærsgaard Stray : Using AdaBoost I'm getting 0% error
10:48:22	 From zoeansari : Test dataset usually will get less part of the whole
10:48:57	 From zoeansari : Even on the test part Sofus? If so, that’s fascinating
10:49:07	 From Sofus Kjærsgaard Stray : Nah, that was just training
10:49:15	 From Sofus Kjærsgaard Stray : so I assume overtraining is happening
10:49:53	 From Troels Christian Petersen : One can split 50/50, but typically 80% training and 20% testing is the norm
10:50:10	 From Jonathan Stubkjær Jegstrup : Ok, thanks!
10:50:36	 From Sofus Kjærsgaard Stray : Even on the testing data I'm getting 0% error
10:50:51	 From Troels Christian Petersen : 0% error sounds “too good”… this can happen on the TRAINING dataset, but should not be the case for the TESTING dataset...
10:51:23	 From Aske R. : anyone knows how to get lightgbm and xgboost installed on Spyder? 
10:51:35	 From Yane García : the matrix you drow is the confusion matrix? you pointed a linear region diff to the diagonal.
10:51:39	 From Troels Christian Petersen : Hmmm…. then there is something wrong, but that is of course hard to debug from here. Look at it, explain it to a peer, send the code to us, etc.
10:51:41	 From Yane García : I got confused
10:52:26	 From Troels Christian Petersen : @Yana: No, it was the 2D distribution of signal and background (b-quark jets or not)…
10:52:29	 From Sofus Kjærsgaard Stray : I'm using sklearn AdaBoostClassifier, so the code is justabc = AdaBoostClassifier(n_estimators=100)model = abc.fit(X_train,y_train)
10:52:57	 From Yane García : Thanks 
10:53:55	 From zoeansari : Did you split the data before training Sofus?
10:54:13	 From Sofus Kjærsgaard Stray : I did
10:54:22	 From Troels Christian Petersen  To  Sofus Kjærsgaard Stray(privately) : Hej Sofus… send evt. koden til mig. Så kan jeg (eller Zoe eller Carl) lige tage et kig på den…
10:54:40	 From zoeansari : That’s interesting, It would be nice if I could look at your screen
10:54:41	 From Sofus Kjærsgaard Stray  To  Troels Christian Petersen(privately) : Hvordan skal jeg sende den?
10:54:51	 From Sofus Kjærsgaard Stray : How do I screen share?
10:54:52	 From Troels Christian Petersen  To  Sofus Kjærsgaard Stray(privately) : Email…
10:55:04	 From Troels Christian Petersen  To  Sofus Kjærsgaard Stray(privately) : petersen@nbi.dk
10:55:29	 From zoeansari : Maybe you can send a screen shot somewhere to me? Maybe Slack if you are also there?
10:56:58	 From Andy Anker : I guess it has something to do with replicates. So if you do not remove those, you will have most/all of your data in from validation replicated into your trainingset.
10:57:17	 From Sofus Kjærsgaard Stray  To  Troels Christian Petersen(privately) : Den lader mig ikke sende .py filer fra KU mail, så jeg bruger en privat en
10:57:33	 From Sofus Kjærsgaard Stray : I'll use slack
10:58:30	 From Troels Christian Petersen  To  Sofus Kjærsgaard Stray(privately) : Ja, fortsæt på Slack med fx. Zoe...
10:58:41	 From Sofus Kjærsgaard Stray  To  Troels Christian Petersen(privately) : kk
10:58:58	 From Sofus Kjærsgaard Stray : I've posted the code on slack
10:59:04	 From zoeansari : Probably you are right Andy, cause I can see in Slack that Sofus is using dataset version1
10:59:09	 From zoeansari : Thanks Sofus
10:59:22	 From Sofus Kjærsgaard Stray : That might be the case
11:00:21	 From Andy Anker : It is the replicates. I replied on slack with a fix
11:01:15	 From joachim : What is "state-of-the-art" in predicting on the partcile dataset? I can achieve 90% accuracy with no hyperparameter training
11:01:26	 From Troels Christian Petersen : Hi Sofus et al. I suddenly realise what might be the problem, as others I see. If you do not check you data - AND THIS IS VERY GENERAL - then you run into problem. Version 0 (and 1) of the data is simply repeated, so you should of course use Version2. Otherwise, the same events will be in both the training and the testing set, and only 100 (or 1000) different events will be there, which 100 trees will easily learn completely!!! Good luck...
11:02:32	 From Troels Christian Petersen : @joachim: 90% is good, but it can also be improved by a couple of %. I don’t have a “record number” for this dataset, but please post what you get, and we’ll get a feel for, what everybody gets.
11:03:44	 From Sofus Kjærsgaard Stray : Yeah using version 2 at least gives a much more sensible result: Now getting an error of 9.5%
11:04:15	 From Emil Schou Martiny : arh, i spend the last 10 minutes figuring out why my sloppy code was perfect
11:04:49	 From Andy Anker : Wasn't the best errors on 6-7 % last time? I find it odd we cannot beat it.
11:05:36	 From svend korsgaard  To  Troels Christian Petersen(privately) : Where could I find the link to the 2 blog posts again?
11:07:38	 From Rasmus Salmon : Andy, isn't that do to the fact that we evaluated on the training data on monday :)
11:08:09	 From Andy Anker : You are right, thx
11:10:34	 From svend korsgaard : Does anyone need a member to  work on the final project with?
11:10:39	 From kristoffer : 9.6 for me as well, using the RandonForestClassifier from scikit learn. 
11:10:48	 From kristoffer : *9.6% error
11:15:20	 From Troels Christian Petersen : Regarding the accuracy, I was actually a bit suspecious about the 6-7% from Monday, as I also think around 9% is doable, but not much lower. That is why I wanted you to send me your code, but instead of debugging this, it leads to an interesting point, which I’ll make in a moment…
11:20:37	 From Sofus Kjærsgaard Stray : Regarding the second question, the only one I've tried so far is SciKit learn
11:22:55	 From Jonathan : I get a score of about 0.85 using the sklearn decision tree classifier
11:23:13	 From Viktoria Lavro : yes
11:24:25	 From Haider Fadil Ali Hussein Al-Saadi : i got a binary logloss of 0.238 with LightGBM, no idea how to read that though
11:25:31	 From Simon Ulrik Hilbard : what does easy to train mean ? 
11:26:40	 From joachim : random forest!
11:26:45	 From ximtecs : Woody
11:26:59	 From ximtecs : Developed at DIKU :) 
11:27:45	 From Emil Schou Martiny : after lunch we start again at 13.15 right?
11:29:50	 From Sofus Kjærsgaard Stray : Has anyone tried LightGBM? What does "validation data" refer to in the docs?
11:30:52	 From Haider Fadil Ali Hussein Al-Saadi : validation data refers to the test set i believe
11:30:57	 From Haider Fadil Ali Hussein Al-Saadi : https://github.com/microsoft/LightGBM/blob/master/examples/python-guide/simple_example.py
11:31:02	 From Haider Fadil Ali Hussein Al-Saadi : that's the example i used
11:31:03	 From Sofus Kjærsgaard Stray : alright
11:32:14	 From Marta : @svend korsgaard, as it stands now, our group has 3-4 members, we will do a geophysics project and we could use another member
11:34:55	 From Jonathan : i get a fraction wrong of about 0.15 using sklearn
11:35:18	 From Troels Christian Petersen : Yes, after lunch we start 13:15...
11:35:34	 From Jonathan : how would I go about improving my score?
11:35:57	 From Jonathan : just using sclera of the shelf
11:36:00	 From Jonathan : sklearn*
11:36:17	 From Sofus Kjærsgaard Stray : So using LightGBM results in predictions that aren't 0 and 1. How do I "fix" that?
11:36:40	 From Troels Christian Petersen : @Jonathan: You would change around the size of the training sample (say 20% for testing), and also the number of estimators/trees.
11:36:53	 From Jonathan : ok thx
11:37:26	 From Carl-Johannes Johnsen : @Sofus, one trick is list comprehension
11:37:40	 From Carl-Johannes Johnsen : [1 if pred > 0.5 else 0 for pred in predictions]
11:37:57	 From Sofus Kjærsgaard Stray : Alright, cool. Thanks
11:38:18	 From Troels Christian Petersen : @Sofus: Well, there is generally “score” [0,1] and “predict” (0 or 1). In fact, a continuous score is typically best, as it allows one (or a user) to subsequently decide, where to set the cut (and not necessarily at 0.5, though that is a valid option).
11:39:44	 From svend korsgaard  To  Troels Christian Petersen(privately) : Are we meant to just use scikitlearn.tree() function? or also make loops in python to make BDT models already?
11:40:13	 From Alba : I am getting this error when I run my code : Input contains NaN, infinity or a value too large for dtype('float32')  
11:40:39	 From Alba : Does someone else got the same? 
11:41:11	 From Sofus Kjærsgaard Stray : What are you trying to run?
11:41:46	 From Troels Christian Petersen : You should generally go look at the SciKit-Learn documentation and see, which methods there are. “Tree” is the simplest one (I believe), while e.g. AdaBoostClassifier is a boosted version.
11:43:03	 From Rasmus Salmon : I have not tried all the algorithms but here are some of my results presented af wrong fractions. I let the algorithms do all the hyperparameter optimization. Does anybody else get comparable results:A single tree: 15.6A bagged tree: 10.6AdaBoost: 9.9Random forrest: 9.9xgBoost: 9.7
11:43:22	 From Troels Christian Petersen : Alba (and others): When you get this type of problem (and we all do), then first debug yourself for 5-10 minutes (check Google!), then discuss it with a collaborator, and finally write out here or directly to us (probably Zoe/Carl).
11:43:33	 From Sofus Kjærsgaard Stray : Roughly yeah. LightGBM is getting me around 10 - 10.5%
11:43:46	 From Sofus Kjærsgaard Stray : but that's with default parameters
11:45:06	 From Troels Christian Petersen  To  Rasmus Salmon(privately) : Hej Rasmus - gode resultater. Du lader til at have fået lavet et godt setup. Ville du have noget imod at dele din kode med de andre? Hvis ikke, så ville jeg sætte den på, som et eksempel (muligvis sammen med andre) på Week1 websiden. BH. Troels
11:47:30	 From Haider Fadil Ali Hussein Al-Saadi : lightGBM gets me 8.74%
11:47:50	 From Sofus Kjærsgaard Stray : I'm down to 9.5% after messing with learning_rate and num_leaves
11:48:22	 From Rasmus Salmon  To  Troels Christian Petersen(privately) : Selvfølgelig skal jeg bare sende den til dig på mail?
11:49:42	 From Troels Christian Petersen  To  Rasmus Salmon(privately) : Ja, meget gerne (petersen@nbi.dk), og det behøver ikke være lige nu - og jeg skal nok checke den igennem, inden jeg poster… tak.
11:50:45	 From Andy Anker : Do anyone have a good algorithm for hyperparameter opt. ? Currently, I am doing a normal grid search, which might not be the most effective.
11:51:13	 From Troels Christian Petersen  To  Haider Fadil Ali Hussein Al-Saadi(privately) : Hej Haider - godt resultat. Du lader til at have fået lavet en god model. Ville du have noget imod at dele din kode med de andre? Hvis ikke, så ville jeg sætte den på, som et eksempel (muligvis sammen med andre) på Week1 websiden. BH. Troels
11:51:38	 From Sofus Kjærsgaard Stray : I feel like I'm just changing the hyperparameters at random atm... like I don't know what a bunch of these things mean
11:52:06	 From Haider Fadil Ali Hussein Al-Saadi  To  Troels Christian Petersen(privately) : hvordan vil du have jeg sender den?
11:52:06	 From Troels Christian Petersen  To  Haider Fadil Ali Hussein Al-Saadi(privately) : We will get to hyperparameter optimisation on Wednesday next week - specifically this point, and how it is done. It is hard and CPU-intensive work, and there is no guaranties to get it right.
11:52:11	 From Emil Schou Martiny : I am looping over a bunch of differen parameters to get an idea what they do
11:52:21	 From Troels Christian Petersen : We will get to hyperparameter optimisation on Wednesday next week - specifically this point, and how it is done. It is hard and CPU-intensive work, and there is no guaranties to get it right.
11:52:45	 From Troels Christian Petersen : Looping over a grid is an option, but certainly not the best one… :-)
11:53:01	 From Sofus Kjærsgaard Stray : Can't you just use machine learning to optimize the hyperparameters, get another MLA to optimize the parameters of that and so on until you have an infinite loop
11:53:38	 From Troels Christian Petersen : However, for now don’t worry too much about performance details, but rather see, that you can run many different algorithms and get reasonable results, which you understand and can use. Make plots of these, so that you see the distributions of the scoring.
11:53:41	 From Haider Fadil Ali Hussein Al-Saadi : im just changing them manually and seeing whats what tbh. Like for example, too many leaf nodes leads to poorer result, which i assume is due to overtraining. This makes sense, since if we had an amount of nodes equal to the amount of samples in the training set,  the algorithm should optimize to just subdivide them each into their own little node?
11:54:37	 From Sofus Kjærsgaard Stray : Yeah, I'm getting the same. 20 nodes seems good for me. Learning rate is also another big change. It defaults to 0.1 in lightGBM and changing it to 0.25 helps
11:54:47	 From kristoffer : lightGBM takes me o 9.6 % as well, but it is a hell of a lot faster than sklearn :)
11:55:02	 From Haider Fadil Ali Hussein Al-Saadi : We will get to hyperparameter optimisation on Wednesday next week - specifically this point, and how it is done. It is hard and CPU-intensive work, and there is no guaranties to get it right.Troels wrote the above to me privately by mistake
11:55:09	 From Jonathan : I get about 10% with adaboost
11:55:23	 From Haider Fadil Ali Hussein Al-Saadi : oh nvm he fixed it
11:56:04	 From Troels Christian Petersen : @Sofus: In principle you can, but it is hard to “categorise” a dataset. Perhaps you can estimate the best HPs by looking at the size of the data, the correlations in it, the lacking values, how many are categorical and missing, and how many variables in total, and then have an ML-algorithm determine some HPs. But I doubt that it will work perfectly, and you would also need 1000s of very different datasets to train this “HyperPar-ML” :-)
11:57:10	 From kristoffer : 9.5 lightBGM
params = {
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'num_leaves': 63,
}
11:57:27	 From kristoffer : 9.5 %
11:57:35	 From Sofus Kjærsgaard Stray : 9.3 withparam = {'boosting':'dart','num_leaves': 20, 'objective': 'regression','learning_rate':0.25}
11:57:44	 From Haider Fadil Ali Hussein Al-Saadi : params = {    'boosting_type': 'gbdt',    'objective': 'binary',    'num_leaves': 20,    'verbose': 0}gbm = lgb.train(params,                lgb_train,                num_boost_round=1000,                valid_sets=lgb_eval,                early_stopping_rounds=50)
11:57:55	 From Haider Fadil Ali Hussein Al-Saadi : 8.74%
11:58:04	 From Haider Fadil Ali Hussein Al-Saadi : 5000 size of test set
11:58:18	 From Sofus Kjærsgaard Stray : damn, I'm on 25 rounds only
11:58:35	 From Haider Fadil Ali Hussein Al-Saadi : it stops after 150--