10:37:18 From Aske R. : you can't submit without filling out the second question? 10:39:44 From Aske R. : yes 10:41:12 From svend korsgaard : Treols, is there any way we can get some data from cern regarding the research in dark matter? 10:43:15 From Aske R. : the lightgbm vs xgboost has some example code 10:44:08 From Sofus Kjærsgaard Stray : What are we training/testing on? 10:44:24 From Marta : when are you going to put up the recorded lecture from today up on the course webpage? 10:44:45 From Marta : wonderful thanks 10:44:48 From Jonathan : I have a quick question regarding the last exercise regarding the cut criterion - can I ask this here? 10:47:16 From Jonathan Stubkjær Jegstrup : Should we split the data into training and testing? And if so, what should the ratio be? 30% train and 70% test, 50/50 or anything else? 10:48:02 From Sofus Kjærsgaard Stray : Using AdaBoost I'm getting 0% error 10:48:22 From zoeansari : Test dataset usually will get less part of the whole 10:48:57 From zoeansari : Even on the test part Sofus? If so, that’s fascinating 10:49:07 From Sofus Kjærsgaard Stray : Nah, that was just training 10:49:15 From Sofus Kjærsgaard Stray : so I assume overtraining is happening 10:49:53 From Troels Christian Petersen : One can split 50/50, but typically 80% training and 20% testing is the norm 10:50:10 From Jonathan Stubkjær Jegstrup : Ok, thanks! 10:50:36 From Sofus Kjærsgaard Stray : Even on the testing data I'm getting 0% error 10:50:51 From Troels Christian Petersen : 0% error sounds “too good”… this can happen on the TRAINING dataset, but should not be the case for the TESTING dataset... 10:51:23 From Aske R. : anyone knows how to get lightgbm and xgboost installed on Spyder? 10:51:35 From Yane García : the matrix you drow is the confusion matrix? you pointed a linear region diff to the diagonal. 10:51:39 From Troels Christian Petersen : Hmmm…. then there is something wrong, but that is of course hard to debug from here. Look at it, explain it to a peer, send the code to us, etc. 10:51:41 From Yane García : I got confused 10:52:26 From Troels Christian Petersen : @Yana: No, it was the 2D distribution of signal and background (b-quark jets or not)… 10:52:29 From Sofus Kjærsgaard Stray : I'm using sklearn AdaBoostClassifier, so the code is just abc = AdaBoostClassifier(n_estimators=100) model = abc.fit(X_train,y_train) 10:52:57 From Yane García : Thanks 10:53:55 From zoeansari : Did you split the data before training Sofus? 10:54:13 From Sofus Kjærsgaard Stray : I did 10:54:22 From Troels Christian Petersen To Sofus Kjærsgaard Stray(privately) : Hej Sofus… send evt. koden til mig. Så kan jeg (eller Zoe eller Carl) lige tage et kig på den… 10:54:40 From zoeansari : That’s interesting, It would be nice if I could look at your screen 10:54:41 From Sofus Kjærsgaard Stray To Troels Christian Petersen(privately) : Hvordan skal jeg sende den? 10:54:51 From Sofus Kjærsgaard Stray : How do I screen share? 10:54:52 From Troels Christian Petersen To Sofus Kjærsgaard Stray(privately) : Email… 10:55:04 From Troels Christian Petersen To Sofus Kjærsgaard Stray(privately) : petersen@nbi.dk 10:55:29 From zoeansari : Maybe you can send a screen shot somewhere to me? Maybe Slack if you are also there? 10:56:58 From Andy Anker : I guess it has something to do with replicates. So if you do not remove those, you will have most/all of your data in from validation replicated into your trainingset. 10:57:17 From Sofus Kjærsgaard Stray To Troels Christian Petersen(privately) : Den lader mig ikke sende .py filer fra KU mail, så jeg bruger en privat en 10:57:33 From Sofus Kjærsgaard Stray : I'll use slack 10:58:30 From Troels Christian Petersen To Sofus Kjærsgaard Stray(privately) : Ja, fortsæt på Slack med fx. Zoe... 10:58:41 From Sofus Kjærsgaard Stray To Troels Christian Petersen(privately) : kk 10:58:58 From Sofus Kjærsgaard Stray : I've posted the code on slack 10:59:04 From zoeansari : Probably you are right Andy, cause I can see in Slack that Sofus is using dataset version1 10:59:09 From zoeansari : Thanks Sofus 10:59:22 From Sofus Kjærsgaard Stray : That might be the case 11:00:21 From Andy Anker : It is the replicates. I replied on slack with a fix 11:01:15 From joachim : What is "state-of-the-art" in predicting on the partcile dataset? I can achieve 90% accuracy with no hyperparameter training 11:01:26 From Troels Christian Petersen : Hi Sofus et al. I suddenly realise what might be the problem, as others I see. If you do not check you data - AND THIS IS VERY GENERAL - then you run into problem. Version 0 (and 1) of the data is simply repeated, so you should of course use Version2. Otherwise, the same events will be in both the training and the testing set, and only 100 (or 1000) different events will be there, which 100 trees will easily learn completely!!! Good luck... 11:02:32 From Troels Christian Petersen : @joachim: 90% is good, but it can also be improved by a couple of %. I don’t have a “record number” for this dataset, but please post what you get, and we’ll get a feel for, what everybody gets. 11:03:44 From Sofus Kjærsgaard Stray : Yeah using version 2 at least gives a much more sensible result: Now getting an error of 9.5% 11:04:15 From Emil Schou Martiny : arh, i spend the last 10 minutes figuring out why my sloppy code was perfect 11:04:49 From Andy Anker : Wasn't the best errors on 6-7 % last time? I find it odd we cannot beat it. 11:05:36 From svend korsgaard To Troels Christian Petersen(privately) : Where could I find the link to the 2 blog posts again? 11:07:38 From Rasmus Salmon : Andy, isn't that do to the fact that we evaluated on the training data on monday :) 11:08:09 From Andy Anker : You are right, thx 11:10:34 From svend korsgaard : Does anyone need a member to work on the final project with? 11:10:39 From kristoffer : 9.6 for me as well, using the RandonForestClassifier from scikit learn. 11:10:48 From kristoffer : *9.6% error 11:15:20 From Troels Christian Petersen : Regarding the accuracy, I was actually a bit suspecious about the 6-7% from Monday, as I also think around 9% is doable, but not much lower. That is why I wanted you to send me your code, but instead of debugging this, it leads to an interesting point, which I’ll make in a moment… 11:20:37 From Sofus Kjærsgaard Stray : Regarding the second question, the only one I've tried so far is SciKit learn 11:22:55 From Jonathan : I get a score of about 0.85 using the sklearn decision tree classifier 11:23:13 From Viktoria Lavro : yes 11:24:25 From Haider Fadil Ali Hussein Al-Saadi : i got a binary logloss of 0.238 with LightGBM, no idea how to read that though 11:25:31 From Simon Ulrik Hilbard : what does easy to train mean ? 11:26:40 From joachim : random forest! 11:26:45 From ximtecs : Woody 11:26:59 From ximtecs : Developed at DIKU :) 11:27:45 From Emil Schou Martiny : after lunch we start again at 13.15 right? 11:29:50 From Sofus Kjærsgaard Stray : Has anyone tried LightGBM? What does "validation data" refer to in the docs? 11:30:52 From Haider Fadil Ali Hussein Al-Saadi : validation data refers to the test set i believe 11:30:57 From Haider Fadil Ali Hussein Al-Saadi : https://github.com/microsoft/LightGBM/blob/master/examples/python-guide/simple_example.py 11:31:02 From Haider Fadil Ali Hussein Al-Saadi : that's the example i used 11:31:03 From Sofus Kjærsgaard Stray : alright 11:32:14 From Marta : @svend korsgaard, as it stands now, our group has 3-4 members, we will do a geophysics project and we could use another member 11:34:55 From Jonathan : i get a fraction wrong of about 0.15 using sklearn 11:35:18 From Troels Christian Petersen : Yes, after lunch we start 13:15... 11:35:34 From Jonathan : how would I go about improving my score? 11:35:57 From Jonathan : just using sclera of the shelf 11:36:00 From Jonathan : sklearn* 11:36:17 From Sofus Kjærsgaard Stray : So using LightGBM results in predictions that aren't 0 and 1. How do I "fix" that? 11:36:40 From Troels Christian Petersen : @Jonathan: You would change around the size of the training sample (say 20% for testing), and also the number of estimators/trees. 11:36:53 From Jonathan : ok thx 11:37:26 From Carl-Johannes Johnsen : @Sofus, one trick is list comprehension 11:37:40 From Carl-Johannes Johnsen : [1 if pred > 0.5 else 0 for pred in predictions] 11:37:57 From Sofus Kjærsgaard Stray : Alright, cool. Thanks 11:38:18 From Troels Christian Petersen : @Sofus: Well, there is generally “score” [0,1] and “predict” (0 or 1). In fact, a continuous score is typically best, as it allows one (or a user) to subsequently decide, where to set the cut (and not necessarily at 0.5, though that is a valid option). 11:39:44 From svend korsgaard To Troels Christian Petersen(privately) : Are we meant to just use scikitlearn.tree() function? or also make loops in python to make BDT models already? 11:40:13 From Alba : I am getting this error when I run my code : Input contains NaN, infinity or a value too large for dtype('float32') 11:40:39 From Alba : Does someone else got the same? 11:41:11 From Sofus Kjærsgaard Stray : What are you trying to run? 11:41:46 From Troels Christian Petersen : You should generally go look at the SciKit-Learn documentation and see, which methods there are. “Tree” is the simplest one (I believe), while e.g. AdaBoostClassifier is a boosted version. 11:43:03 From Rasmus Salmon : I have not tried all the algorithms but here are some of my results presented af wrong fractions. I let the algorithms do all the hyperparameter optimization. Does anybody else get comparable results: A single tree: 15.6 A bagged tree: 10.6 AdaBoost: 9.9 Random forrest: 9.9 xgBoost: 9.7 11:43:22 From Troels Christian Petersen : Alba (and others): When you get this type of problem (and we all do), then first debug yourself for 5-10 minutes (check Google!), then discuss it with a collaborator, and finally write out here or directly to us (probably Zoe/Carl). 11:43:33 From Sofus Kjærsgaard Stray : Roughly yeah. LightGBM is getting me around 10 - 10.5% 11:43:46 From Sofus Kjærsgaard Stray : but that's with default parameters 11:45:06 From Troels Christian Petersen To Rasmus Salmon(privately) : Hej Rasmus - gode resultater. Du lader til at have fået lavet et godt setup. Ville du have noget imod at dele din kode med de andre? Hvis ikke, så ville jeg sætte den på, som et eksempel (muligvis sammen med andre) på Week1 websiden. BH. Troels 11:47:30 From Haider Fadil Ali Hussein Al-Saadi : lightGBM gets me 8.74% 11:47:50 From Sofus Kjærsgaard Stray : I'm down to 9.5% after messing with learning_rate and num_leaves 11:48:22 From Rasmus Salmon To Troels Christian Petersen(privately) : Selvfølgelig skal jeg bare sende den til dig på mail? 11:49:42 From Troels Christian Petersen To Rasmus Salmon(privately) : Ja, meget gerne (petersen@nbi.dk), og det behøver ikke være lige nu - og jeg skal nok checke den igennem, inden jeg poster… tak. 11:50:45 From Andy Anker : Do anyone have a good algorithm for hyperparameter opt. ? Currently, I am doing a normal grid search, which might not be the most effective. 11:51:13 From Troels Christian Petersen To Haider Fadil Ali Hussein Al-Saadi(privately) : Hej Haider - godt resultat. Du lader til at have fået lavet en god model. Ville du have noget imod at dele din kode med de andre? Hvis ikke, så ville jeg sætte den på, som et eksempel (muligvis sammen med andre) på Week1 websiden. BH. Troels 11:51:38 From Sofus Kjærsgaard Stray : I feel like I'm just changing the hyperparameters at random atm... like I don't know what a bunch of these things mean 11:52:06 From Haider Fadil Ali Hussein Al-Saadi To Troels Christian Petersen(privately) : hvordan vil du have jeg sender den? 11:52:06 From Troels Christian Petersen To Haider Fadil Ali Hussein Al-Saadi(privately) : We will get to hyperparameter optimisation on Wednesday next week - specifically this point, and how it is done. It is hard and CPU-intensive work, and there is no guaranties to get it right. 11:52:11 From Emil Schou Martiny : I am looping over a bunch of differen parameters to get an idea what they do 11:52:21 From Troels Christian Petersen : We will get to hyperparameter optimisation on Wednesday next week - specifically this point, and how it is done. It is hard and CPU-intensive work, and there is no guaranties to get it right. 11:52:45 From Troels Christian Petersen : Looping over a grid is an option, but certainly not the best one… :-) 11:53:01 From Sofus Kjærsgaard Stray : Can't you just use machine learning to optimize the hyperparameters, get another MLA to optimize the parameters of that and so on until you have an infinite loop 11:53:38 From Troels Christian Petersen : However, for now don’t worry too much about performance details, but rather see, that you can run many different algorithms and get reasonable results, which you understand and can use. Make plots of these, so that you see the distributions of the scoring. 11:53:41 From Haider Fadil Ali Hussein Al-Saadi : im just changing them manually and seeing whats what tbh. Like for example, too many leaf nodes leads to poorer result, which i assume is due to overtraining. This makes sense, since if we had an amount of nodes equal to the amount of samples in the training set, the algorithm should optimize to just subdivide them each into their own little node? 11:54:37 From Sofus Kjærsgaard Stray : Yeah, I'm getting the same. 20 nodes seems good for me. Learning rate is also another big change. It defaults to 0.1 in lightGBM and changing it to 0.25 helps 11:54:47 From kristoffer : lightGBM takes me o 9.6 % as well, but it is a hell of a lot faster than sklearn :) 11:55:02 From Haider Fadil Ali Hussein Al-Saadi : We will get to hyperparameter optimisation on Wednesday next week - specifically this point, and how it is done. It is hard and CPU-intensive work, and there is no guaranties to get it right. Troels wrote the above to me privately by mistake 11:55:09 From Jonathan : I get about 10% with adaboost 11:55:23 From Haider Fadil Ali Hussein Al-Saadi : oh nvm he fixed it 11:56:04 From Troels Christian Petersen : @Sofus: In principle you can, but it is hard to “categorise” a dataset. Perhaps you can estimate the best HPs by looking at the size of the data, the correlations in it, the lacking values, how many are categorical and missing, and how many variables in total, and then have an ML-algorithm determine some HPs. But I doubt that it will work perfectly, and you would also need 1000s of very different datasets to train this “HyperPar-ML” :-) 11:57:10 From kristoffer : 9.5 lightBGM params = { 'boosting_type': 'gbdt', 'objective': 'binary', 'num_leaves': 63, } 11:57:27 From kristoffer : 9.5 % 11:57:35 From Sofus Kjærsgaard Stray : 9.3 with param = {'boosting':'dart','num_leaves': 20, 'objective': 'regression','learning_rate':0.25} 11:57:44 From Haider Fadil Ali Hussein Al-Saadi : params = { 'boosting_type': 'gbdt', 'objective': 'binary', 'num_leaves': 20, 'verbose': 0 } gbm = lgb.train(params, lgb_train, num_boost_round=1000, valid_sets=lgb_eval, early_stopping_rounds=50) 11:57:55 From Haider Fadil Ali Hussein Al-Saadi : 8.74% 11:58:04 From Haider Fadil Ali Hussein Al-Saadi : 5000 size of test set 11:58:18 From Sofus Kjærsgaard Stray : damn, I'm on 25 rounds only 11:58:35 From Haider Fadil Ali Hussein Al-Saadi : it stops after 150--