13:16:10 From Haider Fadil Ali Hussein Al-Saadi : Yes 13:17:02 From Haider Fadil Ali Hussein Al-Saadi : wut 13:17:52 From Aske R. : sory 13:27:40 From Andy Anker : Thx that was what I tried to ask :D 13:30:35 From Haider Fadil Ali Hussein Al-Saadi : a cycle? 13:30:39 From Aske R. : 1 nanosecond? 13:30:52 From Sofus Kjærsgaard Stray : I found roughly 1 ns too 13:30:55 From Sofus Kjærsgaard Stray : oh rip 13:30:59 From Elias : 1/2.5GHz 13:31:06 From Haider Fadil Ali Hussein Al-Saadi : 1/ processor speed? 13:31:09 From Aske R. : from here https://gist.github.com/hellerbarde/2843375 13:31:58 From Haider Fadil Ali Hussein Al-Saadi : 3 ghz 13:32:02 From Haider Fadil Ali Hussein Al-Saadi : 3.4? 13:32:05 From Sofus Kjærsgaard Stray : Consumer CPU or specialized equipment? 13:32:08 From Michael : 5Ghz max 13:32:27 From Michael : 0.2ns 13:32:51 From Aske R. : more 13:32:51 From Haider Fadil Ali Hussein Al-Saadi : more 13:32:54 From Haider Fadil Ali Hussein Al-Saadi : many more 13:32:57 From Aske R. : if it has multiple threads 13:34:02 From Rasmus Salmon : 2-4 13:34:06 From Mikkel Langgaard Lauritzen : 2-32 13:34:14 From Michael : Depending on float of integer? 13:34:18 From Haider Fadil Ali Hussein Al-Saadi : from 1 to infinity 13:34:40 From Mikkel Langgaard Lauritzen : https://en.wikipedia.org/wiki/FLOPS 13:34:47 From Troels Petersen : …Haider… your answered range fits 1/2 of all questions :-) 13:34:56 From Haider Fadil Ali Hussein Al-Saadi : :D 13:35:30 From Michael : ~0.07ns 13:35:32 From Haider Fadil Ali Hussein Al-Saadi : 1/(3*5 ghz)? 13:35:36 From Andy Anker : 66 ps 13:36:51 From Michael : ~1.98cm 13:36:59 From Haider Fadil Ali Hussein Al-Saadi : to be fair, you technically said, how long does it take to make AN calculation, not the average time per calculation, so the correct answer to that question is still 1/ processor speed :) 13:38:22 From Aske R. : reading from memory is expensive 13:41:49 From Aske R. : 1 ms from ssd 13:42:00 From Ann-Sofie Priergaard Zinck : 8,000,000 ns?? 13:42:45 From Aske R. : nevermind that was 1 MB 13:43:12 From Aske R. : sequentially 13:44:36 From Andy Anker : What about GPU`? 13:45:22 From Andy Anker : perf. thanks 13:45:27 From Sofus Kjærsgaard Stray : It's pretty good for that Iris dataset example that scikit learn always uses 13:45:45 From Sofus Kjærsgaard Stray : yes 13:46:03 From Haider Fadil Ali Hussein Al-Saadi : iris flowers 13:46:32 From Sofus Kjærsgaard Stray : It seems to work better in low dimensional space and with clearer boundaries between cases 13:49:02 From Haider Fadil Ali Hussein Al-Saadi : But doesn't that become difficult if the parameters are of different types? 13:53:31 From kristoffer : thanks 13:55:12 From Sofus Kjærsgaard Stray : Any unsupervised system where you need to classify stuff 13:55:16 From Sofus Kjærsgaard Stray : like our small project 13:55:51 From Andy Anker : Regarding the problem with zip codes and finding the right distance matrix for that and the other features. Could you use a dimensionality reduction tool as PCA to get all these features into the same space and put the PCA components into KNN``?? (Same question for KNN clustering, is it smart to use PCA first?) 13:57:01 From Aske R. : how do you spell that? 13:57:15 From Andy Anker : T-sne 13:57:17 From Troels Petersen : The algorithm is called t-SNE… more on that! 13:58:32 From kristoffer : In the small project you say: Max use 15 variables. Can that be trnasformed variables from PCA for instance? 13:59:12 From Andy Anker : It says 15 variables from the variable list :( 13:59:26 From kristoffer : arh thanks Andy 14:19:50 From Aske R. : can't you just add the largest negative value to all the values 14:19:54 From Aske R. : and then normalise after 14:20:15 From Troels Petersen : Or quantile transform, which works well for NNs 14:21:23 From Troels Petersen : @Andy: Yes, the PCA and t-SNE are simple, but works well in the unsupervised cases. 14:22:07 From Troels Petersen : A great example (that I should take up) is human genome data. This has 3B dimensions (!!!), so how to see, if you are related to anyone? What metric to use? 14:23:01 From Troels Petersen : The answer is, that since all bases are the same distance, no transformation is needed, but the PCA needs to figure out, which “directions” are important, which it then does, and collects these… 14:23:43 From Ann-Sofie Priergaard Zinck : Does it make sense to use PCA to find the "best" variables and then take those "best" original variables, normalize them in some way, and feed them to the ML algorithm instead of using the transformed values? 14:23:48 From Troels Petersen : I’m not sure how you would do this with a NN, especially given, that you might have no ground truth, or only some truth (i.e. “found in Italy near Rome”) for a very small dataset (in the 100s max). 14:24:27 From Troels Petersen : @Ann-Sofie: Yes, that is certainly an option, especially for the very high dim. cases, such as above. 14:25:20 From Sofus Kjærsgaard Stray : Does anyone know what it means if I get a "End of HDF5 error back trace Problems reading records." error trying to read the train/test.h5 files? 14:25:42 From Sofus Kjærsgaard Stray : using pandas.read_hdf 14:26:08 From kristoffer : Where is the code for the exercises? 14:27:24 From Rasmus Salmon : What is the gain of getting the values in the range [0,1]? Why not just do normalization by subtracting the mean and dividing by the standard deviation? 14:28:54 From Ann-Sofie Priergaard Zinck : In a regression problem, what is the most common way to check the "accuracy" of the predicted y's? Explained variance score? 14:29:46 From kristoffer : Arh, reloading is what i needed... 14:31:00 From Andy Anker : Troels, if you used an autoencoder on the dataset with the human genome I would suspect the latent space to be a reduced dimension of the input features. So the encoder works as a dimensionality reduction, where we want to minimize the information loss and the decoder is reconstructing the input from the latent space, so the human genome. My question is if the encoder in that type of architecture would beat methods as PCA and T-SNE to cluster the data. 14:33:25 From Troels Petersen : @Ann-Sofie: The most used metric in regression is, I would think, the RMS or MAE (or MAD) of either the difference (E-T) or the relative difference (E-T)/T. That was in fact six answers… :-) 14:34:07 From Haider Fadil Ali Hussein Al-Saadi : Ah, so even if we have features that have the same type of units or sizes, we still have to select which of them we do K-NN on. So it becomes a way to look at the data in the same way we can do 2d plots of all the variables, but much easier to extend into higher dimensions? 14:34:13 From Troels Petersen : So, it depends on if you care about outliers (then use RMS) or core resolution (then use MAE or MAD), and also if the size of the absolute or relative error matters. 14:34:40 From Ann-Sofie Priergaard Zinck : Thanks 14:36:53 From Sofus Kjærsgaard Stray : How are people reading the train/test files? Panda's read_hdf doesn't work 14:37:01 From Troels Petersen : @Andy: Your auto-encoder approach to human genome data is very interesting. People have tried it, and it works nicely, but not better than the t-SNE and alike… so they both reach close to the limit of the information available. 14:37:27 From Troels Petersen : However, my personal local source tells me, that in fact UMAP seems to be doing slightly better and is preferred :-) 14:38:52 From Troels Petersen : @Sofus: We will provide a train/test reader before Wednesday… wanted you to focus on trying it on the other datasets before :-) 14:39:48 From Troels Petersen : @Haider: Yes, the k-NN becomes a good way to consider multi-dimensional data fast and easy. That is why it is good for error detection (e.g. in experiments), simply because these are so different, that it will “easily” find them. 14:39:57 From Sofus Kjærsgaard Stray : Alright, I'll wait till then I guess 14:40:01 From Ann-Sofie Priergaard Zinck : @Sofus: https://www.nbi.dk/~petersen/Teaching/ML2020/SmallProject/ReadingData.py 14:40:07 From Andy Anker : Okay thanks, I will try UMAP out then! 14:40:35 From Troels Petersen : @Ann-Sofie: Yes, this is an “old style” reader… which we want to update. 14:41:11 From Sofus Kjærsgaard Stray : That readers doesn't work for me in any case 14:41:13 From Sofus Kjærsgaard Stray : reader* 14:43:06 From Katja : Have you install h5py? 14:43:30 From Sofus Kjærsgaard Stray : I have h5py yes 14:44:04 From Sofus Kjærsgaard Stray : Yeah okay so I restarted my kernel around 5 times and it didn't work 14:44:06 From Sofus Kjærsgaard Stray : but the 6th time it did 14:44:07 From Sofus Kjærsgaard Stray : magic 14:45:31 From Brian Vinter : Im working from a very small screen so I cannot see the chat-window at all times 14:45:35 From Runi : I'm going through the Cancer.py code, I cannot get different results as seems to be intended: "# Run it a few times and realize that success differ extremely. Several approaches can be tried to remedy this." I get the same Wrong, Right every time. Anyone that came across this? 14:45:44 From Brian Vinter : Please Speak up if you need me to address something 14:46:39 From Runi : I get this every time: [ 66 503] (Wrong, Right) 14:46:57 From Brian Vinter : You will get the same end result for the version with interactions yes - try to make it run only once and see the result 14:51:45 From Jonathan : So when you use KNN to see if there's anything worth exploring in terms of classification (with the goal to use a better algorithm down the line) how do you determine if you should go futher? what results on KNN is good results that should make us go on with the project and use more advanced methods? 14:52:21 From Emil Martiny : first step is the comparison to just guessing. 14:52:41 From Jonathan : yeah but how much better than guessing is good enough ? 14:52:53 From Brian Vinter : Thats a good question:) 14:53:33 From Brian Vinter : Again we venture into art-over-science - a typical rule-of-thumb would be that KNN should get to app 70% accuracy 14:53:53 From Troels Petersen : @Jonathan: Yes, Emil gives one point. Another could be, if there is a class of events that clearly stand out. Maybe you want to take those out of the sample (errors). Further, if you see clear structure, but not perfect separation, then you might want to simply go explore with other algorithms, using the kNN as a baseline for how well they should do! 14:53:54 From Runi : @Brian I'm drawing a blank on "version with interactions". Can you clarify 14:53:58 From Brian Vinter : But you really have to consider your context and (hopefully) prior knowledge og the ara 14:54:45 From Brian Vinter : interactions should have been itterations 14:55:15 From Brian Vinter : So run it only with the initial random guess and watch how badly it goes 14:55:37 From Brian Vinter : Then add the iterations and observe how you converge to the same result (most) times 14:59:29 From Haider Fadil Ali Hussein Al-Saadi : how should I break a tie in my election for classes in K-NN? 14:59:40 From Haider Fadil Ali Hussein Al-Saadi : choose only uneven K? 15:00:52 From Jonathan : lets say you have a scewed data set, consisting of 70% 1 and 30% 0, then just guessing 1 would get me 70% accuracy, right? how much better should my KNN then do ? before I say, ok there's something here worth exploring further ? 15:05:13 From Brian Vinter : Tiebreak; usually uneven - but if that’s not enough you simply extend K with one 15:08:11 From Brian Vinter : Skewed dataset…. The 70 should still hold - assuming that the validation set has the same distribution as your training set 15:39:22 From Brian Vinter To Troels Petersen(privately) : Der er vist faldet ro på? Jeg forvinder af og til - i forventning om at du lige om et øjeblik skal overtage for at tale om det lille projekt! 15:39:41 From Troels Petersen To Brian Vinter(privately) : Det er helt fint… 16:05:38 From Sofus Kjærsgaard Stray : It's basically only 25 of them that have any feature importance anyway 16:06:16 From Emy Alerskans : I read that the PCA variables you obtain after doing a PCA not strictly corresponds to the original variables in the data set, i.e. that more than one variable is used for each PCA variable so to speak. Is this true? Then how would we then use the PCA in a meaningful way to find the original best variables? 16:11:17 From Svend : How do we read the h5 data? 16:12:01 From Rasmus Salmon : Should we submit source code? 16:13:30 From Jonathan : can we use the p_truth_E as variable in the classification problem ? 16:13:36 From Jonathan : as training variable 16:13:48 From Aske R. : didn't work for me when I copy pasted it 16:14:33 From Andy Anker : Jonathan: The p_truth_E variable is not in the test file, so that is not possible 16:15:25 From Michael : Can we just take the PCA and the 15 maximum component of the original features? 16:15:35 From Aske R. : SHAP 16:15:51 From Sofus Kjærsgaard Stray : You can run your machine learning on all the variables and get the feature importance out 16:15:59 From Rasmus Salmon : https://scikit-learn.org/stable/modules/feature_selection.html 16:16:24 From Giorgos : Albert Alonso told me that he will use the test data to train the MLA in order to impress you, so have that in mind Troels. 16:16:38 From Haider Fadil Ali Hussein Al-Saadi : genius 16:16:59 From Albert Alonso : what? xdd 16:17:16 From Giorgos : just a joke, just a joke 16:18:18 From Svend : If I invent an algorithm than succesfully earns money in the stock market, should I share it or keep it to myself? 16:20:21 From kristoffer : ok 16:21:11 From Aske R. : should the clustering also aim at separating signal and background 16:21:15 From kristoffer : tearh its ok 16:21:23 From kristoffer : *yearh 16:21:30 From Aske R. : or just a general clustering to find any relation 16:21:53 From Aske R. : so terminating at a higher number of k wouldn't be a bad Idea? 16:22:07 From Aske R. : k = clusters 16:30:43 From kristoffer : what is state of the art regarding clustering? (which packages do you use?) 16:31:02 From kristoffer : @troels 16:31:45 From Troels Petersen : I would think UMAP, but t-SNE following closely. It is simply that PCA was only linear, while the two above “breaks” with that. 16:34:28 From Troels Petersen To Brian Vinter(privately) : Er du der? Ellers blot slut optagelsen, og send den til mig ved lejlighed (gerne droppet et offentligt sted på top.nbi.dk). 16:48:23 From Aske R. : which variables weren't to be used, didn't get it down 16:49:47 From Carl-Johannes Johnsen : @Aske: the website for the small project has the variable list, which does not contain the variables you shouldn't use. So you should not use the variables not in the variable list :) 16:51:23 From Rasmus Ørsøe : Seems like the list was just updated 16:51:58 From Troels Petersen : You may use all the variables in the list given on the webpage, and this list contains all variables at all, except the “book keeping variables”: index, eventnumber, and runnumber. 16:52:50 From Troels Petersen : Yes, the list was just updated, as I admittedly had forgotten to remove the variables in the list it linked to (had it in another place, sigh!)… 16:53:32 From Aske R. : https://www.nbi.dk/~petersen/Teaching/ML2020/VariableList.html 16:53:36 From Aske R. : leads me to a not found 16:54:12 From Troels Petersen : Re-load… it is here: 16:54:14 From Troels Petersen : https://www.nbi.dk/~petersen/Teaching/ML2020/SmallProject/VariableList.html 16:56:04 From Aske R. : still not on my end :/ 16:56:13 From Katja : How should I understand the data format. Then I use the ReadingData.py (8 kB) you provided, and say np.shape(data) I get (162500,) - I would have expected (xx, 166) or something like that. 16:56:57 From Aske R. : the link you posted works, but the one I can click on from the website doesn't 16:57:19 From Troels Petersen : Hi Aske. Good point - I’ll update them. 16:58:38 From Troels Petersen : Hi Katja. This is an older version of a reader, which will be updated for Wednesday. But I’m still surprised that you don’t get the right shape out. I’ll make sure the shape is correct for “Wednesday’s edition” :-) 16:59:05 From Andy Anker : What is a good strategy to determine the variables to use in the clustering example? As I see it, it is not possible to use SHAP here. 17:01:10 From Troels Petersen : Hi Andy. Well, since the unsupervised learning is about clustering, it aims at classification rather than regression, though the latter will also have a part to play. Thus, variables good for classification will probably be useful. Otherwise, you can simply try say N different random combinations, and see which clusters the data the most. 17:01:24 From Sofus Kjærsgaard Stray : I'd just use the best ones from feature importance testing @andy 17:02:18 From Andy Anker : Okay that is also my current strategy