Decided to do a tree based method for the classification (LightGBM), an ANN for the refression and K-Means for the clustering problem. 1. Classification_MariaRasmussen_LGBM.txt: LightGBM: (13.5 MB) Feature selection: Did a random search of 100 HP configurations (cv=5, earlystopping_rounds=30, max_depth=-1) For the best model I calculated the shap values (shap.TreeExplainer()) and chose the 15 highest ranking features. HP: The random search was repeated for the 15 chosen features. Ended up with a lot of trees (best iteration at 1035 estimators) Maybe accuracy increase not enough to justify the many more parameters compared to a simpler model. 2. Regression_MariaRasmussen_NN.txt: Keras Tensorflow: (3.4 MB) Feature selection: First removed features of variance=0 and duplicate columns(141 features left) Based on neural network with 3 hidden layers, 150 neurons in each calculated the shap values (shap.DeepExplainer()) Based on this chose 50 highest ranking features. Repeated and ended with 10 features first 7 features pretty clear but after that less clear which features to include). HPs: Used RandomSearch for HP optimization (keras.tuners.RandomSearch()) Tested: n_hidden_layers between 2 and 10. neurons in each layer between 32 and 512 in steps of 32 learning rate: 0.01, 0.001, 0.0001 activation function: relu loss function: mean squared error tried 50 configurations of HPs ended with: 4 hidden layers: (224, 192, 448, 320) learning_rate = 0.001 Then optimized number of epochs using early stopping with patience=100 and saving the best model based on 'val_loss' 3. Clustering_MariaRasmussen_KMeans.txt: SKLearn KMeans: n_clusters=6, 651.8kB Feature selection: removed features with variance=0 Calculated the correlation matrix for the normalized features ([0,1]) only included features if they had abs(correlation) lower than 0.8 with another feature already included After this: 65 features still left. Did not really know how to proceed so decided to try 50 random combinations of the 65 features and calculate the KMeans clustering for n_clusters=5. The average silhouette value (SKLearn, subsample_size=10000) as well as the percentage of electrons in each cluster was monitored. Based on this a combination of features that was deemed "okay" on both accounts was chosen. HPs: Having chosen the features, n_clusters was varied (3,4,5,6,7,8,9) and n_clusters=6 was chosen based on the silhouette distribution of the clusters (this selection was not very obvious). The model seems to be at least partly dividing the entried in electrons/non electrons clusters [0,1,2,3] contain > 0.82 fraction of electrons clusters [4,5] contain < 0.35 fraction of electrons