Decided to do a tree based method for the classification (LightGBM), an ANN for the refression and K-Means for the clustering problem. 


1. Classification_MariaRasmussen_LGBM.txt:
   LightGBM:  (13.5 MB)
   Feature selection: Did a random search of 100 HP configurations (cv=5, earlystopping_rounds=30, max_depth=-1)
                      For the best model I calculated the shap values (shap.TreeExplainer()) and chose the 15 highest ranking features.
   HP:                The random search was repeated for the 15 chosen features. Ended up with a lot of trees (best iteration at 1035 estimators)
                      Maybe accuracy increase not enough to justify the many more parameters compared to a simpler model.


2. Regression_MariaRasmussen_NN.txt:
   Keras Tensorflow: (3.4 MB)
   Feature selection: First removed features of variance=0 and duplicate columns(141 features left)
                      Based on neural network with 3 hidden layers, 150 neurons in each calculated the shap values (shap.DeepExplainer())
                      Based on this chose 50 highest ranking features. Repeated and ended with 10 features first 7 features pretty clear but after that less clear which features to include).
   
   HPs:               Used RandomSearch for HP optimization (keras.tuners.RandomSearch())
                      Tested: n_hidden_layers between 2 and 10.
                              neurons in each layer between 32 and 512 in steps of 32
                              learning rate: 0.01, 0.001, 0.0001
                      activation function: relu
                      loss function: mean squared error
                      tried 50 configurations of HPs ended with:
                      4 hidden layers: (224, 192, 448, 320)
                      learning_rate = 0.001
                      Then optimized number of epochs using early stopping with patience=100 and saving the best model based on 'val_loss'
                      
   
3. Clustering_MariaRasmussen_KMeans.txt:
   SKLearn KMeans: n_clusters=6, 651.8kB
   Feature selection: removed features with variance=0
                      Calculated the correlation matrix for the normalized features ([0,1])
                      only included features if they had abs(correlation) lower than 0.8 with another feature already included
                      After this: 65 features still left.
                      Did not really know how to proceed so decided to try 50 random combinations of the 65 features and calculate the KMeans 
                      clustering for n_clusters=5. The average silhouette value (SKLearn, subsample_size=10000) as well as the percentage of electrons in each cluster was monitored.
                      Based on this a combination of features that was deemed "okay" on both accounts was chosen. 
   
   HPs:               Having chosen the features, n_clusters was varied (3,4,5,6,7,8,9) and n_clusters=6 was chosen based on the 
                      silhouette distribution of the clusters (this selection was not very obvious).
                      The model seems to be at least partly dividing the entried in electrons/non electrons
                      clusters [0,1,2,3] contain > 0.82 fraction of electrons
                      clusters [4,5] contain < 0.35 fraction of electrons