iui-group-l-name-zensiert/0-pilot-project/Process.md

# Using kNN
## Finding optimal k-Value
Through testing on the original dataset (split 80:20) we found, that the optimal k-value is 3.

Running the kNN on the dataset without any preprocessing results in:
> weighted avg       0.97      0.97      0.97     56000

# Dataset optimization
## Standardization
### Standard
It seemed like StandardScalar on the MNIST dataset wouldn't change the outcome, so we ommitted standardization. 
Reason for that is probably, that the MNIST Dataset was already optimized for processing. 

### MinMax
Needs to be updated.

## Feature selection
To be tested

## Feature reduction
### PCA
Testing with PCA and plotting component vs. variance we found that a 98.64% variance could be archived with only 300 components [^1].

Testing further the a variance of 99.99999999999992% was archived at 709 components, which was also the same for 784 components (the original amount of components), which means, that no/minimal variance/information is lost when using 709 components in comparison to 784 components[^2].

For now we will simply go with n_components of 709.

### LDA
To be tested

# TODO
- [ ] Look up point of Covariance Matrix and how it works
    - https://www.youtube.com/watch?v=152tSYtiQbw
    - Probably part of PCA
- [ ] Reference for standardization not changing results of classifier
- [ ] Reference for MNIST already been standardized
- [ ] Test standardization method other than `StandardScalar`
- [ ] Test feature reduction method other than `PCA` (i.e. LDA(Linear Discriminant Analysis))
    - https://en.wikipedia.org/wiki/Dimensionality_reduction
    - https://towardsdatascience.com/is-lda-a-dimensionality-reduction-technique-or-a-classifier-algorithm-eeed4de9953a
    - https://medium.com/machine-learning-researcher/dimensionality-reduction-pca-and-lda-6be91734f567
    - https://towardsdatascience.com/dimensionality-reduction-does-pca-really-improve-classification-outcome-6e9ba21f0a32
- [ ] Add feature selection process
    - https://scikit-learn.org/stable/modules/feature_selection.html


[^1]: https://medium.com/@miat1015/mnist-using-pca-for-dimension-reduction-and-also-t-sne-and-also-3d-visualization-55084e0320b5
[^2]: Could be due to rounding in python
Too many changes to properly fit in a commit msg, changes will be discussed on discord or simply ask me 2021-05-17 03:34:11 +02:00			`# Using kNN`
			`## Finding optimal k-Value`
			`Through testing on the original dataset (split 80:20) we found, that the optimal k-value is 3.`

			`Running the kNN on the dataset without any preprocessing results in:`
			`> weighted avg 0.97 0.97 0.97 56000`

			`# Dataset optimization`
			`## Standardization`
			`### Standard`
			`It seemed like StandardScalar on the MNIST dataset wouldn't change the outcome, so we ommitted standardization.`
			`Reason for that is probably, that the MNIST Dataset was already optimized for processing.`

			`### MinMax`
			`Needs to be updated.`

			`## Feature selection`
			`To be tested`

			`## Feature reduction`
			`### PCA`
			`Testing with PCA and plotting component vs. variance we found that a 98.64% variance could be archived with only 300 components [^1].`

			`Testing further the a variance of 99.99999999999992% was archived at 709 components, which was also the same for 784 components (the original amount of components), which means, that no/minimal variance/information is lost when using 709 components in comparison to 784 components[^2].`

			`For now we will simply go with n_components of 709.`

			`### LDA`
			`To be tested`

			`# TODO`
			`- [ ] Look up point of Covariance Matrix and how it works`
			`- https://www.youtube.com/watch?v=152tSYtiQbw`
			`- Probably part of PCA`
			`- [ ] Reference for standardization not changing results of classifier`
			`- [ ] Reference for MNIST already been standardized`
			- [ ] Test standardization method other than `StandardScalar`
			- [ ] Test feature reduction method other than `PCA` (i.e. LDA(Linear Discriminant Analysis))
			`- https://en.wikipedia.org/wiki/Dimensionality_reduction`
			`- https://towardsdatascience.com/is-lda-a-dimensionality-reduction-technique-or-a-classifier-algorithm-eeed4de9953a`
			`- https://medium.com/machine-learning-researcher/dimensionality-reduction-pca-and-lda-6be91734f567`
			`- https://towardsdatascience.com/dimensionality-reduction-does-pca-really-improve-classification-outcome-6e9ba21f0a32`
			`- [ ] Add feature selection process`
			`- https://scikit-learn.org/stable/modules/feature_selection.html`


			`[^1]: https://medium.com/@miat1015/mnist-using-pca-for-dimension-reduction-and-also-t-sne-and-also-3d-visualization-55084e0320b5`
			`[^2]: Could be due to rounding in python`