Another Principal Component Analysis Post: A simple python example

Carter Rhea
2 min readNov 1, 2020

There are dozens (if not hundreds) of posts on the principal component analysis on Medium. So why have I decided to write another post on this topic? The answer is simple, I wanted to provide the community with a very simple explanation and example of the principal component analysis. This notebook was presented as part of a two-day workshop on machine learning for the McGill Hackathon (https://www.physics.mcgill.ca/hackathon/). After the positive responses I received from presenting PCA (Principal Component Analysis) in this manner, I’ve decided to augment the post and include it here. I hope you enjoy it :)

To start off: there is nothing particularly complicated about PCA. At its heart, PCA simply reduces the number of attributes in a data set in its principal components (get it?). Each component encodes a certain amount of variance in the initial dataset. The first principal component encodes the most variance, the second principal component encodes the second most variance, and so on…

A natural question is: why would I use PCA?
Well, I usually formulate the answer like this:
PCA helps your algorithms focus on what is important instead of unhelpful (and often harmful) noise
To me, it really is that simple :)

We can calculate our PCAs in five easy steps:
1 — Calculate the mean of each feature
2 — Calculate the covariance matrix of features
3 — Calculate Eigenvalues and Eigenvectors of Covariance Matrix
4 — Sort Eigenvalues and keep k
5 — Construct transformation matrix from eigenvectors

Let’s see what this look like…. We first need to import matplotlib.pyplot and numpy :) Then we can define and visualize our dataset!

Here is my lazy visualization of the data :)

Great! Now we can get to the fun stuff! We are going to calculate the mean value for each feature, normalize the features to this mean by subtracting the mean from each value, and then calculate the covariance matrix.

At this point, we have our covariance matrix and can immediately calculate the eigenvectors! We will be using the numpy linear algebra eigenvalue solver (https://numpy.org/doc/stable/reference/generated/numpy.linalg.eig.html)

Visualization of Principal Components

Now that we have calculated our principal components, we can find our transformation matrix which will enable us to transform any other data set into our PCA subspace!

And with that, we have completed our tutorial of PCA! Hopefully, you now have a working understanding of each step of the principal component analysis. In practice, we simply use the sklearn function: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

--

--

Carter Rhea

PhD Student in Astrophysics at the University of Montreal working on machine learning in astronomy. Co-founder of cadena.ca