Introduction to Data Science: A Comp-Math-Stat Approach

1MS041, 2021

©2021 Raazesh Sainudiin, Benny Avelin. Attribution 4.0 International (CC BY 4.0)

Singular value decomposition

In the book they explain the Singular Value Decomposition differently. The concept of singular value decomposition is tightly connected to Principal Component Analysis, and the methods to compute the SVD is how we get the components in PCA. We will therefore switch between explaining tthe ideas in the context of PCA as well as using the linear algebra "geometric" concepts. So, for those of you who know PCA but not SVD will have a hook, but also those of you who know linear algebra will understand the ideas of PCA.

Consider a set of points in $R^d$, for instance the following plot (in $R^2$):

When we considered linear regression we considered linear approximations and minimized the squared error to find our function. That is, we only considered the y-component of our error.

Background

Projections of vectors

Consider two vectors $a$ and $b$, we would like to project $a$ onto $b$, i.e.

From the above picture we can think of the decomposition of $a$ in terms of $b$ as two vectors, namely $a_1$ and $a_2$, they are orthogonal, which we write as $a_1 \perp a_2$.

Lets say that we wish to approximate the data using a low-dimensional subspace, think of a low-dimensional hyperplane. In the case of 2d there is only 1d hyperplanes (lines), but if you have, say 100 dimensional we could consider the best fitting 10 dimensional hyperplane. What we mean with best fitting is that the distance from the point to its projection onto our subspace is as small as possible. Think of our 2d example above, then we would like to find the line such that orthogonal projection gives the smallest error. Just looking at the plot we would take the line $y=x$.

But how do we formulate this rigorously?

Consider a line given by the unit vector $v$, and consider a point $x$ then the projection of $x$ onto $v$ is as above given by $$ (v \cdot x) v $$

We will now use these ideas applied to IID samples of points $\{X_1,\ldots,X_n\}\in \mathbb{R}^m$ with zero expectation. Let $w$ be a unit vector. Consider the projection of each $X_i$ onto $v$ but only consider the proportion i.e. $X_i \cdot v$, then define $$ Y_i = (X_i \cdot v) $$ The line with maximal empirical variance can be written as $$ v_1:=\arg\max_{\|v\|=1} \frac{1}{n} \sum_i (Y_i - \overline{Y}_n)^2 = \arg\max_{\|v\|=1} \frac{1}{n} \sum_{j=1}^n ((X_i \cdot v) - \overline{X \cdot v}_n)^2 = \arg\max_{\|v\|=1} \sum_{j=1}^n |X_i \cdot v|^2 $$

If we construct a matrix $A$ of size $n \times m$ with rows $X_i$ then we can rewrite $$ \sum_{j=1}^n |X_i \cdot v|^2 = |Av|^2 $$ and our problem reduces to the linear algebra problem of given an $n \times m$ matrix $A$ to find the direction that is most "expanded/least contracted" by $A$, in the following sense $$ \arg\max_{\|w\|=1} |Av| $$

Note, the singular vectors are not necessarily unique, in fact if $v$ is a singular vector, then so is $-v$. We can also have ties, in that case we arbitrarily pick one. In the following we will as in the book assume that the singular vectors can be picked uniquely, for instance by requiring no ties and that we fix the sign as to make the vector unique.

Definition

The vector $v_1 \in \mathbb{R}^m$ of the ($n \times m$) matrix $A$, defined as $$v_1:= \arg\max_{\|v\|=1} |Av|$$ is called the first singular vector of $A$. The value $\sigma_1(A)$ defined as $$\sigma_1(A) := |Av_1|$$ is called the first singular value of $A$.

Note

In the context where $A$ is constructed from our IID vectors $X_1,\ldots,X_n$, we see that $\sigma_1$ is the standard deviation in the direction of the first singular vector.

Now that we have defined the first singular vector, we can define the second singular vector. This is simply a vector that is orthogonal to $v_1$ again solving our maximum problem, i.e. $$ v_2 := \arg\max_{\|v\|=1, v \perp v_1} |Av| $$

We can interpret this as follows, consider the plane given by the first singular vector $v_1$ as the normal, then we can consider our new problem by finding the vector $v$ that maximizes $|(P_1A)v|$ where $P_1A=\{P_1A_{1,\cdot},\ldots,P_1A_{n,\cdot}\}$ where $P$ is the projection of a vector onto the plane $v_1 \cdot x = 0$.

Theorem (Greedy Algorithm)

Let $A$ be an n x d matrix with singular vectors $v_1,...,v_r$. For $1 \leq k \leq r$, let $V_k$ be the subspace spanned by $v_1,\ldots,v_k$. For each $k$, $V_k$ is the best fit $k$-dimensional subspace for $A$.

What is $V_k$? $$ V_k = \{\alpha_1 v_1 + \ldots + \alpha_k v_k: (\alpha_1,\ldots,\alpha_k) \in \mathbb{R}^k\} =: span(\{v_1,\ldots,v_k\}). $$ What do we mean by best fit? Let $\tilde V_k$ be another $k$-dimensional subspace consider the distance of a point $p$ to the $k$-dimensional subspace $\tilde V_k$, such a space is spanned by an orthonormal basis $\tilde v_1, \ldots, \tilde v_k$, the distance from $p$ to $\tilde V_k$ can be seen to be $$ \|p - proj_{\tilde V_k} p\| = \|p - \sum_{i=1}^k (\tilde v_i \cdot p)\tilde v_i\| $$ We mean that $V_k$ is the $k$-dimensional subspace that minimizes $$ \sum_{i=1}^n \|X_i - proj_{\tilde V_k} X_i\|^2 $$

But we can use the Pythagorean theorem to get $$ \sum_{i=1}^n \left (\|proj_{\tilde V_k}X_i\|^2 + \|X_i - proj_{\tilde V_k}X_i\|^2 \right ) = \sum_{i=1}^n \|X_i\|^2 $$

and thus we can get $$ \sum_{i=1}^n \left (\|proj_{\tilde V_k}X_i\|^2 - \|X_i\|^2 \right ) = \sum_{i=1}^n \|X_i - proj_{\tilde V_k}X_i\|^2 $$

From the above we see that the best fitting subspace is the subspace that maximizes the "variance" in the sense that we have seen. The point I am making is that we can rephrase the theorem as saying that finding $v_1,\ldots,v_k$ in a greedy way by maximizing the variance is the same as directly minimizing the variance of the deviation from the subspace.

We can now use the code above to find the singular vectors, let us take a look at the example data I plotted above

As we can see this is pretty much identical to the vector in the 45 degree direction. To find the second singular vector, we simply project our data onto the plane spanned by having $v_1$ as the normal.

Its clear which direction this is headed. Let us also look at what happens when we project the data onto the plane with normal $v_2$.

Question: What have we done, two projections in a row? What is the projection of a projection?

It should be clear from the definition and the "Greedy Algorithm" theorem that $proj_{V_m} A = A$. That is, if we use all possible singular vectors, then we can represent the data from $A$ as points in $V_m$. That is any row of $A$ can be written as a linear combination of all the singular vectors.

We can now implement findSingularVectors using our methods above

findSingularVectors on X is a simple as

Question: Is this a good way of finding the singular vectors?

YouTry:

Test and see what happens if we try to apply our findSingularValues on something with more dimensions, like 10 dimension and we would like 5 singular vectors, what happens? Why?

Singular Value Decomposition of a Matrix

Remember that we said that if we compute $m$ singular vectors of the $n \times m$ dimensional matrix $A$, then $$ proj_{V_m} A = A $$ this implies that we can write each row in $A$ as $X_i = \sum_{j=1}^m (X_i \cdot v_j) v_j$ which we can now rewrite as $$ A = \sum_{j=1}^m A v_j v_j^T $$ denoting $u_i:=\frac{A v_i}{\sigma_i}$ we see that the above expression becomes $$ A = \sum_{j=1}^m \sigma_j u_j v_j^T $$

This is the singular value decomposition. I.e. we have decomposed $A$ into a sum of matrices, that is $u_j v_j^T$ is $n \times m$ matrices.

Rewriting the above equation in matrix format we get $$ A = UDV^T $$ where $U$ is the matrix with $u_1,\ldots$ as the columns, $D$ is a diagonal matrix with $\sigma_i$ as the diagonal and $V$ is the matrix with $v_i$ as columns of $V$.

Definition

The vectors $u_i$ are called the left singular vectors.

Power method

Now let us consider the matrix $A^TA$ this matrix has dimension $m \times m$, using our decomposition above we obtain that $$ A^T A = (UDV^T)^T(UDV^T) = (VDU^T UDV^T) = VD^2V^T $$ since $U^TU = I$ which comes from the fact that the columns are orthonormal. This means that for any column $v_i$ in $V$ $$ A^T A v_i = VD^2V^T v_i = \sigma_i^2 v_i $$ so we see that $v_i$ is the $i$:th eigenvector of $A^T A$ with eigenvalue $\sigma_i^2$. We can thus find the singular vectors by trying to find the eigenvectors of $A^T A$.

How do we find the eigenvectors of $B = A^T A$? Well first note that $$ B^k = (V D^2 V^T)^k = (V D^{2k} V^T) $$ by the same argument as above, i.e. $V^T V = I$. Thus we see that if $\sigma_1 > \sigma_2$ then if we let $k$ be large enough then $$ B^k \approx (\sigma_1)^{2k} v_1 v_1^T. $$

Let us check what result we got

The vector we get is supposed to be an eigenvector of $B$, let us check if that is true

Ok, so most components are of order $10^{-7}$, which is fairly accurate. Let us check with numpy's built in SVD and see what we get

SVD from numpy returns $U,D,V.T$ so that we can write Z1 = U@np.diag(D)@V.T. Remember that the columns of $V$ is the singular vectors, so we can extract them as follows

We can also check if we get back our original matrix $Z1$ if we multiply everything together

PCA

What is PCA, well basically it is a coordinate transformation from the original coordinates to the coordinate system given by the singular vectors. Since $V$ is orthonormal it is as simple as a product, i.e.

$$ A = UDV^T $$

Recall that each row in $A$ is a data point i.e. an $m$ dimensional vector and that $V$ is an orthonormal basis, as such we project each point in $A$ onto each basis vector from $V$ by using dot products, as in $(X_i \cdot v_i)v_i$, the coordinate in the basis is just $X_i \cdot v_i$, and as such we get $$ PCA(A) = AV = UDV^TV = UD $$ Lets try it out

PCA class

since we dont all have sklearn lets implement our own PCA class that we can use from now.

SVD in Action

This is all cool and such, but what can you do with it?

Singular value decomposition can be used in the following ways

Factor Analysis

Example on compressing data

What we can see is that even with only 10 components we were able to fairly well represent the digits, although it is clear that some are not so easy.

Explained variance

Explained variance is how much percentage of the total variance is captured by our singular vectors. Remember the interpretation of the singular values as the standard deviation, as such the variance explained of the first $k$ components is just the sum of the singular values squared and divided by the total variance.

Decorrelation

Note that since the singular vectors are orthogonal, we immediately have that the transformed values always have zero correlation.

Another example on compressing data, building a classifier

From the above plot, it should be possible to use logistic regression to solve this problem

Cool, lets create a class to do LogisticRegression

Ok, so what happens if we try our code on the full data-set?

OK, so thats a problem, why? Basically it is because the log-likelihood function is not numerically stable in the sense that even for moderate values $e^x$ is outside the computable range. There are ways around this using approximations, because remember that $\log(1+e^{-x})$ is actually not a superbad function

The implementation in sklearn solves much of this and it allows us to complete our model fitting on the full dataset.

Comments

You would think that since both PCA and Logistic regression are linear that PCA+LogisticRegression would be the same as just LogisticRegression in the first place. This is not true, as PCA and LogisticRegression are minimizing different things, remember, for instance, the way we measure error (vertical / orthogonal).

We can also illustrate it here.

Intriguing, so we see that it is something else. Perhaps, at this point you might also remember that PCA does not use the labels, it just represents the data $X$ in different coordinates.

Scaling data

Conclusion

The PCA scaler does a better job than the standard scaler in this case. In fact it is more intune with the shape of the data and as such it makes more sense. This is a key tool that you can use in your life as a data scientist.

Recommender engines

Let us say that we have a library of movies, lets say that we have $m$ movies. Furthermore assume that we have $n$ users that are watching these movies and outputting a like whenever they liked a movie. We can represent this as a matrix of the form $n \times m$, with a $1$ in position $i,j$ if user $i$ liked movie $j$.

There is an idea that is called collaborative filtering:

Note that we can from the singular value decomposition and what we did with PCA we can do the following

$$ A = U \Sigma V^T = \begin{bmatrix} U_1 & U_2 \end{bmatrix} \begin{bmatrix} \Sigma_r & 0 \\ 0 & 0 \end{bmatrix} \begin{bmatrix} V_1^T \\ V_2^T \end{bmatrix} = U_1 \left(\Sigma_r V_1^T\right) . $$

This method is a type of ranking factorization, note that for $C = U_1$ and $F = \Sigma_r V_1^T$, then the product $CF$ is a low rank factorization approximation of $A$.

Lets choose a small value for the rank, lets say $100$

Now the idea is that $CF$ is the prediction of the scores, and we would construct a recommendation for each user based on that which is predicted in $CF$ but the user has not yet seen. We could set a threshold for predicted as $>1/2$ or we could simply list the recommendations in order of descending predicted score.

This is a simple recommender system and this is a fairly large field of study.