MaDL – Eigenvalue and Singular Value Decomposition
Let’s now talk about eigenvalue and singular value decomposition, which are very important tools that we’ll see in the deep learning lecture, the self-driving car lecture, and also the computer vision lecture. It’s true that many mathematical objects can be better understood by breaking them into parts. For example, integers can be decomposed in the prime numbers, from which we can directly see by which certain integers can be divided by. Now here in this case, we’re interested in decomposing matrices. An eigen decomposition decomposes a matrix into so-called eigenvectors and eigenvalues. An eigenvector of a square matrix A is a non-zero vector such that multiplication by a only altars its scale and the scale is known as the corresponding eigenvalue. Here’s an example. We have matrix A. That is the matrix we want to decompose. If there’s a vector of V, that if we multiply that with matrix A, we get a scaled version of vector V. Then this vector V, we call an eigenvector and the corresponding eigenvalue that we need for scaling V such that this equation is true. We call the eigenvalue. That’s what we mean by only altering the scale. We don’t change the direction of the vector. We only change the norm of the vector. As any scaled version of vector V is an eigenvector, in order to make the eigen decomposition unique, we typically only consider unit eigenvectors. We require that these vectors here are normalized such that they have a pleading norm 1. We now concatenate all eigenvectors to form a matrix V. So this is a matrix V where all of the eigenvectors are in its column. And similarly we form all eigenvalues into a diagonal matrix, lambda, where we put all of these eigenvalues, there’s always one eigenvalue that corresponds to an eigenvector. So lambda 1 corresponds to V1, lambda 2 corresponds to V2 and so on. So we form this diagonal matrix that the corresponding eigenvalues corresponding to these eigenvectors are in the diagonal and the rest is 0 because it’s a diagonal matrix. As a little remark, to remove further ambiguity, by convention we typically saw the eigenvalues in descending order, which means that if lambda 1 is not the largest eigenvalue, then we swap that with another lambda, which is larger such until we achieve this descending order. And of course we need to swap the columns of V in the same way such that they correspond. The eigen decomposition of A is then given by the following expression, A equals V times lambda, is the diagonal matrix, times V to the power of minus 1, V inverse. And it’s guaranteed that every real square matrix has such an eigen decomposition. Furthermore every real and symmetric matrix A, which is also a very common case, can be decomposed into a simpler expression, Q lambda Q transpose. So we don’t have a minus one here, no inverse, we have just a transpose. Because Q in this case is an autonormal matrix composed of the eigenvectors of A. It’s again a matrix with the columns being the eigenvectors, but in this case it’s all of the columns are autonormal to each other, which means that we can replace the inverse with the transpose. Let’s look at an example of what this transformation that we discovered does to a set of vectors or a set of elements in space. So what I’ve drawn here are in black the first eigenvector and the second eigenvector, V1 and V2. And I’ve also drawn a circle, which denotes a set of points, as well as two specific other vectors that have no specific meaning, just to see how they are getting transformed. And if we now apply matrix A to the circle and these vectors, we get this circle and these pink and orange vectors here. But in this case here we do it by decomposing this transformation into three sub-transformations, where first we rotate, we have a QT that we multiply with the, if we multiply a vector here on the right hand side and we multiply this first with QT. And because Q is an autonormal matrix, we know that this is just a rotation matrix, in 2D space, for example. So we rotate this entire coordinate system here, such that it is aligned like this with the axis. And then we apply the diagonal matrix, which means we scale along the individual axis by the eigenvalues. So now you can see we have a stretched ellipsoid here in the horizontal dimension and we have a little bit a shrink ellipsoid in the vertical dimension, because in this case we have assumed that lambda 1 is larger than 1 and lambda 2 is smaller than 1. And similarly these example vectors, the orange and the purple one, get transformed as well. And then finally we apply Q. So we rotate back, we don’t apply QT, now we apply Q. So we rotate back into the original coordinate system. And what we can see now, what it does, what this transformation does effectively through these free sub transformations is that we have, we have basically, I’m distorted this unit circle here, this isotropic unit circle, by scaling the space into the direction of V1 and V2 by lambda 1 and lambda 2 respectively. So in this direction of V1, we have stretched the space and also this corresponding example vectors. And in the other direction we have shrink the space, because lambda 1 is bigger than 1 and lambda 0 and this example here is smaller than 1. Okay, so here’s some properties. A matrix is called singular if any of its eigenvalues is 0. If any of its eigenvalues is 0, we have a matrix where there is some dependencies between columns. And so it’s not invertible, it’s singular. The rank of a matrix equals the number of non-zero eigenvalues. That’s a convenient property. If we have the eigenvalue decomposition, we know how many of these eigenvalues are 0 or close to 0. So we know if it’s possible to invert that matrix or not. And we know how many solutions the corresponding linear system has. Only matrices with full rank can be inverted. Only if we have, that’s what we said before. If we have all of the columns to be independent, only a non-singular matrix we can invert it. A matrix whose eigenvalues are all positive is called positive-definite. The matrix whose eigenvalues are all positive or 0 is called positive-semit definite. For positive-semit definite matrices we have that for all x, we construct this product here. This must be bigger or equal to 0. Positive definite matrices are additional guarantee that if this product is 0, then also the vector must be 0. Now it’s of course possible to derive and calculate the eigenvalue decomposition with pen and paper. But luckily in practice we don’t need to do that because there’s great numerical linear algebra packages that we can use. And for example, in Python we have numpy where we can use numpylinalc.ic to simply with one call decomparease matrix A into its eigenveg doesn’t eigenvalues. If you want to know a little bit more about this decomposition and how to do it by hand, I can recommend this link here. Now finally there’s also something called the singular value decomposition. Now why do we have yet another decomposition? The problem with the eigenvalue decomposition is that it can only be applied to square matrices. For non-square matrices however we can use the singular value decomposition. The singular value decomposition looks similar to the eigenvalue decomposition but it can factor rise in non-square matrix. Matrix out of the matrices, the m times n matrices of real numbers for example. And you can see that here on the left hand side on the right hand side we have different matrices now but we also have a d diagonal matrix in the center. So in this case we have u element of the space of m by m matrices, d is diagonal matrix element of the space of m times n matrices and v element of the space of n times n matrices. u and v are orthogonal matrices and d is the diagonal. It could be a square diagonal or non-square diagonal matrix. Of course the interesting case here is the non-square diagonal matrix case because then we can apply this to non-square matrices. The elements along the diagonal of d are known as singular values and the columns of u and v are the so-called left and right singular vectors respectively. And indeed the eigenvalue decomposition and singular value decomposition are quite related. If you consider the singular value decomposition of arbitrary real matrix A and if you take A transpose A then we using the rules of transpose that we’ve seen earlier we can write this as such and then of course the because u is an autonormal matrix this term goes away here. So we have d squared v d squared v transpose. As we can see that the right singular vectors v are the eigenvectors of A transpose A. The right singular vectors of v according to the singular value decomposition of A are the eigenvectors of A transpose A. And similarly the left singular vectors u are the eigenvectors of A transpose. And the eigenvalues of A transpose A and A transpose are equal to the square singular values of A. We have the d squared here. So it’s basically just the square operation sitting between the eigenvalue decomposition and the singular value decomposition. Now let’s finally look at an application and this application is something we’ll look into more detail in lecture 11 of the deep learning course. It’s called the principal component analysis PCA. PCA is a technique for analyzing large high dimensional datasets as we’ll see. So we can project any high dimensional dataset into a smaller dimensional space where we can look at it. And we do this by projecting it such that most of the information is retained. So we want to project it such that we keep the directions of largest variance. And now the nice property here is that if we compute the eigenvectors of the scatter matrix or the covariance matrix, these eigenvectors are called the principal components and they face the direction of the largest variance. This is something you can show and we’ll also look at later. So here in this example we have the blue dataset and we compute the eigenvectors and we can see that the largest variance is in this direction here because this is spread like that. And then the next large variance autogonal direction to this vector here is this vector here which is the second eigenvector. And so if we want to project this two-dimensional dataset into a one-dimensional dataset, the best we can do is if we project it along this line because this is how the variance is spread maximally. For this illustration here, the two eigenvectors we want and we too are scaled by the square root of the eigenvalue. So this indicates the spread or the standard deviation in each of these directions. Thank you.