An example is in Latent Semantic Analysis (LSA, or Latent Semantic Indexing LSI) with Term-Document matrix.
First data is a list of documents, second data is a list of terms. We build a matrix in which each cell represents "is term t in document d?". It is "1" if term t is found in document d, "0" otherwise. In this case, documents are the features (columns) and terms are the observations (rows).
This is usually used in NLP (Natural Language Processing) to calculate text similarity.
An example I cite an example from "Introduction to NLP" course (University of Michigan), module 3-4 "Dimensionality Reduction". The course's module uses Matlab for calculation, though the main programming tool to be used is Python with NLTK.
We define SVD,
A = U Σ VT
– U is the matrix of orthogonal eigenvectors of AAT
– V is the matrix of orthogonal eigenvectors of ATA
– The components of Σ are the eigenvalues of ATA
As example, there is 6 documents with 9 terms,
D1: T6, T9
D2: T1, T2
D3: T2, T5, T8
D4: T1, T4, T6, T8, T9
D5: T1, T7
D6: T3, T7
D7: T1, T3
The Term-Document matrix, in Python with Numpy,
import numpy as np
# matrix, columns represent documents, rows represent terms,
A0 = np.array([
[0, 1, 0, 1, 1, 0, 1],
[0, 1, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 1],
[0, 0, 0, 1, 0, 0, 0],
[0, 1, 1, 0, 0, 0, 0],
[1, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 1, 1, 0],
[0, 0, 1, 1, 0, 0, 0],
[1, 0, 0, 1, 0, 0, 0]], dtype=float)
# normalized
from sklearn.preprocessing import normalize
A = normalize(A0, axis=0, norm='l2')
np.set_printoptions(precision=4, suppress=True)
print A
[[ 0. 0.5774 0. 0.4472 0.7071 0. 0.7071]
[ 0. 0.5774 0.5774 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. 0.7071 0.7071]
[ 0. 0. 0. 0.4472 0. 0. 0. ]
[ 0. 0.5774 0.5774 0. 0. 0. 0. ]
[ 0.7071 0. 0. 0.4472 0. 0. 0. ]
[ 0. 0. 0. 0. 0.7071 0.7071 0. ]
[ 0. 0. 0.5774 0.4472 0. 0. 0. ]
[ 0.7071 0. 0. 0.4472 0. 0. 0. ]]
------------------------------------------------------ (1) SVD with numpy.linalg.svd, in this example we reduce matrix to rank-4 approximation (dimensionality reduction),
from numpy.linalg import svd
DIM = 4
U, s, V = svd(A)
# diagonal matrix
S = np.zeros((U.shape[0], V.shape[0]), dtype=float)
S[:V.shape[0], :V.shape[0]] = np.diag(s)
# dimension
print U.shape, V.shape, S.shape
print "U"
print U
print "S"
print S
print "V"
print V.T
# reconstruction
A1 = np.dot(U, np.dot(S, V))
print "reconstruction"
print A1
# is close?
print np.allclose(A, A1)
# now, reduce to DIM, rank-4 approximation
print "REDUCE to", DIM
U1 = U[:, :DIM]
S1 = S[:DIM, :DIM]
V1 = V.T[:, :DIM].T
print "U1"
print U1
print "S1"
print S1
print "V1"
print V1.T
# reconstruction on matrices reduction
A2 = np.dot(U1, np.dot(S1, V1))
print "reduced reconstruction:"
print A2
As expected, the output is,
Shape:
(9, 9) (7, 7) (9, 7)
U
[[-0.6977 -0.0931 0.0175 -0.6951 -0. 0.0157 0.1441
-0. 0. ]
[-0.2619 0.2966 0.4681 0.1969 -0. -0.2468 -0.157
-0.6356 0.3099]
[-0.3527 -0.4491 -0.1017 0.4013 0.7071 -0.0066 -0.0493
-0. 0. ]
[-0.1121 0.141 -0.1478 -0.0733 -0. 0.4842 -0.8402
-0. -0. ]
[-0.2619 0.2966 0.4681 0.1969 0. -0.2468 -0.157
0.6356 -0.3099]
[-0.1874 0.3747 -0.5049 0.127 -0. -0.2287 0.0338
-0.3099 -0.6356]
[-0.3527 -0.4491 -0.1017 0.4013 -0.7071 -0.0066 -0.0493
-0. 0. ]
[-0.2104 0.3337 0.0954 0.282 0. 0.734 0.4657
0. 0. ]
[-0.1874 0.3747 -0.5049 0.127 0. -0.2287 0.0338
0.3099 0.6356]]
S
[[ 1.5777 0. 0. 0. 0. 0. 0. ]
[ 0. 1.2664 0. 0. 0. 0. 0. ]
[ 0. 0. 1.189 0. 0. 0. 0. ]
[ 0. 0. 0. 0.7962 0. 0. 0. ]
[ 0. 0. 0. 0. 0.7071 0. 0. ]
[ 0. 0. 0. 0. 0. 0.5664 0. ]
[ 0. 0. 0. 0. 0. 0. 0.1968]
[ 0. 0. 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. 0. 0. ]]
V
[[-0.168 0.4184 -0.6005 0.2256 0. -0.571 0.2432]
[-0.4471 0.228 0.4631 -0.2185 -0. -0.4872 -0.4986]
[-0.2687 0.4226 0.5009 0.49 0. 0.2451 0.4451]
[-0.3954 0.3994 -0.3929 -0.1305 0. 0.6132 -0.3697]
[-0.4708 -0.3028 -0.0501 -0.2609 -0.7071 0.0113 0.3405]
[-0.3162 -0.5015 -0.121 0.7128 0. -0.0166 -0.3542]
[-0.4708 -0.3028 -0.0501 -0.2609 0.7071 0.0113 0.3405]]
reconstruction
[[ 0. 0.5774 0. 0.4472 0.7071 -0. 0.7071]
[-0. 0.5774 0.5774 0. 0. -0. 0. ]
[ 0. 0. 0. -0. 0. 0.7071 0.7071]
[ 0. 0. 0. 0.4472 0. 0. -0. ]
[ 0. 0.5774 0.5774 0. 0. 0. 0. ]
[ 0.7071 0. 0. 0.4472 0. 0. 0. ]
[ 0. 0. 0. -0. 0.7071 0.7071 0. ]
[ 0. 0. 0.5774 0.4472 0. -0. 0. ]
[ 0.7071 -0. 0. 0.4472 0. 0. 0. ]]
True
REDUCE to 4
U1
[[-0.6977 -0.0931 0.0175 -0.6951]
[-0.2619 0.2966 0.4681 0.1969]
[-0.3527 -0.4491 -0.1017 0.4013]
[-0.1121 0.141 -0.1478 -0.0733]
[-0.2619 0.2966 0.4681 0.1969]
[-0.1874 0.3747 -0.5049 0.127 ]
[-0.3527 -0.4491 -0.1017 0.4013]
[-0.2104 0.3337 0.0954 0.282 ]
[-0.1874 0.3747 -0.5049 0.127 ]]
S1
[[ 1.5777 0. 0. 0. ]
[ 0. 1.2664 0. 0. ]
[ 0. 0. 1.189 0. ]
[ 0. 0. 0. 0.7962]]
V1
[[-0.168 0.4184 -0.6005 0.2256]
[-0.4471 0.228 0.4631 -0.2185]
[-0.2687 0.4226 0.5009 0.49 ]
[-0.3954 0.3994 -0.3929 -0.1305]
[-0.4708 -0.3028 -0.0501 -0.2609]
[-0.3162 -0.5015 -0.121 0.7128]
[-0.4708 -0.3028 -0.0501 -0.2609]]
reduced reconstruction:
[[-0.0018 0.5958 -0.0148 0.4523 0.6974 0.0102 0.6974]
[-0.0723 0.4938 0.6254 0.0743 0.0121 -0.0133 0.0121]
[ 0.0002 -0.0067 0.0052 -0.0013 0.3569 0.7036 0.3569]
[ 0.1968 0.0512 0.0064 0.2179 0.0532 -0.054 0.0532]
[-0.0723 0.4938 0.6254 0.0743 0.0121 -0.0133 0.0121]
[ 0.6315 -0.0598 0.0288 0.5291 -0.0008 0.0002 -0.0008]
[ 0.0002 -0.0067 0.0052 -0.0013 0.3569 0.7036 0.3569]
[ 0.2151 0.2483 0.4347 0.2262 -0.0359 0.0394 -0.0359]
[ 0.6315 -0.0598 0.0288 0.5291 -0.0008 0.0002 -0.0008]]
Compare the reduced (rank-4 approximation) reconstruction matrix with the original A matrix,
from scipy.sparse.linalg import svds
u, s, vt = svds(A, 4)
print "u"
print u
print "s4"
print np.diag(s)
print "vt"
print vt
print "u * s4 * vt"
print np.dot(u, np.dot(np.diag(s), vt))
u
[[-0.6951 0.0175 -0.0931 0.6977]
[ 0.1969 0.4681 0.2966 0.2619]
[ 0.4013 -0.1017 -0.4491 0.3527]
[-0.0733 -0.1478 0.141 0.1121]
[ 0.1969 0.4681 0.2966 0.2619]
[ 0.127 -0.5049 0.3747 0.1874]
[ 0.4013 -0.1017 -0.4491 0.3527]
[ 0.282 0.0954 0.3337 0.2104]
[ 0.127 -0.5049 0.3747 0.1874]]
s4
[[ 0.7962 0. 0. 0. ]
[ 0. 1.189 0. 0. ]
[ 0. 0. 1.2664 0. ]
[ 0. 0. 0. 1.5777]]
vt
[[ 0.2256 -0.2185 0.49 -0.1305 -0.2609 0.7128 -0.2609]
[-0.6005 0.4631 0.5009 -0.3929 -0.0501 -0.121 -0.0501]
[ 0.4184 0.228 0.4226 0.3994 -0.3028 -0.5015 -0.3028]
[ 0.168 0.4471 0.2687 0.3954 0.4708 0.3162 0.4708]]
u * s4 * vt
[[-0.0018 0.5958 -0.0148 0.4523 0.6974 0.0102 0.6974]
[-0.0723 0.4938 0.6254 0.0743 0.0121 -0.0133 0.0121]
[ 0.0002 -0.0067 0.0052 -0.0013 0.3569 0.7036 0.3569]
[ 0.1968 0.0512 0.0064 0.2179 0.0532 -0.054 0.0532]
[-0.0723 0.4938 0.6254 0.0743 0.0121 -0.0133 0.0121]
[ 0.6315 -0.0598 0.0288 0.5291 -0.0008 0.0002 -0.0008]
[ 0.0002 -0.0067 0.0052 -0.0013 0.3569 0.7036 0.3569]
[ 0.2151 0.2483 0.4347 0.2262 -0.0359 0.0394 -0.0359]
[ 0.6315 -0.0598 0.0288 0.5291 -0.0008 0.0002 -0.0008]]
This Scipy's reconstruction matrix of rank-4 approximation is the same as previous Numpy's result.
Conclusion, as in the previous SVD post, if we have a large Documents-Terms matrix, then we reduce dimensionality with SVD, we are able to reproduce the approximation of original matrix with smaller size of decomposed matrices.
--------------------------------------
Note: In R, normalization and then SVD can be done this way,
A0 = c(
0, 1, 0, 1, 1, 0, 1,
0, 1, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 1, 1,
0, 0, 0, 1, 0, 0, 0,
0, 1, 1, 0, 0, 0, 0,
1, 0, 0, 1, 0, 0, 0,
0, 0, 0, 0, 1, 1, 0,
0, 0, 1, 1, 0, 0, 0,
1, 0, 0, 1, 0, 0, 0)
A = matrix(A0, nrow=9, ncol=7, byrow=T)
# l2 norm by matrix's column
A = apply(A, 2, FUN=function(x) {return(x/sqrt(sum(x^2)))}
# SVD
s = svd(A)
# left singular vectors,
s$u
# right singular vectors
s$v
# singular values, to diagonal matrix,
diag(s$d)
# helper function
reduce = function(s, d) {
u = as.matrix(s$u[, 1:d])
v = as.matrix(s$v[, 1:d])
# Create the diagonal matrix dm,
dm = as.matrix(diag(s$d)[1:d, 1:d])
return(list(u=u, v=v, d=dm))
}
# reduced to rank-4
reduced_s = reduce(s, 4)
reduced_s$u
reduced_s$v
reduced_s$d
# reconstruction,
recon_A = reduced_s$u %*% reduced_s$A %*% t(reduced_s$v)
recon_A