Everything should be made as simple as possible, but not simpler. (Albert Einstein)

Sunday, October 25, 2015

Singular Value Decomposition with Numpy & Scipy

Following previous post "Singular Value Decomposition and Dimensionality, Using R...", here is another approach using Numpy and Scipy.

An example is in Latent Semantic Analysis (LSA, or Latent Semantic Indexing LSI) with Term-Document matrix. 

First data is a list of documents, second data is a list of terms. We build a matrix in which each cell represents "is term t in document d?". It is "1" if term t is found in document d, "0" otherwise. In this case, documents are the features (columns) and terms are the observations (rows).

This is usually used in NLP (Natural Language Processing) to calculate text similarity. 

An example I cite an example from "Introduction to NLP" course (University of Michigan), module 3-4 "Dimensionality Reduction". The course's module uses Matlab for calculation, though the main programming tool to be used is Python with NLTK. 

We define SVD, 
A = U Σ VT
U is the matrix of orthogonal eigenvectors of AAT
V is the matrix of orthogonal eigenvectors of ATA
The components of Σ are the eigenvalues of ATA
 

As example, there is 6 documents with 9 terms, 

D1: T6, T9
D2: T1, T2
D3: T2, T5, T8
D4: T1, T4, T6, T8, T9
D5: T1, T7
D6: T3, T7
D7: T1, T3


The Term-Document matrix, in Python with Numpy, 

import numpy as np  
# matrix, columns represent documents, rows represent terms,    
A0 = np.array([
    [0, 1, 0, 1, 1, 0, 1],
    [0, 1, 1, 0, 0, 0, 0],
    [0, 0, 0, 0, 0, 1, 1],
    [0, 0, 0, 1, 0, 0, 0],
    [0, 1, 1, 0, 0, 0, 0],
    [1, 0, 0, 1, 0, 0, 0],
    [0, 0, 0, 0, 1, 1, 0],
    [0, 0, 1, 1, 0, 0, 0],
    [1, 0, 0, 1, 0, 0, 0]], dtype=float) 
 
# normalized
from sklearn.preprocessing import normalize
 
A = normalize(A0, axis=0, norm='l2')
 
np.set_printoptions(precision=4, suppress=True)
print A

[[ 0.     0.5774 0.     0.4472 0.7071 0.     0.7071] 
 [ 0.     0.5774 0.5774 0.     0.     0.     0.    ] 
 [ 0.     0.     0.     0.     0.     0.7071 0.7071] 
 [ 0.     0.     0.     0.4472 0.     0.     0.    ] 
 [ 0.     0.5774 0.5774 0.     0.     0.     0.    ] 
 [ 0.7071 0.     0.     0.4472 0.     0.     0.    ] 
 [ 0.     0.     0.     0.     0.7071 0.7071 0.    ] 
 [ 0.     0.     0.5774 0.4472 0.     0.     0.    ] 
 [ 0.7071 0.     0.     0.4472 0.     0.     0.    ]]



 ------------------------------------------------------
(1) SVD with numpy.linalg.svd, in this example we reduce matrix to rank-4 approximation (dimensionality reduction),



    from numpy.linalg import svd
 
    DIM = 4
    U, s, V = svd(A)
 
    # diagonal matrix
    S = np.zeros((U.shape[0], V.shape[0]), dtype=float)
    S[:V.shape[0], :V.shape[0]] = np.diag(s)
 
    # dimension
    print U.shape, V.shape, S.shape
     
    print "U"
    print U
    print "S"
    print S
    print "V"
    print V.T
     
    # reconstruction
    A1 = np.dot(U, np.dot(S, V))
    print "reconstruction"
    print A1
 
    # is close? 
    print np.allclose(A, A1)
 
    # now, reduce to DIM, rank-4 approximation
    print "REDUCE to", DIM
    U1 = U[:, :DIM]
    S1 = S[:DIM, :DIM]
    V1 = V.T[:, :DIM].T
 
    print "U1"
    print U1
    print "S1"
    print S1
    print "V1"
    print V1.T
 
    # reconstruction on matrices reduction
    A2 = np.dot(U1, np.dot(S1, V1))
    print "reduced reconstruction:"
    print A2


As expected, the output is,

Shape: 
(9, 9) (7, 7) (9, 7)

U
[[-0.6977 -0.0931  0.0175 -0.6951 -0.      0.0157  0.1441 

-0.      0.    ]
 [-0.2619  0.2966  0.4681  0.1969 -0.     -0.2468 -0.157 

-0.6356  0.3099]
 [-0.3527 -0.4491 -0.1017  0.4013  0.7071 -0.0066 -0.0493 

-0.      0.    ]
 [-0.1121  0.141  -0.1478 -0.0733 -0.      0.4842 -0.8402 

-0.     -0.    ]
 [-0.2619  0.2966  0.4681  0.1969  0.     -0.2468 -0.157  

0.6356 -0.3099]
 [-0.1874  0.3747 -0.5049  0.127  -0.     -0.2287  0.0338 

-0.3099 -0.6356]
 [-0.3527 -0.4491 -0.1017  0.4013 -0.7071 -0.0066 -0.0493 

-0.      0.    ]
 [-0.2104  0.3337  0.0954  0.282   0.      0.734   0.4657  

0.      0.    ]
 [-0.1874  0.3747 -0.5049  0.127   0.     -0.2287  0.0338  

0.3099  0.6356]]

S
[[ 1.5777  0.      0.      0.      0.      0.      0.    ]
 [ 0.      1.2664  0.      0.      0.      0.      0.    ]
 [ 0.      0.      1.189   0.      0.      0.      0.    ]
 [ 0.      0.      0.      0.7962  0.      0.      0.    ]
 [ 0.      0.      0.      0.      0.7071  0.      0.    ]
 [ 0.      0.      0.      0.      0.      0.5664  0.    ]
 [ 0.      0.      0.      0.      0.      0.      0.1968]
 [ 0.      0.      0.      0.      0.      0.      0.    ]
 [ 0.      0.      0.      0.      0.      0.      0.    ]]
 

V
[[-0.168   0.4184 -0.6005  0.2256  0.     -0.571   0.2432]
 [-0.4471  0.228   0.4631 -0.2185 -0.     -0.4872 -0.4986]
 [-0.2687  0.4226  0.5009  0.49    0.      0.2451  0.4451]
 [-0.3954  0.3994 -0.3929 -0.1305  0.      0.6132 -0.3697]
 [-0.4708 -0.3028 -0.0501 -0.2609 -0.7071  0.0113  0.3405]
 [-0.3162 -0.5015 -0.121   0.7128  0.     -0.0166 -0.3542]
 [-0.4708 -0.3028 -0.0501 -0.2609  0.7071  0.0113  0.3405]]
 

reconstruction
[[ 0.      0.5774  0.      0.4472  0.7071 -0.      0.7071]
 [-0.      0.5774  0.5774  0.      0.     -0.      0.    ]
 [ 0.      0.      0.     -0.      0.      0.7071  0.7071]
 [ 0.      0.      0.      0.4472  0.      0.     -0.    ]
 [ 0.      0.5774  0.5774  0.      0.      0.      0.    ]
 [ 0.7071  0.      0.      0.4472  0.      0.      0.    ]
 [ 0.      0.      0.     -0.      0.7071  0.7071  0.    ]
 [ 0.      0.      0.5774  0.4472  0.     -0.      0.    ]
 [ 0.7071 -0.      0.      0.4472  0.      0.      0.    ]]
True
 

REDUCE to 4
U1
[[-0.6977 -0.0931  0.0175 -0.6951]
 [-0.2619  0.2966  0.4681  0.1969]
 [-0.3527 -0.4491 -0.1017  0.4013]
 [-0.1121  0.141  -0.1478 -0.0733]
 [-0.2619  0.2966  0.4681  0.1969]
 [-0.1874  0.3747 -0.5049  0.127 ]
 [-0.3527 -0.4491 -0.1017  0.4013]
 [-0.2104  0.3337  0.0954  0.282 ]
 [-0.1874  0.3747 -0.5049  0.127 ]]
 

S1
[[ 1.5777  0.      0.      0.    ]
 [ 0.      1.2664  0.      0.    ]
 [ 0.      0.      1.189   0.    ]
 [ 0.      0.      0.      0.7962]]
 

V1
[[-0.168   0.4184 -0.6005  0.2256]
 [-0.4471  0.228   0.4631 -0.2185]
 [-0.2687  0.4226  0.5009  0.49  ]
 [-0.3954  0.3994 -0.3929 -0.1305]
 [-0.4708 -0.3028 -0.0501 -0.2609]
 [-0.3162 -0.5015 -0.121   0.7128]
 [-0.4708 -0.3028 -0.0501 -0.2609]]
 

reduced reconstruction:
[[-0.0018  0.5958 -0.0148  0.4523  0.6974  0.0102  0.6974]
 [-0.0723  0.4938  0.6254  0.0743  0.0121 -0.0133  0.0121]
 [ 0.0002 -0.0067  0.0052 -0.0013  0.3569  0.7036  0.3569]
 [ 0.1968  0.0512  0.0064  0.2179  0.0532 -0.054   0.0532]
 [-0.0723  0.4938  0.6254  0.0743  0.0121 -0.0133  0.0121]
 [ 0.6315 -0.0598  0.0288  0.5291 -0.0008  0.0002 -0.0008]
 [ 0.0002 -0.0067  0.0052 -0.0013  0.3569  0.7036  0.3569]
 [ 0.2151  0.2483  0.4347  0.2262 -0.0359  0.0394 -0.0359]
 [ 0.6315 -0.0598  0.0288  0.5291 -0.0008  0.0002 -0.0008]]


Compare the reduced (rank-4 approximation) reconstruction matrix with the original A matrix, 

(2) SVD with scipy.sparse.linalg.svds, just the same as previous example, in this example we reduce matrix to rank-4 approximation (dimensionality reduction), 


    from scipy.sparse.linalg import svds
      
    u, s, vt = svds(A, 4)
    print "u"
    print u
    print "s4"
    print np.diag(s)
    print "vt"
    print vt
 
    print "u * s4 * vt"
    print np.dot(u, np.dot(np.diag(s), vt))

u
[[-0.6951  0.0175 -0.0931  0.6977]
 [ 0.1969  0.4681  0.2966  0.2619]
 [ 0.4013 -0.1017 -0.4491  0.3527]
 [-0.0733 -0.1478  0.141   0.1121]
 [ 0.1969  0.4681  0.2966  0.2619]
 [ 0.127  -0.5049  0.3747  0.1874]
 [ 0.4013 -0.1017 -0.4491  0.3527]
 [ 0.282   0.0954  0.3337  0.2104]
 [ 0.127  -0.5049  0.3747  0.1874]]
 

s4
[[ 0.7962  0.      0.      0.    ]
 [ 0.      1.189   0.      0.    ]
 [ 0.      0.      1.2664  0.    ]
 [ 0.      0.      0.      1.5777]]
 

vt
[[ 0.2256 -0.2185  0.49   -0.1305 -0.2609  0.7128 -0.2609]
 [-0.6005  0.4631  0.5009 -0.3929 -0.0501 -0.121  -0.0501]
 [ 0.4184  0.228   0.4226  0.3994 -0.3028 -0.5015 -0.3028]
 [ 0.168   0.4471  0.2687  0.3954  0.4708  0.3162  0.4708]]
 

u * s4 * vt
[[-0.0018  0.5958 -0.0148  0.4523  0.6974  0.0102  0.6974]
 [-0.0723  0.4938  0.6254  0.0743  0.0121 -0.0133  0.0121]
 [ 0.0002 -0.0067  0.0052 -0.0013  0.3569  0.7036  0.3569]
 [ 0.1968  0.0512  0.0064  0.2179  0.0532 -0.054   0.0532]
 [-0.0723  0.4938  0.6254  0.0743  0.0121 -0.0133  0.0121]
 [ 0.6315 -0.0598  0.0288  0.5291 -0.0008  0.0002 -0.0008]
 [ 0.0002 -0.0067  0.0052 -0.0013  0.3569  0.7036  0.3569]
 [ 0.2151  0.2483  0.4347  0.2262 -0.0359  0.0394 -0.0359]
 [ 0.6315 -0.0598  0.0288  0.5291 -0.0008  0.0002 -0.0008]]


This Scipy's reconstruction matrix of rank-4 approximation is the same as previous Numpy's result.

Conclusion, as in the previous SVD post, if we have a large Documents-Terms matrix, then we reduce dimensionality with SVD, we are able to reproduce the approximation of original matrix with smaller size of decomposed matrices. 

--------------------------------------
Note: In R, normalization and then SVD can be done this way, 

A0 = c(
  0, 1, 0, 1, 1, 0, 1,
  0, 1, 1, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 1, 1,
  0, 0, 0, 1, 0, 0, 0,
  0, 1, 1, 0, 0, 0, 0,
  1, 0, 0, 1, 0, 0, 0,
  0, 0, 0, 0, 1, 1, 0,
  0, 0, 1, 1, 0, 0, 0,
  1, 0, 0, 1, 0, 0, 0) 
 
A = matrix(A0, nrow=9, ncol=7, byrow=T) 
 
# l2 norm by matrix's column 
A = apply(A, 2, FUN=function(x) {return(x/sqrt(sum(x^2)))} 
 
# SVD 
s = svd(A)

# left singular vectors,
s$u 
# right singular vectors 
s$v 
# singular values, to diagonal matrix,
diag(s$d) 
 
# helper function
reduce = function(s, d) {
  u = as.matrix(s$u[, 1:d])
  v = as.matrix(s$v[, 1:d])
  # Create the diagonal matrix dm,  
  dm = as.matrix(diag(s$d)[1:d, 1:d])
  return(list(u=u, v=v, d=dm)) 
}

# reduced to rank-4 
reduced_s = reduce(s, 4)
reduced_s$u 
reduced_s$v
reduced_s$d
 
# reconstruction, 
recon_A = reduced_s$u %*% reduced_s$A %*% t(reduced_s$v)  
recon_A