Research Article: A constrained singular value decomposition method that integrates sparsity and orthogonality

Date Published: March 13, 2019

Publisher: Public Library of Science

Author(s): Vincent Guillemot, Derek Beaton, Arnaud Gloaguen, Tommy Löfstedt, Brian Levine, Nicolas Raymond, Arthur Tenenhaus, Hervé Abdi, Shyamal D Peddada.


We propose a new sparsification method for the singular value decomposition—called the constrained singular value decomposition (CSVD)—that can incorporate multiple constraints such as sparsification and orthogonality for the left and right singular vectors. The CSVD can combine different constraints because it implements each constraint as a projection onto a convex set, and because it integrates these constraints as projections onto the intersection of multiple convex sets. We show that, with appropriate sparsification constants, the algorithm is guaranteed to converge to a stable point. We also propose and analyze the convergence of an efficient algorithm for the specific case of the projection onto the balls defined by the norms L1 and L2. We illustrate the CSVD and compare it to the standard singular value decomposition and to a non-orthogonal related sparsification method with: 1) a simulated example, 2) a small set of face images (corresponding to a configuration with a number of variables much larger than the number of observations), and 3) a psychometric application with a large number of observations and a small number of variables. The companion R-package, csvd, that implements the algorithms described in this paper, along with reproducible examples, are available for download from

Partial Text

The singular value decomposition (SVD) [1–3]—the tool “par excellence” of multivariate statistics—constitutes the core of many multivariate methods such as, to name but a few, principal component analysis [4], canonical correlation analysis [5], multiple correspondence analysis [6], and partial least squares methods [7]. To analyze data tables whose rows typically correspond to observations and columns to variables, these statistical methods use the SVD to generate orthogonal optimal linear combinations of the variables—called components or factor scores—that extract the most important information in the original data. In most cases, only the components that explain the largest proportion of the data variance are kept for further investigation. The coefficients—called loadings—of the linear combination used to compute a component are also often used to understand or “interpret” the corresponding components and this interpretation is greatly facilitated (particularly when the number of variables is large) when, for a given component, only a few variables have large loadings and the other variables have negligible loadings. If this pattern does not naturally hold, several procedures can be used to select the variables that are important for a component. The early psychometric school, for example, would use rotations, such as VARIMAX, [8] of the components in a low dimensional subspace; whereas recent approaches favor computationally based methods such as bootstrap ratios [7], or select important variables using an explicit non-linear optimization method such as the LASSO [9]. Closely related to the current work, in the specific case of principal component analysis (for an extensive review of sparsification for PCA see [10]), Witten et al. (see Section 3.2 in [11]) propose to implement either an orthogonality constraint, or a sparsity constraint (but not both simultaneously, see also, for related ideas, [12, 13]). Along the same lines, Benidis et al. [14] proposed, recently, an algorithm, based on Procrustes approach, for sparse principal component analysis that includes an orthogonality constraint on the loadings.

Matrices are denoted by uppercase bold letters (e.g., X), vectors by lowercase bold (e.g., x), and their elements by lower case italic (e.g., xi,j). Matrices, vectors, and elements from the same matrix all use the same letter (e.g., A, a, a). The transpose operation is denoted by the superscript “⊤”, the inverse of a square matrix A is denoted by A−1. The identity matrix is denoted I, vectors or matrices of ones are denoted by 1, matrices or vectors of zeros are denoted by 0 (when multiplied with or added to other matrices, matrices I, 1, and 0 are assumed to be conformable, in case of doubt their size is specified). When provided with a square matrix, the diag(⋅) operator returns a vector with the diagonal elements of the matrix, and when provided with a vector, the diag(⋅) operator returns a diagonal matrix with the elements of the vector as the diagonal elements of this matrix. When provided with a square matrix, the trace(⋅) operator gives the sum of the diagonal elements of this matrix. The L2 norm of vector x, denoted ‖x‖2 is defined as ‖x‖2=x⊤x, The L1 norm of vector x, denoted ‖x‖1 is defined as ‖x‖1 = ∑(|xn|). A vector x is normalized by dividing this vector by its L2 norm (and so a normalized vector has an L2 norm equal to 1). The Frobenius norm of matrix X, denoted ‖X‖F is defined as ‖X‖F2=trace(X⊤X). The Frobenius inner product of two rectangular matrices A and B of same dimensions, denoted 〈A, B〉F is defined as 〈A,B〉F=trace(AB⊤). The concatenation of an I by J matrix X and an I by 1 vector y is the I by J + 1 matrix denoted [X, y] obtained by the juxtaposition of y on the right side of matrix X. The orthogonal complement of the space linearly spanned by the columns of a rectangular matrix M is denoted M⊥. Two rectangular matrices A and B of same dimensions are said to be orthogonal if and only if 〈A, B〉F = 0.

The SVD of a data matrix X∈RI×J of rank L ≤ min(I, J) gives the solution of the following problem: Find a least-squares optimal, rank R (with R ≤ L) approximation of X, denoted X^[R]. Specifically, the SVD solves the following optimization problem [1, 2, 15]:
where MI,J(R) is the set of all real I × J matrices of rank R.

In this section, we empirically evaluate, illustrate the constrained singular value decomposition (CSVD), and compare its performance to the performance of the plain SVD and the closely related sparsification method of Witten et al. [11], the PMD method. To do so, we used: 1) some simulated datasets, 2) one simulated dataset mimicking a real dataset, and 3) one real dataset (the characteristics of these datasets are listed in Table 1).

The constrained singular value decomposition is a computationally efficient new method that sparsifies the SVD while preserving the orthogonality of the singular vectors. To do so, the CSVD expresses each constraint as a projection onto a convex set and integrates multiple constraints as the projection onto the intersection of the convex sets expressing the constraints (POCS). As shown in Appendix D, the CSVD algorithm is guaranteed to converge to a stable point because it is a member of the more general class of the block successive upper-bound minimization (BSUM) algorithms. The CSVD can easily be extended to incorporate additional constraints (e.g., group LASSO, metrics constraints of the rows or columns, spatial constraints) as long as these constraints can be expressed as projections onto convex sets.

In this section, we show that repeatedly using the deflation operation will generate a set of left (respectively right) orthogonal singular vectors ordered by their singular value.

As stated by Witten et al. ([11], page 519 ff.) the constraint parameters c1 and c2 lead to solutions only when they are in the range:
Fig 1 in [11] describes the geometric intuition behind this range in R2.

In this section we describe a fast and exact algorithm for the projection onto the intersection of the L1-ball of radius c (i.e., BL1(c)) and the L2-ball of radius 1 (i.e., BL2(1)). This projection is defined by the following equation:
where x∈RN, N is the number of variables of the dimension of interest (i.e., I or J), and c is the sparsity parameter (i.e., c1 or c2) with 1≤c≤N.

In this appendix we prove the convergence of the CSVD. To do so, we show that the CSVD is an instance of the block successive upper-bound minimization (BSUM) algorithm (introduced in [27]) and, as such, converges to a stationary point.




Leave a Reply

Your email address will not be published.