Research Article: Fast optimization of non-negative matrix tri-factorization

Date Published: June 11, 2019

Publisher: Public Library of Science

Author(s): Andrej Čopar, Blaž Zupan, Marinka Zitnik, Holger Fröhlich.


Non-negative matrix tri-factorization (NMTF) is a popular technique for learning low-dimensional feature representation of relational data. Currently, NMTF learns a representation of a dataset through an optimization procedure that typically uses multiplicative update rules. This procedure has had limited success, and its failure cases have not been well understood. We here perform an empirical study involving six large datasets comparing multiplicative update rules with three alternative optimization methods, including alternating least squares, projected gradients, and coordinate descent. We find that methods based on projected gradients and coordinate descent converge up to twenty-four times faster than multiplicative update rules. Furthermore, alternating least squares method can quickly train NMTF models on sparse datasets but often fails on dense datasets. Coordinate descent-based NMTF converges up to sixteen times faster compared to well-established methods.

Partial Text

Extracting patterns from relational data is a key task in natural language processing [1], bioinformatics [2], and digital humanities [3]. We typically represent a relational dataset with a data matrix, encoding, for example, information on document-term frequencies, gene-disease associations, or user-item ratings. Non-negative matrix tri-factorization (NMTF) is a general technique that takes a data matrix and compresses, or embeds, the matrix into a compact latent space. The learned embedding space can be used to identify clusters [4, 5], reveal interesting patterns [6, 7], and generate feature representations for downstream analytics [8, 9]. NMTF has been used to discover disease-disease associations [10]. identify cancer driver genes from patient data [11], and to model topics in text data [12]. However, despite numerous applications, training NMTF models on large datasets can be slow and has remained computationally challenging [13].

We first describe datasets and their preprocessing. We then continue with a formal presentation of optimization methods, focusing on derivation of three optimization methods that are new for non-negative matrix tri-factorization.

We empirically study the convergence of the algorithms on six datasets of varying size and density. We find that traditional multiplicative update rules method has the worst performance. In contrast, coordinate descent converges 5 to 24 times faster than multiplicative update rules (Table 2) and up to 16 times faster when comparing the runtime (Table 3). Multiplicative update rules method outperforms alternating least squares on dense datasets, whereas alternating least squares achieves most promising results on sparse datasets.

Currently, multiplicative update rules represent a popular off-the-shelf optimization approach for non-negative matrix tri-factorization (NMTF) that is used in diverse applications, ranging from bioinformatics to natural language processing (e.g., [4–12]). We derived three new optimization methods for NMTF and demonstrated their convergence and scalability on six datasets of varying size and density. Importantly, we observe that coordinate descent, the newly derived method, converges fast and is stable on datasets of varying size and density. Our results suggest that coordinate descent might be a preferred off-the-shelf optimization method to train NMTF models on large datasets. These findings together with complete mathematical derivations (see S1 File) and a scalable public implementation of the algorithms (see section Implementation) represent primary contributions of this paper.

A traditional optimization approach to non-negative matrix tri-factorization uses multiplicative update rules. We described three alternative algorithms that train a non-negative matrix tri-factorization model based on alternating least squares, projected gradients, or coordinate descent. We conducted an empirical study comparing convergence and runtime of the training algorithms on six datasets. Our results show that the new approaches converge faster than multiplicative update rules and that coordinate descent achieves the best average performance.