Research Article: From patterned response dependency to structured covariate dependency: Entropy based categorical-pattern-matching

Date Published: June 14, 2018

Publisher: Public Library of Science

Author(s): Hsieh Fushing, Shan-Yu Liu, Yin-Chen Hsieh, Brenda McCowan, Quanquan Gu.


Data generated from a system of interest typically consists of measurements on many covariate features and possibly multiple response features across all subjects in a designated ensemble. Such data is naturally represented by one response-matrix against one covariate-matrix. A matrix lattice is an advantageous platform for simultaneously accommodating heterogeneous data types: continuous, discrete and categorical, and exploring hidden dependency among/between features and subjects. After each feature being individually renormalized with respect to its own histogram, the categorical version of mutual conditional entropy is evaluated for all pairs of response and covariate features according to the combinatorial information theory. Then, by applying Data Could Geometry (DCG) algorithmic computations on such a mutual conditional entropy matrix, multiple synergistic feature-groups are partitioned. Distinct synergistic feature-groups embrace distinct structures of dependency. The explicit details of dependency among members of synergistic features are seen through mutliscale compositions of blocks computed by a computing paradigm called Data Mechanics. We then propose a categorical pattern matching approach to establish a directed associative linkage: from the patterned response dependency to serial structured covariate dependency. The graphic display of such a directed associative linkage is termed an information flow and the degrees of association are evaluated via tree-to-tree mutual conditional entropy. This new universal way of discovering system knowledge is illustrated through five data sets. In each case, the emergent visible heterogeneity is an organization of discovered knowledge.

Partial Text

Nearly all scientific researches are geared to acquire knowledge and understanding on systems of interest. So data generated from a target system typically consists of measurements on many covariate features and possibly multiple response features belonging to subjects, who constitute a representative ensemble of the system. As such a system data set typically consists of one response data matrix against a covariate data matrix. These two matrices share the common ensemble of subjects, which are arranged along its row-axis, while their own features are arranged along their own column-axis, respectively. The matrix lattice is indeed an advantageous platform for revealing patterned structures, particularly for dependency within response or covariate sides. Moreover these two platforms become the joint foundation for all the directed associative linkages going from response side to covariate side.

Evaluation of amount of information conveyed by X with regard to Y. In his 1965 paper [9] with title “Three approaches to the quantitative definition of information,” A. N. Kolmogorov said that

Motivations and goals Before introducing our computational paradigms for extracting information contained in a system data set, we motivate our developments by explaining why modeling methodologies in statistics have limited merits in many system studies. Here we use the popular Logistic regression as an illustrating example. We explicitly demonstrate why this modeling is not expandable mathematically, that is, this modeling is constrained strictly by the underlying homogeneity assumption, which goes against the heterogeneity naturally embedded within almost all systems of scientific interest. It is worth emphasizing that similar explanations would be applicable to the linear regression model as well.

In this section we analyze five simple system data sets from UCI Machine Learning Repository

In this paper we develop one universal platform: algorithmic computing protocol plus graphic display techniques for system data analysis. Our computing protocol is developed under the guiding principle of having multiple synergistic mechanisms contained in a system. Our goal is geared to first extract authentic information contents contained in a system data set. And secondly our categorical-pattern-matching via graphic display is to stimulate proper understanding of computed information, and thirdly to discover pertinent knowledge about the system under study. The resultant system knowledge on one single response mechanism is visible and explainable through a series of covariate mechanisms. Such knowledge is organized and represented through one single information flow. And a system is likely better understood by multiple information flows.




0 0 vote
Article Rating
Notify of
Inline Feedbacks
View all comments