PhD Jan Lemeire, 17 December 2007 Learning Causal Models of Multivariate Systems and the Value of it for the Performance Modeling of Computer Programs The work presented in this thesis consists of a philosophical, theoretical and practical exploration of causal inference and its benefits for the performance analysis of computer programs. %************************************************************************* == Promotie Text == Jan Lemeire's multidisciplinair werk ontstond vanuit het idee om causale leeralgoritmes te introduceren in de wereld van de performantie analyse. Naast dit praktisch en toegepast onderzoek omvat het ook een theoretische en filosofische studie van causaal leren. Een performantie analyse tracht de uitvoer van een programma op een computer systeem te begrijpen in termen van het aanwenden van de beschikbare rekenkracht. De causale leeralgoritmes laten toe om vanuit experimentele data automatisch te leren welke en hoe de variabelen de performantie beïnvloeden. De mogelijkheid om vanuit enkel observaties iets te kunnen leren over de causale opbouw van een systeem is een ambitieus en dan ook controversieel onderwerp. Het vernieuwende concept van Kolmogorov Minimal Sufficient Statistic (KMSS) biedt een mogelijkheid om causaal leren te evalueren. Het idee is dat een model alle patronen of regulariteiten van de observaties moet beschrijven. Een rode draad zijn kwalitatieve eigenschappen. Jan Lemeire stelt voor deze te definiëren als de regulariteiten van de KMSS; de eigenschappen van data die een compressie toelaten. Kwalitatieve eigenschappen geven een ander kennis weer dan louter kwantitatieve informatie. Ze zijn zelfs onmisbaar in het volledig begrijpen van een systeem. %************************************************************************* == Promotion Text == Jan Lemeire's multidisciplinary work originated from the idea to introduce the causal learning algorithms into the world of performance analysis. Besides this practical and applied research it also comprises a theoretical and philosophical study of causal inference. A performance analysis aims at understanding the execution of programs on computer systems in terms of resource utilization. The causal learning algorithms allow the automatic construction, from experimental data, of models showing which and how variables influence the overall performance. The feasibility to learn the causal mechanisms of a system from observations only is an ambituous, yet controversial subject. The new concept of Kolmogorov Minimal Sufficient Statistic (KMSS) provides means to evaluate the validity of causal inference. The idea is that a model should capture all patterns or regularities of the observations. A red thread throughout this work is the notion of qualitative properties. The proposal is to define them as the regularities of the KMSS, the properties of the data which allow compression. Qualitative properties provide another kind of knowledge than pure quantitative information. They are indispensable in the understanding of the system. %************************************************************************* == Abstract == The PhD research of Jan Lemeire consists of the introduction of causal inference into the field of performance analysis, but also of the theoretical and philosophical study of causal inference. Algorithms for causal inference reveal from experimental data the causal structure of the system. They try to detect the cause-effect relations among the observed quantities. Applied on data of a performance analysis, the automatically learned causal models show how and which variables affect the overall performance. A performance analysis aims at understanding the execution of program on computer systems in terms of resource utilization. Results show that correct and util models are learned about the performance of sequential and parallel programs. Unexpected dependencies and potential explanations for outliers could be detected. The analysis also allows the validation of independencies and the genericity of performance characteristics. Causal models can be viewed as qualitative models, expressing the relational information of the system. They can be turned into quantitative models by a regression analysis. Practical deployment of the learning algorithms required that three limitations had to be overcome. The general and form-free independency test based on information-theoretic concept of mutual information is used. Conditional independencies are the key information for causal inference. Indirectly related variables ... CI The underlying probability distribution is estimated by a kernel density estimation. This allows the analysis of data containing non-linear relations and a mixture of continuous, discrete and categorical variables. By this the first two limitations were solved. The third limitation is the presence of deterministic relations in performance data. Deterministic relations result in what I called 'information equivalences', two variables contain the same information about a third, reference variable. Finding which one of both directly relates to the reference variable cannot be dertermined by independency information only. The solution I propose is to relate the one having the simplest relation with the reference variable. This makes sense under the assumption that complexity increases along causal paths. An augmented Bayesian network was defined which also represents information equivalences. The consequences of them were added to the theory, such as the independencies that follow from them. By an extending of the definition, the faithfulness of the models could be reestablished. The faithfulness property means that the model represents all conditional independencies. Finally, the learning algorithms could easily be extended to learn augmented models from data containing deterministic relations. Contribute to the causal interpretation of the learned models and the discussion of the validity of causal inference. Causal inference is studied with the general principles for inductive inference, more specifically that of Kolmogorov complexity. According to this, learning is equated with finding compressed descriptions of the data, where the patterns or regularities of the data allow the compression. The Kolmogorov Minimal Sufficient Statistic allows the formal separation of the meaningful information, containing the regularities of the data, from the accidental, random information. This corresponds to an extended interpretation of the faithfulness property: a model should capture all regularities of the data. Applied on causal model theory, I show that A Bayesian network provides the KMSS of a probability distribution and is faithful to it when based on the minimal factorization and resulting in a random Directed Acyclic Graph (DAG) and random and unrelated Conditional Probability Distributions (CPDs). Faithfulness is violated when a CPD exhibits a strict regularity, such as defining a deterministic relation, or when a regularity among CPDs gives rise to a conditional independence not following from the Markov condition. The causal interpretation of Bayesian networks, as defined by Pearl's interventions, corresponds to the canonical decomposition of the model into independent CPDs. The concepts of causal model theory, such as d-separation and interventions, rely on this decomposition. % correspond with/to? The counterexamples of causal inference can be analyzed by these two aspects: does the minimal BN captures all regularities and does the CPD decomposition correspond to the mechanisms. However, the assumption that these submodels correspond to the underlying physical mechanisms of the system under study is not always valid, as demonstrated by the many counterexamples that can be found in literature. Fortunately, unfaithfulness or the presence of regularities among the CPDs give indications that the real structure is based on another decomposition. %*************************************************************************