Author : Pritam Chanda
Publisher :
ISBN 13 :
Total Pages : 216 pages
Book Rating : 4.:/5 (645 download)
Book Synopsis AN INFORMATION THEORETIC FRAMEWORK FOR IDENTIFICATION AND MODELING OF GENE-GENE AND GENE-ENVIRONMENT INTERACTIONS by : Pritam Chanda
Download or read book AN INFORMATION THEORETIC FRAMEWORK FOR IDENTIFICATION AND MODELING OF GENE-GENE AND GENE-ENVIRONMENT INTERACTIONS written by Pritam Chanda and published by . This book was released on 2010 with total page 216 pages. Available in PDF, EPUB and Kindle. Book excerpt: Many applications in various fields of scientific research, economics, financial and marketing applications produce high dimensional data sets in which the data attributes are interdependent. Data mining techniques have been employed to make sense of these data sets, to discover useful patterns and models in the data that aid explaining how the system being represented works. To discover key patterns in the data, it is necessary to find relationships between the variables (or attributes) in the data that helps to explain the interdependencies (such as independence, synergy and redundancy) among the attributes that are important for understanding an appropriate probabilistic model representing the data.^In a biological or genetic context, statistical interactions between two or more genes (called gene-gene interactions or GGI) and also involving several non-genetic or environmental factors (called gene-environment interactions or GEI) are manifestations of the underlying complex biological interactions. The risk of developing many common and complex diseases such as cancer, autoimmune disease and cardiovascular disease involves complex interactions between multiple genes and several endogenous and exogenous environmental factors (or covariates). The successful detection of critical gene-gene and gene-environment statistical interactions can provide the scientific basis for many underlying biological interactions, improves the prospects for uncovering potentially undiscovered genes involved in the disease process and helps to develop preventative and curative measures for particular genetic susceptibilities.^More specifically, the identification of interactions from available genotype data is crucial because GEI and GGI analysis (1) can highlight important interactions among genetic variations in different regions of the genome and non-genetic or environmental factors. They can be used to identify and prioritize regions for sequencing studies. (2) Can be employed for directing study design so that the relevant informative environmental variables can be collected, (3) Can provide evidence in support of specific mechanisms of causality. In this dissertation, we develop, extend, validate and apply information theoretic metrics for identification and characterization of interactions among genetic variations in the epidemiological studies as studies have linked the complex epidemiological associations between genetic variations with the risk of developing many diseases.^We investigate interactions between genes (referred to as gene-gene interactions or GGI) and between genes and non-genetic factors or environmental variables (referred to as gene-environment interactions or GEI) and systematically investigate the dependence of our metrics on genetic and study-design factors to identify the GGI/GEI and enable a visual presentation of the results. We also develop several simulation strategies to be used extensively for performance evaluation because the underlying structure and true relationships between genetic and environmental factors in experimental data sets are rarely known with certainty. The high dimensionality of large data sets (e.g. from genome-wide studies) and presence of confounding factors like multiple correlations (or linkage disequilibrium among genes) and genetic heterogeneity results in combinatorial explosion of the number of possible interactions present in the data.^This combinatorial growth makes it computationally difficult, if not impossible, to exhaustively assess the full range of predictor variables for potential interactions associated with the trait or phenotype variables and diseases in epidemiological studies. Therefore, we develop and evaluate a set of algorithms capable of efficiently searching the combinatorial space for mining significant and non-redundant interactions for both discrete and quantitative phenotypes and conduct detailed power, false-discovery rate and sample size analysis for epidemiological studies. In GEI analysis, the presence of high degree of linkage disequilibrium among the genetic variables results in several interactions to contain redundant information regarding the phenotype variable.^Therefore it is essential to prune a set of GEI using a modeling step which we define as the process of identifying a parsimonious set of combinations or variables capable of explaining the disease phenotype/trait variable that will avoid over- and under-fitted models. We develop a novel algorithm that uses information theoretic metrics and their properties to efficiently perform the model synthesis task. Another principal challenge in GEI analyses is to develop metrics for prioritization of genetic variables for sequencing studies that incorporates knowledge from interactions between the genes. The gene-environment associations identified from large scale genotyping studies require large follow-on studies to comprehensively sequence the disease-associated regions to enable discovery of less common genetic variations that may be contributing to disease.^Such comprehensive follow up studies are resource intensive and require large sample sizes so that it is essential to leverage the available information from existing genotyping studies to identify the most promising disease associated regions and the possible environmental factors. Prioritizing genetic regions involved in GGI or GEI for sequencing studies can be difficult because the number of interactions, the order of interactions and their magnitudes can vary considerably making it difficult to make decisions regarding the relative importance of, e.g., a few large magnitude interactions vis-a-vis numerous interactions of moderate magnitude.^In this research, we develop a novel metric for effectively visualizing and ranking the genetic and environmental variables involved in numerous statistical interactions. Finally, often in genetic data sets, the phenotype or trait variable is absent and it is useful to mine statistical interactions among the genetic variables in an unsupervised fashion that can highlight the underlying biological interactions among the genes and proteins present in pathways. To address such analyses, in this dissertation, we study the problem of mining statistically significant correlation patterns and interaction information in genetic data. We develop novel concepts of combinations of variables containing highly significant, moderately significant and non-significant correlation information and present some bounds on correlation information and develop several pruning strategies utilizing these bounds to efficiently prune the combinatorial search space.^Using the bounds and pruning strategies, we develop efficient search algorithms to mine such associations in an efficient and effective manner and also critically examine the performance of our proposed mining algorithms.