Author : Dongrui Zhong
Publisher :
ISBN 13 :
Total Pages : 0 pages
Book Rating : 4.:/5 (133 download)
Book Synopsis A Model-based Framework for High-throughput Genomic Data Enhancement by : Dongrui Zhong
Download or read book A Model-based Framework for High-throughput Genomic Data Enhancement written by Dongrui Zhong and published by . This book was released on 2021 with total page 0 pages. Available in PDF, EPUB and Kindle. Book excerpt: Reusability is part of the FAIR data principle, which aims to make data Findable, Accessible, Interoperable, and Reusable. One of the current efforts to increase the reusability of public genomics data has been to focus on the inclusion of quality metadata associated with the data. When necessary metadata are missing, most researchers will consider the data useless. We developed a framework to predict the missing metadata of gene expression datasets to maximize their reusability. We found that when using predicted data to conduct other analyses, it is not optimal to use all the predicted data. Instead, one should only use the subset of data, which can be predicted accurately. We proposed a new metric called Proportion of Cases Accurately Predicted (PCAP), which is optimized in our specifically-designed machine learning pipeline. The new approach performed better than pipelines using commonly used metrics such as F1-score in terms of maximizing the reusability of data with missing values. We also found that different variables might need to be predicted using different machine learning methods and/or different data processing protocols. Using differential gene expression analysis as an example, we showed that when missing variables are accurately predicted, the corresponding gene expression data can be reliably used in downstream analyses. The performance of conventional machine learning methods relies greatly on the quality of the data. In the field of bioinformatics, researchers often need to work with huge amount of data from different resources, such as Gene Expression Omnibus (GEO) and the Cancer Genome Atlas Program (TCGA). It is possible that gene expression data collected in different studies are somewhat inconsistent, for reasons such as the difference in the experiment design or the platform (different microarray or RNA-seq platforms) used to read the gene expression data. This brings difficulty in cross-study analysis, such as model transferability among studies. Some normalization methods such as quantile normalization and rank normalization are widely used to deal with such inconsistency. However, models trained using normalized data could still suffer from extreme values or large variation of errors in the original data, since conventional normalization methods may not completely overcome these issues. We propose to use the pairwise rank information among the gene expression values (paired predictive variables, or PPV) calculated from the original gene expression values to train machine learning models. Our result shows that PPV gives more statistically robust models, while the interpretability of variables is greatly maintained, and the model performance is equivalent or better than models trained using original gene expression data with conventional normalization methods. The raw gene expression values can be decomposed into four components: the true value, systematic errors, random errors, and unexpected errors (producing outliers when large enough). Normalization methods are used to remove systematic errors, and outlier detection methods can be applied to detect outliers. Few, if at all, methods have been developed to remove random errors. In this study, we developed an efficient method based on probabilistic principal component analysis (PPCA) to remove both random and unexpected errors in gene expression data by borrowing information across genes and samples using an internal prediction strategy. The proposed method, gene-expression-data-enhancement (GEDE), first estimates a covariance matrix for all the genes using an existing dataset with a relatively large number of samples. It then infers enhanced gene expressions for a dataset to be analyzed by predicting gene expression values using expression values of other genes. We showed that the enhanced version of a gene expression profile has higher quality than the original profile using both simulation study and real data analysis. In real data analysis, we showed that enhanced gene expression profiles gave better differential gene expression analysis results than original profiles. We recommend gene expression enhancement as an essential step in gene expression data analysis pipeline.