Development of Statistical Methods for the Analysis of Single-cell RNA-seq Data

Download Development of Statistical Methods for the Analysis of Single-cell RNA-seq Data PDF Online Free

Author :
Publisher :
ISBN 13 :
Total Pages : 0 pages
Book Rating : 4.:/5 (141 download)

DOWNLOAD NOW!


Book Synopsis Development of Statistical Methods for the Analysis of Single-cell RNA-seq Data by : Constantin Ahlmann-Eltze

Download or read book Development of Statistical Methods for the Analysis of Single-cell RNA-seq Data written by Constantin Ahlmann-Eltze and published by . This book was released on 2023* with total page 0 pages. Available in PDF, EPUB and Kindle. Book excerpt:

Statistical Methods Development for the Analysis of Single Cell RNA-seq Data

Download Statistical Methods Development for the Analysis of Single Cell RNA-seq Data PDF Online Free

Author :
Publisher :
ISBN 13 :
Total Pages : 0 pages
Book Rating : 4.:/5 (123 download)

DOWNLOAD NOW!


Book Synopsis Statistical Methods Development for the Analysis of Single Cell RNA-seq Data by : Xiuyu Ma

Download or read book Statistical Methods Development for the Analysis of Single Cell RNA-seq Data written by Xiuyu Ma and published by . This book was released on 2020 with total page 0 pages. Available in PDF, EPUB and Kindle. Book excerpt: Single-cell analysis is a rapidly evolving approach to characterize genome-wide gene expression at the individual cell level. Overcoming unique variational structure underlying the data and studying cellular heterogeneity require statistical tools. In this dissertation, I develop and improve statistical methods focus on identifying genes with differential distributions across conditions. The first method uses a compositional structure which explicitly accounts for the cellular subtypes to characterize gene expression as a mixture over subtypes and quantify the distributional change between conditions. We also extend the distributional comparison to more than two conditions. The second method accelerates the inference for patterns of how means are varied among multiple groups. It scales up the first method when more mixing components are considered. The first method, called scDDboost, introduces an empirical Bayesian mixture approach and leverages cell-subtype structure revealed in cluster analysis in order to boost gene-level information on expression changes. Cell clustering informs gene-level analysis through a specially-constructed prior distribution over pairs of multinomial probability vectors; this prior meshes with available model-based tools that score patterns of differential expression over multiple subtypes. We derive an explicit formula for the posterior probability that a gene has the same distribution in two cellular conditions, allowing for a gene-specific mixture over subtypes in each condition. Advantage is gained by the compositional structure of the model, in which a host of gene-specific mixture components are allowed, but also in which the mixing proportions are constrained at the whole-cell level. This structure leads to a novel form of information sharing through which the cell-clustering results support gene-level scoring of differential distribution. The result, according to our numerical experiments, is improved sensitivity compared to several standard approaches for detecting distributional expression changes. The compositional model has great flexibility and we further extend it to more than two conditions. The second method called EBSeq.v2 accelerates a widely used package EBSeq. The number of patterns for equivalent/differential means among groups grows fast with the number of groups. It introduces challenge for memory and computation. We provide a pruning algorithm to eliminates unlikely patterns that we can assess through preliminary checks over local Bayes factors. Further improvements are gained through a more efficient one-step EM for hyperparameters optimization and codes implementation in C++.

Statistical Methods for Bulk and Single-cell RNA Sequencing Data

Download Statistical Methods for Bulk and Single-cell RNA Sequencing Data PDF Online Free

Author :
Publisher :
ISBN 13 :
Total Pages : 207 pages
Book Rating : 4.:/5 (11 download)

DOWNLOAD NOW!


Book Synopsis Statistical Methods for Bulk and Single-cell RNA Sequencing Data by : Wei Li

Download or read book Statistical Methods for Bulk and Single-cell RNA Sequencing Data written by Wei Li and published by . This book was released on 2019 with total page 207 pages. Available in PDF, EPUB and Kindle. Book excerpt: Since the invention of next-generation RNA sequencing (RNA-seq) technologies, they have become a powerful tool to study the presence and quantity of RNA molecules in biological samples and have revolutionized transcriptomic studies on bulk tissues. Recently, the emerging single-cell RNA sequencing (scRNA-seq) technologies enable the investigation of transcriptomic landscapes at a single-cell resolution, providing a chance to characterize stochastic heterogeneity within a cell population. The analysis of bulk and single-cell RNA-seq data at four different levels (samples, genes, transcripts, and exons) involves multiple statistical and computational questions, some of which remain challenging up to date. The first part of this dissertation focuses on the statistical challenges in the transcript-level analysis of bulk RNA-seq data. The next-generation RNA-seq technologies have been widely used to assess full-length RNA isoform structure and abundance in a high-throughput manner, enabling us to better understand the alternative splicing process and transcriptional regulation mechanism. However, accurate isoform identification and quantification from RNA-seq data are challenging due to the information loss in sequencing experiments. In Chapter 2, given the fast accumulation of multiple RNA-seq datasets from the same biological condition, we develop a statistical method, MSIQ, to achieve more accurate isoform quantification by integrating multiple RNA-seq samples under a Bayesian framework. The MSIQ method aims to (1) identify a consistent group of samples with homogeneous quality and (2) improve isoform quantification accuracy by jointly modeling multiple RNA-seq samples and allowing for higher weights on the consistent group. We show that MSIQ provides a consistent estimator of isoform abundance, and we demonstrate the accuracy of MSIQ compared with alternative methods through both simulation and real data studies. In Chapter 3, we introduce a novel method, AIDE, the first approach that directly controls false isoform discoveries by implementing the statistical model selection principle. Solving the isoform discovery problem in a stepwise manner, AIDE prioritizes the annotated isoforms and precisely identifies novel isoforms whose addition significantly improves the explanation of observed RNA-seq reads. Our results demonstrate that AIDE has the highest precision compared to the state-of-the-art methods, and it is able to identify isoforms with biological functions in pathological conditions. The second part of this dissertation discusses two statistical methods to improve scRNA-seq data analysis, which is complicated by the excess missing values, the so-called dropouts due to low amounts of mRNA sequenced within individual cells. In Chapter 5, we introduce scImpute, a statistical method to accurately and robustly impute the dropouts in scRNA-seq data. The scImpute method automatically identifies likely dropouts, and only performs imputation on these values by borrowing information across similar cells. Evaluation based on both simulated and real scRNA-seq data suggests that scImpute is an effective tool to recover transcriptome dynamics masked by dropouts, enhance the clustering of cell subpopulations, and improve the accuracy of differential expression analysis. In Chapter 6, we propose a flexible and robust simulator, scDesign, to optimize the choices of sequencing depth and cell number in designing scRNA-seq experiments, so as to balance the exploration of the depth and breadth of transcriptome information. It is the first statistical framework for researchers to quantitatively assess practical scRNA-seq experimental design in the context of differential gene expression analysis. In addition to experimental design, scDesign also assists computational method development by generating high-quality synthetic scRNA-seq datasets under customized experimental settings.

Statistical Simulation and Analysis of Single-cell RNA-seq Data

Download Statistical Simulation and Analysis of Single-cell RNA-seq Data PDF Online Free

Author :
Publisher :
ISBN 13 :
Total Pages : 0 pages
Book Rating : 4.:/5 (141 download)

DOWNLOAD NOW!


Book Synopsis Statistical Simulation and Analysis of Single-cell RNA-seq Data by : Tianyi Sun

Download or read book Statistical Simulation and Analysis of Single-cell RNA-seq Data written by Tianyi Sun and published by . This book was released on 2023 with total page 0 pages. Available in PDF, EPUB and Kindle. Book excerpt: The recent development of single-cell RNA sequencing (scRNA-seq) technologies has revolutionized transcriptomic studies by revealing the genome-wide gene expression levels within individual cells. In contrast to bulk RNA sequencing, scRNA-seq technology captures cell-specific transcriptome landscapes, which can reveal crucial information about cell-to-cell heterogeneity across different tissues, organs, and systems and enable the discovery of novel cell types and new transient cell states. According to search results from PubMed, from 2009-2023, over 5,000 published studies have generated datasets using this technology. Such large volumes of data call for high-quality statistical methods for their analysis. In the three projects of this dissertation, I have explored and developed statistical methods to model the marginal and joint gene expression distributions and determine the latent structure type for scRNA-seq data. In all three projects, synthetic data simulation plays a crucial role. My first project focuses on the exploration of the Beta-Poisson hierarchical model for the marginal gene expression distribution of scRNA-seq data. This model is a simplified mechanistic model with biological interpretations. Through data simulation, I demonstrate three typical behaviors of this model under different parameter combinations, one of which can be interpreted as one source of the sparsity and zero inflation that is often observed in scRNA-seq datasets. Further, I discuss parameter estimation methods of this model and its other applications in the analysis of scRNA-seq data. My second project focuses on the development of a statistical simulator, scDesign2, to generate realistic synthetic scRNA-seq data. Although dozens of simulators have been developed before, they lack the capacity to simultaneously achieve the following three goals: preserving genes, capturing gene correlations, and generating any number of cells with varying sequencing depths. To fill in this gap, scDesign2 is developed as a transparent simulator that achieves all three goals and generates high-fidelity synthetic data for multiple scRNA-seq protocols and other single-cell gene expression count-based technologies. Compared with existing simulators, scDesign2 is advantageous in its transparent use of probabilistic models and is unique in its ability to capture gene correlations via copula. We verify that scDesign2 generates more realistic synthetic data for four scRNA-seq protocols (10x Genomics, CEL-Seq2, Fluidigm C1, and Smart-Seq2) and two single-cell spatial transcriptomics protocols (MERFISH and pciSeq) than existing simulators do. Under two typical computational tasks, cell clustering and rare cell type detection, we demonstrate that scDesign2 provides informative guidance on deciding the optimal sequencing depth and cell number in single-cell RNA-seq experimental design, and that scDesign2 can effectively benchmark computational methods under varying sequencing depths and cell numbers. With these advantages, scDesign2 is a powerful tool for single-cell researchers to design experiments, develop computational methods, and choose appropriate methods for specific data analysis needs. My third project focuses on deciding latent structure types for scRNA-seq datasets. Clustering and trajectory inference are two important data analysis tasks that can be performed for scRNA-seq datasets and will lead to different interpretations. However, as of now, there is no principled way to tell which one of these two types of analysis results is more suitable to describe a given dataset. In this project, we propose two computational approaches that aim to distinguish cluster-type vs. trajectory-type scRNA-seq datasets. The first approach is based on building a classifier using eigenvalue features of the gene expression covariance matrix, drawing inspiration from random matrix theory (RMT). The second approach is based on comparing the similarity of real data and simulated data generated by assuming the cell latent structure as clusters or a trajectory. While both approaches have limitations, we show that the second approach gives more promising results and has room for further improvements.

Statistical Methods for RNA-sequencing Data

Download Statistical Methods for RNA-sequencing Data PDF Online Free

Author :
Publisher :
ISBN 13 :
Total Pages : 0 pages
Book Rating : 4.:/5 (123 download)

DOWNLOAD NOW!


Book Synopsis Statistical Methods for RNA-sequencing Data by : Rhonda Bacher

Download or read book Statistical Methods for RNA-sequencing Data written by Rhonda Bacher and published by . This book was released on 2017 with total page 0 pages. Available in PDF, EPUB and Kindle. Book excerpt: Major methodological and technological advances in sequencing have inspired ambitious biological questions that were previously elusive. Addressing such questions with novel and complex data requires statistically rigorous tools. In this dissertation, I develop, evaluate, and apply statistical and computational methods for analysis of high-throughput sequencing data. A unifying theme of this work is that all these methods are aimed at RNA-seq data. The first method focuses on characterizing gene expression in RNA-seq experiments with ordered conditions. The second focuses on single-cell RNA-seq data, where we develop a method for normalization to account for a previously unknown technical artifact in the data. Finally, we develop a simulation in order to recapitulate the source of the artifact [in silico].

Statistical Methods in Single Cell and Spatial Transcriptomics Data

Download Statistical Methods in Single Cell and Spatial Transcriptomics Data PDF Online Free

Author :
Publisher :
ISBN 13 :
Total Pages : pages
Book Rating : 4.:/5 (13 download)

DOWNLOAD NOW!


Book Synopsis Statistical Methods in Single Cell and Spatial Transcriptomics Data by : Roopali Singh

Download or read book Statistical Methods in Single Cell and Spatial Transcriptomics Data written by Roopali Singh and published by . This book was released on 2021 with total page pages. Available in PDF, EPUB and Kindle. Book excerpt: Single cell RNA-sequencing (scRNA-seq) allows one to study the transcriptomics of different cell types in heterogeneous samples (e.g. tissues) at a single cell level. Most scRNA-seq protocols experience high levels of dropout due to the small amount of starting material, leading to a majority of reported expression levels being zero. Though missing data contain information about reproducibility, they are often excluded in the reproducibility assessment, potentially generating misleading assessments. In the first part of my dissertation, we develop a copula-based regression model to assess how the reproducibility of high-throughput experiments is affected by the choices of operational factors (e.g., platform or sequencing depth) when a large number of measurements are missing. Simulations show that our method is more accurate in detecting differences in reproducibility than existing measures of reproducibility. We illustrate the usefulness of our method by comparing the reproducibility of different library preparation platforms and studying the effect of sequencing depth on reproducibility, thereby determining the cost-effective sequencing depth that is required to achieve sufficient reproducibility. The spatial locations of these single cells are lost in scRNA-seq data. A recently emerging technology, Spatial Transcriptomics (ST), measures the gene expression in a tissue slice in situ, maintaining cells' spatial information in the tissue. However, they do not have a single-cell resolution but rather produce a group of potentially heterogeneous cells at each spot, which needs to be deconvolved to learn cell composition at each spot. In the second part of my dissertation, we develop a reference-free deconvolution method, based on Bayesian non-negative matrix factorization, to infer the cell type composition of each spot. Unlike the existing deconvolution methods, which all take reference-based approaches, our approach does not rely on scRNA-seq references. Simulations show that our method is more accurate in detecting the cell-type compositions than existing deconvolution techniques in case of varying spot size, heterogeneity, and imperfect single-cell reference. We illustrate the usefulness of our method using Mouse Brain Cerebellum data and Human Intestine Developmental data.

Statistical Methods for Improving Data Quality in Modern Rna Sequencing Experiments

Download Statistical Methods for Improving Data Quality in Modern Rna Sequencing Experiments PDF Online Free

Author :
Publisher :
ISBN 13 :
Total Pages : 0 pages
Book Rating : 4.:/5 (136 download)

DOWNLOAD NOW!


Book Synopsis Statistical Methods for Improving Data Quality in Modern Rna Sequencing Experiments by : Zijian Ni (Ph.D.)

Download or read book Statistical Methods for Improving Data Quality in Modern Rna Sequencing Experiments written by Zijian Ni (Ph.D.) and published by . This book was released on 2022 with total page 0 pages. Available in PDF, EPUB and Kindle. Book excerpt: RNA sequencing (RNA-seq) has revolutionized the possibility of measuring transcriptome-wide gene expression in the last two decades. Modern RNA sequencing techniques such as single-cell RNA sequencing (scRNA-seq) and spatial transcriptomics (ST) have been developed in recent years, allowing researchers to quantify gene expression in single-cell resolution or to profile gene activity patterns in 2-dimensional space across tissue. While useful, data collected from these techniques always come with noise, and appropriate filtering and cleaning are required for reliable downstream analyses. In this dissertation, I investigate multiple quality-related issues in scRNA-seq and ST experiments, and I develop, implement, evaluate and apply statistical methods to adjust for them. A unifying theme of this work is that all these methods aim at improving data quality and allowing for better power and precision in downstream analyses. For scRNA-seq data, the quality issue we discuss in this dissertation is distinguishing barcodes associated with real cells from those binding background noise. In droplet-based scRNA-seq experiments, raw data contains both cell barcodes that should be retained for downstream analysis as well as background barcodes that are uninformative and should be filtered out. Due to ambient RNAs presenting in all the barcodes, cell barcodes are not easily distinghished from background barcodes. Both misclassified background barcodes and cell barcodes induce misleading results in downstream analyses. Existing filtering methods test barcodes individually and consequently do not leverage the strong cell-to-cell correlation present in most datasets. To improve cell detection, we introduce CB2, a cluster-based approach for distinguishing real cells from background barcodes. As demonstrated in simulated and case study datasets, CB2 has increased power for identifying real cells which allows for the identification of novel subpopulations and improves downstream differential expression analyses. We then present a benchmark study to evaluate the performance of cell detection methods, including CB2, on public scRNA-seq datasets covering a variety of experiment protocols. In recent years, variants of scRNA-seq techniques have been developed for specialized biological tasks. While the data structures remain the same as the standard scRNA-seq experiment, the underlying data properties can alter a lot. Here, we propose the first benchmark study to provide a thorough comparison across existing cell detection methods in scRNA-seq data, and to guide users to choose the appropriate methods for their experiments. Evaluation metrics include power, precision, computational efficiency, robustness, and accessibility. In addition, we provide investigation and guidance on appropriately choosing filtering parameters in order to improve data quality. For ST data, we uncover, for the first time, a novel quality issue that genes expressed at one tissue region bleed out and contaminate nearby tissue regions. ST is a powerful and widely-used approach for profiling transcriptome-wide gene expression across a tissue with emerging applications in molecular medicine and tumor diagnostics. Recent ST experiments utilize slides containing thousands of spots with spot-specific barcodes that bind RNAs. Ideally, unique molecular identifiers at a spot measure spot-specific expression, but this is often not the case owing to bleed from nearby spots, an artifact we refer to as spot swapping. We design a creative human-mouse chimeric ST experiment to validate the existence of spot swapping. Spot swapping hinders inferences of region-specific gene activities and tissue annotations. In order to decontaminate ST data, we propose SpotClean, a probabilistic model that measures the spot swapping effect and estimates gene expression using EM algorithm. SpotClean is shown to provide a more accurate estimation of the underlying gene expression, increase the specificity of marker gene signals, and, more importantly, allow for improved tumor diagnostics.

Computational Methods for Single-Cell Data Analysis

Download Computational Methods for Single-Cell Data Analysis PDF Online Free

Author :
Publisher : Humana Press
ISBN 13 : 9781493990566
Total Pages : 271 pages
Book Rating : 4.9/5 (95 download)

DOWNLOAD NOW!


Book Synopsis Computational Methods for Single-Cell Data Analysis by : Guo-Cheng Yuan

Download or read book Computational Methods for Single-Cell Data Analysis written by Guo-Cheng Yuan and published by Humana Press. This book was released on 2019-02-14 with total page 271 pages. Available in PDF, EPUB and Kindle. Book excerpt: This detailed book provides state-of-art computational approaches to further explore the exciting opportunities presented by single-cell technologies. Chapters each detail a computational toolbox aimed to overcome a specific challenge in single-cell analysis, such as data normalization, rare cell-type identification, and spatial transcriptomics analysis, all with a focus on hands-on implementation of computational methods for analyzing experimental data. Written in the highly successful Methods in Molecular Biology series format, chapters include introductions to their respective topics, lists of the necessary materials and reagents, step-by-step, readily reproducible laboratory protocols, and tips on troubleshooting and avoiding known pitfalls. Authoritative and cutting-edge, Computational Methods for Single-Cell Data Analysis aims to cover a wide range of tasks and serves as a vital handbook for single-cell data analysis.

Statistical Methods for Whole Transcriptome Sequencing

Download Statistical Methods for Whole Transcriptome Sequencing PDF Online Free

Author :
Publisher :
ISBN 13 :
Total Pages : 0 pages
Book Rating : 4.:/5 (133 download)

DOWNLOAD NOW!


Book Synopsis Statistical Methods for Whole Transcriptome Sequencing by : Cheng Jia

Download or read book Statistical Methods for Whole Transcriptome Sequencing written by Cheng Jia and published by . This book was released on 2017 with total page 0 pages. Available in PDF, EPUB and Kindle. Book excerpt: RNA-Sequencing (RNA-Seq) has enabled detailed unbiased profiling of whole transcriptomes with incredible throughput. Recent technological breakthroughs have pushed back the frontiers of RNA expression measurement to single-cell level (scRNA-Seq). With both bulk and single-cell RNA-Seq analyses, modeling of the noise structure embedded in the data is crucial for drawing correct inference. In this dissertation, I developed a series of statistical methods to account for the technical variations specific in RNA-Seq experiments in the context of isoform- or gene- level differential expression analyses. In the first part of my dissertation, I developed MetaDiff (https://github.com/jiach/MetaDiff ), a random-effects meta-regression model, that allows the incorporation of uncertainty in isoform expression estimation in isoform differential expression analysis. This framework was further extended to detect splicing quantitative trait loci with RNA-Seq data. In the second part of my dissertation, I developed TASC (Toolkit for Analysis of Single-Cell data; https://github.com/scrna-seq/TASC), a hierarchical mixture model, to explicitly adjust for cell-to-cell technical differences in scRNA-Seq analysis using an empirical Bayes approach. This framework can be adapted to perform differential gene expression analysis. In the third part of my dissertation, I developed, TASC-B, a method extended from TASC to model transcriptional bursting- induced zero-inflation. This model can identify and test for the difference in the level of transcriptional bursting. Compared to existing methods, these new tools that I developed have been shown to better control the false discovery rate in situations where technical noise cannot be ignored. They also display superior power in both our simulation studies and real world applications.

Statistical Methods for Alternative Splicing Using RNA Sequencing

Download Statistical Methods for Alternative Splicing Using RNA Sequencing PDF Online Free

Author :
Publisher :
ISBN 13 :
Total Pages : 0 pages
Book Rating : 4.:/5 (133 download)

DOWNLOAD NOW!


Book Synopsis Statistical Methods for Alternative Splicing Using RNA Sequencing by : Yu Hu

Download or read book Statistical Methods for Alternative Splicing Using RNA Sequencing written by Yu Hu and published by . This book was released on 2018 with total page 0 pages. Available in PDF, EPUB and Kindle. Book excerpt: The emergence of RNA-seq technology has made it possible to estimate isoform-specific gene expression and detect differential alternative splicing between conditions, thus providing us an effective way to discover disease susceptibility genes. Analysis of alternative splicing, however, is challenging because various biases present in RNA-seq data complicates the analysis, and if not appropriately corrected, will affect gene expression estimation and downstream modeling. Motivated by these issues, my dissertation focused on statistical problems related to the analysis of alternative splicing in RNA-seq data. In Part I of my dissertation, I developed PennSeq, a method that aims to account for non-uniform read distribution in isoform expression estimation. PennSeq models non-uniformity using the empirical read distribution in RNA-seq data. It is the first time that non-uniformity is modeled at the isoform level. Compared to existing approaches, PennSeq allows bias correction at a much finer scale and achieved higher estimation accuracy. In Part II of my dissertation, I developed PennDiff, a method that aims to detect differential alternative splicing by RNA-seq. This approach avoids multiple testing for exons originated from the same isoform(s) and is able to detect differential alternative splicing at both exon and gene level, with more flexibility and higher sensitivity than existing methods. In Part III of my dissertation, I focused on problems arising from single-cell RNA-seq (scRNA-seq), a newly developed technology that allows the measurement of cellular heterogeneity of gene expression in single cells. Compared to bulk tissue RNA-seq, analysis of scRNA-seq data is more challenging due to high technical variability across cells and extremely low sequencing depth. To overcome these challenges, I developed SCATS, a method that aims to detect differential alternative splicing with scRNA-seq data. SCATS employs an empirical Bayes approach to model technical noise by use of external RNA spike-ins and groups informative reads sharing the same isoform(s) to detect splicing change. SCATS showed superior performance in both simulation and real data analyses. In summary, methods developed in my dissertation provide biomedical researchers a set of powerful tools for transcriptomic data analysis and will aid novel scientific discovery.

RNA-Seq Analysis: Methods, Applications and Challenges

Download RNA-Seq Analysis: Methods, Applications and Challenges PDF Online Free

Author :
Publisher : Frontiers Media SA
ISBN 13 : 2889637050
Total Pages : 169 pages
Book Rating : 4.8/5 (896 download)

DOWNLOAD NOW!


Book Synopsis RNA-Seq Analysis: Methods, Applications and Challenges by : Filippo Geraci

Download or read book RNA-Seq Analysis: Methods, Applications and Challenges written by Filippo Geraci and published by Frontiers Media SA. This book was released on 2020-06-08 with total page 169 pages. Available in PDF, EPUB and Kindle. Book excerpt:

Statistical Analysis of Next Generation Sequencing Data

Download Statistical Analysis of Next Generation Sequencing Data PDF Online Free

Author :
Publisher : Springer
ISBN 13 : 3319072129
Total Pages : 438 pages
Book Rating : 4.3/5 (19 download)

DOWNLOAD NOW!


Book Synopsis Statistical Analysis of Next Generation Sequencing Data by : Somnath Datta

Download or read book Statistical Analysis of Next Generation Sequencing Data written by Somnath Datta and published by Springer. This book was released on 2014-07-03 with total page 438 pages. Available in PDF, EPUB and Kindle. Book excerpt: Next Generation Sequencing (NGS) is the latest high throughput technology to revolutionize genomic research. NGS generates massive genomic datasets that play a key role in the big data phenomenon that surrounds us today. To extract signals from high-dimensional NGS data and make valid statistical inferences and predictions, novel data analytic and statistical techniques are needed. This book contains 20 chapters written by prominent statisticians working with NGS data. The topics range from basic preprocessing and analysis with NGS data to more complex genomic applications such as copy number variation and isoform expression detection. Research statisticians who want to learn about this growing and exciting area will find this book useful. In addition, many chapters from this book could be included in graduate-level classes in statistical bioinformatics for training future biostatisticians who will be expected to deal with genomic data in basic biomedical research, genomic clinical trials and personalized medicine. About the editors: Somnath Datta is Professor and Vice Chair of Bioinformatics and Biostatistics at the University of Louisville. He is Fellow of the American Statistical Association, Fellow of the Institute of Mathematical Statistics and Elected Member of the International Statistical Institute. He has contributed to numerous research areas in Statistics, Biostatistics and Bioinformatics. Dan Nettleton is Professor and Laurence H. Baker Endowed Chair of Biological Statistics in the Department of Statistics at Iowa State University. He is Fellow of the American Statistical Association and has published research on a variety of topics in statistics, biology and bioinformatics.

Benchmarking Statistical and Machine-Learning Methods for Single-cell RNA Sequencing Data

Download Benchmarking Statistical and Machine-Learning Methods for Single-cell RNA Sequencing Data PDF Online Free

Author :
Publisher :
ISBN 13 :
Total Pages : 203 pages
Book Rating : 4.:/5 (129 download)

DOWNLOAD NOW!


Book Synopsis Benchmarking Statistical and Machine-Learning Methods for Single-cell RNA Sequencing Data by : Nan Xi

Download or read book Benchmarking Statistical and Machine-Learning Methods for Single-cell RNA Sequencing Data written by Nan Xi and published by . This book was released on 2021 with total page 203 pages. Available in PDF, EPUB and Kindle. Book excerpt: The large-scale, high-dimensional, and sparse single-cell RNA sequencing (scRNA-seq) data have raised great challenges in the pipeline of data analysis. A large number of statistical and machine learning methods have been developed to analyze scRNA-seq data and answer related scientific questions. Although different methods claim advantages in certain circumstances, it is difficult for users to select appropriate methods for their analysis tasks. Benchmark studies aim to provide recommendations for method selection based on an objective, accurate, and comprehensive comparison among cutting-edge methods. They can also offer suggestions for further methodological development through massive evaluations conducted on real data. In Chapter 2, we conduct the first, systematic benchmark study of nine cutting-edge computational doublet-detection methods. In scRNA-seq, doublets form when two cells are encapsulated into one reaction volume by chance. The existence of doublets, which appear as but are not real cells, is a key confounder in scRNA-seq data analysis. Computational methods have been developed to detect doublets in scRNA-seq data; however, the scRNA-seq field lacks a comprehensive benchmarking of these methods, making it difficult for researchers to choose an appropriate method for their specific analysis needs. Our benchmark study compares doublet-detection methods in terms of their detection accuracy under various experimental settings, impacts on downstream analyses, and computational efficiency. Our results show that existing methods exhibited diverse performance and distinct advantages in different aspects. In Chapter 3, we develop an R package DoubletCollection to integrate the installation and execution of different doublet-detection methods. Traditional benchmark studies can be quickly out-of-date due to their static design and the rapid growth of available methods. DoubletCollection addresses this issue in benchmarking doublet-detection methods for scRNA-seq data. DoubletCollection provides a unified interface to perform and visualize downstream analysis after doublet-detection. Additionally, we created a protocol using DoubletCollection to execute and benchmark doublet-detection methods. This protocol can automatically accommodate new doublet-detection methods in the fast-growing scRNA-seq field. In Chapter 4, we conduct the first comprehensive empirical study to explore the best modeling strategy for autoencoder-based imputation methods specific to scRNA-seq data. The autoencoder-based imputation method is a family of promising methods to denoise sparse scRNA-seq data; however, the design of autoencoders has not been formally discussed in the literature. Current autoencoder-based imputation methods either borrow the practice from other fields or design the model on an ad hoc basis. We find that the method performance is sensitive to the key hyperparameter of autoencoders, including architecture, activation function, and regularization. Their optimal settings on scRNA-seq are largely different from those on other data types. Our results emphasize the importance of exploring hyperparameter space in such complex and flexible methods. Our work also points out the future direction of improving current methods.

Statistical Methods for Reliable Inference in RNA-seq Experiments to Facilitate Regenerative Medicine

Download Statistical Methods for Reliable Inference in RNA-seq Experiments to Facilitate Regenerative Medicine PDF Online Free

Author :
Publisher :
ISBN 13 :
Total Pages : 0 pages
Book Rating : 4.:/5 (938 download)

DOWNLOAD NOW!


Book Synopsis Statistical Methods for Reliable Inference in RNA-seq Experiments to Facilitate Regenerative Medicine by :

Download or read book Statistical Methods for Reliable Inference in RNA-seq Experiments to Facilitate Regenerative Medicine written by and published by . This book was released on 2015 with total page 0 pages. Available in PDF, EPUB and Kindle. Book excerpt: The last decade of genome research has led to major technological advances in sequencing, genotyping, and phenotyping. However, how best to derive useful information from them still remains to be explored by statistical scientists. In this dissertation, I develop, implement, evaluate and apply three statistical methods for high-dimensional data analysis to facilitate efforts in regenerative medicine. The first method is an empirical Bayes model called EBSeq for identifying differentially expressed (DE) genes and isoforms. Unlike microarrays, RNA-seq experiments allow for the identification of not only DE genes, but also their corresponding isoforms on a genome-wide scale. Taking advantage of the merits of empirical Bayesian methods, we developed EBSeq which models the uncertainty groups via different priors. Our results demonstrate substantially improved power and performance of EBSeq for identifying DE isoforms compared to other competing methods. The second method is an auto-regressive hidden Markov model called EBSeq-HMM for identifying expression changes across ordered conditions. With improvements in next-generation sequencing technologies and reductions in price, ordered RNA-seq experiments are becoming common. Of primary interest in these experiments is identifying genes that are changing over time or space, for example, and then characterizing the specific expression changes. In EBSeq-HMM, an autoregressive hidden Markov model is implemented to accommodate dependence in gene expression across ordered conditions. As demonstrated in simulation and case studies, the output proves useful in identifying DE genes, characterizing their changes over conditions, and classifying genes into particular expression paths. The third method is a statistical pipeline called Oscope for identifying oscillatory gene sets using unsynchronized single-cell RNA-seq data. Recent advance of single-cell RNA-seq enables precise quantification of gene expression among individual cells. This provides the potential to uncover oscillatory systems at single-cell level. However, methods to identify candidate oscillatory gene sets in an unsynchronized cell population are still lacking. Here we developed a statistical pipeline with 3 main modules - a paired-sine model to identify co-oscillating gene paires, a K-Medoid clustering module to group gene pairs into oscillatory gene sets, and an extended nearest insertion algorithm to recover base cycle profile of oscillatory genes.

Statistical Methods for the Analysis of RNA Sequencing Data

Download Statistical Methods for the Analysis of RNA Sequencing Data PDF Online Free

Author :
Publisher :
ISBN 13 :
Total Pages : 340 pages
Book Rating : 4.:/5 (16 download)

DOWNLOAD NOW!


Book Synopsis Statistical Methods for the Analysis of RNA Sequencing Data by : Man-Kee Maggie Chu

Download or read book Statistical Methods for the Analysis of RNA Sequencing Data written by Man-Kee Maggie Chu and published by . This book was released on 2014 with total page 340 pages. Available in PDF, EPUB and Kindle. Book excerpt: The next generation sequencing technology, RNA-sequencing (RNA-seq), has an increasing popularity over traditional microarrays in transcriptome analyses. Statistical methods used for gene expression analyses with these two technologies are di erent because the array-based technology measures intensities using continuous distributions, whereas RNA-seq provides absolute quantification of gene expression using counts of reads. There is a need for reliable statistical methods to exploit the information from the rapidly evolving sequencing technologies and limited work has been done on expression analysis of time-course RNA-seq data. Functional clustering is an important method for examining gene expression patterns and thus discovering co-expressed genes to better understand the biological systems. Clusteringbased approaches to analyze repeated digital gene expression measures are in demand. In this dissertation, we propose a model-based clustering method for identifying gene expression patterns in time-course RNA-seq data. Our approach employs a longitudinal negative binomial mixture model to postulate the over-dispersed time-course gene count data. The e ectiveness of the proposed clustering method is assessed using simulated data and is illustrated by real data from time-course genomic experiments. Due to the complexity and size of genomic data, the choice of good starting values is an important issue to the proposed clustering algorithm. There is a need for a reliable initialization strategy for cluster-wise regression specifically for time-course discrete count data. We modify existing common initialization procedures to suit our model-based clustering algorithm and the procedures are evaluated through a simulation study on artificial datasets and are applied to real genomic examples to identify the optimal initialization method. Another common issue in gene expression analysis is the presence of missing values in the datasets. Various treatments to missing values in genomic datasets have been developed but limited work has been done on RNA-seq data. In the current work, we examine the performance of various imputation methods and their impact on the clustering of time-course RNA-seq data. We develop a cluster-based imputation method which is specifically suitable for dealing with missing values in RNA-seq datasets. Simulation studies are provided to assess the performance of the proposed imputation approach.

Development and Benchmarking of Imputation Methods for Micriobome and Single-cell Sequencing Data

Download Development and Benchmarking of Imputation Methods for Micriobome and Single-cell Sequencing Data PDF Online Free

Author :
Publisher :
ISBN 13 :
Total Pages : 175 pages
Book Rating : 4.:/5 (129 download)

DOWNLOAD NOW!


Book Synopsis Development and Benchmarking of Imputation Methods for Micriobome and Single-cell Sequencing Data by : Ruochen Jiang

Download or read book Development and Benchmarking of Imputation Methods for Micriobome and Single-cell Sequencing Data written by Ruochen Jiang and published by . This book was released on 2021 with total page 175 pages. Available in PDF, EPUB and Kindle. Book excerpt: Next generation sequencing (NGS) has revolutionized biomedical research and has a broad impact and applications. Since its advent around 15 years ago, this high scalable DNA sequencing technology has generated numerous biological data with new features and brought new challenges to data analysis. For example, researchers utilize RNA sequencing (RNA-seq) technology to more accurately quantify the gene expression levels. However, the NGS technology involves many processing steps and technical variations when measuring the expression values in the biological samples. In other words, the NGS data researchers observed could be biased due to the randomness and constraints in the NGS technology. This dissertation will mainly focus on microbiome sequencing data and single-cell RNA-seq (scRNA-seq) data. Both of them are highly sparse matrix-form count data. The zeros could either be biological or non-biological, and the high sparsity in the data have brought challenges to data analysis. Missing data imputation problem has been studied in statistics and social science as the survey data often experience non-response to some of the survey questions and those unresponded questions will be marked as "NA" or missing values in the data. Imputation methods are used to provide a sophisticated guess for the missing values, and the purpose is to avoid discarding the collected samples and for the ease of using the state-of-the-art statistical methods. In machine learning, the famous Netflix data challenge regarding film recommendation system also falls into the missing data imputation problem category. Netflix wants to find a way to predict users' fondness of the movies they have not watched. The potential scores these users would give to the unwatched films are regarded as missing values in the data. NGS data imputation problem is different from the previous two cases in that the missing values in the NGS data are not so well-defined. The zeros in the NGS data could either come from the biological origin (should not be regarded as missing values) or non-biological origin (due to the limitation of the sequencing technology and should be regarded as missing values). The size (number of samples and features) of the NGS matrix data is usually larger than the size of survey data but smaller than the size of the recommendation system data. In addition, in most cases, the percentage of missing values in the survey data is less than the percentage of zeros in the NGS data, and the missing values in the film recommendation system data have the highest percentage (> 99.9%). As a result, the commonly used missing data imputation methods in statistics and machine learning are not directly applicable to NGS data. In recent years, numerous imputation methods have been proposed to deal with the highly sparse scRNA-seq data. In light of this, this dissertation aims to address two questions. First, the microbiome sequencing data, having additional information comparing to the scRNA-seq data, lacks an imputation method. Secondly, whether to use imputation or not in scRNA-seq data analysis is still a controversial problem. The first part of this dissertation focuses on the first imputation method developed for the microbiome sequencing data: mbImpute. Microbiome studies have gained increased attention since many discoveries revealed connections between human microbiome compositions and diseases. A critical challenge in microbiome data analysis is the existence of many non-biological zeros, which distort taxon abundance distributions, complicate data analysis, and jeopardize the reliability of scientific discoveries. To address this issue, we propose the first imputation method for microbiome data---mbImpute---to identify and recover likely non-biological zeros by borrowing information jointly from similar samples, similar taxa, and optional metadata including sample covariates and taxon phylogeny. Comprehensive simulations verify that mbImpute achieves better imputation accuracy under multiple metrics, compared with five state-of-the-art imputation methods designed for non-microbiome data. In real data applications, we demonstrate that mbImpute improves the power of identifying disease-related taxa from microbiome data of type 2 diabetes and colorectal cancer, and mbImpute preserves non-zero distributions of taxa abundances. The second part of this dissertation focuses on how to deal with high sparsity in the scRNA-seq data. ScRNA-seq technologies have revolutionized biomedical sciences by enabling genome-wide profiling of gene expression levels at an unprecedented single-cell resolution. A distinct characteristic of scRNA-seq data is the vast proportion of zeros unseen in bulk RNA-seq data. Researchers view these zeros differently: some regard zeros as biological signals representing no or low gene expression, while others regard zeros as false signals or missing data to be corrected. As a result, the scRNA-seq field faces much controversy regarding how to handle zeros in data analysis. We first discuss the sources of biological and non-biological zeros in scRNA-seq data. Second, we evaluate the impacts of non-biological zeros on cell clustering and differential gene expression analysis. Third, we summarize the advantages, disadvantages, and suitable users of three input data types: observed counts, imputed counts, and binarized counts and evaluate the performance of downstream analysis on these three input data types. Finally, we discuss the open questions regarding non-biological zeros, the need for benchmarking, and the importance of transparent analysis.

Gene Expression Data Analysis

Download Gene Expression Data Analysis PDF Online Free

Author :
Publisher : CRC Press
ISBN 13 : 1000425754
Total Pages : 276 pages
Book Rating : 4.0/5 (4 download)

DOWNLOAD NOW!


Book Synopsis Gene Expression Data Analysis by : Pankaj Barah

Download or read book Gene Expression Data Analysis written by Pankaj Barah and published by CRC Press. This book was released on 2021-11-08 with total page 276 pages. Available in PDF, EPUB and Kindle. Book excerpt: Development of high-throughput technologies in molecular biology during the last two decades has contributed to the production of tremendous amounts of data. Microarray and RNA sequencing are two such widely used high-throughput technologies for simultaneously monitoring the expression patterns of thousands of genes. Data produced from such experiments are voluminous (both in dimensionality and numbers of instances) and evolving in nature. Analysis of huge amounts of data toward the identification of interesting patterns that are relevant for a given biological question requires high-performance computational infrastructure as well as efficient machine learning algorithms. Cross-communication of ideas between biologists and computer scientists remains a big challenge. Gene Expression Data Analysis: A Statistical and Machine Learning Perspective has been written with a multidisciplinary audience in mind. The book discusses gene expression data analysis from molecular biology, machine learning, and statistical perspectives. Readers will be able to acquire both theoretical and practical knowledge of methods for identifying novel patterns of high biological significance. To measure the effectiveness of such algorithms, we discuss statistical and biological performance metrics that can be used in real life or in a simulated environment. This book discusses a large number of benchmark algorithms, tools, systems, and repositories that are commonly used in analyzing gene expression data and validating results. This book will benefit students, researchers, and practitioners in biology, medicine, and computer science by enabling them to acquire in-depth knowledge in statistical and machine-learning-based methods for analyzing gene expression data. Key Features: An introduction to the Central Dogma of molecular biology and information flow in biological systems A systematic overview of the methods for generating gene expression data Background knowledge on statistical modeling and machine learning techniques Detailed methodology of analyzing gene expression data with an example case study Clustering methods for finding co-expression patterns from microarray, bulkRNA, and scRNA data A large number of practical tools, systems, and repositories that are useful for computational biologists to create, analyze, and validate biologically relevant gene expression patterns Suitable for multidisciplinary researchers and practitioners in computer science and the biological sciences