Background: Understanding the location and cell-type specific binding of Transcription Factors (TFs) is important in the study of gene regulation. Computational prediction of TF binding sites is challenging, because TFs often bind only to short DNA motifs and cell-type specific co-factors may work together with the same TF to determine binding. Here, we consider the problem of learning a general model for the prediction of TF binding using DNase1-seq data and TF motif description in form of position specific energy matrices (PSEMs).
Methods: We use TF ChIP-seq data as a gold-standard for model training and evaluation. Our contribution is a novel ensemble learning approach using random forest classifiers. In the context of the ENCODE-DREAM in vivo TF binding site prediction challenge we consider different learning setups.
Results: Our results indicate that the ensemble learning approach is able to better generalize across tissues and cell-types compared to individual tissue-specific classifiers or a classifier applied to the data aggregated across tissues. Furthermore, we show that incorporating DNase1-seq peaks is essential to reduce the false positive rate of TF binding predictions compared to considering the raw DNase1 signal.
Conclusions: Analysis of important features reveals that the models preferentially select motifs of other TFs that are close interaction partners in existing protein protein-interaction networks. Code generated in the scope of this project is available on GitHub: https://github.com/SchulzLab/TFAnalysis (DOI: 10.5281/zenodo.1409697)
Background: Bidirectional promoters (BPs) are prevalent in eukaryotic genomes. However, it is poorly understood how the cell integrates different epigenomic information, such as transcription factor (TF) binding and chromatin marks, to drive gene expression at BPs. Single-cell sequencing technologies are revolutionizing the field of genome biology. Therefore, this study focuses on the integration of single-cell RNA-seq data with bulk ChIP-seq and other epigenetics data, for which single-cell technologies are not yet established, in the context of BPs.
Results: We performed integrative analyses of novel human single-cell RNA-seq (scRNA-seq) data with bulk ChIP-seq and other epigenetics data. scRNA-seq data revealed distinct transcription states of BPs that were previously not recognized. We find associations between these transcription states to distinct patterns in structural gene features, DNA accessibility, histone modification, DNA methylation and TF binding profiles.
Conclusions: Our results suggest that a complex interplay of all of these elements is required to achieve BP-specific transcriptional output in this specialized promoter configuration. Further, our study implies that novel statistical methods can be developed to deconvolute masked subpopulations of cells measured with different bulk epigenomic assays using scRNA-seq data.
An ontology-based method for assessing batch effect adjustment approaches in heterogeneous datasets
(2018)
Motivation: International consortia such as the Genotype-Tissue Expression (GTEx) project, The Cancer Genome Atlas (TCGA) or the International Human Epigenetics Consortium (IHEC) have produced a wealth of genomic datasets with the goal of advancing our understanding of cell differentiation and disease mechanisms. However, utilizing all of these data effectively through integrative analysis is hampered by batch effects, large cell type heterogeneity and low replicate numbers. To study if batch effects across datasets can be observed and adjusted for, we analyze RNA-seq data of 215 samples from ENCODE, Roadmap, BLUEPRINT and DEEP as well as 1336 samples from GTEx and TCGA. While batch effects are a considerable issue, it is non-trivial to determine if batch adjustment leads to an improvement in data quality, especially in cases of low replicate numbers.
Results: We present a novel method for assessing the performance of batch effect adjustment methods on heterogeneous data. Our method borrows information from the Cell Ontology to establish if batch adjustment leads to a better agreement between observed pairwise similarity and similarity of cell types inferred from the ontology. A comparison of state-of-the art batch effect adjustment methods suggests that batch effects in heterogeneous datasets with low replicate numbers cannot be adequately adjusted. Better methods need to be developed, which can be assessed objectively in the framework presented here.
Background: Understanding the location and cell-type specific binding of Transcription Factors (TFs) is important in the study of gene regulation. Computational prediction of TF binding sites is challenging, because TFs often bind only to short DNA motifs and cell-type specific co-factors may work together with the same TF to determine binding. Here, we consider the problem of learning a general model for the prediction of TF binding using DNase1-seq data and TF motif description in form of position specific energy matrices (PSEMs).
Methods: We use TF ChIP-seq data as a gold-standard for model training and evaluation. Our contribution is a novel ensemble learning approach using random forest classifiers. In the context of the ENCODE-DREAM in vivo TF binding site prediction challenge we consider different learning setups.
Results: Our results indicate that the ensemble learning approach is able to better generalize across tissues and cell-types compared to individual tissue-specific classifiers or a classifier built based upon data aggregated across tissues. Furthermore, we show that incorporating DNase1-seq peaks is essential to reduce the false positive rate of TF binding predictions compared to considering the raw DNase1 signal.
Conclusions: Analysis of important features reveals that the models preferentially select motifs of other TFs that are close interaction partners in existing protein protein-interaction networks. Code generated in the scope of this project is available on GitHub: https://github.com/SchulzLab/TFAnalysis (DOI: 10.5281/zenodo.1409697).
Specialized de novo assemblers for diverse datatypes have been developed and are in widespread use for the analyses of single-cell genomics, metagenomics and RNA-seq data. However, assembly of large sequencing datasets produced by modern technologies is challenging and computationally intensive. In-silico read normalization has been suggested as a computational strategy to reduce redundancy in read datasets, which leads to significant speedups and memory savings of assembly pipelines. Previously, we presented a set multi-cover optimization based approach, ORNA, where reads are reduced without losing important k-mer connectivity information, as used in assembly graphs. Here we propose extensions to ORNA, named ORNA-Q and ORNA-K, which consider a weighted set multi-cover optimization formulation for the in-silico read normalization problem. These novel formulations make use of the base quality scores obtained from sequencers (ORNA-Q) or k-mer abundances of reads (ORNA-K) to improve normalization further. We devise efficient heuristic algorithms for solving both formulations. In applications to human RNA-seq data, ORNA-Q and ORNA-K are shown to assemble more or equally many full length transcripts compared to other normalization methods at similar or higher read reduction values. The algorithm is implemented under the latest version of ORNA (v2.0, https://github.com/SchulzLab/ORNA).
Hepatic lipid deposition and inflammation represent risk factors for hepatocellular carcinoma (HCC). The mRNA-binding protein tristetraprolin (TTP, gene name ZFP36) has been suggested as a tumor suppressor in several malignancies, but it increases insulin resistance. The aim of this study was to elucidate the role of TTP in hepatocarcinogenesis and HCC progression. Employing liver-specific TTP-knockout (lsTtp-KO) mice in the diethylnitrosamine (DEN) hepatocarcinogenesis model, we observed a significantly reduced tumor burden compared to wild-type animals. Upon short-term DEN treatment, modelling early inflammatory processes in hepatocarcinogenesis, lsTtp-KO mice exhibited a reduced monocyte/macrophage ratio as compared to wild-type mice. While short-term DEN strongly induced an abundance of saturated and poly-unsaturated hepatic fatty acids, lsTtp-KO mice did not show these changes. These findings suggested anti-carcinogenic actions of TTP deletion due to effects on inflammation and metabolism. Interestingly, though, investigating effects of TTP on different hallmarks of cancer suggested tumor-suppressing actions: TTP inhibited proliferation, attenuated migration, and slightly increased chemosensitivity. In line with a tumor-suppressing activity, we observed a reduced expression of several oncogenes in TTP-overexpressing cells. Accordingly, ZFP36 expression was downregulated in tumor tissues in three large human data sets. Taken together, this study suggests that hepatocytic TTP promotes hepatocarcinogenesis, while it shows tumor-suppressive actions during hepatic tumor progression.
Background/Aims: Hepatocellular carcinoma (HCC) represents the second most common cause of cancer-related deaths worldwide, not least due to its high chemoresistance. The long non-coding RNA nuclear paraspeckle assembly transcript 1 (NEAT1), localised in nuclear paraspeckles, has been shown to enhance chemoresistance in several cancer types. Since data on NEAT1 in HCC chemosensitivity are completely lacking and chemoresistance is linked to poor prognosis, we aimed to study NEAT1 expression in HCC chemoresistance and its link to HCC prognosis.
Methods: NEAT1 expression was determined in either sensitive, or sorafenib, or doxorubicin resistant HepG2, PLC/PRF/5, and Huh7 cells by qPCR. Paraspeckles were detected by immunostaining of paraspeckle component 1 (PSPC1) in cell culture and in a cohort of HCC patients. PSPC1 expression was correlated with clinical data. The expression of transcript variants of NEAT1 and transcripts encoding the paraspeckle-associated proteins was analysed in the TCGA liver cancer data set.
Results: NEAT1 was overexpressed in all three sorafenib and doxorubicin resistant cell lines. Paraspeckles were present in all chemoresistant cells, whereas no signal was detected in the sensitive cells. Expression of NEAT1 transcripts as well as transcripts encoding PSPC1, NONO, and RBM14 was increased in tumour tissue. Expression of PSPC1, NONO, and RBM14 transcripts was significantly associated with poor survival, whereas NEAT1 expression was not. Immunohistochemical analysis revealed that nuclear and cytoplasmic PSPC1-positivity was significantly associated with shorter overall survival of HCC patients.
Conclusion: Our data show an induction of NEAT1 in HCC chemoresistance and a high correlation of transcripts encoding paraspeckle-associated proteins with poor survival in HCC. Therefore, NEAT1, PSPC1, NONO, and RBM14 might be promising targets for novel HCC therapies, and the paraspeckle-associated proteins might be clinical markers and predictors for poor survival in HCC.
KDEL receptors (KDELRs) represent transmembrane proteins of the secretory pathway which regulate the retention of soluble ER-residents as well as retrograde and anterograde vesicle trafficking. In addition, KDELRs are involved in the regulation of cellular stress response and ECM degradation. For a deeper insight into KDELR1 specific functions, we characterised a KDELR1-KO cell line (HAP1) through whole transcriptome analysis by comparing KDELR1-KO cells with its respective HAP1 wild-type. Our data indicate more than 300 significantly and differentially expressed genes whose gene products are mainly involved in developmental processes such as cell adhesion and ECM composition, pointing out to severe cellular disorders due to a loss of KDELR1. Impaired adhesion capacity of KDELR1-KO cells was further demonstrated through in vitro adhesion assays, while collagen- and/or laminin-coating nearly doubled the adhesion property of KDELR1-KO cells compared to wild-type, confirming a transcriptional adaptation to improve or restore the cellular adhesion capability. Perturbations within the secretory pathway were verified by an increased secretion of ER-resident PDI and decreased cell viability under ER stress conditions, suggesting KDELR1-KO cells to be severely impaired in maintaining cellular homeostasis.
Cancer-associated fibroblasts (CAFs) in the tumor microenvironment contribute to all stages of tumorigenesis and are usually considered to be tumor-promoting cells. CAFs show a remarkable degree of heterogeneity, which is attributed to developmental origin or to local environmental niches, resulting in distinct CAF subsets within individual tumors. While CAF heterogeneity is frequently investigated in late-stage tumors, data on longitudinal CAF development in tumors are lacking. To this end, we used the transgenic polyoma middle T oncogene-induced mouse mammary carcinoma model and performed whole transcriptome analysis in FACS-sorted fibroblasts from early- and late-stage tumors. We observed a shift in fibroblast populations over time towards a subset previously shown to negatively correlate with patient survival, which was confirmed by multispectral immunofluorescence analysis. Moreover, we identified a transcriptomic signature distinguishing CAFs from early- and late-stage tumors. Importantly, the signature of early-stage CAFs correlated well with tumor stage and survival in human mammary carcinoma patients. A random forest analysis suggested predictive value of the complete set of differentially expressed genes between early- and late-stage CAFs on bulk tumor patient samples, supporting the clinical relevance of our findings. In conclusion, our data show transcriptome alterations in CAFs during tumorigenesis in the mammary gland, which suggest that CAFs are educated by the tumor over time to promote tumor development. Moreover, we show that murine CAF gene signatures can harbor predictive value for human cancer.
Summary: Understanding the role of short-interfering RNA (siRNA) in diverse biological processes is of current interest and often approached through small RNA sequencing. However, analysis of these datasets is difficult due to the complexity of biological RNA processing pathways, which differ between species. Several properties like strand specificity, length distribution, and distribution of soft-clipped bases are few parameters known to guide researchers in understanding the role of siRNAs. We present RAPID, a generic eukaryotic siRNA analysis pipeline, which captures information inherent in the datasets and automatically produces numerous visualizations as user-friendly HTML reports, covering multiple categories required for siRNA analysis. RAPID also facilitates an automated comparison of multiple datasets, with one of the normalization techniques dedicated for siRNA knockdown analysis, and integrates differential expression analysis using DESeq2. RAPID is available under MIT license at https://github.com/SchulzLab/RAPID. We recommend using it as a conda environment available from https://anaconda.org/bioconda/rapid.