OPUS 4 | Suchen

34 Treffer

1 bis 10

Sortieren nach

Predicting transcription factor binding using ensemble random forest models [version 2; peer review: 2 approved] (2019)

Ardakani, Fatemeh Behjati ; Schmidt, Florian ; Schulz, Marcel Holger

Background: Understanding the location and cell-type specific binding of Transcription Factors (TFs) is important in the study of gene regulation. Computational prediction of TF binding sites is challenging, because TFs often bind only to short DNA motifs and cell-type specific co-factors may work together with the same TF to determine binding. Here, we consider the problem of learning a general model for the prediction of TF binding using DNase1-seq data and TF motif description in form of position specific energy matrices (PSEMs). Methods: We use TF ChIP-seq data as a gold-standard for model training and evaluation. Our contribution is a novel ensemble learning approach using random forest classifiers. In the context of the ENCODE-DREAM in vivo TF binding site prediction challenge we consider different learning setups. Results: Our results indicate that the ensemble learning approach is able to better generalize across tissues and cell-types compared to individual tissue-specific classifiers or a classifier built based upon data aggregated across tissues. Furthermore, we show that incorporating DNase1-seq peaks is essential to reduce the false positive rate of TF binding predictions compared to considering the raw DNase1 signal. Conclusions: Analysis of important features reveals that the models preferentially select motifs of other TFs that are close interaction partners in existing protein protein-interaction networks. Code generated in the scope of this project is available on GitHub: https://github.com/SchulzLab/TFAnalysis (DOI: 10.5281/zenodo.1409697).

Integrative prediction of gene expression with chromatin accessibility and conformation data (2020)

Schmidt, Florian ; Kern, Fabian ; Schulz, Marcel Holger

Background: Enhancers play a fundamental role in orchestrating cell state and development. Although several methods have been developed to identify enhancers, linking them to their target genes is still an open problem. Several theories have been proposed on the functional mechanisms of enhancers, which triggered the development of various methods to infer promoter–enhancer interactions (PEIs). The advancement of high-throughput techniques describing the three-dimensional organization of the chromatin, paved the way to pinpoint long-range PEIs. Here we investigated whether including PEIs in computational models for the prediction of gene expression improves performance and interpretability. Results: We have extended our TEPIC framework to include DNA contacts deduced from chromatin conformation capture experiments and compared various methods to determine PEIs using predictive modelling of gene expression from chromatin accessibility data and predicted transcription factor (TF) motif data. We designed a novel machine learning approach that allows the prioritization of TFs binding to distal loop and promoter regions with respect to their importance for gene expression regulation. Our analysis revealed a set of core TFs that are part of enhancer–promoter loops involving YY1 in different cell lines. Conclusion: We present a novel approach that can be used to prioritize TFs involved in distal and promoter-proximal regulatory events by integrating chromatin accessibility, conformation, and gene expression data. We show that the integration of chromatin conformation data can improve gene expression prediction and aids model interpretability.

Predicting transcription factor binding using ensemble random forest models [version 1; peer review: 2 approved with reservations] (2018)

Ardakani, Fatemeh Behjati ; Schmidt, Florian ; Schulz, Marcel Holger

Background: Understanding the location and cell-type specific binding of Transcription Factors (TFs) is important in the study of gene regulation. Computational prediction of TF binding sites is challenging, because TFs often bind only to short DNA motifs and cell-type specific co-factors may work together with the same TF to determine binding. Here, we consider the problem of learning a general model for the prediction of TF binding using DNase1-seq data and TF motif description in form of position specific energy matrices (PSEMs). Methods: We use TF ChIP-seq data as a gold-standard for model training and evaluation. Our contribution is a novel ensemble learning approach using random forest classifiers. In the context of the ENCODE-DREAM in vivo TF binding site prediction challenge we consider different learning setups. Results: Our results indicate that the ensemble learning approach is able to better generalize across tissues and cell-types compared to individual tissue-specific classifiers or a classifier applied to the data aggregated across tissues. Furthermore, we show that incorporating DNase1-seq peaks is essential to reduce the false positive rate of TF binding predictions compared to considering the raw DNase1 signal. Conclusions: Analysis of important features reveals that the models preferentially select motifs of other TFs that are close interaction partners in existing protein protein-interaction networks. Code generated in the scope of this project is available on GitHub: https://github.com/SchulzLab/TFAnalysis (DOI: 10.5281/zenodo.1409697)

Improving in-silico normalization using read weights (2019)

Durai, Dilip A. ; Schulz, Marcel Holger

Specialized de novo assemblers for diverse datatypes have been developed and are in widespread use for the analyses of single-cell genomics, metagenomics and RNA-seq data. However, assembly of large sequencing datasets produced by modern technologies is challenging and computationally intensive. In-silico read normalization has been suggested as a computational strategy to reduce redundancy in read datasets, which leads to significant speedups and memory savings of assembly pipelines. Previously, we presented a set multi-cover optimization based approach, ORNA, where reads are reduced without losing important k-mer connectivity information, as used in assembly graphs. Here we propose extensions to ORNA, named ORNA-Q and ORNA-K, which consider a weighted set multi-cover optimization formulation for the in-silico read normalization problem. These novel formulations make use of the base quality scores obtained from sequencers (ORNA-Q) or k-mer abundances of reads (ORNA-K) to improve normalization further. We devise efficient heuristic algorithms for solving both formulations. In applications to human RNA-seq data, ORNA-Q and ORNA-K are shown to assemble more or equally many full length transcripts compared to other normalization methods at similar or higher read reduction values. The algorithm is implemented under the latest version of ORNA (v2.0, https://github.com/SchulzLab/ORNA).

A hierarchical regulatory network analysis of the vitamin D induced transcriptome reveals novel regulators and complete VDR dependency in monocytes (2021)

Warwick, Timothy ; Schulz, Marcel Holger ; Günther, Stefan ; Gilsbach, Ralf ; Neme, Antonio ; Carlberg, Carsten ; Brandes, Ralf ; Seuter, Sabine

The transcription factor vitamin D receptor (VDR) is the high affinity nuclear target of the biologically active form of vitamin D3 (1,25(OH)2D3). In order to identify pure genomic transcriptional effects of 1,25(OH)2D3, we used VDR cistrome, transcriptome and open chromatin data, obtained from the human monocytic cell line THP-1, for a novel hierarchical analysis applying three bioinformatics approaches. We predicted 75.6% of all early 1,25(OH)2D3-responding (2.5 or 4 h) and 57.4% of the late differentially expressed genes (24 h) to be primary VDR target genes. VDR knockout led to a complete loss of 1,25(OH)2D3–induced genome-wide gene regulation. Thus, there was no indication of any VDR-independent non-genomic actions of 1,25(OH)2D3 modulating its transcriptional response. Among the predicted primary VDR target genes, 47 were coding for transcription factors and thus may mediate secondary 1,25(OH)2D3 responses. CEBPA and ETS1 ChIP-seq data and RNA-seq following CEBPA knockdown were used to validate the predicted regulation of secondary vitamin D target genes by both transcription factors. In conclusion, a directional network containing 47 partly novel primary VDR target transcription factors describes secondary responses in a highly complex vitamin D signaling cascade. The central transcription factor VDR is indispensable for all transcriptome-wide effects of the nuclear hormone.

Two Piwis with Ago-like functions silence somatic genes at the chromatin level (2020)

Drews, Franziska ; Karunanithi, Sivarajan ; Götz, Ulrike ; Marker, Simone ; Wijn, Raphael de ; Pirritano, Marcello ; Rodrigues-Viana, Angela M. ; Jung, Martin ; Gasparoni, Gilles ; Schulz, Marcel Holger ; Simon, Martin

Most sRNA biogenesis mechanisms involve either RNAseIII cleavage or ping-pong amplification by different Piwi proteins harboring slicer activity. Here, we follow the question why the mechanism of transgene-induced silencing in the ciliate Paramecium needs both Dicer activity and two Ptiwi proteins. This pathway involves primary siRNAs produced from non-translatable transgenes and secondary siRNAs from endogenous remote loci. Our data does not indicate any signatures from ping-pong amplification but Dicer cleavage of long dsRNA. We show that Ptiwi13 and 14 have different preferences for primary and secondary siRNAs but do not load them mutually exclusive. Both Piwis enrich for antisense RNAs and Ptiwi14 loaded siRNAs show a 5′-U signature. Both Ptiwis show in addition a general preference for Uridine-rich sRNAs along the entire sRNA length. Our data indicates both Ptiwis and 2’-O-methylation to contribute to strand selection of Dicer cleaved siRNAs. This unexpected function of two distinct vegetative Piwis extends the increasing knowledge of the diversity of Piwi functions in diverse silencing pathways. As both Ptiwis show differential subcellular localisation, Ptiwi13 in the cytoplasm and Ptiwi14 in the vegetative macronucleus, we conclude that cytosolic and nuclear silencing factors are necessary for efficient chromatin silencing.

Comparative analysis of common alignment tools for single-cell RNA sequencing (2022)

Schulze Brüning, Ralf ; Tombor, Lukas ; Schulz, Marcel Holger ; Dimmeler, Stefanie ; John, David

Background: With the rise of single-cell RNA sequencing new bioinformatic tools have been developed to handle specific demands, such as quantifying unique molecular identifiers and correcting cell barcodes. Here, we benchmarked several datasets with the most common alignment tools for single-cell RNA sequencing data. We evaluated differences in the whitelisting, gene quantification, overall performance, and potential variations in clustering or detection of differentially expressed genes. We compared the tools Cell Ranger version 6, STARsolo, Kallisto, Alevin, and Alevin-fry on 3 published datasets for human and mouse, sequenced with different versions of the 10X sequencing protocol. Results: Striking differences were observed in the overall runtime of the mappers. Besides that, Kallisto and Alevin showed variances in the number of valid cells and detected genes per cell. Kallisto reported the highest number of cells; however, we observed an overrepresentation of cells with low gene content and unknown cell type. Conversely, Alevin rarely reported such low-content cells. Further variations were detected in the set of expressed genes. While STARsolo, Cell Ranger 6, Alevin-fry, and Alevin produced similar gene sets, Kallisto detected additional genes from the Vmn and Olfr gene family, which are likely mapping artefacts. We also observed differences in the mitochondrial content of the resulting cells when comparing a prefiltered annotation set to the full annotation set that includes pseudogenes and other biotypes. Conclusion: Overall, this study provides a detailed comparison of common single-cell RNA sequencing mappers and shows their specific properties on 10X Genomics data.

Integrative prediction of gene expression with chromatin accessibility and conformation data (2019)

Schmidt, Florian ; Kern, Fabian ; Schulz, Marcel Holger

Background Enhancers play a fundamental role in orchestrating cell state and development. Although several methods have been developed to identify enhancers, linking them to their target genes is still an open problem. Several theories have been proposed on the functional mechanisms of enhancers, which triggered the development of various methods to infer promoter enhancer interactions (PEIs). The advancement of high-throughput techniques describing the three-dimensional organisation of the chromatin, paved the way to pinpoint long-range PEIs. Here we investigated whether including PEIs in computational models for the prediction of gene expression improves performance and interpretability. Results We have extended our Tepic framework to include DNA contacts deduced from chromatin conformation capture experiments and compared various methods to determine PEIs using predictive modelling of gene expression from chromatin accessibility data and predicted transcription factor (TF) motif data. We found that including long-range PEIs deduced from both HiC and HiChIP data indeed improves model performance. We designed a novel machine learning approach that allows to prioritize TFs in distal loop and promoter regions with respect to their importance for gene expression regulation. Our analysis revealed a set of core TFs that are part of enhancer-promoter loops involving YY1 in different cell lines. Conclusion: We show that the integration of chromatin conformation data improves gene expression prediction, underlining the importance of enhancer looping for gene expression regulation. Our general approach can be used to prioritize TFs that are involved in distal and promoter-proximal regulation using accessibility, conformation and expression data.

Automated analysis of small RNA datasets with RAPID (2019)

Karunanithi, Sivarajan ; Simon, Martin ; Schulz, Marcel Holger

Summary: Understanding the role of short-interfering RNA (siRNA) in diverse biological processes is of current interest and often approached through small RNA sequencing. However, analysis of these datasets is difficult due to the complexity of biological RNA processing pathways, which differ between species. Several properties like strand specificity, length distribution, and distribution of soft-clipped bases are few parameters known to guide researchers in understanding the role of siRNAs. We present RAPID, a generic eukaryotic siRNA analysis pipeline, which captures information inherent in the datasets and automatically produces numerous visualizations as user-friendly HTML reports, covering multiple categories required for siRNA analysis. RAPID also facilitates an automated comparison of multiple datasets, with one of the normalization techniques dedicated for siRNA knockdown analysis, and integrates differential expression analysis using DESeq2. RAPID is available under MIT license at https://github.com/SchulzLab/RAPID. We recommend using it as a conda environment available from https://anaconda.org/bioconda/rapid.

Broad domains of histone marks in the highly compact Paramecium macronuclear genome (2021)

Drews, Franziska ; Salhab, Abdulrahman ; Karunanithi, Sivarajan ; Cheaib, Miriam ; Jung, Martin ; Schulz, Marcel Holger ; Simon, Martin

The unicellular ciliate Paramecium contains a large vegetative macronucleus with several unusual characteristics including an extremely high coding density and high polyploidy. As macronculear chromatin is devoid of heterochromatin our study characterizes the functional epigenomic organisation necessary for gene regulation and proper PolII activity. Histone marks (H3K4me3, H3K9ac, H3K27me3) revealed no narrow peaks but broad domains along gene bodies, whereas intergenic regions were devoid of nucleosomes. Our data implicates H3K4me3 levels inside ORFs to be the main factor to associate with gene expression and H3K27me3 appears to occur as a bistable domain with H3K4me3 in plastic genes. Surprisingly, silent and lowly expressed genes show low nucleosome occupancy suggesting that gene inactivation does not involve increased nucleosome occupancy and chromatin condensation. Due to a high occupancy of Pol II along highly expressed ORFs, transcriptional elongation appears to be quite different to other species. This is supported by missing heptameric repeats in the C-terminal domain of Pol II and a divergent elongation system. Our data implies that unoccupied DNA is the default state, whereas gene activation requires nucleosome recruitment together with broad domains of H3K4me3. This could represent a buffer for paused Pol II along ORFs in absence of elongation factors of higher eukaryotes.

1 bis 10

Autor*innen
Titel
Weitere Person(en)
Gutachter*innen
Zusammenfassung
Volltext

Open Access

Filtern

Autor*in

Erscheinungsjahr

Dokumenttyp

Sprache

Volltext vorhanden

Gehört zur Bibliographie

Schlagworte

Institut

34 Treffer