The unicellular ciliate Paramecium contains a large vegetative macronucleus with several unusual characteristics, including an extremely high coding density and high polyploidy. As macronculear chromatin is devoid of heterochromatin, our study characterizes the functional epigenomic organization necessary for gene regulation and proper Pol II activity. Histone marks (H3K4me3, H3K9ac, H3K27me3) reveal no narrow peaks but broad domains along gene bodies, whereas intergenic regions are devoid of nucleosomes. Our data implicate H3K4me3 levels inside ORFs to be the main factor associated with gene expression, and H3K27me3 appears in association with H3K4me3 in plastic genes. Silent and lowly expressed genes show low nucleosome occupancy, suggesting that gene inactivation does not involve increased nucleosome occupancy and chromatin condensation. Because of a high occupancy of Pol II along highly expressed ORFs, transcriptional elongation appears to be quite different from that of other species. This is supported by missing heptameric repeats in the C-terminal domain of Pol II and a divergent elongation system. Our data imply that unoccupied DNA is the default state, whereas gene activation requires nucleosome recruitment together with broad domains of H3K4me3. In summary, gene activation and silencing in Paramecium run counter to the current understanding of chromatin biology.
Most sRNA biogenesis mechanisms involve either RNAseIII cleavage or ping-pong amplification by different Piwi proteins harboring slicer activity. Here, we follow the question why the mechanism of transgene-induced silencing in the ciliate Paramecium needs both Dicer activity and two Ptiwi proteins. This pathway involves primary siRNAs produced from non-translatable transgenes and secondary siRNAs from endogenous remote loci. Our data does not indicate any signatures from ping-pong amplification but Dicer cleavage of long dsRNA. We show that Ptiwi13 and 14 have different preferences for primary and secondary siRNAs but do not load them mutually exclusive. Both Piwis enrich for antisense RNAs and Ptiwi14 loaded siRNAs show a 5′-U signature. Both Ptiwis show in addition a general preference for Uridine-rich sRNAs along the entire sRNA length. Our data indicates both Ptiwis and 2’-O-methylation to contribute to strand selection of Dicer cleaved siRNAs. This unexpected function of two distinct vegetative Piwis extends the increasing knowledge of the diversity of Piwi functions in diverse silencing pathways. As both Ptiwis show differential subcellular localisation, Ptiwi13 in the cytoplasm and Ptiwi14 in the vegetative macronucleus, we conclude that cytosolic and nuclear silencing factors are necessary for efficient chromatin silencing.
Improved integration of single cell transcriptome data demonstrated on heart failure in mice and men
(2023)
Biomedical research frequently uses murine models to study disease mechanisms. However, the translation of these findings to human disease remains a significant challenge. In order to improve the comparability of mouse and human data, we present a cross-species integration pipeline for single-cell transcriptomic assays.
The pipeline merges expression matrices and assigns clear orthologous relationships. Starting from Ensembl ortholog assignments, we allocated 82% of mouse genes to unique orthologs by using additional publicly available resources such as Uniprot, and NCBI databases. For genes with multiple matches, we employed the Needleman-Wunsch global alignment based on either amino acid or nucleotide sequence to identify the ortholog with the highest degree of similarity.
The workflow was tested for its functionality and efficiency by integrating scRNA-seq datasets from heart failure patients with the corresponding mouse model. We were able to assign unique human orthologs to up to 80% of the mouse genes, utilizing the known 17,492 orthologous pairs. Curiously, the integration process enabled the identification of both common and unique regulatory pathways between species in heart failure.
In conclusion, our pipeline streamlines the integration process, enhances gene nomenclature alignment and simplifies the translation of mouse models to human disease. We have made the OrthoIntegrate R-package accessible on GitHub (https://github.com/MarianoRuzJurado/OrthoIntegrate), which includes the assignment of ortholog definitions for human and mouse, as well as the pipeline for integrating single cells.
Understanding how epigenetic variation in non-coding regions is involved in distal gene-expression regulation is an important problem. Regulatory regions can be associated to genes using large-scale datasets of epigenetic and expression data. However, for regions of complex epigenomic signals and enhancers that regulate many genes, it is difficult to understand these associations. We present StitchIt, an approach to dissect epigenetic variation in a gene-specific manner for the detection of regulatory elements (REMs) without relying on peak calls in individual samples. StitchIt segments epigenetic signal tracks over many samples to generate the location and the target genes of a REM simultaneously. We show that this approach leads to a more accurate and refined REM detection compared to standard methods even on heterogeneous datasets, which are challenging to model. Also, StitchIt REMs are highly enriched in experimentally determined chromatin interactions and expression quantitative trait loci. We validated several newly predicted REMs using CRISPR-Cas9 experiments, thereby demonstrating the reliability of StitchIt. StitchIt is able to dissect regulation in superenhancers and predicts thousands of putative REMs that go unnoticed using peak-based approaches suggesting that a large part of the regulome might be uncharted water.
Summary: Understanding the role of short-interfering RNA (siRNA) in diverse biological processes is of current interest and often approached through small RNA sequencing. However, analysis of these datasets is difficult due to the complexity of biological RNA processing pathways, which differ between species. Several properties like strand specificity, length distribution, and distribution of soft-clipped bases are few parameters known to guide researchers in understanding the role of siRNAs. We present RAPID, a generic eukaryotic siRNA analysis pipeline, which captures information inherent in the datasets and automatically produces numerous visualizations as user-friendly HTML reports, covering multiple categories required for siRNA analysis. RAPID also facilitates an automated comparison of multiple datasets, with one of the normalization techniques dedicated for siRNA knockdown analysis, and integrates differential expression analysis using DESeq2. RAPID is available under MIT license at https://github.com/SchulzLab/RAPID. We recommend using it as a conda environment available from https://anaconda.org/bioconda/rapid.
Background Enhancers play a fundamental role in orchestrating cell state and development. Although several methods have been developed to identify enhancers, linking them to their target genes is still an open problem. Several theories have been proposed on the functional mechanisms of enhancers, which triggered the development of various methods to infer promoter enhancer interactions (PEIs). The advancement of high-throughput techniques describing the three-dimensional organisation of the chromatin, paved the way to pinpoint long-range PEIs. Here we investigated whether including PEIs in computational models for the prediction of gene expression improves performance and interpretability.
Results We have extended our Tepic framework to include DNA contacts deduced from chromatin conformation capture experiments and compared various methods to determine PEIs using predictive modelling of gene expression from chromatin accessibility data and predicted transcription factor (TF) motif data. We found that including long-range PEIs deduced from both HiC and HiChIP data indeed improves model performance. We designed a novel machine learning approach that allows to prioritize TFs in distal loop and promoter regions with respect to their importance for gene expression regulation. Our analysis revealed a set of core TFs that are part of enhancer-promoter loops involving YY1 in different cell lines.
Conclusion: We show that the integration of chromatin conformation data improves gene expression prediction, underlining the importance of enhancer looping for gene expression regulation. Our general approach can be used to prioritize TFs that are involved in distal and promoter-proximal regulation using accessibility, conformation and expression data.
Understanding the role of short-interfering RNA (siRNA) in diverse biological processes is of current interest and often approached through small RNA sequencing. However, analysis of these datasets is difficult due to the complexity of biological RNA processing pathways, which differ between species. Several properties like strand specificity, length distribution, and distribution of soft-clipped bases are few parameters known to guide researchers in understanding the role of siRNAs. We present RAPID, a generic eukaryotic siRNA analysis pipeline, which captures information inherent in the datasets and automatically produces numerous visualizations as user-friendly HTML reports, covering multiple categories required for siRNA analysis. RAPID also facilitates an automated comparison of multiple datasets, with one of the normalization techniques dedicated for siRNA knockdown analysis, and integrates differential expression analysis using DESeq2.
Non-coding variations located within regulatory elements may alter gene expression by modifying Transcription Factor (TF) binding sites and thereby lead to functional consequences like various traits or diseases. To understand these molecular mechanisms, different TF models are being used to assess the effect of DNA sequence variations, such as Single Nucleotide Polymorphisms (SNPs). However, few statistical approaches exist to compute statistical significance of results but they often are slow for large sets of SNPs, such as data obtained from a genome-wide association study (GWAS) or allele-specific analysis of chromatin data.
Results We investigate the distribution of maximal differential TF binding scores for general computational models that assess TF binding. We find that a modified Laplace distribution can adequately approximate the empirical distributions. A benchmark on in vitro and in vivo data sets showed that our new approach improves on an existing method in terms of performance and speed. In applications on large sets of eQTL and GWAS SNPs we could illustrate the usefulness of the novel statistic to highlight cell type specific regulators and TF target genes.
Conclusions Our approach allows the evaluation of DNA changes that induce differential TF binding in a fast and accurate manner, permitting computations on large mutation data sets. An implementation of the novel approach is freely available at https://github.com/SchulzLab/SNEEP.
Understanding the complexity of transcriptional regulation is a major goal of computational biology. Because experimental linkage of regulatory sites to genes is challenging, computational methods considering epigenomics data have been proposed to create tissue-specific regulatory maps. However, we showed that these approaches are not well suited to account for the variations of the regulatory landscape between cell-types. To overcome these drawbacks, we developed a new method called STITCHIT, that identifies and links putative regulatory sites to genes. Within STITCHIT, we consider the chromatin accessibility signal of all samples jointly to identify regions exhibiting a signal variation related to the expression of a distinct gene. STITCHIT outperforms previous approaches in various validation experiments and was used with a genome-wide CRISPR-Cas9 screen to prioritize novel doxorubicin-resistance genes and their associated non-coding regulatory regions. We believe that our work paves the way for a more refined understanding of transcriptional regulation at the gene-level.
Mechanisms by which specific histone modifications regulate distinct gene regulatory networks remain little understood. We investigated how H3K79me2, a modification catalyzed by DOT1L and previously considered a general transcriptional activation mark, regulates gene expression in mammalian cardiogenesis. Early embryonic cardiomyocyte ablation of Dot1l revealed that H3K79me2 does not act as a general transcriptional activator, but rather regulates highly specific gene regulatory networks at two critical cardiogenic junctures: left ventricle patterning and postnatal cardiomyocyte cell cycle withdrawal. Mechanistic analyses revealed that H3K79me2 in two distinct domains, gene bodies and regulatory elements, synergized to promote expression of genes activated by DOT1L. Surprisingly, these analyses also revealed that H3K79me2 in specific regulatory elements contributed to silencing genes usually not expressed in cardiomyocytes. As DOT1L mutants had increased numbers of postnatal mononuclear cardiomyocytes and prolonged cardiomyocyte cell cycle activity, controlled inhibition of DOT1L might be a strategy to promote cardiac regeneration post-injury.