Highlights
• Single nucleotide variants (SNVs) may affect transcription factor (TF) binding
• Fast statistical approach to assess significance of differential TF binding for SNVs
• Validate new approach on in vitro and in vivo TF binding assays
• Applications on GWAS SNVs and large eQTL studies illustrate utility
Summary
Non-coding variants located within regulatory elements may alter gene expression by modifying transcription factor (TF) binding sites, thereby leading to functional consequences. Different TF models are being used to assess the effect of DNA sequence variants, such as single nucleotide variants (SNVs). Often existing methods are slow and do not assess statistical significance of results. We investigated the distribution of absolute maximal differential TF binding scores for general computational models that affect TF binding. We find that a modified Laplace distribution can adequately approximate the empirical distributions. A benchmark on in vitro and in vivo datasets showed that our approach improves upon an existing method in terms of performance and speed. Applications on eQTLs and on a genome-wide association study illustrate the usefulness of our statistics by highlighting cell type-specific regulators and target genes. An implementation of our approach is freely available on GitHub and as bioconda package.
Background: Eukaryotic gene expression is controlled by cis-regulatory elements (CREs), including promoters and enhancers, which are bound by transcription factors (TFs). Differential expression of TFs and their binding affinity at putative CREs determine tissue- and developmental-specific transcriptional activity. Consolidating genomic data sets can offer further insights into the accessibility of CREs, TF activity, and, thus, gene regulation. However, the integration and analysis of multi-modal data sets are hampered by considerable technical challenges. While methods for highlighting differential TF activity from combined chromatin state data (e.g., ChIP-seq, ATAC-seq, or DNase-seq) and RNA-seq data exist, they do not offer convenient usability, have limited support for large-scale data processing, and provide only minimal functionality for visually interpreting results.
Results: We developed TF-Prioritizer, an automated pipeline that prioritizes condition-specific TFs from multi-modal data and generates an interactive web report. We demonstrated its potential by identifying known TFs along with their target genes, as well as previously unreported TFs active in lactating mouse mammary glands. Additionally, we studied a variety of ENCODE data sets for cell lines K562 and MCF-7, including twelve histone modification ChIP-seq as well as ATAC-seq and DNase-seq datasets, where we observe and discuss assay-specific differences.
Conclusion: TF-Prioritizer accepts ATAC-seq, DNase-seq, or ChIP-seq and RNA-seq data as input and identifies TFs with differential activity, thus offering an understanding of genome-wide gene regulation, potential pathogenesis, and therapeutic targets in biomedical research.
For medicine to fulfill its promise of personalized treatments based on a better understanding of disease biology, computational and statistical tools must exist to analyze the increasing amount of patient data that becomes available. A particular challenge is that several types of data are being measured to cope with the complexity of the underlying systems, enhance predictive modeling and enrich molecular understanding.
Here we review a number of recent approaches that specialize in the analysis of multimodal data in the context of predictive biomedicine. We focus on methods that combine different OMIC measurements with image or genome variation data. Our overview shows the diversity of methods that address analysis challenges and reveals new avenues for novel developments.
Improved integration of single cell transcriptome data demonstrated on heart failure in mice and men
(2023)
Biomedical research frequently uses murine models to study disease mechanisms. However, the translation of these findings to human disease remains a significant challenge. In order to improve the comparability of mouse and human data, we present a cross-species integration pipeline for single-cell transcriptomic assays.
The pipeline merges expression matrices and assigns clear orthologous relationships. Starting from Ensembl ortholog assignments, we allocated 82% of mouse genes to unique orthologs by using additional publicly available resources such as Uniprot, and NCBI databases. For genes with multiple matches, we employed the Needleman-Wunsch global alignment based on either amino acid or nucleotide sequence to identify the ortholog with the highest degree of similarity.
The workflow was tested for its functionality and efficiency by integrating scRNA-seq datasets from heart failure patients with the corresponding mouse model. We were able to assign unique human orthologs to up to 80% of the mouse genes, utilizing the known 17,492 orthologous pairs. Curiously, the integration process enabled the identification of both common and unique regulatory pathways between species in heart failure.
In conclusion, our pipeline streamlines the integration process, enhances gene nomenclature alignment and simplifies the translation of mouse models to human disease. We have made the OrthoIntegrate R-package accessible on GitHub (https://github.com/MarianoRuzJurado/OrthoIntegrate), which includes the assignment of ortholog definitions for human and mouse, as well as the pipeline for integrating single cells.
Non-coding variations located within regulatory elements may alter gene expression by modifying Transcription Factor (TF) binding sites and thereby lead to functional consequences like various traits or diseases. To understand these molecular mechanisms, different TF models are being used to assess the effect of DNA sequence variations, such as Single Nucleotide Polymorphisms (SNPs). However, few statistical approaches exist to compute statistical significance of results but they often are slow for large sets of SNPs, such as data obtained from a genome-wide association study (GWAS) or allele-specific analysis of chromatin data.
Results We investigate the distribution of maximal differential TF binding scores for general computational models that assess TF binding. We find that a modified Laplace distribution can adequately approximate the empirical distributions. A benchmark on in vitro and in vivo data sets showed that our new approach improves on an existing method in terms of performance and speed. In applications on large sets of eQTL and GWAS SNPs we could illustrate the usefulness of the novel statistic to highlight cell type specific regulators and TF target genes.
Conclusions Our approach allows the evaluation of DNA changes that induce differential TF binding in a fast and accurate manner, permitting computations on large mutation data sets. An implementation of the novel approach is freely available at https://github.com/SchulzLab/SNEEP.
Motivation DNA CpG methylation (CpGm) has proven to be a crucial epigenetic factor in the gene regulatory system. Assessment of DNA CpG methylation values via whole-genome bisulfite sequencing (WGBS) is, however, computationally extremely demanding.
Results We present FAst MEthylation calling (FAME), the first approach to quantify CpGm values directly from bulk or single-cell WGBS reads without intermediate output files. FAME is very fast but as accurate as standard methods, which first produce BS alignment files before computing CpGm values. We present experiments on bulk and single-cell bisulfite datasets in which we show that data analysis can be significantly sped-up and help addressing the current WGBS analysis bottleneck for large-scale datasets without compromising accuracy.
Availability An implementation of FAME is open source and licensed under GPL-3.0 at https://github.com/FischerJo/FAME.
Background: With the rise of single-cell RNA sequencing new bioinformatic tools have been developed to handle specific demands, such as quantifying unique molecular identifiers and correcting cell barcodes. Here, we benchmarked several datasets with the most common alignment tools for single-cell RNA sequencing data. We evaluated differences in the whitelisting, gene quantification, overall performance, and potential variations in clustering or detection of differentially expressed genes. We compared the tools Cell Ranger version 6, STARsolo, Kallisto, Alevin, and Alevin-fry on 3 published datasets for human and mouse, sequenced with different versions of the 10X sequencing protocol.
Results: Striking differences were observed in the overall runtime of the mappers. Besides that, Kallisto and Alevin showed variances in the number of valid cells and detected genes per cell. Kallisto reported the highest number of cells; however, we observed an overrepresentation of cells with low gene content and unknown cell type. Conversely, Alevin rarely reported such low-content cells. Further variations were detected in the set of expressed genes. While STARsolo, Cell Ranger 6, Alevin-fry, and Alevin produced similar gene sets, Kallisto detected additional genes from the Vmn and Olfr gene family, which are likely mapping artefacts. We also observed differences in the mitochondrial content of the resulting cells when comparing a prefiltered annotation set to the full annotation set that includes pseudogenes and other biotypes.
Conclusion: Overall, this study provides a detailed comparison of common single-cell RNA sequencing mappers and shows their specific properties on 10X Genomics data.
Background Eukaryotic gene expression is controlled by cis-regulatory elements (CREs) including promoters and enhancers which are bound by transcription factors (TFs). Differential expression of TFs and their putative binding sites on CREs cause tissue and developmental-specific transcriptional activity. Consolidating genomic data sets can offer further insights into the accessibility of CREs, TF activity, and thus gene regulation. However, the integration and analysis of multi-modal data sets are hampered by considerable technical challenges. While methods for highlighting differential TF activity from combined ChIP-seq and RNA-seq data exist, they do not offer good usability, have limited support for large-scale data processing, and provide only minimal functionality for visual result interpretation.
Results We developed TF-Prioritizer, an automated java pipeline to prioritize condition-specific TFs derived from multi-modal data. TF-Prioritizer creates an interactive, feature-rich, and user-friendly web report of its results. To showcase the potential of TF-Prioritizer, we identified known active TFs (e.g., Stat5, Elf5, Nfib, Esr1), their target genes (e.g., milk proteins and cell-cycle genes), and newly classified lactating mammary gland TFs (e.g., Creb1, Arnt).
Conclusion TF-Prioritizer accepts ChIP-seq and RNA-seq data, as input and suggests TFs with differential activity, thus offering an understanding of genome-wide gene regulation, potential pathogenesis, and therapeutic targets in biomedical research.
Electrocardiograms (ECG) record the heart activity and are the most common and reliable method to detect cardiac arrhythmias, such as atrial fibrillation (AFib). Lately, many commercially available devices such as smartwatches are offering ECG monitoring. Therefore, there is increasing demand for designing deep learning models with the perspective to be physically implemented on these small portable devices with limited energy supply. In this paper, a workflow for the design of small, energy-efficient recurrent convolutional neural network (RCNN) architecture for AFib detection is proposed. However, the approach can be well generalized to every type of long time series. In contrast to previous studies, that demand thousands of additional network neurons and millions of extra model parameters, the logical steps for the generation of a CNN with only 114 trainable parameters are described. The model consists of a small segmented CNN in combination with an optimal energy classifier. The architectural decisions are made by using the energy consumption as a metric in an equally important way as the accuracy. The optimization steps are focused on the software which can be embedded afterwards on a physical chip. Finally, a comparison with some previous relevant studies suggests that the widely used huge CNNs for similar tasks are mostly redundant and unessentially computationally expensive.
The unicellular ciliate Paramecium contains a large vegetative macronucleus with several unusual characteristics, including an extremely high coding density and high polyploidy. As macronculear chromatin is devoid of heterochromatin, our study characterizes the functional epigenomic organization necessary for gene regulation and proper Pol II activity. Histone marks (H3K4me3, H3K9ac, H3K27me3) reveal no narrow peaks but broad domains along gene bodies, whereas intergenic regions are devoid of nucleosomes. Our data implicate H3K4me3 levels inside ORFs to be the main factor associated with gene expression, and H3K27me3 appears in association with H3K4me3 in plastic genes. Silent and lowly expressed genes show low nucleosome occupancy, suggesting that gene inactivation does not involve increased nucleosome occupancy and chromatin condensation. Because of a high occupancy of Pol II along highly expressed ORFs, transcriptional elongation appears to be quite different from that of other species. This is supported by missing heptameric repeats in the C-terminal domain of Pol II and a divergent elongation system. Our data imply that unoccupied DNA is the default state, whereas gene activation requires nucleosome recruitment together with broad domains of H3K4me3. In summary, gene activation and silencing in Paramecium run counter to the current understanding of chromatin biology.
Mechanisms by which specific histone modifications regulate distinct gene regulatory networks remain little understood. We investigated how H3K79me2, a modification catalyzed by DOT1L and previously considered a general transcriptional activation mark, regulates gene expression in mammalian cardiogenesis. Early embryonic cardiomyocyte ablation of Dot1l revealed that H3K79me2 does not act as a general transcriptional activator, but rather regulates highly specific gene regulatory networks at two critical cardiogenic junctures: left ventricle patterning and postnatal cardiomyocyte cell cycle withdrawal. Mechanistic analyses revealed that H3K79me2 in two distinct domains, gene bodies and regulatory elements, synergized to promote expression of genes activated by DOT1L. Surprisingly, these analyses also revealed that H3K79me2 in specific regulatory elements contributed to silencing genes usually not expressed in cardiomyocytes. As DOT1L mutants had increased numbers of postnatal mononuclear cardiomyocytes and prolonged cardiomyocyte cell cycle activity, controlled inhibition of DOT1L might be a strategy to promote cardiac regeneration post-injury.
Long non-coding RNAs (lncRNAs) can act as regulatory RNAs which, by altering the expression of target genes, impact on the cellular phenotype and cardiovascular disease development. Endothelial lncRNAs and their vascular functions are largely undefined. Deep RNA-Seq and FANTOM5 CAGE analysis revealed the lncRNA LINC00607 to be highly enriched in human endothelial cells. LINC00607 was induced in response to hypoxia, arteriosclerosis regression in non-human primates and also in response to propranolol used to induce regression of human arteriovenous malformations. siRNA knockdown or CRISPR/Cas9 knockout of LINC00607 attenuated VEGF-A-induced angiogenic sprouting. LINC00607 knockout in endothelial cells also integrated less into newly formed vascular networks in an in vivo assay in SCID mice. Overexpression of LINC00607 in CRISPR knockout cells restored normal endothelial function. RNA- and ATAC-Seq after LINC00607 knockout revealed changes in the transcription of endothelial gene sets linked to the endothelial phenotype and in chromatin accessibility around ERG-binding sites. Mechanistically, LINC00607 interacted with the SWI/SNF chromatin remodeling protein BRG1. CRISPR/Cas9-mediated knockout of BRG1 in HUVEC followed by CUT&RUN revealed that BRG1 is required to secure a stable chromatin state, mainly on ERG-binding sites. In conclusion, LINC00607 is an endothelial-enriched lncRNA that maintains ERG target gene transcription by interacting with the chromatin remodeler BRG1.
Spatial genome organization is tightly controlled by several regulatory mechanisms and is essential for gene expression control. Nuclear receptors are ligand-activated transcription factors that modulate physiological and pathophysiological processes and are primary pharmacological targets. DNA binding of the important loop-forming insulator protein CCCTC-binding factor (CTCF) was modulated by 1α,25-dihydroxyvitamin D3 (1,25(OH)2D3). We performed CTCF HiChIP assays to produce the first genome-wide dataset of CTCF long-range interactions in 1,25(OH)2D3-treated cells, and to determine whether dynamic changes of spatial chromatin interactions are essential for fine-tuning of nuclear receptor signaling. We detected changes in 3D chromatin organization upon vitamin D receptor (VDR) activation at 3.1% of all observed CTCF interactions. VDR binding was enriched at both differential loop anchors and within differential loops. Differential loops were observed in several putative functional roles including TAD border formation, promoter-enhancer looping, and establishment of VDR-responsive insulated neighborhoods. Vitamin D target genes were enriched in differential loops and at their anchors. Secondary vitamin D effects related to dynamic chromatin domain changes were linked to location of downstream transcription factors in differential loops. CRISPR interference and loop anchor deletion experiments confirmed the functional relevance of nuclear receptor ligand-induced adjustments of the chromatin 3D structure for gene expression regulation.
The transcription factor vitamin D receptor (VDR) is the high affinity nuclear target of the biologically active form of vitamin D3 (1,25(OH)2D3). In order to identify pure genomic transcriptional effects of 1,25(OH)2D3, we used VDR cistrome, transcriptome and open chromatin data, obtained from the human monocytic cell line THP-1, for a novel hierarchical analysis applying three bioinformatics approaches. We predicted 75.6% of all early 1,25(OH)2D3-responding (2.5 or 4 h) and 57.4% of the late differentially expressed genes (24 h) to be primary VDR target genes. VDR knockout led to a complete loss of 1,25(OH)2D3–induced genome-wide gene regulation. Thus, there was no indication of any VDR-independent non-genomic actions of 1,25(OH)2D3 modulating its transcriptional response. Among the predicted primary VDR target genes, 47 were coding for transcription factors and thus may mediate secondary 1,25(OH)2D3 responses. CEBPA and ETS1 ChIP-seq data and RNA-seq following CEBPA knockdown were used to validate the predicted regulation of secondary vitamin D target genes by both transcription factors. In conclusion, a directional network containing 47 partly novel primary VDR target transcription factors describes secondary responses in a highly complex vitamin D signaling cascade. The central transcription factor VDR is indispensable for all transcriptome-wide effects of the nuclear hormone.
Endothelial cells play a critical role in the adaptation of tissues to injury. Tissue ischemia induced by infarction leads to profound changes in endothelial cell functions and can induce transition to a mesenchymal state. Here we explore the kinetics and individual cellular responses of endothelial cells after myocardial infarction by using single cell RNA sequencing. This study demonstrates a time dependent switch in endothelial cell proliferation and inflammation associated with transient changes in metabolic gene signatures. Trajectory analysis reveals that the majority of endothelial cells 3 to 7 days after myocardial infarction acquire a transient state, characterized by mesenchymal gene expression, which returns to baseline 14 days after injury. Lineage tracing, using the Cdh5-CreERT2;mT/mG mice followed by single cell RNA sequencing, confirms the transient mesenchymal transition and reveals additional hypoxic and inflammatory signatures of endothelial cells during early and late states after injury. These data suggest that endothelial cells undergo a transient mes-enchymal activation concomitant with a metabolic adaptation within the first days after myocardial infarction but do not acquire a long-term mesenchymal fate. This mesenchymal activation may facilitate endothelial cell migration and clonal expansion to regenerate the vascular network.
Understanding how epigenetic variation in non-coding regions is involved in distal gene-expression regulation is an important problem. Regulatory regions can be associated to genes using large-scale datasets of epigenetic and expression data. However, for regions of complex epigenomic signals and enhancers that regulate many genes, it is difficult to understand these associations. We present StitchIt, an approach to dissect epigenetic variation in a gene-specific manner for the detection of regulatory elements (REMs) without relying on peak calls in individual samples. StitchIt segments epigenetic signal tracks over many samples to generate the location and the target genes of a REM simultaneously. We show that this approach leads to a more accurate and refined REM detection compared to standard methods even on heterogeneous datasets, which are challenging to model. Also, StitchIt REMs are highly enriched in experimentally determined chromatin interactions and expression quantitative trait loci. We validated several newly predicted REMs using CRISPR-Cas9 experiments, thereby demonstrating the reliability of StitchIt. StitchIt is able to dissect regulation in superenhancers and predicts thousands of putative REMs that go unnoticed using peak-based approaches suggesting that a large part of the regulome might be uncharted water.
Electrocardiograms (ECG) record the heart activity and are the most common and reliable method to detect cardiac arrhythmias, such as atrial fibrillation (AFib). Lately, many commercially available devices such as smartwatches are offering ECG monitoring. Therefore, there is increasing demand for designing deep learning models with the perspective to be physically implemented on these small portable devices with limited energy supply. In this paper, a workflow for the design of small, energy-efficient recurrent convolutional neural network (RCNN) architecture for AFib detection is proposed. However, the approach can be well generalized to every type of long time series. In contrast to previous studies, that demand thousands of additional network neurons and millions of extra model parameters, the logical steps for the generation of a CNN with only 114 trainable parameters are described. The model consists of a small segmented CNN in combination with an optimal energy classifier. The architectural decisions are made by using the energy consumption as a metric in an equally important way as the accuracy. The optimisation steps are focused on the software which can be embedded afterwards on a physical chip. Finally, a comparison with some previous relevant studies suggests that the widely used huge CNNs for similar tasks are mostly redundant and unessentially computationally expensive.
The unicellular ciliate Paramecium contains a large vegetative macronucleus with several unusual characteristics including an extremely high coding density and high polyploidy. As macronculear chromatin is devoid of heterochromatin our study characterizes the functional epigenomic organisation necessary for gene regulation and proper PolII activity. Histone marks (H3K4me3, H3K9ac, H3K27me3) revealed no narrow peaks but broad domains along gene bodies, whereas intergenic regions were devoid of nucleosomes. Our data implicates H3K4me3 levels inside ORFs to be the main factor to associate with gene expression and H3K27me3 appears to occur as a bistable domain with H3K4me3 in plastic genes. Surprisingly, silent and lowly expressed genes show low nucleosome occupancy suggesting that gene inactivation does not involve increased nucleosome occupancy and chromatin condensation. Due to a high occupancy of Pol II along highly expressed ORFs, transcriptional elongation appears to be quite different to other species. This is supported by missing heptameric repeats in the C-terminal domain of Pol II and a divergent elongation system. Our data implies that unoccupied DNA is the default state, whereas gene activation requires nucleosome recruitment together with broad domains of H3K4me3. This could represent a buffer for paused Pol II along ORFs in absence of elongation factors of higher eukaryotes.
Background: Enhancers play a fundamental role in orchestrating cell state and development. Although several methods have been developed to identify enhancers, linking them to their target genes is still an open problem. Several theories have been proposed on the functional mechanisms of enhancers, which triggered the development of various methods to infer promoter–enhancer interactions (PEIs). The advancement of high-throughput techniques describing the three-dimensional organization of the chromatin, paved the way to pinpoint long-range PEIs. Here we investigated whether including PEIs in computational models for the prediction of gene expression improves performance and interpretability.
Results: We have extended our TEPIC framework to include DNA contacts deduced from chromatin conformation capture experiments and compared various methods to determine PEIs using predictive modelling of gene expression from chromatin accessibility data and predicted transcription factor (TF) motif data. We designed a novel machine learning approach that allows the prioritization of TFs binding to distal loop and promoter regions with respect to their importance for gene expression regulation. Our analysis revealed a set of core TFs that are part of enhancer–promoter loops involving YY1 in different cell lines.
Conclusion: We present a novel approach that can be used to prioritize TFs involved in distal and promoter-proximal regulatory events by integrating chromatin accessibility, conformation, and gene expression data. We show that the integration of chromatin conformation data can improve gene expression prediction and aids model interpretability.
The aging process is characterized by a chronic, low‐grade inflammatory state, termed “inflammaging.” It has been suggested that macrophage activation plays a key role in the induction and maintenance of this state. In the present study, we aimed to elucidate the mechanisms responsible for aging‐associated changes in the myeloid compartment of mice. The aging phenotype, characterized by elevated cytokine production, was associated with a dysfunction of the hypothalamic–pituitary–adrenal (HPA) axis and diminished serum corticosteroid levels. In particular, the concentration of corticosterone, the major active glucocorticoid in rodents, was decreased. This could be explained by an impaired expression and activity of 11β‐hydroxysteroid dehydrogenase type 1 (11β‐HSD1), an enzyme that determines the extent of cellular glucocorticoid responses by reducing the corticosteroids cortisone/11‐dehydrocorticosterone to their active forms cortisol/corticosterone, in aged macrophages and peripheral leukocytes. These changes were accompanied by a downregulation of the glucocorticoid receptor target gene glucocorticoid‐induced leucine zipper (GILZ) in vitro and in vivo. Since GILZ plays a central role in macrophage activation, we hypothesized that the loss of GILZ contributed to the process of macroph‐aging. The phenotype of macrophages from aged mice was indeed mimicked in young GILZ knockout mice. In summary, the current study provides insight into the role of glucocorticoid metabolism and GILZ regulation during aging.