Refine
Document Type
- Doctoral Thesis (3)
Language
- English (3)
Has Fulltext
- yes (3)
Is part of the Bibliography
- no (3)
Keywords
- Paramecium (1)
- RNA biology (1)
- RNA interference (1)
- bioinformatics (1)
- epigenome (1)
- graph theory (1)
- macronucleus (1)
- protein assembly (1)
- protein structure (1)
- respiratory complex I (1)
Institute
- Informatik (2)
- Informatik und Mathematik (1)
In the last two decades, our understanding of human gene regulation has improved tremendously. There are plentiful computational methods which focus on integrative data analysis of humans, and model organisms, like mouse and drosophila. However, these tools are not directly employable by researchers working on non-model organisms to answer fundamental biological, and evolutionary questions. We aimed to develop new tools, and adapt existing software for the analysis of transcriptomic and epigenomic data of one such non-model organism, Paramecium tetraurelia, an unicellular eukaryote. Paramecium contains two diploid (2n) germline micronuclei (MIC) and a polyploid (800n) somatic macronuclei (MAC). The transcriptomic and epigenomic regulatory landscape of the MAC genome, which has 80% protein-coding genes and short intergenic regions, is poorly understood.
We developed a generic automated eukaryotic short interfering RNA (siRNA) analysis tool, called RAPID. Our tool captures diverse siRNA characteristics from small RNA sequencing data and provides easily navigable visualisations. We also introduced a normalisation technique to facilitate comparison of multiple siRNA-based gene knockdown studies. Further, we developed a pipeline to characterise novel genome-wide endogenous short interfering RNAs (endo-siRNAs). In contrary to many organisms, we found that the endo-siRNAs are not acting in cis, to silence their parent mRNA. We also predicted phasing of siRNAs, which are regulated by the RNA interference (RNAi) pathway.
Further, using RAPID, we investigated the aberrations of endo-siRNAs, and their respective transcriptomic alterations caused by an RNAi pathway triggered by feeding small RNAs against a target gene. We find that the small RNA transcriptome is altered, even if a gene unrelated to RNAi pathway is targeted. This is important in the context of investigations of genetically modified organisms (GMOs). We suggest that future studies need to distinguish transcriptomic changes caused by RNAi inducing techniques and actual regulatory changes.
Subsequently, we adapted existing epigenomics analysis tools to conduct the first comprehensive epigenomic characterisation of nucleosome positioning and histone modifications of the Paramecium MAC. We identified well positioned nucleosomes shifted downstream of the transcription start site. GC content seems to dictate, in cis, the positioning of nucleosomes, histone marks (H3K4me3, H3K9ac, and H3K27me3), and Pol II in the AT-rich Paramecium genome. We employed a chromatin state segmentation approach, on nucleosomes and histone marks, which revealed genes with active, repressive, and bivalent chromatin states. Further, we constructed a regulatory association network of all the aforementioned data, using the sparse partial correlation network technique. Our analysis revealed subsets of genes, whose expression is positively associated with H3K27me3, different to the otherwise reported negative association with gene expression in many other organisms.
Further, we developed a Random Forests classifier to predict gene expression using genic (gene length, intron frequency, etc.) and epigenetic features. Our model has a test performance (PR-AUC) of 0.83. Upon evaluating different feature sets, we found that genic features are as predictive, of gene expression, as the epigenetic features. We used Shapley local feature explanation values, to suggest that high H3K4me3, high intron frequency, low gene length, high sRNA, and high GC content are the most important elements for determining gene expression status.
In this thesis, we developed novel tools, and employed several bioinformatics and machine learning methods to characterise the regulatory landscape of the Paramecium’s (epi)genome.
A central concern in genetics is to identify mechanisms of transcriptional regulation. The aim is to unravel the mapping between the DNA sequence and gene expression. However, it turned out that this is extremely complex. Gene regulation is highly cell type-specific and even moderate changes in gene ex- pression can have functional consequences.
Important contributors to gene regulation are transcription factors (TFs), that are able to directly interact with the DNA. Often, a first step in understanding the effect of a TF on the gene’s regulation is to identify the genomic regions a TF binds to. Therefore, one needs to be aware of the TF’s binding preferences, which are commonly summarized in TF binding motifs. Although for many TFs the binding motif is experimentally validated, there is still a large number of TFs where no binding motif is known. There exist many tools that link TF binding motifs to TFs. We developed the method Massif that improves the performance of such tools by incorporating a domain score that uses the DNA binding domain of the studied TF as additional information.
TF binding sites are often enriched in regulatory elements (REMs) such as promoters or enhancers, where the latter can be located megabases away from its target gene. However, to understand the regulation of a gene it is crucial to know where the REMs of a gene are located. We introduced the EpiRegio webserver that holds REMs associated to target genes predicted across many cell types and tissues using STITCHIT, a previously established method. Our publicly available webserver enables to query for REMs associated to genes (gene query) and REMs overlapping genomic regions (region query). We illus- trated the usefulness of EpiRegio by pointing to a TF that occurs enriched in the REMs of differential expressed genes in circPLOD2 depleted pericytes. Further, we highlighted genes, which are affected by CRISPR-Cas induced mutations in non-coding genomic regions using EpiRegio’s region query. Non-coding genetic variants within REMs may alter gene expression by modifying TF binding sites, which can lead to various kinds of traits or diseases. To understand the underlying molecular mechanisms, one aims to evaluate the effect of such genetic variations on TF binding sites. We developed an accurate and fast statistical approach, that can assess whether a single nucleotide polymorphism (SNP) is regulatory. Further, we combined this approach with epigenetic data and additional analyses in our Sneep workflow. For instance, it enables to identify TFs whose binding preferences are affected by the analyzed SNPs, which is illustrated on eQTL datasets for different cell types. Additionally, we used our Sneep workflow to highlight cardiovascular disease genes using regulatory SNPs and REM-gene interactions.
Overall, the described results allow a better understanding of REM-gene interactions and their interplay with TFs on gene regulation.
Proteins are biological macromolecules playing essential roles in all living organisms.
Proteins often bind with each other forming complexes to fulfill their function. Such protein complexes assemble along an ordered pathway. An assembled protein complex can often be divided into structural and functional modules. Knowing the order of assembly and the modules of a protein complex is important to understand biological processes and treat diseases related to misassembly.
Typical structures of the Protein Data Bank (PDB) contain two to three subunits and a few thousand atoms. Recent developments have led to large protein complexes being resolved. The increasing number and size of the protein complexes demand for computational assistance for the visualization and analysis. One such large protein complex is respiratory complex I accounting for 45 subunits in Homo sapiens.
Complex I is a well understood protein complex that served as case study to validate our methods.
Our aim was to analyze time-resolved Molecular Dynamics (MD) simulation data, identify modules of a protein complex and generate hypotheses for the assembly pathway of a protein complex. For that purpose, we abstracted the topology of protein complexes to Complex Graphs of the Protein Topology Graph Library (PTGL). The subunits are represented as vertices, and spatial contacts as edges. The edges are weighted with the number of contacts based on a distance threshold. This allowed us to apply graph-theoretic methods to visualize and analyze protein complexes.
We extended the implementations of two methods to achieve a computation of Complex Graphs in feasible runtimes. The first method skipped checks for contacts using the information which residues are sequential neighbors. We extended the method to protein complexes and structures containing ligands. The second method introduced spheres encompassing all atoms of a subunit and skipped the check for contacts if the corresponding spheres do not overlap. Both methods combined allowed skipping up to 93 % of the checks for contacts for sample complexes of 40 subunits compared to up to 10 % of the previous implementation. We showed that the runtime of the combined method scaled linearly with the number of atoms compared to a non-linear scaling of the previous implementation We implemented a third method fixing the assignment of an orientation to secondary structure elements. We placed a three-dimensional vector in each secondary structure element and computed the angle between secondary structure elements to assign an orientation. This method sped up the runtime especially for large structures, such as the capsid of human immunodeficiency virus, for which the runtime decreased from 43 to less than 9 hours.
The feasible runtimes allowed us to investigate two data sets of MD trajectories of respiratory complex I of Thermus thermophilus that we received. The data sets differ only by whether ubiquinone is bound to the complex. We implemented a pipeline, PTGLdynamics, to compute the contacts and Complex Graphs for all time steps of the trajectories. We investigated different methods to track changes of contacts during the simulation and created a heat map put onto the three-dimensional structure visualizing the changes. We also created line plots to visualize the changes of contacts over the course of the simulation. Both visualizations helped spotting outstandingly flexible or rigid regions of the structure or time points of the simulation in which major dynamics occur.
We introduced normalizations of the edge weights of Complex Graphs for identi-fying modules and predicting the assembly pathway. The idea is to normalize the number of contacts for the number of residues of a subunit. We defined five different normalizations.
To identify structural and functional modules, we applied the Leiden graph clustering algorithm to the Complex Graphs of respiratory complex I and the respiratory supercomplex. We examined the results for the different normalizations of the weights of the Complex Graphs. The absolute edge weight produced the best result identifying three of four modules that have been defined in the literature for respiratory complex I.
We applied agglomerative hierarchical clustering to the edges of a Complex Graph to create hypotheses of the assembly pathway. The rationale was that subunits with an extensive interface in the final structure assemble early. We tested our method against two existing methods on a data set of 21 proteins with reported assembly pathways. Our prediction outperformed the other methods and ran in feasible runtimes of a few minutes at most.
We also tested our method on respiratory complex I, the respiratory supercomplex and the respiratory megacomplex. We compared the results for the different normalizations with an assembly pathway of respiratory complex I described in the literature. We transformed the assembly pathways to dendrograms and compared the predictions to the reference using the Robinson-Foulds distance and clustering information distance. We analyzed the landscape of the clustering information distance by generating random dendrograms and showed that our result is far better than expected at random. We showed in a detailed analysis that the assembly prediction using one normalization was able to capture key features of the assembly pathway that has been proposed in the literature.
In conclusion, we presented different applications of graph theory to automatically analyze the topology of protein complexes. Our programs run in feasible runtimes even for large complexes. We showed that graph-theoretic modeling of the protein structure can be used to analyze MD simulation data, identify modules of protein complexes and predict assembly pathways.