Refine
Year of publication
- 2021 (1) (remove)
Document Type
- Doctoral Thesis (1)
Language
- English (1)
Has Fulltext
- yes (1)
Is part of the Bibliography
- no (1) (remove)
Keywords
- macronucleus (1) (remove)
Institute
- Informatik (1)
In the last two decades, our understanding of human gene regulation has improved tremendously. There are plentiful computational methods which focus on integrative data analysis of humans, and model organisms, like mouse and drosophila. However, these tools are not directly employable by researchers working on non-model organisms to answer fundamental biological, and evolutionary questions. We aimed to develop new tools, and adapt existing software for the analysis of transcriptomic and epigenomic data of one such non-model organism, Paramecium tetraurelia, an unicellular eukaryote. Paramecium contains two diploid (2n) germline micronuclei (MIC) and a polyploid (800n) somatic macronuclei (MAC). The transcriptomic and epigenomic regulatory landscape of the MAC genome, which has 80% protein-coding genes and short intergenic regions, is poorly understood.
We developed a generic automated eukaryotic short interfering RNA (siRNA) analysis tool, called RAPID. Our tool captures diverse siRNA characteristics from small RNA sequencing data and provides easily navigable visualisations. We also introduced a normalisation technique to facilitate comparison of multiple siRNA-based gene knockdown studies. Further, we developed a pipeline to characterise novel genome-wide endogenous short interfering RNAs (endo-siRNAs). In contrary to many organisms, we found that the endo-siRNAs are not acting in cis, to silence their parent mRNA. We also predicted phasing of siRNAs, which are regulated by the RNA interference (RNAi) pathway.
Further, using RAPID, we investigated the aberrations of endo-siRNAs, and their respective transcriptomic alterations caused by an RNAi pathway triggered by feeding small RNAs against a target gene. We find that the small RNA transcriptome is altered, even if a gene unrelated to RNAi pathway is targeted. This is important in the context of investigations of genetically modified organisms (GMOs). We suggest that future studies need to distinguish transcriptomic changes caused by RNAi inducing techniques and actual regulatory changes.
Subsequently, we adapted existing epigenomics analysis tools to conduct the first comprehensive epigenomic characterisation of nucleosome positioning and histone modifications of the Paramecium MAC. We identified well positioned nucleosomes shifted downstream of the transcription start site. GC content seems to dictate, in cis, the positioning of nucleosomes, histone marks (H3K4me3, H3K9ac, and H3K27me3), and Pol II in the AT-rich Paramecium genome. We employed a chromatin state segmentation approach, on nucleosomes and histone marks, which revealed genes with active, repressive, and bivalent chromatin states. Further, we constructed a regulatory association network of all the aforementioned data, using the sparse partial correlation network technique. Our analysis revealed subsets of genes, whose expression is positively associated with H3K27me3, different to the otherwise reported negative association with gene expression in many other organisms.
Further, we developed a Random Forests classifier to predict gene expression using genic (gene length, intron frequency, etc.) and epigenetic features. Our model has a test performance (PR-AUC) of 0.83. Upon evaluating different feature sets, we found that genic features are as predictive, of gene expression, as the epigenetic features. We used Shapley local feature explanation values, to suggest that high H3K4me3, high intron frequency, low gene length, high sRNA, and high GC content are the most important elements for determining gene expression status.
In this thesis, we developed novel tools, and employed several bioinformatics and machine learning methods to characterise the regulatory landscape of the Paramecium’s (epi)genome.