Refine
Document Type
- Article (6)
- Doctoral Thesis (2)
Has Fulltext
- yes (8)
Is part of the Bibliography
- no (8)
Keywords
- bioinformatics (8) (remove)
Institute
Proteins are biological macromolecules playing essential roles in all living organisms.
Proteins often bind with each other forming complexes to fulfill their function. Such protein complexes assemble along an ordered pathway. An assembled protein complex can often be divided into structural and functional modules. Knowing the order of assembly and the modules of a protein complex is important to understand biological processes and treat diseases related to misassembly.
Typical structures of the Protein Data Bank (PDB) contain two to three subunits and a few thousand atoms. Recent developments have led to large protein complexes being resolved. The increasing number and size of the protein complexes demand for computational assistance for the visualization and analysis. One such large protein complex is respiratory complex I accounting for 45 subunits in Homo sapiens.
Complex I is a well understood protein complex that served as case study to validate our methods.
Our aim was to analyze time-resolved Molecular Dynamics (MD) simulation data, identify modules of a protein complex and generate hypotheses for the assembly pathway of a protein complex. For that purpose, we abstracted the topology of protein complexes to Complex Graphs of the Protein Topology Graph Library (PTGL). The subunits are represented as vertices, and spatial contacts as edges. The edges are weighted with the number of contacts based on a distance threshold. This allowed us to apply graph-theoretic methods to visualize and analyze protein complexes.
We extended the implementations of two methods to achieve a computation of Complex Graphs in feasible runtimes. The first method skipped checks for contacts using the information which residues are sequential neighbors. We extended the method to protein complexes and structures containing ligands. The second method introduced spheres encompassing all atoms of a subunit and skipped the check for contacts if the corresponding spheres do not overlap. Both methods combined allowed skipping up to 93 % of the checks for contacts for sample complexes of 40 subunits compared to up to 10 % of the previous implementation. We showed that the runtime of the combined method scaled linearly with the number of atoms compared to a non-linear scaling of the previous implementation We implemented a third method fixing the assignment of an orientation to secondary structure elements. We placed a three-dimensional vector in each secondary structure element and computed the angle between secondary structure elements to assign an orientation. This method sped up the runtime especially for large structures, such as the capsid of human immunodeficiency virus, for which the runtime decreased from 43 to less than 9 hours.
The feasible runtimes allowed us to investigate two data sets of MD trajectories of respiratory complex I of Thermus thermophilus that we received. The data sets differ only by whether ubiquinone is bound to the complex. We implemented a pipeline, PTGLdynamics, to compute the contacts and Complex Graphs for all time steps of the trajectories. We investigated different methods to track changes of contacts during the simulation and created a heat map put onto the three-dimensional structure visualizing the changes. We also created line plots to visualize the changes of contacts over the course of the simulation. Both visualizations helped spotting outstandingly flexible or rigid regions of the structure or time points of the simulation in which major dynamics occur.
We introduced normalizations of the edge weights of Complex Graphs for identi-fying modules and predicting the assembly pathway. The idea is to normalize the number of contacts for the number of residues of a subunit. We defined five different normalizations.
To identify structural and functional modules, we applied the Leiden graph clustering algorithm to the Complex Graphs of respiratory complex I and the respiratory supercomplex. We examined the results for the different normalizations of the weights of the Complex Graphs. The absolute edge weight produced the best result identifying three of four modules that have been defined in the literature for respiratory complex I.
We applied agglomerative hierarchical clustering to the edges of a Complex Graph to create hypotheses of the assembly pathway. The rationale was that subunits with an extensive interface in the final structure assemble early. We tested our method against two existing methods on a data set of 21 proteins with reported assembly pathways. Our prediction outperformed the other methods and ran in feasible runtimes of a few minutes at most.
We also tested our method on respiratory complex I, the respiratory supercomplex and the respiratory megacomplex. We compared the results for the different normalizations with an assembly pathway of respiratory complex I described in the literature. We transformed the assembly pathways to dendrograms and compared the predictions to the reference using the Robinson-Foulds distance and clustering information distance. We analyzed the landscape of the clustering information distance by generating random dendrograms and showed that our result is far better than expected at random. We showed in a detailed analysis that the assembly prediction using one normalization was able to capture key features of the assembly pathway that has been proposed in the literature.
In conclusion, we presented different applications of graph theory to automatically analyze the topology of protein complexes. Our programs run in feasible runtimes even for large complexes. We showed that graph-theoretic modeling of the protein structure can be used to analyze MD simulation data, identify modules of protein complexes and predict assembly pathways.
Despite advances in bioinformatics, custom scripts remain a source of difficulty, slowing workflow development and hampering reproducibility. Here, we introduce Vectools, a command-line tool-suite to reduce reliance on custom scripts and improve reproducibility by offering a wide range of common easy-to-use functions for table and vector manipulation. Vectools also offers a number of vector related functions to speed up workflow development, such as simple machine learning and common statistics functions.
Despite advances in bioinformatics, custom scripts remain a source of difficulty, slowing workflow development and hampering reproducibility. Here, we introduce Vectools, a command-line tool-suite to reduce reliance on custom scripts and improve reproducibility by offering a wide range of common easy-to-use functions for table and vector manipulation. Vectools also offers a number of vector related functions to speed up workflow development, such as simple machine learning and common statistics functions.
Biomedical data obtained during cell experiments, laboratory animal research, or human studies often display a complex distribution. Statistical identification of subgroups in research data poses an analytical challenge. Here were introduce an interactive R-based bioinformatics tool, called “AdaptGauss”. It enables a valid identification of a biologically-meaningful multimodal structure in the data by fitting a Gaussian mixture model (GMM) to the data. The interface allows a supervised selection of the number of subgroups. This enables the expectation maximization (EM) algorithm to adapt more complex GMM than usually observed with a noninteractive approach. Interactively fitting a GMM to heat pain threshold data acquired from human volunteers revealed a distribution pattern with four Gaussian modes located at temperatures of 32.3, 37.2, 41.4, and 45.4 °C. Noninteractive fitting was unable to identify a meaningful data structure. Obtained results are compatible with known activity temperatures of different TRP ion channels suggesting the mechanistic contribution of different heat sensors to the perception of thermal pain. Thus, sophisticated analysis of the modal structure of biomedical data provides a basis for the mechanistic interpretation of the observations. As it may reflect the involvement of different TRP thermosensory ion channels, the analysis provides a starting point for hypothesis-driven laboratory experiments.
Das Ziel des adaptiven Entwurfs von Substanzbibliotheken ist es, die vollständige biologische Testung einer molekularen Screeningbibliothek zu vermeiden. Stattdessen erfolgt, geleitet durch Optimierungsalgorithmen, eine "intelligente" Navigation durch den chemischen Raum, um so bevorzugt Substanzen mit gewünschten Eigenschaften auszuwählen. In einer retrospektiven Studie wurden die Optimierungsalgorithmen "Zufallssuche", "Simulated Annealing", "Evolutionsstrategie" und "Partikelschwarmoptimierung" im Hinblick auf den Entwurf von Bibliotheken von Serinproteaseinhibitoren systematischen verglichen. Die Gesamtzahl verfügbarer Substanztestungen wurde auf 300 beschränkt, um Laborbedingungen zu simulieren. Als Ergebnis zeigten sich besonders die Evolutionsstrategien für einen Einsatz in einer Niedrigdurchsatzscreening-Kampagne geeignet, da diese effizient mit großen Populationen und wenigen Iterationen arbeiteten. Der zweite Teil dieser Arbeit beschreibt den erfolgreichen Entwurf einer fokussierten Bibliothek von RNA-Liganden. In einer hybriden, prospektiven Optimierungsstudie wurden nach dem Vorbild einer iterativen Niedrigdurchsatzscreening-Kampagne vom Computer vorgeschlagene Moleküle im Labor getestet. Die Substanzen wurden auf Inhibition einer spezifischen molekularen Wechselwirkung im Replikationszyklus von HIV getestet (Tat-TAR-Interaktion). In vier Generationen wurden 9 von 170 untersuchten Verbindungen positiv auf Inhibition der Tat-TAR-Interaktion getestet (Trefferquote: 5,3%), wobei lediglich 0,089% der Verbindungen der Screeningbibliothek untersucht wurden. Die zwei potentesten Kandidaten wiesen einen IC50 von 51 uM bzw. 116 uM auf.
The abyssal seafloor is a mosaic of highly diverse habitats that represent the least known marine ecosystems on Earth. Some regions enriched in natural resources, such as polymetallic nodules in the Clarion-Clipperton Zone (CCZ), attract much interest because of their huge commercial potential. Since nodule mining will be destructive, baseline data are necessary to measure its impact on benthic communities. Hence, we conducted an environmental DNA and RNA metabarcoding survey of CCZ biodiversity targeting microbial and meiofaunal eukaryotes that are the least known component of the deep-sea benthos. We analyzed two 18S rRNA gene regions targeting eukaryotes with a focus on Foraminifera (37F) and metazoans (V1V2), sequenced from 310 surface-sediment samples from the CCZ and other abyssal regions. Our results confirm huge unknown deep-sea biodiversity. Over 60% of benthic foraminiferal and almost a third of eukaryotic operational taxonomic units (OTUs) could not be assigned to a known taxon. Benthic Foraminifera are more common in CCZ samples than metazoans and dominated by clades that are only known from environmental surveys. The most striking results are the uniqueness of CCZ areas, both datasets being characterized by a high number of OTUs exclusive to the CCZ, as well as greater beta diversity compared to other abyssal regions. The alpha diversity in the CCZ is high and correlated with water depth and terrain complexity. Topography was important at a local scale, with communities at CCZ stations located in depressions more diverse and heterogeneous than those located on slopes. This could result from eDNA accumulation, justifying the interim use of eRNA for more accurate biomonitoring surveys. Our descriptions not only support previous findings and consolidate our general understanding of deep-sea ecosystems, but also provide a data resource inviting further taxon-specific and large-scale modeling studies. We foresee that metabarcoding will be useful for deep-sea biomonitoring efforts to consider the diversity of small taxa, but it must be validated based on ground truthing data or experimental studies.
Here, we present a peptide-based linear mixed models tool—PBLMM, a standalone desktop application for differential expression analysis of proteomics data. We also provide a Python package that allows streamlined data analysis workflows implementing the PBLMM algorithm. PBLMM is easy to use without scripting experience and calculates differential expression by peptide-based linear mixed regression models. We show that peptide-based models outperform classical methods of statistical inference of differentially expressed proteins. In addition, PBLMM exhibits superior statistical power in situations of low effect size and/or low sample size. Taken together our tool provides an easy-to-use, high-statistical-power method to infer differentially expressed proteins from proteomics data.
Precise knowledge on the binding sites of an RNA-binding protein (RBP) is key to understanding the complex post-transcriptional regulation of gene expression. This information can be obtained from individual-nucleotide resolution UV crosslinking and immunoprecipitation (iCLIP) experiments. Here, we present a complete data analysis workflow to reliably detect RBP binding sites from iCLIP data. The workflow covers all steps from the initial quality control of the sequencing reads up to peak calling and quantification of RBP binding. For each tool, we explain the specific requirements for iCLIP data analysis and suggest optimised parameter settings.