Understanding protein complexes by graph-theoretical analysis of their topology

Wolf, Jan Niclas

doi:10.21248/gups.77274

search hit 3 of 13

Back to Result List

Understanding protein complexes by graph-theoretical analysis of their topology

Jan Niclas Wolf

Proteins are biological macromolecules playing essential roles in all living organisms. Proteins often bind with each other forming complexes to fulfill their function. Such protein complexes assemble along an ordered pathway. An assembled protein complex can often be divided into structural and functional modules. Knowing the order of assembly and the modules of a protein complex is important to understand biological processes and treat diseases related to misassembly. Typical structures of the Protein Data Bank (PDB) contain two to three subunits and a few thousand atoms. Recent developments have led to large protein complexes being resolved. The increasing number and size of the protein complexes demand for computational assistance for the visualization and analysis. One such large protein complex is respiratory complex I accounting for 45 subunits in Homo sapiens. Complex I is a well understood protein complex that served as case study to validate our methods. Our aim was to analyze time-resolved Molecular Dynamics (MD) simulation data, identify modules of a protein complex and generate hypotheses for the assembly pathway of a protein complex. For that purpose, we abstracted the topology of protein complexes to Complex Graphs of the Protein Topology Graph Library (PTGL). The subunits are represented as vertices, and spatial contacts as edges. The edges are weighted with the number of contacts based on a distance threshold. This allowed us to apply graph-theoretic methods to visualize and analyze protein complexes. We extended the implementations of two methods to achieve a computation of Complex Graphs in feasible runtimes. The first method skipped checks for contacts using the information which residues are sequential neighbors. We extended the method to protein complexes and structures containing ligands. The second method introduced spheres encompassing all atoms of a subunit and skipped the check for contacts if the corresponding spheres do not overlap. Both methods combined allowed skipping up to 93 % of the checks for contacts for sample complexes of 40 subunits compared to up to 10 % of the previous implementation. We showed that the runtime of the combined method scaled linearly with the number of atoms compared to a non-linear scaling of the previous implementation We implemented a third method fixing the assignment of an orientation to secondary structure elements. We placed a three-dimensional vector in each secondary structure element and computed the angle between secondary structure elements to assign an orientation. This method sped up the runtime especially for large structures, such as the capsid of human immunodeficiency virus, for which the runtime decreased from 43 to less than 9 hours. The feasible runtimes allowed us to investigate two data sets of MD trajectories of respiratory complex I of Thermus thermophilus that we received. The data sets differ only by whether ubiquinone is bound to the complex. We implemented a pipeline, PTGLdynamics, to compute the contacts and Complex Graphs for all time steps of the trajectories. We investigated different methods to track changes of contacts during the simulation and created a heat map put onto the three-dimensional structure visualizing the changes. We also created line plots to visualize the changes of contacts over the course of the simulation. Both visualizations helped spotting outstandingly flexible or rigid regions of the structure or time points of the simulation in which major dynamics occur. We introduced normalizations of the edge weights of Complex Graphs for identi-fying modules and predicting the assembly pathway. The idea is to normalize the number of contacts for the number of residues of a subunit. We defined five different normalizations. To identify structural and functional modules, we applied the Leiden graph clustering algorithm to the Complex Graphs of respiratory complex I and the respiratory supercomplex. We examined the results for the different normalizations of the weights of the Complex Graphs. The absolute edge weight produced the best result identifying three of four modules that have been defined in the literature for respiratory complex I. We applied agglomerative hierarchical clustering to the edges of a Complex Graph to create hypotheses of the assembly pathway. The rationale was that subunits with an extensive interface in the final structure assemble early. We tested our method against two existing methods on a data set of 21 proteins with reported assembly pathways. Our prediction outperformed the other methods and ran in feasible runtimes of a few minutes at most. We also tested our method on respiratory complex I, the respiratory supercomplex and the respiratory megacomplex. We compared the results for the different normalizations with an assembly pathway of respiratory complex I described in the literature. We transformed the assembly pathways to dendrograms and compared the predictions to the reference using the Robinson-Foulds distance and clustering information distance. We analyzed the landscape of the clustering information distance by generating random dendrograms and showed that our result is far better than expected at random. We showed in a detailed analysis that the assembly prediction using one normalization was able to capture key features of the assembly pathway that has been proposed in the literature. In conclusion, we presented different applications of graph theory to automatically analyze the topology of protein complexes. Our programs run in feasible runtimes even for large complexes. We showed that graph-theoretic modeling of the protein structure can be used to analyze MD simulation data, identify modules of protein complexes and predict assembly pathways.
Proteine sind biologische Makromoleküle, die eine essenzielle Rolle in allen lebenden Organismen spielen. Zur Erfüllung ihrer Funktion bilden Proteine meist Komplexe. Solche Proteinkomplexe assemblieren entlang eines geordneten Pfades. Ein assemblierter Proteinkomplex kann oft in strukturelle und funktionelle Module unterteilt werden. Die Reihenfolge der Assemblierung und die Module eines Proteinkomplexes zu kennen, ist wichtig, um biologische Prozesse zu verstehen und Krankheiten zu heilen. Typische Strukturen der Protein Data Bank (PDB) enthalten zwei bis drei Untereinheiten. Kürzliche Entwicklungen haben dazu geführt, dass große Proteinkomplexe aufgeklärt wurden. Die Größe der Proteinkomplexe erfordert Unterstützung durch Rechner für die Visualisierung und Analyse. Ein Beispiel ist der menschliche respiratorische Komplex I, mit seinen 45 Untereinheiten. Komplex I dient uns als Fallbeispiel zur Validierung unserer Methoden. Unser Ziel war es, zeitaufgelöste Molecular Dynamics- (MD-) Simulationsdaten zu analysieren, Module eines Proteinkomplexes zu identifizieren und Hypothesen für den Assemblierungspfad eines Proteinkomplexes zu generieren. Dafür abstrahierten wir die Topologie von Proteinkomplexen auf Complex Graphs der Protein Topology Graph Library (PTGL). Die Untereinheiten werden durch Knoten repräsentiert und räumliche Kontakte als Kanten. Die Kanten sind mit der Anzahl an Kontakten gewichtet, die auf einem Distanzschwellwert basieren. Das erlaubte uns, graphentheoretische Methoden anzuwenden, um Proteinkomplexe zu visualisieren und zu analysieren. Wir erweiterten die Implementierungen von zwei Methoden, um überflüssige Kontaktberechnungen zu vermeiden. Die beiden Methoden kombiniert erlaubten bis zu 93 % der Kontaktberechnungen im Testdatensatz zu überspringen, verglichen mit bis zu 10 % der vorherigen Implementierung. Wir zeigten, dass die Laufzeit der kombinierten Methoden linear mit der Anzahl an Atomen skaliert. Wir implementierten eine dritte Methode zur Bestimmung der Orientierung zwischen Sekundärstrukturelementen. Die Methode beschleunigte die Laufzeit, besonders für große Strukturen. Die hinnehmbaren Laufzeiten ermöglichten uns, zwei Datensätze von MD-Trajektorien von dem respiratorischen Komplex I von Thermus thermophilus zu untersuchen, die sich nur im Vorhandensein von Ubiquinon unterschieden. Wir implementierten eine Pipeline, PTGLdynamics, um die Kontakte und Complex Graphs für alle Zeitschritte der Trajektorien zu berechnen. Wir verfolgten die Änderungen der Kontakte über die Simulation und erzeugten eine Heatmap, auf die dreidimensionale Struktur gelegt, um Änderungen zu visualisieren. Wir visualisierten die Änderungen der Kontakte außerdem in Liniendiagrammen. Beide Visualisierungen halfen herausragend flexible oder starre Regionen der Struktur und Zeitpunkte der Simulation zu finden, in denen bedeutende Dynamiken auftreten. Wir führten fünf verschiedene Normalisierungen der Kantengewichte von Complex Graphs ein, um die Anzahl an Kontakten mit den Längen der Untereinheiten zu normalisieren. Wir wandten den Leiden-Graphenclusteralgorithmus auf Complex Graphs des respiratorischen Komplexes I und den respiratorischen Superkomplex an. Wir erhielten für das absolute Kantengewicht das beste Ergebnis, indem es drei von vier in der Literatur beschriebenen Modulen identifizierte. Wir wandten agglomeratives hierarchisches Clustern der Kanten eines Complex Graph an, um Hypothesen über den Assemblierungspfad zu erzeugen. Die Idee war, dass Untereinheiten mit einem extensiven Interface in der finalen Struktur früh assemblieren. Wir testeten unsere Methode gegen zwei existierende Methoden auf einem Datensatz von 21 Proteinen. Unsere Vorhersagen übertrafen die anderen Methoden und liefen in angemessenen Laufzeiten weniger Minuten. Wir testeten außerdem unsere Methode anhand des respiratorischen Komplexes I, des Superkomplexes und des Megakomplexes. Wir verglichen die Ergebnisse für die verschiedenen Normalisierungen mit einem Assemblierungspfad aus der Literatur. Wir transformierten die Assemblierungspfade zu Dendrogrammen und verglichen die Vorhersagen mit der Referenz unter Benutzung der Robinson-Foulds- und der Clusterinformationsdistanz. Wir analysierten die Landschaft der Clusterinformationsdistanz durch Erzeugen zufälliger Dendrogramme und zeigten, dass unser Ergebnis weit besser ist, als zufällig erwartbar. Wir zeigten in einer detaillierten Analyse, dass die Assemblierungsvorhersage in der Literatur beschriebene Schlüsseleigenschaften abbildete. Zusammenfassend haben wir verschiedene Anwendungen von Graphentheorie präsentiert, um automatisch die Topologie von Proteinkomplexen zu analysieren. Unsere Programme laufen in angemessenen Laufzeiten, sogar für große Komplexe. Wir zeigten, dass graphentheoretische Modellierung von Proteinstrukturen benutzt werden kann, um MD-Simulationsdaten zu analysieren, Module von Proteinkomplexen zu identifizieren und Assemblierungspfade vorherzusagen.

Metadaten
Author:	Jan Niclas Wolf GND
URN:	urn:nbn:de:hebis:30:3-772744
DOI:	https://doi.org/10.21248/gups.77274
Place of publication:	Frankfurt am Main
Referee:	Ina Koch ORCiD, Marcel Holger Schulz ORCiD GND
Document Type:	Doctoral Thesis
Language:	English
Date of Publication (online):	2023/11/01
Year of first Publication:	2023
Publishing Institution:	Universitätsbibliothek Johann Christian Senckenberg
Granting Institution:	Johann Wolfgang Goethe-Universität
Date of final exam:	2023/04/04
Release Date:	2023/11/01
Tag:	bioinformatics; graph theory; protein assembly; protein structure; respiratory complex I
Page Number:	162
HeBIS-PPN:	512825092
Institutes:	Informatik und Mathematik
Dewey Decimal Classification:	0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
	5 Naturwissenschaften und Mathematik / 57 Biowissenschaften; Biologie / 570 Biowissenschaften; Biologie
Sammlungen:	Universitätspublikationen
Licence (German):	Deutsches Urheberrecht

Open Access

Understanding protein complexes by graph-theoretical analysis of their topology

Download full text files

Export metadata

Additional Services