Heß, Martin Philipp (2018)
Visual Search and Analysis in Molecular Biology.
Technische Universität Darmstadt
Dissertation, Erstveröffentlichung
Kurzbeschreibung (Abstract)
The computation of protein sequence alignments is one of the most fundamental tasks in computational biology. Pairwise sequence alignments (PSA) form the basis for the detection of homologous protein sequences. Multiple sequence alignments (MSA) can provide insights into structural and functional relationships across a set of proteins. In the alignment process, evolutionary, functionally, or structurally related regions between the sequences are identified and aligned depending on a particular scoring model. Evolutionary substitution events are usually modeled by substitution matrices, while insertion and deletion events are modeled by specific gap penalties.
The quality of sequence alignments depends heavily on the chosen scoring model, the alignment algorithm, and the sequence data itself. The selection of the best parameters for a given alignment task is, however, non-trivial. Thus many researchers regularly use potentially suboptimal default parameters. This also includes biased and dated substitution matrices. In addition, the construction of MSAs is an NP-complete task and as such the optimal alignment is unknown, even for a fixed parameter set. MSA algorithms thus rely on heuristics to approximate the optimal MSA resulting in alignments of suboptimal quality which often require manual refinement. Assessing the quality of MSAs is also problematic since most established quality measures are limited in the detection of bad alignment regions.
In this thesis, we present several approaches and concepts to improve the accuracy of sequence alignments. In particular, this includes two novel substitution models to enable existing methods to produce better alignments as well as approaches to enable experts and non-experts to assess the quality of the computed MSAs and to effectively refine them to improve their accuracy.
We present the novel CorBLOSUM substitution model that fixes a substantial programming error in the original BLOSUM code. This error negatively affects the homologous sequence search performance of the original BLOSUM matrices as well as their revised RBLOSUM variants. Our exhaustive benchmark analysis based on 51 different ASTRAL subsets shows that CorBLOSUM matrices usually detect more true homologs when compared with their incorrect BLOSUM and RBLOSUM counterparts. For this reason, using CorBLOSUM matrices instead of BLOSUM can substantially improve the results of homologous sequence search.
Furthermore, we propose the novel PFASUM substitution model that is derived from Pfam seed alignments using our novel PFASUM algorithm. Unlike conventional substitution models, our PFASUM matrices are thus based on manually curated expert ground truth data that reflects the currently known sequence space. Additionally, our PFASUM algorithm incorporates several mechanism to avoid oversampling while handling ambiguous amino acids in a reasonable way. As shown by our thorough performance evaluations, these features enable PFASUM matrices to significantly outperform widely used conventional matrices in homologous sequence search. Additionally, using PFASUM matrices for the construction of MSAs also results in more accurate MSAs in most cases.
Beside the aforementioned substitution models, we present a novel visual analysis and comparison approach for protein MSAs. It allows to detect reliably aligned and misaligned regions in protein MSAs without much effort. This is achieved by using an automatic comparison of alternative MSAs of the same sequence set and the visualization of consistently aligned regions and uncertain areas in the MSAs. Our evaluation shows that our system allows to successfully assess the accuracy of MSAs and to effectively determine uncertain regions for further refinement. Additionally, it can be used to visually assess the impact of different alignment algorithms and parameterizations on the resulting alignments.
In order to outsource the cumbersome task of manual MSA refinement, we present our scientific discovery game Bionigma. It abstracts the alignment problem in the form of a puzzle game. In these puzzles, the amino acids in the alignment are represented by different game tokens. Like one would align beads of identical color in an abacus, the players must align similar tokens to improve their score. Through this, the players successively refine the real MSA in a playful manner. Several user studies show that Bionigma is fun to play and delivers a true game experience to the players. Additionally, our results demonstrate that casual players can successfully refine protein MSAs. In particular, they can even produce more accurate than automatic methods.
In summary, the here presented approaches and concepts can help to significantly improve the accuracy of protein sequence alignments. Notably, our methods enable biologists without profound knowledge in the field of sequence alignments to generate better results without much effort.
Typ des Eintrags: | Dissertation | ||||
---|---|---|---|---|---|
Erschienen: | 2018 | ||||
Autor(en): | Heß, Martin Philipp | ||||
Art des Eintrags: | Erstveröffentlichung | ||||
Titel: | Visual Search and Analysis in Molecular Biology | ||||
Sprache: | Englisch | ||||
Referenten: | Goesele, Prof. Dr. Michael ; Weigt, Prof. Dr. Martin ; Hamacher, Prof. Dr. Kay | ||||
Publikationsjahr: | 2018 | ||||
Ort: | Darmstadt | ||||
Datum der mündlichen Prüfung: | 19 Dezember 2017 | ||||
URL / URN: | http://tuprints.ulb.tu-darmstadt.de/7306 | ||||
Kurzbeschreibung (Abstract): | The computation of protein sequence alignments is one of the most fundamental tasks in computational biology. Pairwise sequence alignments (PSA) form the basis for the detection of homologous protein sequences. Multiple sequence alignments (MSA) can provide insights into structural and functional relationships across a set of proteins. In the alignment process, evolutionary, functionally, or structurally related regions between the sequences are identified and aligned depending on a particular scoring model. Evolutionary substitution events are usually modeled by substitution matrices, while insertion and deletion events are modeled by specific gap penalties. The quality of sequence alignments depends heavily on the chosen scoring model, the alignment algorithm, and the sequence data itself. The selection of the best parameters for a given alignment task is, however, non-trivial. Thus many researchers regularly use potentially suboptimal default parameters. This also includes biased and dated substitution matrices. In addition, the construction of MSAs is an NP-complete task and as such the optimal alignment is unknown, even for a fixed parameter set. MSA algorithms thus rely on heuristics to approximate the optimal MSA resulting in alignments of suboptimal quality which often require manual refinement. Assessing the quality of MSAs is also problematic since most established quality measures are limited in the detection of bad alignment regions. In this thesis, we present several approaches and concepts to improve the accuracy of sequence alignments. In particular, this includes two novel substitution models to enable existing methods to produce better alignments as well as approaches to enable experts and non-experts to assess the quality of the computed MSAs and to effectively refine them to improve their accuracy. We present the novel CorBLOSUM substitution model that fixes a substantial programming error in the original BLOSUM code. This error negatively affects the homologous sequence search performance of the original BLOSUM matrices as well as their revised RBLOSUM variants. Our exhaustive benchmark analysis based on 51 different ASTRAL subsets shows that CorBLOSUM matrices usually detect more true homologs when compared with their incorrect BLOSUM and RBLOSUM counterparts. For this reason, using CorBLOSUM matrices instead of BLOSUM can substantially improve the results of homologous sequence search. Furthermore, we propose the novel PFASUM substitution model that is derived from Pfam seed alignments using our novel PFASUM algorithm. Unlike conventional substitution models, our PFASUM matrices are thus based on manually curated expert ground truth data that reflects the currently known sequence space. Additionally, our PFASUM algorithm incorporates several mechanism to avoid oversampling while handling ambiguous amino acids in a reasonable way. As shown by our thorough performance evaluations, these features enable PFASUM matrices to significantly outperform widely used conventional matrices in homologous sequence search. Additionally, using PFASUM matrices for the construction of MSAs also results in more accurate MSAs in most cases. Beside the aforementioned substitution models, we present a novel visual analysis and comparison approach for protein MSAs. It allows to detect reliably aligned and misaligned regions in protein MSAs without much effort. This is achieved by using an automatic comparison of alternative MSAs of the same sequence set and the visualization of consistently aligned regions and uncertain areas in the MSAs. Our evaluation shows that our system allows to successfully assess the accuracy of MSAs and to effectively determine uncertain regions for further refinement. Additionally, it can be used to visually assess the impact of different alignment algorithms and parameterizations on the resulting alignments. In order to outsource the cumbersome task of manual MSA refinement, we present our scientific discovery game Bionigma. It abstracts the alignment problem in the form of a puzzle game. In these puzzles, the amino acids in the alignment are represented by different game tokens. Like one would align beads of identical color in an abacus, the players must align similar tokens to improve their score. Through this, the players successively refine the real MSA in a playful manner. Several user studies show that Bionigma is fun to play and delivers a true game experience to the players. Additionally, our results demonstrate that casual players can successfully refine protein MSAs. In particular, they can even produce more accurate than automatic methods. In summary, the here presented approaches and concepts can help to significantly improve the accuracy of protein sequence alignments. Notably, our methods enable biologists without profound knowledge in the field of sequence alignments to generate better results without much effort. |
||||
Alternatives oder übersetztes Abstract: |
|
||||
URN: | urn:nbn:de:tuda-tuprints-73062 | ||||
Sachgruppe der Dewey Dezimalklassifikatin (DDC): | 000 Allgemeines, Informatik, Informationswissenschaft > 004 Informatik 500 Naturwissenschaften und Mathematik > 570 Biowissenschaften, Biologie |
||||
Fachbereich(e)/-gebiet(e): | 20 Fachbereich Informatik 20 Fachbereich Informatik > Graphics, Capture and Massively Parallel Computing |
||||
Hinterlegungsdatum: | 29 Jul 2018 19:55 | ||||
Letzte Änderung: | 29 Jul 2018 19:55 | ||||
PPN: | |||||
Referenten: | Goesele, Prof. Dr. Michael ; Weigt, Prof. Dr. Martin ; Hamacher, Prof. Dr. Kay | ||||
Datum der mündlichen Prüfung / Verteidigung / mdl. Prüfung: | 19 Dezember 2017 | ||||
Export: | |||||
Suche nach Titel in: | TUfind oder in Google |
Frage zum Eintrag |
Optionen (nur für Redakteure)
Redaktionelle Details anzeigen |