da Silva Santos, Pedro Bispo (2021)
Multimodal Classification of Audiovisual Content.
Technische Universität Darmstadt
doi: 10.26083/tuprints-00018590
Dissertation, Erstveröffentlichung, Verlagsversion
Kurzbeschreibung (Abstract)
This thesis is concerned with multimodal machine learning for digital humanities. Multimodal machine learning integrates vision, speech, and language to solve a particular set of tasks, such as sentiment analysis, emotion recognition, personality recognition, and deceptive behaviour detection. The usage of other modalities benefited these tasks since human communication is multimodal by its nature. The intersection between humanities and computational methods defines the so-called digital humanities, i.e., a subset in the humanities and social sciences, which leverages digital mechanisms to conduct their research. This thesis supports the claim that using audiovisual modalities when training computational models in digital humanities can benefit the performance of any cumbersome task where annotators use audiovisual sources of information to annotate the data. We hypothesise that audiovisual content studied by some areas from humanities and social sciences, such as psychology, pedagogy, and communication sciences, can be explained and categorised by audiovisual processing techniques. These techniques can increase humanities and socials sciences researchers' productivity by bootstrapping their analysis using machine learning techniques and allowing their research to scale to more massive amounts of data. Besides that, these methods could also implement more socially aware virtual agents. This kind of technology enables a more sophisticated computer-human interaction, which can enrich commercial applications' user experience. Problems tackled by natural language processing techniques sometimes reach an upper bound due to the limitations of the knowledge present in textual information. Humans use prosody to convey meaning. Machine learning models trying to predict the sentiment present in transcribed speech can lose much information if dealing solely with the text modality. Persuasiveness prediction is another excellent example since factors beyond argumentation, such as prosody, visual appearance, and body language, can persuade people. Previous work in opinion mining and persuasiveness prediction have shown that multimodal approaches are quite successful when combining multiple modalities. However, textual transcripts and visual information might not be available due to technical constraints, so one may ask how accurately machine learning models predict people's opinion using only prosodic information. Most of the work in computational paralinguistics rely on cumbersome feature-engineering approaches, so another question is whether domain-agnostic methods work in this field. Our results show that relying on a simple recurrent neural architecture trained on Mel-Frequency Cepstral Coefficients can predict speakers' opinion. Speech is not the only channel besides the textual one that signals critical information. The visual channel is also significant. Humans can express several expressions, defined as cues under the Lens Model of Brunswik. Researchers from humanities and social sciences try to understand how relevant those signals are by manually annotating information that might be present in the facial expressions of subjects under analysis. However, these tasks are very time-consuming and prone to human errors due to fatigue or lack of training. We present that automatically extracted low and high-level features using the latest computer vision methods can explain visual data from researchers of humanities and social sciences, especially from areas like pedagogy and communication sciences. We also demonstrate that an end-to-end approach can automatically predict the psychological construct of intrinsic motivation. Another problem widely studied in political sciences is the understanding of persuasive factors in speeches and debates. For instance, Nagel et al. (2012) have evaluated which features in all three modalities (text, speech, and vision) were forming an audience's impression in the national election debate between Angela Merkel and Gerhard Schroeder. However, there is no previous work in the literature, which presents an automated approach to predict what impression a politician is forming during a debate. Our results reveal that high-level features automatically extracted in a multimodal approach can indicate what elements in political communication mould an audience's impression and are also useful for training machine learning models to predict it. We run the experiments in this thesis with data from psychology, pedagogy, and communication science research, providing empirical evidence to the raised hypothesis that audiovisual content coming from humanities and social sciences can be explained and automatically classified by audiovisual processing methods. This thesis presents new applications of multimodal machine learning in digital humanities, presenting different ways of modelling the tasks, besides reinforcing the well-known issue of fairness in artificial intelligence. In conclusion, this thesis strengthens the notion that audiovisual modalities are primary communication channels which should be carefully analysed and explored in multimodal machine learning for digital humanities.
Typ des Eintrags: | Dissertation | ||||||
---|---|---|---|---|---|---|---|
Erschienen: | 2021 | ||||||
Autor(en): | da Silva Santos, Pedro Bispo | ||||||
Art des Eintrags: | Erstveröffentlichung | ||||||
Titel: | Multimodal Classification of Audiovisual Content | ||||||
Sprache: | Englisch | ||||||
Referenten: | Gurevych, Prof. Dr. Iryna ; Maurer, Prof. Dr. Marcus ; Mihalcea, Prof. Dr. Rada | ||||||
Publikationsjahr: | 2021 | ||||||
Ort: | Darmstadt | ||||||
Kollation: | xii, 156 Seiten | ||||||
Datum der mündlichen Prüfung: | 8 März 2021 | ||||||
DOI: | 10.26083/tuprints-00018590 | ||||||
URL / URN: | https://tuprints.ulb.tu-darmstadt.de/18590 | ||||||
Kurzbeschreibung (Abstract): | This thesis is concerned with multimodal machine learning for digital humanities. Multimodal machine learning integrates vision, speech, and language to solve a particular set of tasks, such as sentiment analysis, emotion recognition, personality recognition, and deceptive behaviour detection. The usage of other modalities benefited these tasks since human communication is multimodal by its nature. The intersection between humanities and computational methods defines the so-called digital humanities, i.e., a subset in the humanities and social sciences, which leverages digital mechanisms to conduct their research. This thesis supports the claim that using audiovisual modalities when training computational models in digital humanities can benefit the performance of any cumbersome task where annotators use audiovisual sources of information to annotate the data. We hypothesise that audiovisual content studied by some areas from humanities and social sciences, such as psychology, pedagogy, and communication sciences, can be explained and categorised by audiovisual processing techniques. These techniques can increase humanities and socials sciences researchers' productivity by bootstrapping their analysis using machine learning techniques and allowing their research to scale to more massive amounts of data. Besides that, these methods could also implement more socially aware virtual agents. This kind of technology enables a more sophisticated computer-human interaction, which can enrich commercial applications' user experience. Problems tackled by natural language processing techniques sometimes reach an upper bound due to the limitations of the knowledge present in textual information. Humans use prosody to convey meaning. Machine learning models trying to predict the sentiment present in transcribed speech can lose much information if dealing solely with the text modality. Persuasiveness prediction is another excellent example since factors beyond argumentation, such as prosody, visual appearance, and body language, can persuade people. Previous work in opinion mining and persuasiveness prediction have shown that multimodal approaches are quite successful when combining multiple modalities. However, textual transcripts and visual information might not be available due to technical constraints, so one may ask how accurately machine learning models predict people's opinion using only prosodic information. Most of the work in computational paralinguistics rely on cumbersome feature-engineering approaches, so another question is whether domain-agnostic methods work in this field. Our results show that relying on a simple recurrent neural architecture trained on Mel-Frequency Cepstral Coefficients can predict speakers' opinion. Speech is not the only channel besides the textual one that signals critical information. The visual channel is also significant. Humans can express several expressions, defined as cues under the Lens Model of Brunswik. Researchers from humanities and social sciences try to understand how relevant those signals are by manually annotating information that might be present in the facial expressions of subjects under analysis. However, these tasks are very time-consuming and prone to human errors due to fatigue or lack of training. We present that automatically extracted low and high-level features using the latest computer vision methods can explain visual data from researchers of humanities and social sciences, especially from areas like pedagogy and communication sciences. We also demonstrate that an end-to-end approach can automatically predict the psychological construct of intrinsic motivation. Another problem widely studied in political sciences is the understanding of persuasive factors in speeches and debates. For instance, Nagel et al. (2012) have evaluated which features in all three modalities (text, speech, and vision) were forming an audience's impression in the national election debate between Angela Merkel and Gerhard Schroeder. However, there is no previous work in the literature, which presents an automated approach to predict what impression a politician is forming during a debate. Our results reveal that high-level features automatically extracted in a multimodal approach can indicate what elements in political communication mould an audience's impression and are also useful for training machine learning models to predict it. We run the experiments in this thesis with data from psychology, pedagogy, and communication science research, providing empirical evidence to the raised hypothesis that audiovisual content coming from humanities and social sciences can be explained and automatically classified by audiovisual processing methods. This thesis presents new applications of multimodal machine learning in digital humanities, presenting different ways of modelling the tasks, besides reinforcing the well-known issue of fairness in artificial intelligence. In conclusion, this thesis strengthens the notion that audiovisual modalities are primary communication channels which should be carefully analysed and explored in multimodal machine learning for digital humanities. |
||||||
Alternatives oder übersetztes Abstract: |
|
||||||
Status: | Verlagsversion | ||||||
URN: | urn:nbn:de:tuda-tuprints-185904 | ||||||
Sachgruppe der Dewey Dezimalklassifikatin (DDC): | 000 Allgemeines, Informatik, Informationswissenschaft > 004 Informatik | ||||||
Fachbereich(e)/-gebiet(e): | 20 Fachbereich Informatik 20 Fachbereich Informatik > Ubiquitäre Wissensverarbeitung |
||||||
Hinterlegungsdatum: | 28 Jul 2021 08:16 | ||||||
Letzte Änderung: | 03 Aug 2021 06:59 | ||||||
PPN: | |||||||
Referenten: | Gurevych, Prof. Dr. Iryna ; Maurer, Prof. Dr. Marcus ; Mihalcea, Prof. Dr. Rada | ||||||
Datum der mündlichen Prüfung / Verteidigung / mdl. Prüfung: | 8 März 2021 | ||||||
Export: | |||||||
Suche nach Titel in: | TUfind oder in Google |
Frage zum Eintrag |
Optionen (nur für Redakteure)
Redaktionelle Details anzeigen |