TU Darmstadt / ULB / TUbiblio

A Fine-grained Data Set and Analysis of Tangling in Bug Fixing Commits

Herbold, Steffen ; Trautsch, Alexander ; Ledel, Benjamin ; Aghamohammadi, Alireza ; Ghaleb, Taher Ahmed ; Chahal, Kuljit Kaur ; Bossenmaier, Tim ; Nagaria, Bhaveet ; Makedonski, Philip ; Ahmadabadi, Matin Nili ; Szabados, Kristof ; Spieker, Helge ; Madeja, Matej ; Hoy, Nathaniel ; Lenarduzzi, Valentina ; Wang, Shangwen ; Rodrıguez-Perez, Gema ; Colomo-Palacios, Ricardo ; Verdecchia, Roberto ; Singh, Paramvir ; Qin, Yihao ; Chakroborti, Debasish ; Davis, Willard ; Walunj, Vijay ; Wu, Hongjun ; Marcilio, Diego ; Alam, Omar ; Aldaeej, Abdullah ; Amit, Idan ; Turhan, Burak ; Eismann, Simon ; Wickert, Anna-Katharina ; Malavolta, Ivano ; Sulir, Matus ; Fard, Fatemeh ; Henley, Austin Z. ; Kourtzanidis, Stratos ; Tuzun, Eray ; Treude, Christoph ; Shamasbi, Simin Maleki ; Pashchenko, Ivan ; Wyrich, Marvin ; Davis, James ; Serebrenik, Alexander ; Albrecht, Ella ; Aktas, Ethem Utku ; Strüber, Daniel ; Erbel, Johannes (2021)
A Fine-grained Data Set and Analysis of Tangling in Bug Fixing Commits.
doi: 10.48550/arXiv.2011.06244
Report, Bibliographie

Kurzbeschreibung (Abstract)

Tangled commits are changes to software that address multiple concerns at once. For researchers interested in bugs, tangled commits mean that they actually study not only bugs, but also other concerns irrelevant for the study of bugs. Objective: We want to improve our understanding of the prevalence of tangling and the types of changes that are tangled within bug fixing commits. Methods: We use a crowd sourcing approach for manual labeling to validate which changes contribute to bug fixes for each line in bug fixing commits. Each line is labeled by four participants. If at least three participants agree on the same label, we have consensus. Results: We estimate that between 17% and 32% of all changes in bug fixing commits modify the source code to fix the underlying problem. However, when we only consider changes to the production code files this ratio increases to 66% to 87%. We find that about 11% of lines are hard to label leading to active disagreements between participants. Due to confirmed tangling and the uncertainty in our data, we estimate that 3% to 47% of data is noisy without manual untangling, depending on the use case. Conclusion: Tangled commits have a high prevalence in bug fixes and can lead to a large amount of noise in the data. Prior research indicates that this noise may alter results. As researchers, we should be skeptics and assume that unvalidated data is likely very noisy, until proven otherwise.

Typ des Eintrags: Report
Erschienen: 2021
Autor(en): Herbold, Steffen ; Trautsch, Alexander ; Ledel, Benjamin ; Aghamohammadi, Alireza ; Ghaleb, Taher Ahmed ; Chahal, Kuljit Kaur ; Bossenmaier, Tim ; Nagaria, Bhaveet ; Makedonski, Philip ; Ahmadabadi, Matin Nili ; Szabados, Kristof ; Spieker, Helge ; Madeja, Matej ; Hoy, Nathaniel ; Lenarduzzi, Valentina ; Wang, Shangwen ; Rodrıguez-Perez, Gema ; Colomo-Palacios, Ricardo ; Verdecchia, Roberto ; Singh, Paramvir ; Qin, Yihao ; Chakroborti, Debasish ; Davis, Willard ; Walunj, Vijay ; Wu, Hongjun ; Marcilio, Diego ; Alam, Omar ; Aldaeej, Abdullah ; Amit, Idan ; Turhan, Burak ; Eismann, Simon ; Wickert, Anna-Katharina ; Malavolta, Ivano ; Sulir, Matus ; Fard, Fatemeh ; Henley, Austin Z. ; Kourtzanidis, Stratos ; Tuzun, Eray ; Treude, Christoph ; Shamasbi, Simin Maleki ; Pashchenko, Ivan ; Wyrich, Marvin ; Davis, James ; Serebrenik, Alexander ; Albrecht, Ella ; Aktas, Ethem Utku ; Strüber, Daniel ; Erbel, Johannes
Art des Eintrags: Bibliographie
Titel: A Fine-grained Data Set and Analysis of Tangling in Bug Fixing Commits
Sprache: Englisch
Publikationsjahr: 13 Oktober 2021
Verlag: arXiv
Reihe: Computer Science
Auflage: 4.Version
DOI: 10.48550/arXiv.2011.06244
URL / URN: https://arxiv.org/abs/2011.06244
Kurzbeschreibung (Abstract):

Tangled commits are changes to software that address multiple concerns at once. For researchers interested in bugs, tangled commits mean that they actually study not only bugs, but also other concerns irrelevant for the study of bugs. Objective: We want to improve our understanding of the prevalence of tangling and the types of changes that are tangled within bug fixing commits. Methods: We use a crowd sourcing approach for manual labeling to validate which changes contribute to bug fixes for each line in bug fixing commits. Each line is labeled by four participants. If at least three participants agree on the same label, we have consensus. Results: We estimate that between 17% and 32% of all changes in bug fixing commits modify the source code to fix the underlying problem. However, when we only consider changes to the production code files this ratio increases to 66% to 87%. We find that about 11% of lines are hard to label leading to active disagreements between participants. Due to confirmed tangling and the uncertainty in our data, we estimate that 3% to 47% of data is noisy without manual untangling, depending on the use case. Conclusion: Tangled commits have a high prevalence in bug fixes and can lead to a large amount of noise in the data. Prior research indicates that this noise may alter results. As researchers, we should be skeptics and assume that unvalidated data is likely very noisy, until proven otherwise.

Zusätzliche Informationen:

Accepted at Empirical Software Engineering, Springer Publishing

Fachbereich(e)/-gebiet(e): 20 Fachbereich Informatik
20 Fachbereich Informatik > Softwaretechnik
Hinterlegungsdatum: 11 Jan 2022 10:37
Letzte Änderung: 10 Aug 2023 13:45
PPN:
Export:
Suche nach Titel in: TUfind oder in Google
Frage zum Eintrag Frage zum Eintrag

Optionen (nur für Redakteure)
Redaktionelle Details anzeigen Redaktionelle Details anzeigen