Self-Admitted Technical Debt in R: Detection and Causes

Abstract

Self-Admitted Technical Debt (SATD) is primarily studied in Object-Oriented (OO) languages and traditionally commercial software. However, scientific software coded in dynamically-typed languages such as R differs in paradigm, and the source code comments’ semantics are different (i.e., more aligned with algorithms and statistics when compared to traditional software). Additionally, many Software Engineering topics are understudied in scientific software development, with SATD detection remaining a challenge for this domain. This gap adds complexity since prior works determined SATD in scientific software does not adjust to many of the keywords identified for OO SATD, possibly hindering its automated detection. Therefore, we investigated how classification models (traditional machine learning, deep neural networks, and deep neural Pre-Trained Language Models (PTMs)) automatically detect SATD in R packages. This study aims to study the capabilities of these models to classify different TD types in this domain and manually analyze the causes of each in a representative sample. Our results show that PTMs (i.e., RoBERTa) outperform other models and work well when the number of comments labelled as a particular SATD type has low occurrences. We also found that some SATD types are more challenging to detect. We manually identified sixteen causes, including eight new causes detected by our study. The most common cause was failure to remember, in agreement with previous studies. These findings will help the R package authors automatically identify SATD in their source code and improve their code quality. In the future, checklists for R developers can also be developed by scientific communities such as rOpenSci to guarantee a higher quality of packages before submission.

Publication
In Automated Software Engineering Journal, Vol (29)2), pp.53

Contributions

  • This is the first automated detection analysis of SATD in R programming, specifically for R packages.
  • Likewise, PTMs for SATD detection have not been used before.
  • An augmented corpus (from 8 to 16) of plausible causes of SATD, extracted from 1,345 comments. It expands on previously proposed categories and is publicly shared.
  • The automated detection of 12 types of SATD compared to 5 types in other SATD studies.


Acknowledgements

This study is partly supported by the Natural Sciences and Engineering Research Council of Canada, RGPIN-2021-04232 and DGECR-2021-00283 at the University of Saskatchewan, and RGPIN-2019-05175 at the University of British Columbia. We thank ANU for the open access support.


Citation

@Article{Sharma2022,
author={Sharma, Rishab
and Shahbazi, Ramin
and Fard, Fatemeh H.
and Codabux, Zadia
and Vidoni, Melina},
title="{Self-Admitted Technical Debt in R: Detection and Causes}",
journal={Automated Software Engineering},
year={2022},
month={Aug},
day={25},
volume={29},
number={2},
pages={53},
issn={1573-7535},
doi={10.1007/s10515-022-00358-6},
url={https://doi.org/10.1007/s10515-022-00358-6}
}


Venue Impact

The following is the venue impact, according to Scimago Journal Ranking:

SCImago Journal & Country Rank