Sie befinden Sich nicht im Netzwerk der Universität Paderborn. Der Zugriff auf elektronische Ressourcen ist gegebenenfalls nur via VPN oder Shibboleth (DFN-AAI) möglich. mehr Informationen...
A Code Centric Evaluation of C/C++ Vulnerability Datasets for Deep Learning Based Vulnerability Detection Techniques
Ist Teil von
16th Innovations in Software Engineering Conference, 2023, p.1-10
Ort / Verlag
New York, NY, USA: ACM
Erscheinungsjahr
2023
Quelle
ACM Digital Library Complete
Beschreibungen/Notizen
Recent years have witnessed tremendous progress in NLP-based code comprehension via deep neural networks (DNN) learning, especially Large Language Models (LLMs). While the original application of LLMs is focused on code generation, there have been attempts to extend the application to more specialized tasks, like code similarity, author attribution, code repairs, and so on. As data plays an important role in the success of any machine learning approach, researchers have also proposed several benchmarks which are coupled with a specific task at hand.
It is well known in the machine learning (ML) community that the presence of biases in the dataset affects the quality of the ML algorithm in a real-world scenario. This paper evaluates several existing datasets from DNN’s application perspective. We specifically focus on training datasets of C/C++ language code. Our choice of language stems from the fact that while LLM-based techniques have been applied and evaluated on programming languages like Python, JavaScript, and Ruby, there is not much LLM research for C/C++. As a result, datasets generated synthetically or from real-world codes are in individual research work. Consequently, in the absence of a uniform dataset, such works are hard to compare with each other.
In this work, we aim to achieve two main objectives– 1. propose code-centric features that are relevant to security program analysis tasks like vulnerability detection; 2. a thorough (qualitative and quantitative) examination of the existing code datasets that demonstrate the main characteristics of the individual datasets to have a clear comparison. Our evaluation finds exciting facts about existing datasets highlighting gaps that need to be addressed.