Schemes for labeling semantic code clones using machine learning

Document Type

Conference Proceeding

Publication Date



Machine learning approaches built to identify code clones fail to perform well due to insufficient training samples and have been restricted only up to Type-III clones. A majority of the publicly available code clone corpora are incomplete in nature and lack labeled samples for semantic or Type-IV clones. We present here two schemes for labeling all types of clones including Type-IV clones. We restrict our study to Java code only. First, we use an unsupervised approach to label Type-IV clones and validate them using expert Java programmers. Next, we present a supervised scheme for labeling (or classifying) unknown samples based on labeled samples derived from our first scheme. We evaluate the performance of our schemes using six well-known Java code clone corpora and report on the quality of produced clones in terms of kappa agreement, mean error and accuracy scores. Results show that both schemes produce high quality code clones facilitating future use of machine learning in detecting clones of Type-IV.

This document is currently not available here.