Developer communication through email, chat, or issue report comments consists mostly of largely unstructured data, i.e., natural language text, mixed with technical artifacts such as project-specific jargon, abbreviations, source code patches, stack traces and identifiers. These technical artifacts represent a valuable source of knowledge on the technical part of the system, with a wide range of applications from establishing traceability links to creating project-specific vocabularies. However, the lack of well-defined boundaries between natural language and technical content make the automated mining of technical artifacts challenging. As a first step towards a general-purpose technique to extracting technical artifacts from unstructured data, we present a lightweight approach to untangle technical artifacts and natural language text. Our approach is based on existing spell checking tools, which are well-understood, fast, readily available across platforms and impartial to different kinds of textual data. Through a handcrafted benchmark, we demonstrate that our approach is able to successfully uncover a wide range of technical artifacts in unstructured data.
Download the Full Paper
The full paper is available for download, if you want to learn more about Lightweight Techniques for Mining Unstructured Data.
If you would like to cite the research in your own work, please use the following citation:
This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author’s copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.