Date Published: June 14, 2019
Publisher: Public Library of Science
Author(s): Jingxian Liao, Guowei Yang, David Kavaler, Vladimir Filkov, Prem Devanbu, Yong-Yeol Ahn.
Successful open source software (OSS) projects comprise freely observable, task-oriented social networks with hundreds or thousands of participants and large amounts of (textual and technical) discussion. The sheer volume of interactions and participants makes it challenging for participants to find relevant tasks, discussions and people. Tagging (e.g., @AmySmith) is a socio-technical practice that enables more focused discussion. By tagging important and relevant people, discussions can be advanced more effectively. However, for all but a few insiders, it can be difficult to identify important and/or relevant people. In this paper we study tagging in OSS projects from a socio-linguistics perspective. First we argue that textual content per se reveals a great deal about the status and identity of who is speaking and who is being addressed. Next, we suggest that this phenomenon can be usefully modeled using modern deep-learning methods. Finally, we illustrate the value of these approaches with tools that could assist people to find the important and relevant people for a discussion.
In distributed software engineering, people work both individually and in teams to complete programming tasks, using collaborative platforms such as GitHub. These platforms support social interactions using mechanisms such as tagging . As an example, a user named evaphx uses “@” tagging to invoke another user djones, to help with issue 1189 in the puma project :
Social discussions in software projects are sizable, highly technical, and important for project success [6–8]. In GitHub, the norm is to use @-mentions, similar to a “tag” in other social networking systems. By @-mentioning another user, one can, inter alia, call their attention to a particular issue [9, 10], to get feedback, or help with a task-related action. The idea of assigning tasks to appropriate individuals has been discussed extensively in Software Engineering, especially in bug [11, 12] and pull request assignment [13, 14]. Proper assignment is of great importance, as it has been shown that a minority of individuals do most of the work in open source projects . Thus, it follows that a system which can automatically identify the most relevant and responsive individuals would help developers, specially during on-boarding .
We sought data relevant to two tasks: first, identifying the highest status (viz, most active committers), and second, identifying individuals to be called upon—both, purely just from they way they speak or are spoken to. All our data were collected by using the Python package PyGithub  through the GitHub public API. We randomly sampled 50 projects from the top 900 GitHub projects with the most stars and followers, which reduced to 46 after removing those having missing data. We chose the top 900 projects to ensure there is a sufficient amount of text in their issue communications; we chose to sample in order to a) reduce the amount of time it takes to build our models; and b) avoid bias that may exist when examining only the very top projects.
In this section we examine the utility of various language features, to evaluate the performance of our LMs, and discuss the prediction accuracy and developer attributes for the sDAE-like model.
Modern tools such as GitHub support social coding, where developers interact via asynchronous textual media to develop software collaboratively. There is abundant textual communication data, which includes social tagging (@-mentions) where people are specifically addressed, to request their attention to technical discussions. We describe a series of experiments exploring the use of this textual data to identify the status of both speakers and those being spoken to, using model language models, and stacked de-noising to convert the text into continuous vector representations. We find good performance, and then examine closely the features of the text that might be leading to the strong performance. We experimentally discount non-content factors such as length, syntactic markers, etc, and find evidence suggesting that the semantic content is the primary factor behind our models’ good performance. In a follow-on experiment, we find that even the identity of who should be tagged can also be found with a reasonable level of accuracy. Finally, we note our work addresses a purely scientific question concerning the socio-linguistics of tagged exchanges in technical communities. We do acknowledge the possibility of improving on our task performance using other features, such as related source, prior social connections between tagger and (potential) tagee etc; we leave this for future work.