Date Published: February 7, 2019
Publisher: Public Library of Science
Author(s): Tony Allard, Paul Alvino, Leslie Shing, Allan Wollaber, Joseph Yuen, Ivan Olier.
Data sets that provide a ground truth to quantify the efficacy of automated algorithms are rare due to the time consuming and expensive, although highly valuable, task of manually annotating observations. These datasets exist for niche problems in developed fields such as Natural Language Processing (NLP) and Business Process Mining (BPM), however it is difficult to find a suitable dataset for use cases that span across multiple fields, such as the one described in this study. The lack of established ground truth maps between cyberspace and the human-interpretable, persona-driven tasks that occur therein, is one of the principal barriers preventing reliable, automated situation awareness of dynamically evolving events and the consequences of loss due to cybersecurity breaches. Automated workflow analysis—the machine-learning assisted identification of templates of repeated tasks—is the likely missing link between semantic descriptions of mission goals and observable events in cyberspace. We summarize our efforts to establish a ground truth for an email dataset pertaining to the operation of an open source software project. The ground truth defines semantic labels for each email and the arrangement of emails within a sequence that describe actions observed in the dataset. Identified sequences are then used to define template workflows that describe the possible tasks undertaken for a project and their business process model. We present the overall purpose of the dataset, the methodology for establishing a ground truth, and lessons learned from the effort. Finally, we report on the proposed use of the dataset for the workflow discovery problem, and its effect on system accuracy.
The prevalence of Information and Communication Technology (ICT) and their function as a critical capability enabler now poses a risk for organizations should they become degraded, compromised, or inoperable . In a military context, commanders want to develop risk management processes to protect their ICT capability enablers and provide mission assurance, where Mission Assurance (MA) is defined as “measures required to accomplish essential objectives of missions in a contested environment” .
Modern organizations often establish workflows to help control and monitor how their employees perform certain defined business functions or tasks (e.g., travel, leave approval, procurement, and recruitment). Having a defined workflow allows organizations the ability to complete tasks in a somewhat predictable and measurable way. It can also aid in associating an organization’s mission essential tasks to underlying resources used within the cyber terrain. Often it is difficult to determine whether defined workflows are actually followed; or if undefined, what actions and workflows are executed on an organizations cyber resources. Workflow analysis defines the systematic approach to identify and characterize tasks that are executed within an organization.
Manually extracting workflows from a dataset is a resource intensive task, especially when there are no defined workflows a priori. Thus we seek to develop a process of automatic workflow extraction from natural language datasets, within the broader context of developing a mission map for MA. To empirically evaluate our techniques we require a dataset that has the following properties:
Here we describe the methods by which we extracted “ground truth” information from the 250 emails, along with examples from the dataset. Our analysis team consisted of 8 individuals from the Australian Defence Science and Technology Group (DST Group), the US Naval Research Lab (NRL), and MIT Lincoln Laboratory (MITLL).
In this section, we provide samples of the consensus results produced during the Delphi process; links to the full results are provided in Section 8. We note that we disallowed the use of JIRA issue numbers as valid consensus keywords early in the Delphi meetings, although those labels do appear in the individual votes.
90% of the observed workflow instances in the dataset mapped to the dominant two workflows, support and bugfix. The fidelity of these workflows were heavily dependent on the granularity of the chosen actions. There is a tension between over-generalizing actions versus assigning actions that are too specific. The former may result in a single digraph that encapsulates all workflow instances, whereas the latter could result in an excessive number of overly specific digraphs. This drove us to a consensus selection of 19 actions unique enough to differentiate workflow instances whilst still remaining general enough to construct meaningful workflow models. For example: the bugfix digraph contains workflow instances that result in creation, modification, and completion of code, regardless of the specific type (e.g. feature additions, improvements, or bugs); the support digraph contains workflow instances that result in user and developer questions, support, and updates; the build system update digraph contains workflow instances that result in updates to the Camel servers; and so on. The result was the identification of different processes occurring in the overall Camel system.
The driving component of our work is the combination of event classification and process mining techniques employed to perform workflow analysis. Existing research in both NLP and process mining fields discuss the use of labeled ground truth datasets to validate their approaches. However, we have yet to find a single labeled, ground truth dataset that can be used to validate the entire workflow analysis pipeline.
In this work we constructed a ground truth dataset that describes a subset of the business functions for an open source software project in order to facilitate methods for automated workflow analysis. This ground truth contains manually annotated keywords, metalabels, traces, and actions, via the Delphi consensus method, that serve as meaningful descriptors to construct the workflows that best describe these business functions. This provides the missing link between semantic descriptions of mission goals and observable events in cyberspace, enabling researchers to quantify the efficacy of automated algorithms for workflow discovery analysis, as well as other automated analysis techniques within the fields of natural language processing and business process mining. This dataset enables future researchers in the area of workflow analysis to understand the methodology that was used to establish the ground truth, including our novel presentation of mapping sequences of natural language events to workflows. We also provide a way to replicate the approach using an extension of our source data or any other source data and provide a foundation for the testing and constructing of automated workflow analysis techniques. The Delphi consensus method results, labeled data, and final workflow results are available through the links in Section 8.