If you suspend your transcription on amara.org, please add a timestamp below to indicate how far you progressed! This will help others to resume your work!
Please do not press “publish” on amara.org to save your progress, use “save draft” instead. Only press “publish” when you're done with quality control.
Anonymity is a topic researched in detail at the Privacy, Security, and Automation Lab at Drexel University. We study how to effectively identify the author of text with unknown authors and how to anonymize text of known authorship. In our previous talks at CCC, we have presented methods to identify authors of regular text, translated text and users a.k.a cyber-criminals of online underground forums. We introduced our authorship anonymization framework ‘Anonymouth’. Many times, we received questions on how applying de-anonymization techniques would work on source code and different domains. In this year’s talk, we will focus on identifying the authors of source code and cross-domain stylometry.
Can the authors of source code be identified automatically through features of their programming style? Do they leave coding “footprints”? Holding important implications for protecting intellectual property as well as for identifying malware authors and tracking how malware spreads and evolves, this question spurred a cross-cutting research project involving NLP and machine learning. Code stylometry requires features unique to coding and to the programming language. Source code has different properties than common writing, such as the lineage, keywords, comments, the way functions and variables are created, and the grammar of the program.
Aware that methods from text analytics can strengthen cyber analytics, this project sought to advance the potential of automated linguistic-type analysis, or stylometry, for authorship attribution of source code. A corpus of tens of thousands of users was built by scraping Google Code Jam Competition dataset. Specifically investigated were new ways of representing coding style through NLP-inspired syntactic, lexical and layout features. Random forests with 300 hundred trees were used along with less than ten decision features per tree. The main dataset had 173 authors each with six source code files with less then 100 lines of C++ code. A series of experiments was performed to discover the feature set that yielded the highest recognition accuracy: 91%. 57% of the features with information gain were syntactic and the rest were lexical and layout features. Tests on a validation dataset of exact same size showed 86% accuracy with the same features. The features that had information gain in the validation experiments all had information gain in the original dataset, which shows that the method and feature set are robust and abstract syntax trees show best promise.
Source code is just one domain studied in authorship attribution. We also study the problem of domain adaption in stylometry. Can we identify the author of an anonymous blog from a suspect group of Twitter accounts? The ability to do so would lead to the ability to link accounts and identities across the Internet. We can achieve high accuracy at identifying authors of documents within the same domain, including blogs, Twitter feeds, and Reddit comments, even when classifying with up to 200 authors. Identifying the author of a group of tweets from among 200 tweeters yields an accuracy of 94% and identifying the author of a blog entry from among 200 bloggers yields an accuracy of 71%. When we try to identify to author of a collection of tweets based on a collection of blogs from 200 authors, however, accuracy drops to 7% using the same method and features.
We are able to increase the accuracy, however, by applying an augmented version of doppelganger finder, a stylometric approach for multiple account detection that can handle small stylistic changes. This provides significant improvements in each of the cross-domain cases.
Advances in authorship attribution offer both positive and negative repercussions for security. However, it is important to understand the assumptions that underlie these results. Blind application of stylometric methods could be dangerous if the domain is not understood. This work shows that stylometric methods are domain dependent. Whether used defensively or offensively, this is certain to impact user account security.