If you suspend your transcription on amara.org, please add a timestamp below to indicate how far you progressed! This will help others to resume your work!
Please do not press “publish” on amara.org to save your progress, use “save draft” instead. Only press “publish” when you're done with quality control.
It is possible to identify individuals by de-anonymizing different types of large datasets. Once individuals are de-anonymized, different types of personal details can be detected from data that belong to them. Furthermore, their identities across different platforms can be linked. This is possible through utilizing machine learning methods that represent human data with a numeric vector that consists of features. Then a classifier is used to learn the patterns of each individual, to classify a previously unseen feature vector.
Tor users, social networks, underground cyber forums, the Netflix dataset have been de-anonymized in the past five years. Advances in machine learning and the improvements in computational power, such as cloud computing services, make these large scale de-anonymization tasks possible in a feasible amount of time. As data aggregators are collecting vast amounts of data from all possible digital media channels and as computing power is becoming cheaper, de-anonymization threatens privacy on a daily basis.
Last year, we showed how we can de-anonymize programmers from their source code. This is an immediate concern for programmers who would like to remain anonymous. (Remember Saeed Malekpour, who was sentenced to death after the Iranian government identified him as the web programmer of a porn site.) We scaled our method to 1,600 programmers after last year’s talk on identifying source code authors via stylometry. We reach 94% accuracy in correctly identifying the 1,600 authors of 14,400 source code samples. These results are a breakthrough in accuracy and magnitude when compared to related work.
This year we have been focusing on de-anonymizing programmers from their binaries of compiled code. Identifying stylistic fingerprints in binaries is much more difficult in comparison to source code. Source code goes through compilation to generate binaries and some stylistic fingerprints get lost in translation while some others survive. We reach 65% accuracy, again a breakthrough, in de-anonymizing binaries of 100 authors.
De-anonymization is a threat to privacy but it has many security enhancing applications. Identifying authors of source code helps aid in resolving plagiarism issues, forensic investigations, and copyright-copyleft disputes. Identifying authors of binaries can help identify the author of a suspicious executable file or even be extended to malware classification. We show how source code and binary authorship attribution works on a real world datasets collected from GitHub.
I hope this talk raises awareness on the dangers of de-anonymization while showing how it can be helpful in resolving conflicts in some other areas. Binary de-anonymization could potentially enhance security by identifying malicious actors such as malware writers or software thieves.
I would like to conclude by mentioning two future directions. Can binary de-anonymization be used for malware family classification and be incorporated to virus detectors? Obfuscators are not the counter measure to de-anonymizing programmers. We can identify the authors of obfuscated code with high accuracy. There is an immediate need for a code anonymization framework, especially for all the open source software developers who would like to remain anonymous.