Gibberish Detection 102

If you suspend your transcription on, please add a timestamp below to indicate how far you progressed! This will help others to resume your work!

Please do not press “publish” on to save your progress, use “save draft” instead. Only press “publish” when you're done with quality control.

Video duration
DGAs (Domain Generation Algorithms) have become a trusty fallback mechanism for malware that’s a headache to deal with, but they have one big drawback – they draw a lot of attention to themselves with their many DNS request for gibberish domains.

When basic entropy-based Machine Learning methods rose to the challenge of automatically detecting DGAs, DGAs responded by subtly changing their output to be /just/ plausible enough to fool those methods. In this talk we’ll harness the might of the English dictionary, cut corners to achieve sane running times for insane computations, and use fancy Machine Learning® methods – all in order to build a classifier with a higher standard for gibberish plausibility.

In recent years, there has been a rising trend in malware’s use of Domain Generation Algorithms (DGAs) as a fallback mechanism in case the campaign is shut down at the DNS level. DGAs are a headache to deal with, but they have one big drawback – they make a lot of noise. To be more precise, they generate a very large amount of DNS requests for domains, and the domains are often complete gibberish.

This situation looks ripe to be exploited with your favorite Cyber™ Machine Learning® Big Data© solution; and indeed, advances were made by basic language processing methods that could detect and stop the outright complete gibberish. These worked well, until DGAs mutated, and started producing more reasonable gibberish. A milestone in this regard was the introduction of KWYJIBO, a DGA that generates gibberish where every other letter is a vowel (e. g. „garolimoja“), which stumps the old methods completely.

How do you thwart KWYJIBO and other DGAs of its sophistication? How do you look for meaninglessness in string-space? In this talk we’ll harness the might of the English dictionary; cheat mathematics to cut running times from impossible to reasonable; and demonstrate a fancy Cyber™ Machine Learning® Big Data© tool based on all the above to tell apart meaningful domain names from nonsense. Where is this arms race going, anyway? Is there such a thing as undetectable gibberish?

Talk ID
Hall 6
4 p.m.
Type of
Ben H
Talk Slug & media link
0.0% Checking done0.0%
0.0% Syncing done0.0%
0.0% Transcribing done0.0%
100.0% Nothing done yet100.0%

Work on this video on Amara!

English: Transcribed until

Last revision: 2 years, 1 month ago