Machine Learning for Cyber Security

Dr Miranda Mowbray

~900 words

Published: August 29^th 12,018 HE

Last modified: January 24^th 12,021 HE

Summary

Dr Miranda Mowbray, Lecturer in Computer Science at the University of Bristol, came to the University of Lancaster to give a presentation about some of the difficulties with using machine learning for detecting cyber security attacks on enterprise networks. The work in question was undertaken whilst she was working at HP Labs.

Mowbray began by explaining just what is meant by machine learning, in case any of her inter-faculty audience were unaware: software that converts into other software by updating itself. She de-technicalised this with the excellent analogy of a recipe that requests salt to taste, and the repeat experimentation that a given cook must then perform in order to find out their ideal amount of salt. She also detailed the distinction between supervised and unsupervised learning; in the former, the software is given a pre-demarcated training set of data (e.g. images of birds and of bees, with the relevant descriptors attached), whilst in the latter there are no hints and the software must produce its own distinction.

Dr Miranda Mowbray, Lecturer in Computer Science at the University of Bristol, came to the University of Lancaster to give a presentation about some of the difficulties with using machine learning for detecting cyber security attacks on enterprise networks. The work in question was undertaken whilst she was working at HP Labs.

Mowbray began by explaining just what is meant by machine learning, in case any of her inter-faculty audience were unaware: software that converts into other software by updating itself. She de-technicalised this with the excellent analogy of a recipe that requests salt to taste, and the repeat experimentation that a given cook must then perform in order to find out their ideal amount of salt. She also detailed the distinction between supervised and unsupervised learning; in the former, the software is given a pre-demarcated training set of data (e.g. images of birds and of bees, with the relevant descriptors attached), whilst in the latter there are no hints and the software must produce its own distinction.

The issue of using machine learning in cyber security, said Mowbray, is that it is hard to get to the grand truth, as one [doesn’t] know what an attack’s going to look like, but [one knows] it’s going to to look weird. Traditional anti-virus tools rely on the use of signatures to identify suspect pieces of software and intercept them. This worked fine for a time, but nowadays we see around 28 million new malware variants per month—it clearly doesn’t scale. These strains of malware are obviously not all hand-written, unless the global population of cybercriminals rivals that of Mozambique. This is the result of the rise of polymorphic malware, which can change its own code in order to evade signature detection. Clearly, higher-level pattern detection is now required.

Mowbray then moved on to detail the major problems with the use of machine learning for cyber security. First, the false alarm problem: if bad events occur at a rate of 1 in a million, and your method has a 0.1 % false alarm rate, you can then expect 1,000 false alarms for every true alarm. Alternatively, the true alarm problem: if a DNS service (for example) can expect to see around 18 billion events per day, and if that service has the same rate of bad events, the bad event detection system will fire off 11.5 true alarms every minute. This will overwhelm any human security team, and can only be assuaged by clustering alarms (e.g. by source machine) and visualisation methods.

Machine learning in cyber security is so fun, according to Mowbray, because one competes with an adversary. One must try to second-guess how will malware authors get around your detection methods. Mowbray’s work began by examining the distribution of TLDs visited by the PCs within a company, looking for sudden changes. This led to some false alarms, such as .ee being flagged as suspicious due to a small group of Estonian employees phoning home for banking and the like. Additionally, Polish domains were particularly prone to being flagged as DGA-generated.

However, this approach to detecting infections was not prophylactic—it requires that a device be infected first. To counteract this, Mowbray looked for unusual 2-length distributions, with 2-lengths referring to [t]he substring between the last and second-last dots before the public suffix (and whilst applying a different approach for Chinese employees, as common ways of representing Mandarin in Latin-character domain names caused them to often throw false alarms). This was supplemented by the addition of methods to test whether identified-suspicious devices were attempting to make a number of unresolved DNS queries, exploiting the fact that only a small proportion of malicious domains generated by most DGAs are ever registered by the attacker.

As described in her 2014 paper with Josiah Hagen, Mowbray’s testing discovered 19 DGAs within 5 days, including 9 that were previously unknown to the cyber security industry. Following this, Mowbray detailed where she believed her findings may lead in the future, citing Rios & Butt’s 2017 BlackHat presentation When IoT Attacks and CyberX’s 2017 Global ICS & IIoT Risk Report, stating that IoT attacks will almost certainly be the biggest growth area going forward, along with robot swarm security.

The Q&A began with a rather unanswerable question asking Mowbray what malware didn’t you catch? She gave it a good effort, however, pointing out that one property of machine learning is that you don’t see its effects, but that machine learning can be applied in an immune system-esque way, being primarily effects-driven. There were a handful of other questions, mostly about minor technical queries, including one from myself regarding how well Mowbray’s methods could be applied to Unicode URLS containing characters such as Cyrillic, Arabic or Emoji. As excited as Mowbray was about the prospect of Emoji domain names, she pointed out that their uptake is likely to remain limited and that, even if they did take off, part of her method was to exclude the most popular domains from her analyses first off. Finally, someone pointed out that where many argue for incredibly complex solutions to most given problems, Mowbray demonstrated that simply monitoring DNS queries and analysing 2-lengths could be just as efficacious. Amen, concluded Mowbray.