This blog post is written/edited by CDT Students Daniel Collins and Matt Clifford
BIAS 22 – Day 2, Dr Oliver Ray: “Knowledge-driven AI”
The second talk of day two was delivered by Dr Oliver Ray (University of Bristol), on the topic of human-in-the-loop machine learning using Inductive Logic Programming (ILP) and its application in cyber threat elucidation.
Cyber threat elucidation is the task of analysing network activity to identify ransomware attacks, and to better understand how they unfold. Ransomware is a type of malware which infects victims’ devices, encrypts their data, and demands money from them to restore access. Infection typically occurs through human error. For example, a person may be unwittingly tricked into downloading and running a “trojan” – malware that has been disguised as a legitimate and benign file. The executed ransomware encrypts data, and backups of that data, on the infected system, and the perpetrator can then demand a ransom payment for decryption services. However, ransomware does not always start encrypting data immediately. Instead, it may lay relatively dormant whilst it spreads to other networked systems, and spend time gathering sensitive information, and creating back-ups of itself to block data recovery. If an attack can be identified at this stage or soon after it has started encrypting data, it can be removed before most of the data has been affected.
Ransomware is a persistent threat to cyber security, and each new attack can be developed to behave in unpredictable ways. Dr Ray outline the need for better tools to prepare for new attacks – when faced with a new attack, there should be systems to help a user understand what is happening and what has happened already so that ransomware can be found and removed as quickly as possible, and relevant knowledge can be gained from the attack.
To identify and monitor threats, security experts may perform forensic analysis of Network Monitoring Systems (NMS) data from around the time of infection. This data exists in the form of network logs – relational databases containing a time-labelled record of events and activity occurring across the networked systems. However, there are very large amounts of log data, and most of it is associated with benign activity, unrelated to the threat, making it difficult to find examples of malicious activity. Further, in the case of new threats, there are little to no labelled examples of logs known to be related to an attack. Human knowledge and reasoning are therefore crucial for identifying relevant information in the logs.
ILP based machine learning (ML) was then presented by Dr Ray as a promising alternative to more ‘popular’ traditional ML methods for differentiating ransomware activity from benign activity in large network logs. This is because ILP is better suited for working with relational data, an area where deep learning and traditional ML methods can struggle since often require tabular or vectorisable data formats. ILP not only gives the ability to make predictions on relational data, but it also produces human interpretable logic rules through which it is possible to uncover and learn about the system itself. This could provide valuable insights into how the infection logs are generated, and which features of the logs are important for identification, as opposed to guessing which features might be important.
Dr Ray went on to detail the results of his work with Dr Steve Moyle (Amplify Intelligence UK and Cyber Security Centre, University of Oxford), on a novel proof-of-concept for an ILP based “eXplanatory Interactive Relational Machine Learning” (XIRML) system called “Acuity”. This human-in-the-loop system allows ILP and cyber security experts to direct the cyber threat elucidation process, through interactive functionality for guided data-caching on large network logs, and hypothesis-shaping for rebutting or altering learned logic rules.
In his concluding remarks, Dr Ray shared his thoughts on the future of this technology. As he sees it, the goal is to develop safe, auditable systems that could be used in practice by domain experts alone, without the need for an ILP expert in the loop. To this end, he suggests that system usability and human-interpretable outputs are both crucial factors for the design of future systems.