Bitcoin Ransomware Detection with Scalable Graph Machine Learning
Ransomware is a type of malware that has become a major threat, rising to 600 million attacks per year, and this cyber-crime is very often facilitated via cryptocurrency. While ransomware relies on pseudonymity to send and receive payments that are difficult to trace, the fact that all transactions on the bitcoin blockchain are written publicly presents an opportunity to develop an analytics pipeline to detect such activities.
Graph Machine Learning is a rapidly developing research area which combines entity attributes and network structure to improve machine learning outcomes. These techniques are becoming increasingly popular, often outperforming traditional approaches when the underlying data can be naturally represented as a graph.
This talk will highlight two main outcomes: 1) how a graph machine learning pipeline is formulated to detect bitcoin addresses that are suspected to be associated with ransomware, and 2) how this algorithm is scaled out to process over 1 billion transactions using Apache Spark.
Outline/Structure of the Talk
The presentation will begin with a brief introduction to ransomware and the problem statement of detecting bitcoin addresses associated with such activity. The rest of the talk will be structured in two main sections:
- Graph machine learning: This section introduces the audience to graph convolutional networks and how we can formulate a machine learning pipeline to solve the problem in hand. We show how the bitcoin transaction data is modelled as a graph, followed by a deeper dive into neural network architecture and design.
- GraphSAGE and Spark: This section takes a step further into considering the scalability of our design. A variation of graph convolutional networks (GraphSAGE) is introduced, followed by an explanation of how we can leverage Apache Spark to develop a scalable implementation of this algorithm. Towards the end of the talk, we will also discuss some examples of interesting challenges that needed to be tackled, such as dealing with randomness in Spark.
There are two main takeaways that this talk will focus on delivering for the audience:
- Exposure to graph machine learning and how it can be applied in real and practical use-cases. The techniques discussed in the talk are not limited to any single application, but can easily be adapted for others and may encourage the attendees to take a second look at their data to see whether it should naturally be represented as a graph structure.
- Understanding of what is involved in scaling out these algorithms for them to be useful in a practical setting - working with big data doesn’t have to mean we are restricted to more traditional analytics only.
Engineers and data scientists working with big data, or anyone who would like to learn about scalable graph machine learning.
Prerequisites for Attendees
No specific prerequisites, however a basic understanding of deep learning and distributed cluster-computing frameworks like Spark will be useful.