Building a Multi-Backend Engine / Arbiter
This tutorial will show you how to combine multiple analysis backends and outlines a basic verdict distillation primitive.
The two backends will be ClamAV
(from the previous tutorial) and YARA.
Adding YARA to the Mix
Before we add YARA support to our Scanner
, we need some YARA rules first!
The Yara-Rules repo is a great resource for free rules.
So, let's download those rules and put them into the pkg
directory of your participant:
cd <your participant's root directory>/pkg/
git clone https://github.com/Yara-Rules/rules.git
We will also need the yara-python
module to interpret these rules - install this if you don't have it:
pip3 install yara-python
When you created your participant, participant-template
created a Scanner
class in your participant's project_slug/package_slug/participant_name_slug.py
file*.
We're going to edit the Scanner
class in this file.
Scanner
subclasses AbstractScanner
, which is provided by polyswarm-client
.
*These
_slug
variables that define your directory structure are based on your responses to thecookiecutter
prompts.
We use yara-python
to compile rules (for which we provide their path) and then match these rules against incoming data:
import logging
import os
import yara
from polyswarmclient.abstractscanner import AbstractScanner, ScanResult, ScanMode
logger = logging.getLogger(__name__) # Initialize logger
RULES_DIR = os.getenv('RULES_DIR', 'docker/yara-rules')
class Scanner(AbstractScanner):
def __init__(self):
super(Scanner, self).__init__(ScanMode.ASYNC)
self.rules = yara.compile(os.path.join(RULES_DIR, "malware/MALW_Eicar"))
async def scan_async(self, guid, artifact_type, content, metadata, chain):
matches = self.rules.match(data=content)
if matches:
return ScanResult(bit=True, verdict=True)
return ScanResult(bit=True, verdict=False)
The YARA backend included with polyswarm-client
accepts a RULES_DIR
environment variable that lets you point to your YARA rules.
So, you should set the RULES_DIR
environment variable to point to the YARA rules you downloaded when you test this engine.
ClamAV Scanner
We are going to re-use the ClamAV scanner from the previous tutorial.
A finished solution can be found in clamav.py.
Multiple Analysis Backends
We will extend our Scanner
class to utilize multiple analysis backends, which means we need to have some way to get the result of both backends (YARA and ClamAV) and distill that into our single verdict.
import asyncio
import logging
from polyswarmclient.abstractscanner import AbstractScanner
from polyswarm_myclamavmicroengine import Scanner as ClamavScanner
from polyswarm_myyaramicroengine import Scanner as YaraScanner
logger = logging.getLogger(__name__) # Initialize logger
BACKENDS = [ClamavScanner, YaraScanner]
class Scanner(AbstractScanner):
def __init__(self):
super(Scanner, self).__init__(ScanMode.ASYNC)
self.backends = [cls() for cls in BACKENDS]
This creates a list of backends containing instances of our YaraScanner and ClamavScanner.
Now that we can access both Scanners, let's use both of their results to distill a final verdict in our Scanner's scan_async()
function.
async def scan_async(self, guid, artifact_type, content, metadata, chain):
results = await asyncio.gather(
*[backend.scan(guid, artifact_type, content, chain) for backend in self.backends]
)
# Unpack the results
bits = [r.bit for r in results]
verdicts = [r.verdict for r in results]
confidences = [r.confidence for r in results]
metadatas = [r.metadata for r in results]
asserted_confidences = [c for b, c in zip(bits, confidences) if b]
avg_confidence = sum(asserted_confidences) / len(asserted_confidences)
# author responsible for distilling multiple metadata values into a value for ScanResult
metadata = metadatas[0]
try:
metadatas = [json.loads(metadata) for metadata in metadatas
if metadata and Verdict.validate(json.loads(metadata))]
if metadatas:
metadata = Verdict().set_malware_family(metadatas[0].get('malware_family', '')).json()
except json.JSONDecodeError:
logger.exception(f'Error decoding sub metadata')
return ScanResult(bit=any(bits), verdict=any(verdicts), confidence=avg_confidence, metadata=metadata)
Here we calculate all of our Scanner's results asynchronously, and then combine them into our final verdict. In this simple example, we:
- assert that if any of the backends indicate that the artifact is malicious, then the artifact is malicious
- assume all backends return confidences at comparable levels; we average these confidences into a single confidence score
- assume that the first backend will return the best result in terms of identifying the malware family of the artfact
A finished solution can be found in multi.py.
Note: the python modules polyswarm_myclamavengine
and polyswarm_myyaraengine
come from the previous examples.
In order for this Multi-engine to be able to use the ClamAV and YARA engines, they have to be available in your PYTHONPATH.
To achieve that, you can run the following command in the root of both the ClamAV and the YARA project directories:
pip3 install .
Test Your Participant
Once everything is in place, let's test our participant:
Next Steps
If You're Building an Engine
Now that you're familiar with developing several proof of concept engines, it's time to consider what it would take to build an production engine that connects to the PolySwarm marketplace.
If You're Building an Arbiter
Get in touch! If you're building an Arbiter, you're likely already in touch with the PolySwarm team. Please reach out and we'll help you plug your Arbiter into the PolySwarm marketplace.