This tutorial will show you how to combine multiple analysis backends and outlines a basic verdict distillation primitive.
The two backends will be
ClamAV (from the previous tutorial) and YARA.
Before we add YARA support to our
Scanner, we need some YARA rules first!
The Yara-Rules repo is a great resource for free rules.
So, let's download those rules and put them into the
pkg directory of your participant:
cd <your participant's root directory>/pkg/ git clone https://github.com/Yara-Rules/rules.git
We will also need the
yara-python module to interpret these rules - install this if you don't have it:
pip3 install yara-python
When you created your participant,
participant-template created a
Scanner class in your participant's
We're going to edit the
Scanner class in this file.
AbstractScanner, which is provided by
_slugvariables that define your directory structure are based on your responses to the
yara-python to compile rules (for which we provide their path) and then match these rules against incoming data:
import logging import os import yara from polyswarmclient.abstractscanner import AbstractScanner, ScanResult, ScanMode logger = logging.getLogger(__name__) # Initialize logger RULES_DIR = os.getenv('RULES_DIR', 'docker/yara-rules') class Scanner(AbstractScanner): def __init__(self): super(Scanner, self).__init__(ScanMode.ASYNC) self.rules = yara.compile(os.path.join(RULES_DIR, "malware/MALW_Eicar")) async def scan_async(self, guid, artifact_type, content, metadata, chain): matches = self.rules.match(data=content) if matches: return ScanResult(bit=True, verdict=True) return ScanResult(bit=True, verdict=False)
The YARA backend included with
polyswarm-client accepts a
RULES_DIR environment variable that lets you point to your YARA rules.
So, you should set the
RULES_DIR environment variable to point to the YARA rules you downloaded when you test this engine.
We are going to re-use the ClamAV scanner from the previous tutorial.
A finished solution can be found in clamav.py.
We will extend our
Scanner class to utilize multiple analysis backends, which means we need to have some way to get the result of both backends (YARA and ClamAV) and distill that into our single verdict.
import asyncio import logging from polyswarmclient.abstractscanner import AbstractScanner from polyswarm_myclamavmicroengine import Scanner as ClamavScanner from polyswarm_myyaramicroengine import Scanner as YaraScanner logger = logging.getLogger(__name__) # Initialize logger BACKENDS = [ClamavScanner, YaraScanner] class Scanner(AbstractScanner): def __init__(self): super(Scanner, self).__init__(ScanMode.ASYNC) self.backends = [cls() for cls in BACKENDS]
This creates a list of backends containing instances of our YaraScanner and ClamavScanner.
Now that we can access both Scanners, let's use both of their results to distill a final verdict in our Scanner's
async def scan_async(self, guid, artifact_type, content, metadata, chain): results = await asyncio.gather( *[backend.scan(guid, artifact_type, content, chain) for backend in self.backends] ) # Unpack the results bits = [r.bit for r in results] verdicts = [r.verdict for r in results] confidences = [r.confidence for r in results] metadatas = [r.metadata for r in results] asserted_confidences = [c for b, c in zip(bits, confidences) if b] avg_confidence = sum(asserted_confidences) / len(asserted_confidences) # author responsible for distilling multiple metadata values into a value for ScanResult metadata = metadatas try: metadatas = [json.loads(metadata) for metadata in metadatas if metadata and Verdict.validate(json.loads(metadata))] if metadatas: metadata = Verdict().set_malware_family(metadatas.get('malware_family', '')).json() except json.JSONDecodeError: logger.exception(f'Error decoding sub metadata') return ScanResult(bit=any(bits), verdict=any(verdicts), confidence=avg_confidence, metadata=metadata)
Here we calculate all of our Scanner's results asynchronously, and then combine them into our final verdict. In this simple example, we:
- assert that if any of the backends indicate that the artifact is malicious, then the artifact is malicious
- assume all backends return confidences at comparable levels; we average these confidences into a single confidence score
- assume that the first backend will return the best result in terms of identifying the malware family of the artfact
A finished solution can be found in multi.py.
Note: the python modules
polyswarm_myyaraengine come from the previous examples.
In order for this Multi-engine to be able to use the ClamAV and YARA engines, they have to be available in your PYTHONPATH.
To achieve that, you can run the following command in the root of both the ClamAV and the YARA project directories:
pip3 install .
Once everything is in place, let's test our participant:
Now that you're familiar with developing several proof of concept engines, it's time to consider what it would take to build an production engine that connects to the PolySwarm marketplace.
Get in touch! If you're building an Arbiter, you're likely already in touch with the PolySwarm team. Please reach out and we'll help you plug your Arbiter into the PolySwarm marketplace.