Go to PolySwarm

Building a Multi-Backend Engine / Arbiter

This tutorial will show you how to combine multiple analysis backends and outlines a basic verdict distillation primitive. The two backends will be ClamAV (from the previous tutorial) and YARA.

Adding YARA to the Mix

Before we add YARA support to our Scanner, we need some YARA rules first!

The Yara-Rules repo is a great resource for free rules. So, let's download those rules and put them into the pkg directory of your participant:

cd <your participant's root directory>/pkg/
git clone https://github.com/Yara-Rules/rules.git

We will also need the yara-python module to interpret these rules - install this if you don't have it:

pip3 install yara-python

When you created your participant, participant-template created a Scanner class in your participant's project_slug/package_slug/participant_name_slug.py file*. We're going to edit the Scanner class in this file. Scanner subclasses AbstractScanner, which is provided by polyswarm-client.

*These _slug variables that define your directory structure are based on your responses to the cookiecutter prompts.

We use yara-python to compile rules (for which we provide their path) and then match these rules against incoming data:

import logging
import os
import yara

from polyswarmclient.abstractscanner import AbstractScanner, ScanResult, ScanMode

logger = logging.getLogger(__name__)  # Initialize logger
RULES_DIR = os.getenv('RULES_DIR', 'docker/yara-rules')

class Scanner(AbstractScanner):
    def __init__(self):
        super(Scanner, self).__init__(ScanMode.ASYNC)
        self.rules = yara.compile(os.path.join(RULES_DIR, "malware/MALW_Eicar"))

    async def scan_async(self, guid, artifact_type, content, metadata, chain):
        matches = self.rules.match(data=content)
        if matches:
            return ScanResult(bit=True, verdict=True)

        return ScanResult(bit=True, verdict=False)

The YARA backend included with polyswarm-client accepts a RULES_DIR environment variable that lets you point to your YARA rules. So, you should set the RULES_DIR environment variable to point to the YARA rules you downloaded when you test this engine.

ClamAV Scanner

We are going to re-use the ClamAV scanner from the previous tutorial.

A finished solution can be found in clamav.py.

Multiple Analysis Backends

We will extend our Scanner class to utilize multiple analysis backends, which means we need to have some way to get the result of both backends (YARA and ClamAV) and distill that into our single verdict.

import asyncio
import logging

from polyswarmclient.abstractscanner import AbstractScanner
from polyswarm_myclamavmicroengine import Scanner as ClamavScanner
from polyswarm_myyaramicroengine import Scanner as YaraScanner

logger = logging.getLogger(__name__)  # Initialize logger
BACKENDS = [ClamavScanner, YaraScanner]

class Scanner(AbstractScanner):

    def __init__(self):
        super(Scanner, self).__init__(ScanMode.ASYNC)
        self.backends = [cls() for cls in BACKENDS]

This creates a list of backends containing instances of our YaraScanner and ClamavScanner.

Now that we can access both Scanners, let's use both of their results to distill a final verdict in our Scanner's scan_async() function.

    async def scan_async(self, guid, artifact_type, content, metadata, chain):
        results = await asyncio.gather(
            *[backend.scan(guid, artifact_type, content, chain) for backend in self.backends]

        # Unpack the results
        bits = [r.bit for r in results]
        verdicts = [r.verdict for r in results]
        confidences = [r.confidence for r in results]
        metadatas = [r.metadata for r in results]

        asserted_confidences = [c for b, c in zip(bits, confidences) if b]
        avg_confidence = sum(asserted_confidences) / len(asserted_confidences)

        # author responsible for distilling multiple metadata values into a value for ScanResult
        metadata = metadatas[0]
            metadatas = [json.loads(metadata) for metadata in metadatas
                         if metadata and Verdict.validate(json.loads(metadata))]
            if metadatas:
                metadata = Verdict().set_malware_family(metadatas[0].get('malware_family', '')).json()
        except json.JSONDecodeError:
            logger.exception(f'Error decoding sub metadata')

        return ScanResult(bit=any(bits), verdict=any(verdicts), confidence=avg_confidence, metadata=metadata)

Here we calculate all of our Scanner's results asynchronously, and then combine them into our final verdict. In this simple example, we:

  • assert that if any of the backends indicate that the artifact is malicious, then the artifact is malicious
  • assume all backends return confidences at comparable levels; we average these confidences into a single confidence score
  • assume that the first backend will return the best result in terms of identifying the malware family of the artfact

A finished solution can be found in multi.py.

Note: the python modules polyswarm_myclamavengine and polyswarm_myyaraengine come from the previous examples. In order for this Multi-engine to be able to use the ClamAV and YARA engines, they have to be available in your PYTHONPATH. To achieve that, you can run the following command in the root of both the ClamAV and the YARA project directories:

pip3 install .

Test Your Participant

Once everything is in place, let's test our participant:

Next Steps

If You're Building an Engine

Now that you're familiar with developing several proof of concept engines, it's time to consider what it would take to build an production engine that connects to the PolySwarm marketplace.

If You're Building an Arbiter

Get in touch! If you're building an Arbiter, you're likely already in touch with the PolySwarm team. Please reach out and we'll help you plug your Arbiter into the PolySwarm marketplace.

2020 © PolySwarm Pte. Ltd.