PolySwarmPolySwarmPolySwarmPolySwarm
Go to PolySwarm
Home

Migrating a PolySwarm-Client-based Engine to a Webhook-based Engine

Existing engines need to migrate to a webhook-based Engine to continue operating on the PolySwarm marketplace. During the transition period, polyswarm-client compatible communities will continue to operate. At some point in the future the polyswarm-client compatible communities will shut down.

There a couple main steps to switch over to webhooks.

  1. Get the new engine template
  2. Convert the Engines Scanner class to microengine-webhooks-py's scan() function

Create new engine template

Follow the docs to get setup with a local microengine-webhooks-py project. Create the new engine template. Once familiar, move on to the following conversion.


Convert the code

Follow along with the explanation and examples to understand the conversion process. The example we use is a local service based engine that stores files in a temporary file and passes the files to the local service for scanning.

Here is the polyswarm-client code that we will be converting:

import asyncio
import shlex

from polyswarmartifact.schema import Verdict

from polyswarmclient.abstractscanner import AbstractScanner, ScanResult
from polyswarmclient.utils import AsyncArtifactTempfile

class Scanner(AbstractScanner):

    async def setup():
        process = await asyncio.create_subprocess_exec(*shlex.split('service scand start'), stdout=asyncio.subprocess.PIPE)
        stdout, _ = await process.communicate()
        return process.returncode == 0

    async def scan(self, guid, artifact_type, content, metadata, chain):
        if artifact_type == ArtifactType.URL:
           return return ScanResult()

        async with AsyncArtifactTempfile(content) as f:
            process = await asyncio.create_subprocess_exec(*shlex.split(f'scand-controller scan {f.name}'), stdout=asyncio.subprocess.PIPE)
            stdout, _ = await process.communicate()

        if 'INFECTED' in stdout.decode('utf-8)':
            metadata = Verdict.set_malware_family(stdout)
            return ScanResult(bit=True, verdict=True, confidence=1.0, metadata=metadata.json())
        else:
            metadata = Verdict.set_malware_family('')
            return ScanResult(bit=True, verdict=False, confidence=1.0, metadata=metadata.json())

To convert this sample engine, we need to understand the differences in Engines built with microengine-webhooks-py.

  1. There is no per-worker setup function
  2. scan() takes a Bounty object, no longer a set of fields
  3. Filtering is performed in the PolySwarm UI Engine configuration, not locally
  4. File content is not pre-downloaded and passed to the scan() function, it must be downloaded intentionally.
  5. The scan() function is called synchronously
  6. Verdict is now a 4 state ENUM value that can be unknown, benign, suspicious, or malicious
  7. The metadata class Verdict from polyswarm-artifacts, conflicts with the 4-state Verdict Enum

Keep these in mind while moving on to the conversion below.


No Setup() function

A pain point in the conversion process is the missing setup() function. The setup() function used to be called once per worker, allowing Engines to start services, create connections, etc. Now, each scan() call is independent of other calls, and sharing state is not an option by default.

There are a couple of recommended ways to workaround this limitation.

Setup before the worker starts

This is the method used in our example project microengine-webhooks-py. The worker uses a tool called Honcho to start the celery worker. It calls commands in a Procfile that are broken up by name: command tuples. In microengine-webhooks-py the Procfile is docker/Procfile.

docker/Profile

celery: celery -A microenginewebhookspy.tasks worker
integration: python -m flask run --host 0.0.0.0
test: pytest --cov --cov-report=term-missing --cov-report=html --no-cov-on-fail --log-cli-level=DEBUG

For this example, we can update the celery command to start the local scanner service before starting the celery worker.

celery:service scan start && celery -A microenginewebhookspy.tasks worker
integration: python -m flask run --host 0.0.0.0
test: pytest --cov --cov-report=term-missing --cov-report=html --no-cov-on-fail --log-cli-level=DEBUG

Celery task instantiation

This is another method that you can explore. With Cellery Task Instantiation, you can define one or more celery tasks that are instantiated once and stored in a global instance that is shared by all other celery tasks. We will leave it to you to explore this further if you wish to try it out.


Converting scan()

With the setup solution in place, developers can start converting the scan() function. To start, clear out the microengine-webhooks-py's scan() function body.

def scan(bounty: Bounty) -> ScanResult:
    pass

Notice in the example above, that the engine performs a check against the ArtifactType. This check is no longer needed for Engines that only handle a single ArtifactType (FILE or URL). PolySwarm will only send bounties for the ArtifactType's that you have configured in the PolySwarm UI. If the Engine supports both URL and FILE ArtifactType's, then you will need to update the detection logic to use the Bounty object field.

dual ArtifactType engine

def scan(bounty: Bounty) -> ScanResult:
    if bounty.artifact_type == ArtifactType.FILE:
        // do something on the FILE artifact
        pass
    else:
        // do something on the URL artifact
        pass

Next thing, we want to get the artifact. For FILE artifacts, it has to be downloaded, like this:

content = bounty.fetch_content().

For URL artifacts, it can be read from Bounty object directly, like this:

url = bounty.artifact_uri.

For our example engine, we will download the FILE artifact.

def scan(bounty: Bounty) -> ScanResult:
    content = bounty.fetch_content()

Next, we can convert the scanning code. The existing scan code will fail because polyswarm-client uses all async code, while microengines-webhooks-py uses all sync code. Every async/await call must be converted to sync counterparts. Many async libraries share the same API as the sync libraries. This makes it possible to just swap the identical sync library when possible. Some libraries have divergent APIs and will take more work to convert.

The example Engine uses the async methods AsyncArtifactTempfile and asyncio.create_subprocess_exec. Let's convert to the builtin tempfile and subprocess, which are sync methods.

First, let's do the tempfile conversion. The following code example shows the change:

def scan(bounty: Bounty) -> ScanResult:
    content = bounty.fetch_content()
    with tempfile.NamedTempfile() as f:
        f.write(content)
        f.seek(0)

In the example, tempfile.NamedTempFile replaces AsyncArtifactTempfile. Notice that the API is different for these two classes. AsyncArtifactTempfile takes the content in the constructor and writes it to the file behind the scenes. tempfile.NamedTempfile does not write anything until write() is called, and it requires a seek() call to ensure future uses of the file read from the beginning.

Next, let's do the subprocess conversion. The follow code example adds that change:

def scan(bounty: Bounty) -> ScanResult:
    content = bounty.fetch_content()
    with tempfile.NamedTempfile() as f:
        f.write(content)
        f.seek(0)
        process = subprocess.Popen(shlex.split(f'scand-controller scan {f.name}'), stdout=subprocess.PIPE)
        stdout, _ = process.communicate()

The subprocess call is similar to its async counterpart. The biggest difference is thatsubprocess.Popen() expects the command as list of args, while asyncio.create_subprocess_exec() expects the command as multiple positional arguments. shlex.split outputs a list, so subprocess.Popen can use the output as is, while asyncio.create_subprocess_exec() has to convert it with *.

Once the filtering has been removed, the content downloaded, and the async code replaced with sync code, it's time to return the results. As before, scan() should return a ScanResult object. The ScanResult class has changed between polyswarm-client and microengine-webhooks-py. Instead of returning a bit, verdict pair, now there is only a verdict that expects one state from a 4-state Verdict enum. So, ScanResult(bit=True, verdict=True) becomes ScanResult(verdict=Verdict.MALICIOUS).

Check the table below to determine the new Verdict enum value, given the existing bit-verdict pairs.

bit-verdict Verdict
False-False UNKNOWN
True-False BENIGN
True-True MALICIOUS
N/A SUSPICIOUS

One problem with this Verdict enum, is that it conflicts with the old metadata class Verdict in polyswarmartifact.schema. In polyswarmartifact.schema, there is a new child class of Verdict called ScanMetadata that you should use instead of Verdict to store your metadata content.

from polyswarmartifact.schema import ScanMetadata.

Lastly, ScanResult.metadata no longer expects a dict, it wants the ScanMetadata object directly.

The conversion is complete. Take a look at the finished example.

import shlex
import subprocess
import tempfile

from polyswarmartifact.schema import ScanMetadata

from microenginewebhookspy.models import Bounty, ScanResult, Verdict

def scan(bounty: Bounty) -> ScanResult:
    content = bounty.fetch_content()
    with tempfile.NamedTempfile() as f:
        f.write(content)
        f.seek(0)
        process = subprocess.Popen(shlex.split(f'scand-controller scan {f.name}'), stdout=subprocess.PIPE)
        stdout, _ = process.communicate()

    if 'INFECTED' in stdout.decode('utf-8'):
        metadata = ScanMetadata.set_malware_family(stdout)
        return ScanResult(verdict=Verdict.MALICIOUS, confidence=1.0, metadata=metadata)
    else:
        metadata = ScanMetadata.set_malware_family('')
        return ScanResult(verdict=Verdict.BENIGN, confidence=1.0, metadata=metadata)

Converting BidStrategy

The BidStategyBase class is no longer relevant with the new setup. Take a look at the new Bid Strategy docs to see how to make a custom bid strategy.

2021 © PolySwarm Pte. Ltd.