Migrating a PolySwarm-Client-based Engine to a Webhook-based Engine
Existing engines need to migrate to a webhook-based Engine to continue operating on the PolySwarm marketplace. During the transition period, polyswarm-client compatible communities will continue to operate. At some point in the future the polyswarm-client compatible communities will shut down.
There a couple main steps to switch over to webhooks.
- Get the new engine template
- Convert the Engines
Scanner
class tomicroengine-webhooks-py
'sscan()
function
Create new engine template
Follow the docs to get setup with a local microengine-webhooks-py
project. Create the new engine template.
Once familiar, move on to the following conversion.
Convert the code
Follow along with the explanation and examples to understand the conversion process. The example we use is a local service based engine that stores files in a temporary file and passes the files to the local service for scanning.
Here is the polyswarm-client
code that we will be converting:
import asyncio
import shlex
from polyswarmartifact.schema import Verdict
from polyswarmclient.abstractscanner import AbstractScanner, ScanResult
from polyswarmclient.utils import AsyncArtifactTempfile
class Scanner(AbstractScanner):
async def setup():
process = await asyncio.create_subprocess_exec(*shlex.split('service scand start'), stdout=asyncio.subprocess.PIPE)
stdout, _ = await process.communicate()
return process.returncode == 0
async def scan(self, guid, artifact_type, content, metadata, chain):
if artifact_type == ArtifactType.URL:
return return ScanResult()
async with AsyncArtifactTempfile(content) as f:
process = await asyncio.create_subprocess_exec(*shlex.split(f'scand-controller scan {f.name}'), stdout=asyncio.subprocess.PIPE)
stdout, _ = await process.communicate()
if 'INFECTED' in stdout.decode('utf-8)':
metadata = Verdict.set_malware_family(stdout)
return ScanResult(bit=True, verdict=True, confidence=1.0, metadata=metadata.json())
else:
metadata = Verdict.set_malware_family('')
return ScanResult(bit=True, verdict=False, confidence=1.0, metadata=metadata.json())
To convert this sample engine, we need to understand the differences in Engines built with microengine-webhooks-py
.
- There is no per-worker setup function
scan()
takes aBounty
object, no longer a set of fields- Filtering is performed in the PolySwarm UI Engine configuration, not locally
- File content is not pre-downloaded and passed to the
scan()
function, it must be downloaded intentionally. - The
scan()
function is called synchronously Verdict
is now a 4 state ENUM value that can beunknown
,benign
,suspicious
, ormalicious
- The metadata class
Verdict
frompolyswarm-artifacts
, conflicts with the 4-stateVerdict
Enum
Keep these in mind while moving on to the conversion below.
No Setup() function
A pain point in the conversion process is the missing setup()
function.
The setup()
function used to be called once per worker, allowing Engines to start services, create connections, etc.
Now, each scan()
call is independent of other calls, and sharing state is not an option by default.
There are a couple of recommended ways to workaround this limitation.
Setup before the worker starts
This is the method used in our example project microengine-webhooks-py
.
The worker uses a tool called Honcho to start the celery worker.
It calls commands in a Procfile
that are broken up by name: command
tuples.
In microengine-webhooks-py
the Procfile
is docker/Procfile
.
docker/Profile
celery: celery -A microenginewebhookspy.tasks worker
integration: python -m flask run --host 0.0.0.0
test: pytest --cov --cov-report=term-missing --cov-report=html --no-cov-on-fail --log-cli-level=DEBUG
For this example, we can update the celery
command to start the local scanner service before starting the celery
worker.
celery:service scan start && celery -A microenginewebhookspy.tasks worker
integration: python -m flask run --host 0.0.0.0
test: pytest --cov --cov-report=term-missing --cov-report=html --no-cov-on-fail --log-cli-level=DEBUG
Celery task instantiation
This is another method that you can explore. With Cellery Task Instantiation, you can define one or more celery tasks that are instantiated once and stored in a global instance that is shared by all other celery tasks. We will leave it to you to explore this further if you wish to try it out.
Converting scan()
With the setup
solution in place, developers can start converting the scan()
function.
To start, clear out the microengine-webhooks-py
's scan()
function body.
def scan(bounty: Bounty) -> ScanResult:
pass
Notice in the example above, that the engine performs a check against the ArtifactType
.
This check is no longer needed for Engines that only handle a single ArtifactType
(FILE
or URL
).
PolySwarm will only send bounties for the ArtifactType
's that you have configured in the PolySwarm UI.
If the Engine supports both URL and FILE ArtifactType
's, then you will need to update the detection logic to use the Bounty
object field.
dual ArtifactType engine
def scan(bounty: Bounty) -> ScanResult:
if bounty.artifact_type == ArtifactType.FILE:
// do something on the FILE artifact
pass
else:
// do something on the URL artifact
pass
Next thing, we want to get the artifact.
For FILE
artifacts, it has to be downloaded, like this:
content = bounty.fetch_content()
.
For URL
artifacts, it can be read from Bounty
object directly, like this:
url = bounty.artifact_uri
.
For our example engine, we will download the FILE
artifact.
def scan(bounty: Bounty) -> ScanResult:
content = bounty.fetch_content()
Next, we can convert the scanning code.
The existing scan code will fail because polyswarm-client
uses all async code, while microengines-webhooks-py
uses all sync code.
Every async/await call must be converted to sync counterparts.
Many async libraries share the same API as the sync libraries.
This makes it possible to just swap the identical sync library when possible.
Some libraries have divergent APIs and will take more work to convert.
The example Engine uses the async methods AsyncArtifactTempfile
and asyncio.create_subprocess_exec
.
Let's convert to the builtin tempfile
and subprocess
, which are sync methods.
First, let's do the tempfile
conversion.
The following code example shows the change:
def scan(bounty: Bounty) -> ScanResult:
content = bounty.fetch_content()
with tempfile.NamedTempfile() as f:
f.write(content)
f.seek(0)
In the example, tempfile.NamedTempFile
replaces AsyncArtifactTempfile
.
Notice that the API is different for these two classes.
AsyncArtifactTempfile
takes the content in the constructor and writes it to the file behind the scenes.
tempfile.NamedTempfile
does not write anything until write()
is called, and it requires a seek()
call to ensure future uses of the file read from the beginning.
Next, let's do the subprocess
conversion.
The follow code example adds that change:
def scan(bounty: Bounty) -> ScanResult:
content = bounty.fetch_content()
with tempfile.NamedTempfile() as f:
f.write(content)
f.seek(0)
process = subprocess.Popen(shlex.split(f'scand-controller scan {f.name}'), stdout=subprocess.PIPE)
stdout, _ = process.communicate()
The subprocess
call is similar to its async counterpart.
The biggest difference is thatsubprocess.Popen()
expects the command as list of args, while asyncio.create_subprocess_exec()
expects the command as multiple positional arguments.
shlex.split
outputs a list, so subprocess.Popen
can use the output as is, while asyncio.create_subprocess_exec()
has to convert it with *
.
Once the filtering has been removed, the content downloaded, and the async code replaced with sync code, it's time to return the results.
As before, scan()
should return a ScanResult
object.
The ScanResult
class has changed between polyswarm-client
and microengine-webhooks-py
.
Instead of returning a bit
, verdict
pair, now there is only a verdict
that expects one state from a 4-state Verdict
enum.
So, ScanResult(bit=True, verdict=True)
becomes ScanResult(verdict=Verdict.MALICIOUS)
.
Check the table below to determine the new Verdict enum value, given the existing bit-verdict pairs.
bit-verdict | Verdict |
---|---|
False-False | UNKNOWN |
True-False | BENIGN |
True-True | MALICIOUS |
N/A | SUSPICIOUS |
One problem with this Verdict
enum, is that it conflicts with the old metadata class Verdict
in polyswarmartifact.schema
.
In polyswarmartifact.schema
, there is a new child class of Verdict
called ScanMetadata
that you should use instead of Verdict
to store your metadata content.
from polyswarmartifact.schema import ScanMetadata
.
Lastly, ScanResult.metadata
no longer expects a dict, it wants the ScanMetadata
object directly.
The conversion is complete. Take a look at the finished example.
import shlex
import subprocess
import tempfile
from polyswarmartifact.schema import ScanMetadata
from microenginewebhookspy.models import Bounty, ScanResult, Verdict
def scan(bounty: Bounty) -> ScanResult:
content = bounty.fetch_content()
with tempfile.NamedTempfile() as f:
f.write(content)
f.seek(0)
process = subprocess.Popen(shlex.split(f'scand-controller scan {f.name}'), stdout=subprocess.PIPE)
stdout, _ = process.communicate()
if 'INFECTED' in stdout.decode('utf-8'):
metadata = ScanMetadata.set_malware_family(stdout)
return ScanResult(verdict=Verdict.MALICIOUS, confidence=1.0, metadata=metadata)
else:
metadata = ScanMetadata.set_malware_family('')
return ScanResult(verdict=Verdict.BENIGN, confidence=1.0, metadata=metadata)
Converting BidStrategy
The BidStategyBase
class is no longer relevant with the new setup.
Take a look at the new Bid Strategy docs to see how to make a custom bid strategy.