PolySwarmPolySwarmPolySwarmPolySwarm
Go to PolySwarm
Home

PolySwarm Customer API v2

An interface to the version 2 PolySwarm customer APIs.

Supports Python 2.7, 3.5 and greater.

Installation

From PyPI:

$ pip install polyswarm-api

If you get an error about a missing package named wheel, that means your version of pip is too old. You need pip version 19 or newer. To update pip, run pip install -U pip.

From source:

$ python setup.py install

If you get an error about a missing package named wheel, that means your version of setuptools is too old. You need setuptools version 40.8.0 or newer. To update setuptools, run pip install -U setuptools.

Create an API Client

from polyswarm_api.api import PolyswarmAPI

api_key = "317b21cb093263b701043cb0831a53b9"

api = PolyswarmAPI(key=api_key)

You will need to get your own API key from polyswarm.network/account/api-keys

Perform Scans

# scan one file
FILE = '/home/user/malicious.bin'

positives = 0
total = 0

instance = api.submit(FILE)
result = api.wait_for(instance)

if result.failed:
    print(f'Failed to get results')
    sys.exit()

print('Engine Assertions:')
for assertion in result.assertions:
    if assertion.verdict:
        positives += 1
    total += 1
    print('\tEngine {} asserts {}'.\
            format(assertion.author_name,
                   'Malicious' if assertion.verdict else 'Benign'))

print(f'Positives: {positives}')
print(f'Total: {total}')
print(f'PolyScore: {result.polyscore}\n')

print(f'sha256: {result.sha256}')
print(f'sha1: {result.sha1}')
print(f'md5: {result.md5}')
print(f'Extended type: {result.extended_type}')
print(f'First Seen: {result.first_seen}')
print(f'Last Seen: {result.last_seen}\n')

print(f'Permalink: {result.permalink}')

# scan one URL
URL = 'https://polyswarm.io'

positives = 0
total = 0

instance = api.submit(URL, artifact_type='url')
result = api.wait_for(instance)

if result.failed:
    print(f'Failed to get results')
    sys.exit()

print('Engine Assertions:')
for assertion in result.assertions:
    if assertion.verdict:
        positives += 1
    total += 1
    print('\tEngine {} asserts {}'.\
            format(assertion.author_name,
                   'Malicious' if assertion.verdict else 'Benign'))

print(f'Positives: {positives}')
print(f'Total: {total}\n')

print(f'Permalink: {result.permalink}')

When scanning a URL, you should always include the protocol (http:// or https://).

Lookup by Hash

# sha256, md5, and sha1 supported
EICAR_HASH = '275a021bbfb6489e54d471899f7db9d1663fc695ec2fe2a2c4538aabf651fd0f'

positives = 0
total = 0

results = api.search(EICAR_HASH)

for result in results:
    if result.failed:
        print(f'Failed to get result.')
        break

    if not result.assertions:
        print('Artifact not scanned yet - Run rescan for Engine Assertions.')
    else:
        print('Engine Assertions:')

        for assertion in result.assertions:
            if assertion.verdict:
                positives += 1
            total += 1
            print('\tEngine {} asserts {}'.\
                  format(assertion.author_name,
                         'Malicious' if assertion.verdict else 'Benign'))

    print(f'Positives: {positives}')
    print(f'Total: {total}')
    print(f'PolyScore: {result.polyscore}\n')

    print(f'sha256: {result.sha256}')
    print(f'sha1: {result.sha1}')
    print(f'md5: {result.md5}')
    print(f'Extended type: {result.extended_type}')
    print(f'First Seen: {result.first_seen}')
    print(f'Last Seen: {result.last_seen}\n')

    print(f'Permalink: {result.permalink}')

Metadata Searching: Basics

PolySwarm's Metadata Search is a powerful and flexible means to discover previously unknown malware.

As PolySwarm ingests Artifacts (files, URLs, etc), various "Metadata Extractors" are run against the Artifacts, producing a rich set of Attributes that describe the Artifacts, the Scans conducted against the Artifacts and contextual information such as the timestamp of ingestion.

All of the Attributes produced by the Metadata extraction process are "pivotable" (searchable), empowering users to hunt for Artifacts that exhibit (or don't exhibit) various Attributes - and mix and match these queries with rich search syntax including Boolean logic and regular expressions.

For example, if you'd like to find all Artifacts that are:

  • 64 bit Windows executables (PEs)
  • AND were first seen within the last week
  • AND export a specified regular expression of function names
  • OR contain a specified C2 address

... then PolySwarm's Metadata Search is your tool!

PolySwarm's Metadata Search is backed by Elasticsearch and supports the full range of Elasticsearch search criteria to deliver flexible results quickly.

Common Metadata Search Pitfalls

Metadata Search inherits several idiosyncrasies from Elasticsearch. Users should be aware of these quirks when developing queries.

  1. Attribute fields are case-sensitive.
  2. If a query refers to a field that doesn't exist, Metadata Search will ignore this portion of the query.

For example, the intention of the following query is likely to identify all Artifacts detected as malicious by at least 2 Engines, where 1 is ClamAV:

scan.latest_scan.ClamAV.assertion:malicious AND scan.detections.malicious:>1

As explained later, scan does not exist on all Artifacts (is optional), let alone scan.latest_scan.ClamAV.

On Artifacts for which scan.latest_scan.ClamAV does not exist, this portion of the query is ignored, effectively reducing this query to:

scan.detections.malicious:>1

Resulting in an incorrect result set that includes Artifacts found to be malicious by at least 2 Engines for which no ClamAV scan result exists.

Users must first verify the existance of the optional field before applying criteria to the contents of the optional field.

This query will return the intended result set:

_exists_:scan.latest_scan.ClamAV AND scan.latest_scan.ClamAV.assertion:malicious AND scan.detections.malicious:>1

"Pivotable" (Searchable) Attributes

Attributes belong to one of the following Scopes:

  1. Artifact: Attributes that describe Artifacts in context and at surface level, e.g. the Artifact's unique id, the time it was first_seen and its various hashes, e.g. sha256. These Attributes are available on all Artifacts.
  2. Scan: Information concerning the Scan(s) that have been executed on the Artifact. Users can address the first and latest (most recent) Scan Result Sets for each scanned Artifact, accessing information such as: what each Engine said about the scanned Artifact, whether the Engine detected the Artifact as malicious or benign and what malware_family the Artifact was found to belong to, as applicable. There is a 1 to many relationship between Artifacts and Scan Result Sets: each Artifact has 0 to N associated Scan Result Sets, whereas every Scan result set will have exactly 1 associated Artifact. These Attributes are available on all Artifacts that have been scanned at least once.
  3. Tool: Attributes that describe the content of Artifacts. A set of Metadata Extraction Tools ("Tools") are run on Artifacts under various criteria, e.g. the Artifact's file type. For example, strings (a Tool) is used to extract IP, domain & URL-like strings from all file Artifacts, whereas pefile (another Tool) exclusively extracts information from Artifacts that are Windows executables. Such Attributes are addressed via the name of the Tool itself. Tool-produced Attributes are available on Artifacts for which the given Tool is applicable.

Queries address Attributes hierarchically (based on their Scope). Results are similarly delivered hierarchically. For example, pefile produces an attribute named imphash. Users are able to access pefile's imphash via pefile.imphash. When constructing queries, it's important to understand this hierarchy and formulate queries accordingly.

The following sections expand on each class of Attribute along with CLI examples that leverage the Attributes.

We're constantly expanding on the number of searchable Attributes, thereby providing even more flexibility to users.

This expansion makes documentation a bit of a moving target. A future release of PolySwarm's Web UI, CLI, Python API and Documentation will support auto-completion and Documentation of these Attributes, making it easier to leverage new Attributes as they're added.

If we're missing something you'd like to be able to search for, please get in touch!

Artifact Scope

Attributes that provide contextual and surface-level information about Artifacts are available under the artifact Scope.

The following artifact Attributes are supported:

  • artifact

    • id - a unique identifier for the Artifact within PolySwarm
    • first_seen - (See Deprecation Note Below) timestamp indicating when the Artifact became available on PolySwarm
    • created - (DEPRECATED) timestamp indicating when the Artifact became available on PolySwarm
    • sha256 - the SHA256 hash of the Artifact
    • md5 - the MD5 hash of the Artifact
    • sha1 - the SHA1 hash of the Artifact

Deprecations:

The artifact.created Attribute will soon be deprecated. It will be replaced with artifact.first_seen.

artifact.first_seen is not currently displayed in Metadata Search results via the CLI. This is a known issue that will be fixed shortly.

Querying via CLI

Querying for an instance of EICAR via Metadata Search:

$ polyswarm --fmt pretty-json search metadata 'artifact.sha256:"275a021bbfb6489e54d471899f7db9d1663fc695ec2fe2a2c4538aabf651fd0f"'
{
    "artifact": {
        "created": "2019-05-03T23:43:42.298385+00:00",
        "id": "90465088441197206",
        "md5": "44d88612fea8a8f36de82e1278abb02f",
        "sha1": "3395856ce81f2b7382dee72602f798b642f14140",
        "sha256": "275a021bbfb6489e54d471899f7db9d1663fc695ec2fe2a2c4538aabf651fd0f"
    },
    ...

Using jq to filter results for the artifact Scope:

$ polyswarm --fmt json search metadata 'artifact.sha256:"275a021bbfb6489e54d471899f7db9d1663fc695ec2fe2a2c4538aabf651fd0f"' | jq .artifact
{
  "created": "2019-05-03T23:43:42.298385+00:00",
  "id": "90465088441197206",
  "md5": "44d88612fea8a8f36de82e1278abb02f",
  "sha1": "3395856ce81f2b7382dee72602f798b642f14140",
  "sha256": "275a021bbfb6489e54d471899f7db9d1663fc695ec2fe2a2c4538aabf651fd0f"
}
Querying via Python
from polyswarm_api.api import PolyswarmAPI

api_key = "<YOUR API KEY>"
api = PolyswarmAPI(key=api_key)

query = 'artifact.sha256:"275a021bbfb6489e54d471899f7db9d1663fc695ec2fe2a2c4538aabf651fd0f"'

results = api.search_by_metadata(query)

# Our query is by cryptographic hash; we expect at most 1 result.
# Regardless, it's good practice to properly handle multiple results.
for result in results:
    print(f"Artifact Attributes: {result.artifact}")

Running this script will produce:

Artifact Attributes: {'created': '2019-05-03T23:43:42.298385+00:00', 'id': '90465088441197206', 'md5': '44d88612fea8a8f36de82e1278abb02f', 'sha1': '3395856ce81f2b7382dee72602f798b642f14140', 'sha256': '275a021bbfb6489e54d471899f7db9d1663fc695ec2fe2a2c4538aabf651fd0f'}

Scan Attributes

Attributes that describe Scans executed against Artifacts are available under the scan Scope.

The following scan Attributes are supported:

  • scan

    • filename - (DEPRECATED) observed filename(s) for the Artifact
    • url - (DEPRECATED; Exists for URL Artifacts Only) observed URLs for the Artifact
    • countries - (See Deprecation Note Below) countries from which a Scan was initiated on the Artifact
    • mimetype - mime type information

      • extended - extended mime information, e.g., "PE32 executable (GUI) Intel 80386, for MS Windows"
      • mime - mime type, e.g. "application/x-dosexec", see File Types for more examples
    • first_seen - (DEPRECATED) UTC date of when the Artifact was first scanned
    • last_seen - (DEPRECATED) UTC date of when the Artifact was last scanned
    • first_scan / latest_scan - the first / more recent Scan Results Set

      • artifact_instance_id: a unique identifier for the Scan Result Set
      • detections

        • benign - number of Engines that asserted that the Artifact is benign in this Scan Result Set
        • malicious - number of Engines that asserted that the Artifact is malicious in this Scan Result Set
        • unknown - number of Engines that did not finalize (reveal) their assertion
        • total - number of Engines that asserted in this Scan Result Set
      • An array of objects, each describing an Engine (e.g. ClamAV)'s response in the Result Set. See Engine Result Objects for a breakdown.

Deprecations:

The scan.filename Attribute will soon be deprecated. It will be replaced with artifact.filenames.

The scan.url Attribute will soon be deprecated. It will be replaced with artifact.urls.

The scan.first_seen and scan.last_seen Attributes will soon be deprecated. They will be replaced with scan.[first_scan|latest_scan].created.

The scan.countries Attribute will soon be replaced by artifact.countries.

PolySwarm maintains a record of all Scan Result Sets, but only the first Scan (scan.first_scan) and latest Scan (scan.latest_scan) are generally available via Metadata Search.

If you have a use case for querying against more/all Scan Result Sets, please get in touch and let's discuss!

File Types

The following is a list of common mimetypes useful for querying via scan.mimetype.

MIME Types Kind of document Extension
application/gzip GZip Compressed Archive .gz
application/octet-stream Any kind of binary data .bin
application/pdf Adobe Portable Document Format .pdf
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet Microsoft Excel 2007+ (OpenXML) .xlsx
application/vnd.openxmlformats-officedocument.wordprocessingml.document Microsoft Word 2007+ (OpenXML) .docx
application/x-dosexec PE32 executable .exe
application/x-java-applet Compiled Java class data .class
application/x-rar RAR archive data .rar
application/xml XML .xml
application/zip ZIP archive .zip
text/html HyperText Markup Language (HTML) .htm .html
text/plain Text, (generally ASCII or ISO 8859-n) .txt

A list of all official MIME media types provided by IANA can be found here.

Engine Result Objects

Attributes that describe a given Engine's assertion (malicious or benign), identified malware_family (as applicable) and additional optional data are available under the scan.<NAME OF SCAN>.<ENGINE NAME> (e.g. scan.latest_scan.ClamAV) Scope.

Remember: In PolySwarm, Engines are not required to provide responses to Scan requests. For this and other reasons (e.g. in the case of errors), any particular Engine may or may not appear within the scan.<NAME OF SCAN> scope. An Engine that appeared in scan.first_scan may or may not appear in scan.latest_scan and vice versa.

Things to keep in mind:

  1. Although unlikely, 0 Engines may choose to respond to any given Scan Request, resulting in 0 Engine Result Objects for the given Scan Result Set. Do not assume there is at least 1 Engine Result Object on any given Scan Result Set.
  2. Do not assume that any particular Engine produced an Engine Result Object for a given Scan Result Set.

ClamAV's Engine Result Object for the first_scan Scan Result Set for an EICAR file in context:

{
    ...
    "scan": {
        ...
        "first_scan": {
            ...
            "ClamAV": {
                "assertion": "malicious",
                "metadata": {
                    "malware_family": "Eicar-Test-Signature"
                }
            },

This Engine Result Object provides the following information:

  • scan.first_scan.ClamAV.assertion: malicious / benign determination
  • scan.first_scan.ClamAV.metadata.malware_family: the identified malware family name of the Artifact

The following query was run to produce the above:

$ polyswarm --fmt pretty-json search metadata 'artifact.sha256:"275a021bbfb6489e54d471899f7db9d1663fc695ec2fe2a2c4538aabf651fd0f"'
Querying via CLI

Querying for an instance of EICAR via Metadata Search:

$ polyswarm --fmt pretty-json search metadata 'artifact.sha256:"275a021bbfb6489e54d471899f7db9d1663fc695ec2fe2a2c4538aabf651fd0f"'
{
    ...
    "scan": {
        "countries": [
            ...
        ],
        ...
        "first_scan": {
            ...
        },
        ...
        "latest_scan": {
            ...
            "ClamAV": {
                "assertion": "malicious",
                "metadata": {
                    "malware_family": "Win.Test.EICAR_HDB-1"
                }
            },...
            "artifact_instance_id": 86531778368568037,
            "detections": {
                "benign": 1,
                "malicious": 9,
                "total": 10
            }
        },
        "mimetype": {
            "extended": "EICAR virus test files",
            "mime": "text/plain"
        }
    },
    ...
}

Use the CLI & jq to easily filter results for the scan Scope:

$ polyswarm --fmt json search metadata 'artifact.sha256:"275a021bbfb6489e54d471899f7db9d1663fc695ec2fe2a2c4538aabf651fd0f"' | jq .scan

Use the CLI & jq to determine what ClamAV said about the Artifact:

$ polyswarm --fmt json search metadata 'artifact.sha256:"275a021bbfb6489e54d471899f7db9d1663fc695ec2fe2a2c4538aabf651fd0f"' | jq .scan.latest_scan.ClamAV.assertion
"malicious"

$ polyswarm --fmt json search metadata 'artifact.sha256:"275a021bbfb6489e54d471899f7db9d1663fc695ec2fe2a2c4538aabf651fd0f"' | jq .scan.latest_scan.ClamAV.metadata.malware_family
"Win.Test.EICAR_HDB-1"

Other Attrributes can be similarly discovered by passing other paths to jq.

Querying via Python
from polyswarm_api.api import PolyswarmAPI

api_key = "<YOUR API KEY>"
api = PolyswarmAPI(key=api_key)

query = 'artifact.sha256:"275a021bbfb6489e54d471899f7db9d1663fc695ec2fe2a2c4538aabf651fd0f"'

results = api.search_by_metadata(query)

# Our query is by cryptographic hash; we expect at most 1 result.
# Regardless, it's good practice to properly handle multiple results.
for result in results:

    print(f"Artifact: {result.sha256}")

    # Artifacts that have not been scanned will lack the scan object in Metadata results.
    if not hasattr(result, 'scan'):
        print(f"Artifact has not been scanned.")
        continue

    # Engines choose whether to respond to each scan request.
    # We cannot assume any given Engine Result Object exists, so much check first.
    if not 'ClamAV' in result.scan['latest_scan']:
        print(f"ClamAV did not deliver an assertion for the latest scan of this Artifact.")
        continue

    # If the scan object exists, the scan.latest_scan object will also exist.
    print(f"ClamAV asserted: {result.scan['latest_scan']['ClamAV']['assertion']}")
    print(f"ClamAV identified the malware family as: {result.scan['latest_scan']['ClamAV']['metadata']['malware_family']}")

Running this script will produce:

Artifact: 275a021bbfb6489e54d471899f7db9d1663fc695ec2fe2a2c4538aabf651fd0f
ClamAV asserted: malicious
ClamAV identified the malware family as: Win.Test.EICAR_HDB-1

Tool Attributes

Metadata Extraction Tools are applied on a per-Artifact basis according to some relevancy criteria.

Attributes produced by each Tool are addressed at the top level by the Tool's name. The following Tools are supported:

  • hash - computes various hash functions on Artifacts. Applies to all file Artifacts.

  • strings - extracts string-like sequences from Artifacts. Applies to all file Artifacts.

    • domains - (array of strings) strings that look like Domain names
    • urls - (array of strings) strings that look like URLs (including things like emails)
    • ipv4 - (array of strings) strings that look like IPv4 addresses
    • ipv6 - (array of strings) strings that look like IPv6 addresses
  • lief - extracts information from executables. Applies to Linux (ELF), Windows (PE) and Java executable Artifacts.

    • has_nx - (bool) indicates whether the executable was compiled with NX protections (referred to as DEP on Windows)
    • is_pie - (bool) indicates whether the executable is compatible with a fully randomized address space (full ASLR)
    • libraries - (array of strings) list of imported libraries
    • entrypoint - (string) entrypoint in decimal notation
    • virtual_size - (string) virtual size in decimal notation
    • exported_functions - (array of strings) list of exported functions
    • imported_functions - (array of strings) list of imported functions
    • virtual_size - (int) size of the executable when mapped into a process space
  • pefile - extracts information from Windows executables. Refer to pefile documentation for a description of these fields. Applies to Windows (PE) executable Artifacts.

    • app_container - (bool)
    • compile_date - (string)
    • compile_date_utc - (string)
    • exports - (array of strings)
    • force_integrity - (bool)
    • force_no_isolation - (bool)
    • has_debug_info - (bool)
    • has_export_table - (bool)
    • has_import_table - (bool)
    • high_entropy_aslr - (bool)
    • imphash - (string) "import hash". Use this to find other Windows executables that import the same functions. Examples provided later.
    • imports - (object) dictionary of imports in format dllname: [list, of, functions]
    • imported_functions - (array of strings)
    • is_dll - (bool)
    • is_driver - (bool)
    • is_exe - (bool)
    • is_probably_packed - (bool)
    • is_suspicious - (bool, can be null)
    • is_valid - (bool, can be null)
    • libraries - (array of strings) array of imported libraries (DLLs)
    • no_bind - (bool)
    • pdb - (array of strings) PDB paths referenced in the Artifact. Useful for finding attacker OPSEC mistakes / false flags. Examples provided later.
    • pdb_guids - (array of strings)
    • terminal_server_aware - (bool)
    • uses_aslr - (bool)
    • uses_cfg - (bool)
    • uses_dep - (bool)
    • uses_seh - (bool)
    • verify_checksum - (bool)
    • warnings - (array of strings) warnings generated during pefile execution
    • wdm_driver - (bool)
  • exiftool - pull assorted Attributes from a wide range of file types. Particularly useful for indicators in office documents. Applies to executables, documents, images, videos and fonts.

    • Author - (string) author of the file, e.g. Theresia Ward
    • build - (string) e.g. August 2018
    • CharacterSet - (string) character set of file, e.g. Unicode
    • codesize - (int) code size of PEs
    • comments - (string) e.g. background challenge Licensed
    • company - (string) e.g. Cremin LLC
    • entrypoint - (string) entry point of PEs as a hexademicimal string, e.g. 0x13f0
    • CreateDate - (string) creation time string from document
    • filedescription - (string) e.g. GlowEffect MFC Application
    • filename - (string) DO NOT USE
    • filetype - (string) e.g. Win64 DLL
    • filetypeextension - (string) file type extension (e.g., 'exe', 'dll', 'pdf')
    • initializeddatasize - (int) size of the initialized .data section in PEs
    • InternalName - (string) internal name extracted from executable, e.g. GlowEffect
    • Language - (string) language of file, e.g. 'en-GB'
    • LanguageCode - (string) language used by executable, e.g. English (U.S.)
    • lastmodifiedby - (string) e.g. Arturo Cole
    • linkerversion - (int) linker version used to compile executables, e.g. 14.0
    • manager - (string) e.g. Hayes
    • mimetype - (string)
    • ModifyDate - (string) last modified time string from document
    • OriginalFileName - (string) original name of the file
    • sourcefile - (string) DO NOT USE
    • Subject - (string) subject of the file, e.g. Plastic
    • timestamp - (string) timestamp of the file, e.g. 2018:03:13 07:19:15+00:00
    • Title - (string) title of the file, e.g. transmit
    • zipcrc - (string) the CRC checksum of ZIP files as a hexadecimal string
    • zipfilename - (string) e.g. [Content_Types].xml
    • zipmodifydate - (string) e.g. 2008:04:16 07:40:25

Several exiftool fields are marked as DO NOT USE.

These fields are currently being "clobbered" via our partner feed ingestion process and therefore should not be relied on yet.

We'll remove these notices when these fields contain descriptive values.

Querying via CLI

Querying for an instance of EICAR via Metadata Search:

$ polyswarm --fmt pretty-json search metadata 'artifact.sha256:"275a021bbfb6489e54d471899f7db9d1663fc695ec2fe2a2c4538aabf651fd0f"'
{
    ...
    "hash": {
        "md5": "44d88612fea8a8f36de82e1278abb02f",
        "sha1": "3395856ce81f2b7382dee72602f798b642f14140",
        "sha256": "275a021bbfb6489e54d471899f7db9d1663fc695ec2fe2a2c4538aabf651fd0f",
        "sha3_256": "8b4c4e204a8a039198e292d2291f4c451d80e4c38bf0cc04ad3841fea8755bd8",
        "sha3_512": "a20290c6ebf01dc5182bb57718250f61ab11b418466714632a7d1474a02849641f7b78e4093e19ad12fdbedbe02f3bec4ca3ec3235557e82ab5ac02d061e7007",
        "sha512": "cc805d5fab1fd71a4ab352a9c533e65fb2d5b885518f4e565e68847223b8e6b85cb48f3afad842726d99239c9e36505c64b0dc9a061d9e507d833277ada336ab",
        "ssdeep": "3:a+JraNvsgzsVqSwHq9:tJuOgzsko",
        "ssdeep_chunk": "a+JraNvsgzsVqSwHq9",
        "ssdeep_chunk_size": 3,
        "ssdeep_double_chunk": "tJuOgzsko",
        "tlsh": "41a022003b0eee2ba20b00200032e8b00808020e2ce00a3820a020b8c83308803ec228",
        "tlsh_quartiles": [
            ...
        ],
        "tlsh_quartiles_minimum_match": 32
    },
    ...
    "strings": {
        "domains": [],
        "ipv4": [],
        "ipv6": [],
        "urls": []
    }
}

The EICAR test file is a plain text file, so only the hash and string Tools are applied to it.

Use the CLI & jq to easily filter results for the hash Scope:

$ polyswarm --fmt json search metadata 'artifact.sha256:"275a021bbfb6489e54d471899f7db9d1663fc695ec2fe2a2c4538aabf651fd0f"' | jq .hash

Use the CLI & jq to determine the Artifact's ssdeep fuzzy hash:

$ polyswarm --fmt json search metadata 'artifact.sha256:"275a021bbfb6489e54d471899f7db9d1663fc695ec2fe2a2c4538aabf651fd0f"' | jq .hash.ssdeep

Other Attrributes can be similarly discovered by passing other paths to jq.

Querying via Python
from polyswarm_api.api import PolyswarmAPI

api_key = "<YOUR API KEY>"
api = PolyswarmAPI(key=api_key)

query = 'artifact.sha256:"275a021bbfb6489e54d471899f7db9d1663fc695ec2fe2a2c4538aabf651fd0f"'

results = api.search_by_metadata(query)

# Our query is by cryptographic hash; we expect at most 1 result.
# Regardless, it's good practice to properly handle multiple results.
for result in results:

    print(f"Artifact: {result.sha256}")

    # The hash Tool applies to all Artifacts and is eventually run on all Artifacts.
    # There is a window, however, between submission and Tool execution, so we verify
    # the output from the Tool exists before attempting to access output members.
    if not hasattr(result, 'hash'):
        print(f"The hash Tool hasn't been run on this Artifact yet.")
        continue

    print(f"ssdeep fuzzy hash of Artifact: {result.hash['ssdeep']}")

Running this script will produce:

Artifact: 275a021bbfb6489e54d471899f7db9d1663fc695ec2fe2a2c4538aabf651fd0f
ssdeep fuzzy hash of Artifact: 3:a+JraNvsgzsVqSwHq9:tJuOgzsko

Metadata Searching: Syntax

Metadata Search supports Elasticsearch's query_string syntax, providing flexibility through:

  1. Boolean logic, e.g. AND, OR & NOT
  2. grouping Boolean terms
  3. ranges, e.g. date:[2012-01-01 TO 2012-12-31]
  4. comparison operators (>, <=, etc)
  5. wildcards
  6. regular expressions

... and a lot more.

Always enclose literals in double-quotations ("), or, alternatively, escape all Elasticsearch control characters in your query.

For example, URLs contain Elasticsearch control characters. Failing to escape these control characters to enclose a queried URL in quotations will result in errors.

Doesn't work: strings.urls:https://polyswarm.network - due to the second exact value match operator (:)

Works: strings.urls:"https://polyswarm.network"

Also works: strings.urls:https\://polyswarm.network

Exact Value Match

Metadata query syntax supports exact value matching via the : operator.

The following will return all Artifacts for which the hash Tool reported an MD5 value of e90099d6f3078a9691ab8fe38f0f25e4.

hash.md5:"e90099d6f3078a9691ab8fe38f0f25e4"

This will either to return 0 or 1 results, modulo MD5 colliding Artifacts.

Boolean Operators

Metadata query syntax supports Boolean operators that allow users to combine clauses into more powerful queries.

Supported operators:

  1. AND
  2. OR
  3. NOT

... and a lot more.

You may combine multiple attributes matches in your query using boolean operators:

scan.latest_scan.ClamAV.first_scan.assertion:benign AND scan.latest_scan.ClamAV.first_scan.assertion:malicious

This query will return all Artifacts that were first flagged by ClamAV as benign, but on their latest scan are now found to be malicious.

You may also use Boolean operators for attribute values:

scan.latest_scan.ClamAV.metadata.scanner.environment.operating_system:(Linux OR Windows)

Will search for all artifacts scanned by ClamAV running on Linux OR Windows.

Comparison Operators

Metadata query syntax supports comparing values.

Supported comparisons:

  1. >: greater than
  2. <: less than
  3. >=: greater than or equal to
  4. <=: less than or equal to

The following will return all artifacts with at least one benign assertion.

scan.latest_scan.assertions.benign:>0

Grouping

Metadata query syntax supports grouping Boolean operators and clauses using parentheses.

The following will return all Artifacts that, on their latest scan, were detected by 1, 2 or 3 engines as malicious OR were detected by ClamAV as malicious:

(scan.latest_scan.assertions.malicious:>0 AND scan.latest_scan.assertions.malicious:<=3) OR scan.latest_scan.ClamAV.assertion:malicious

Ranges

Metadata query syntax supports ranges using the TO operator.

This query:

scan.latest_scan.assertions.benign:>=0 AND scan.latest_scan.assertions.benign:<=10

... can be simplified to this:

scan.latest_scan.assertions.benign:[0 TO 10]

Square brackets ([ & ]) match inclusively with respect to range boundaries, whereas curly brackets ({ & }) match exclusively. In the above query, 0 and 10 are included in the range of matching values.

If we replace the square bracket next to the 10 with a curly bracket:

scan.latest_scan.assertions.benign:[0 TO 10}

... then 10 is not a matching value. The number of benign assertions on a scan is an integer value, so the matching range becomes 0 to 9, inclusive.

In addition to fields containing numbers, ranges can be applied to dates. The following will return all artifacts with an exiftool-identified createdate falling in the year 2019.

exiftool.createdate:[2019-01-01 TO 2019-12-31]

Wildcards

Metadata query syntax supports wildcard matching using the * character.

The following will return all Artifacts that, on their latest scan, were identified by ClamAV as beloning to a malware family that contains the string Trojan:

scan.latest_scan.ClamAV.metadata.malware_family:*Trojan*

This would match on Artifacts that belong to, e.g. Trojan-Ransom.Satan and Trojan.Packed2.39908 families.

Wildcards can be used for Attribute names as well, but must be escaped.

The following will return all Artifacts identified as belonging to a malware family that includes the string Trojan by any Engine:

scan.latest_scan.\*.metadata.malware_family:*Trojan*

Note: Do not escape * in values, only in Attribute names.

Regular Expressions

Metadata query syntax supports regular expression matching on Attribute values.

The following will return all Artifacts that are ELF executables AND export functions that match the regular expression /.*Shooter(Sound|Ping|Key|Image|File).*/, e.g. ShooterSound:

exiftool.FileType:"ELF executable" AND lief.exported_functions:/.*Shooter(Sound|Ping|Key|Image|File).*/

This includes EvilGnome samples.

Regular expressions containing ^ (beginning of line) or $ (end of line) matching are not supported.

Check if Attribute Exists

Metadata query syntax supports existence checking for Attributes.

The following query will return all Artifacts on which the lief Tool has been executed and has reported which libraries the Artifact imports:

_exists_:lief.libraries

Reminder: always check for the existance of an optional Attribute before attempting to match on the value of the optional attribute.

Failure to do so is will almost certainly produce unexpected results.

Metadata Searching: Hunting

PolySwarm's Metadata Search is a powerful tool for finding unique varients and even completely new versions of malware.

In this section, we'll run through several real-world example hunts where new malware, including brand new versions of malware that hadn't be previously published, were found using PolySwarm's Metadata Search.

Hunting Syrian Nation State Android Malware

Lookout published a blog post on COVID-19 related Android malware released by the Syrian Electronic Army.

The post discloses:

  1. Where the command and control (C2) addresses are stored within the malicious applications (within res/values/strings.xml)
  2. A list of SHA1 hashes of applications known to belong to this family of malware

First, we lookup Lookout's first SHA1 hash on PolySwarm:

  • via the Web UI: link
  • via the CLI (hash search): polyswarm search hash 1aefc2ebaf1a78f23473ce6275b0b514bbcdfb08
  • via the CLI (metadata search using the hash): polyswarm search metadata 'hash.sha1:"1aefc2ebaf1a78f23473ce6275b0b514bbcdfb08"'
  • via the Python library:
from polyswarm_api.api import PolyswarmAPI

api_key = "<YOUR API KEY>"
api = PolyswarmAPI(key=api_key)

query = 'hash.sha1:"1aefc2ebaf1a78f23473ce6275b0b514bbcdfb08"'

results = api.search_by_metadata(query)

for result in results:
    print(f"Artifact Attributes: {result.artifact}")

We have a hit!

Next, we download the Artifact (using your choice of Web UI, CLI or Python) and use apktool with the d flag to extract res/values/strings.xml:

<?xml version="1.0" encoding="utf-8"?>
<resources>
    ...
    <string name="MT_Bin_dup_0x7f0c0020">Android Telegram</string>
    <string name="MT_Bin_dup_0x7f0c0021">10000</string>
    <string name="MT_Bin_dup_0x7f0c0022">82.137.218.185</string>
    ...
</resources>

It appears as though the C2 address is 82.137.218.185. This information was not published in Lookout's blog post.

We can use Metadata Search to "pivot" using this IP(v4) address:

  • via the Web UI: link (must be logged in to view)
  • via the CLI: polyswarm search metadata 'strings.ipv4:"82.137.218.185"'
  • via the Python library:
from polyswarm_api.api import PolyswarmAPI

api_key = "<YOUR API KEY>"
api = PolyswarmAPI(key=api_key)

query = 'strings.ipv4:"82.137.218.185"'

results = api.search_by_metadata(query)

for result in results:
    print(f"Artifact Attributes: {result.artifact}")

At the time of writing, we see:

  • 50 results,
  • at least 23 of which were not identified by Lookout in their blog post, and
  • at least 5 of which cannot be found on platforms similar to PolySwarm.

Using PolySwarm, researchers can quickly identify additional variants of malware and produce something that immediately expands on the public knowledge of the threat.

Hunting Iranian Nation State Spyware

ZDNet published an article on Iran's official COVID-19 tracker application that sends the real time location of installees to the Iranian government.

The article provides only a single IOC - a SHA256 hash (0f73ac8839f153cf0e830554d9b34af2ea90fd6514ed3992b66a96bc9c12bb4b) we can find on PolySwarm:

  • via the Web UI: link
  • via the CLI (hash search): polyswarm search hash 0f73ac8839f153cf0e830554d9b34af2ea90fd6514ed3992b66a96bc9c12bb4b
  • via the CLI (metadata search using the hash): polyswarm search metadata 'hash.sha256:"0f73ac8839f153cf0e830554d9b34af2ea90fd6514ed3992b66a96bc9c12bb4b"'
  • via the Python library:
from polyswarm_api.api import PolyswarmAPI

api_key = "<YOUR API KEY>"
api = PolyswarmAPI(key=api_key)

query = 'hash.sha256:"0f73ac8839f153cf0e830554d9b34af2ea90fd6514ed3992b66a96bc9c12bb4b"'

results = api.search_by_metadata(query)

for result in results:
    print(f"Artifact Attributes: {result.artifact}")

Let's take a look some of the Metadata Attributes from this Artifact:

{
    "artifact": {
        "created": "2020-03-10T10:16:50.900548+00:00",
        "id": "71592690635387748",
        "md5": "766e5ecf6b1d86abf401ad9223de857d",
        "sha1": "f1271aa0ccf79d16b036bac5320ed4349af69b65",
        "sha256": "0f73ac8839f153cf0e830554d9b34af2ea90fd6514ed3992b66a96bc9c12bb4b"
    },
    ...
    "strings": {
        "domains": [
            "V.mr",
            "",
            "covid-19-e9057.appspot.com",
            "p.to",
            "II1046766097017-4va56jc12ajt308tpbuge0tc5iqla179.apps.googleusercontent.com",
            "b.mc",
            "YJ.cz",
            "6.om",
            "6.gm",
            "covid-19-e9057.firebaseio.com"
        ],
        ...
    }
}

This excerpt was taken from polyswarm --fmt pretty-json search metadata 'hash.sha256:"0f73ac8839f153cf0e830554d9b34af2ea90fd6514ed3992b66a96bc9c12bb4b"'. The data is, of course, available to the Web UI and Python library interfaces as well.

There are several interesting domains extracted by the strings Tool:

  1. covid-19-e9057.appspot.com
  2. covid-19-e9057.firebaseio.com

It appears as though some portion of the Iranian government's backend for this app is Google's Appspot and Firebase services. This is mildly interesting because Google removed the application from their Play Store.

Next, we conduct a Metadata Search for the unique portion of these domains (covid-19-e9057) + a wildcard (*) to find additional Artifacts that contain these strings:

$ polyswarm --fmt sha256 search metadata 'strings.domains:covid-19-e9057*'

This search nets 4 Artifacts, all of which have been identified as malicious by Engines on PolySwarm. 3 of these Artifacts were, of course, not mentioned in the ZDNet article. Perhaps they have new functionality worth investigating!

Switching gears from Android to Windows malware, TrendMico recently published a blog post on some malware exploiting the rise in Zoom popularity.

Among other tidbits mentioned in the article, this malware:

  1. is a Powershell script,
  2. that embeds a 7zip extractor, and
  3. 7zip-compressed Tor, coinminer (the actual malware) and (legitimate) Zoom installers

The malware will cause the victim machine to mine cryptocurrency if the infected computer is powerful enough (notably, has a discrete GPU) over Tor.

TrendMicro published a handful of IOCs, including a C2 URL: https://2no.co/1O5aW.

Again, we turn to Metadata search to find Artifacts that contain this URL:

  • via the Web UI: link (must be logged in to view)
  • via the CLI: polyswarm search metadata 'strings.urls:*2no.co*1O5aW'
  • via the Python library:
from polyswarm_api.api import PolyswarmAPI

api_key = "<YOUR API KEY>"
api = PolyswarmAPI(key=api_key)

query = 'strings.urls:*2no.co*1O5aW'

results = api.search_by_metadata(query)

for result in results:
    print(f"Artifact Attributes: {result.artifact}")

We get 1 result, which was not part of the IOCs published by TrendMicro, but is clearly a PowerShell script exactly as described in their blog!

Hunt for Exploits

Often when identifying malware families, Engines on PolySwarm will name the malware family and the exploited vulnerability using its associated CVE number.

Users interested in vulnerability analysis and exploitation can use PolySwarm to find Artifacts that exploit known vulnerabilities!

  • via the Web UI: link (must be logged in to view)
  • via the CLI: polyswarm search metadata 'scan.latest_scan.\*.metadata.malware_family:*cve*'
  • via the Python library:
from polyswarm_api.api import PolyswarmAPI

api_key = "<YOUR API KEY>"
api = PolyswarmAPI(key=api_key)

query = 'scan.latest_scan.\*.metadata.malware_family:*cve*'

results = api.search_by_metadata(query)

for result in results:
    print(f"Artifact Attributes: {result.artifact}")

At the time of writing, this query returns 12885 results!

Hunt WannaCry via its Killswitch Domain

WannaCry would not execute malicious logic if a particular, seemingly nonsense domain name could be resolved. Once this domain was registered, WannaCry propagation came to a halt. The domain registration was a bit of a "killswitch".

We can easily find samples of WannaCry by Metadata searching for its "killswitch" domain.

  • via the CLI: polyswarm search metadata "strings.domains:*iuqerfsodp9ifjaposdfjhgosurijfaewrwergwea.com"

Metadata Searching: Advanced

In this section, we'll cover some examples of more complex Metadata search queries.

Malicious Artifacts ingested within a specified 5 minute time window
polyswarm --fmt sha256 search metadata 'artifact.first_seen:[2020-04-15T01:00 TO 2020-04-15T01:05] AND scan.latest_scan.detections.malicious:>0'

This particular time window saw 501 new malicious artifacts.

We can get even more creative and limit to Windows executables:

polyswarm --fmt sha256 search metadata 'exiftool.filetypeextension:exe AND scan.detections.malicious:>0 AND artifact.created:[2020-04-15T01:00 TO 2020-04-15T01:05]'

379 of the 501 artifact previously returned are identified by exiftool as a Windows executable.

exiftool.filetypeextension is shown here because it is succinct. We also support filetype identification by matching on:

  1. exiftool.filetype (e.g. Win32 EXE or Win* EXE)
  2. scan.mimetype.extended (e.g. PE32console)
  3. scan.mimetype (e.g. application/x-dosexec)

The following Python script will print assertion (malicious / benign) information on Artifacts ingested 5 minutes prior to the script being run:

import os
from polyswarm_api.api import PolyswarmAPI
from datetime import datetime, timedelta

if __name__ == "__main__":
    api = PolyswarmAPI(os.getenv("POLYSWARM_API_KEY"))
    # ensure we don't have microseconds in our query string
    end = datetime.utcnow().replace(microsecond=0)

    start = end - timedelta(minutes=5)

    for metadata_result_obj in api.search_by_metadata('artifact.created:[{} TO {}]'.format(start.isoformat(), end.isoformat())):
        if metadata_result_obj.malicious and metadata_result_obj.malicious > 2:
            print("{} marked malicious by {} engines and benign by {}".format(metadata_result_obj.sha256, metadata_result_obj.malicious, metadata_result_obj.benign))
            artifact_instance = next(api.search(metadata_result_obj.sha256))
            print("Polyscore: {}".format(artifact_instance.polyscore))
            break
Process results with jq, combine with PolyScore

The following will find all Artifacts first seen in the first 15 minutes of April 14th, 2020 that were detected as malicious by more than 3 Engines, extract their SHA256s from JSON and write them to the file hash.txt:

polyswarm --fmt json search metadata 'scan.first_seen:[2020-04-14T00:00 TO 2020-04-14T00:15] AND scan.latest_scan.detections.malicious:>3' | jq -r '.artifact.sha256' > hash.txt

We see 879 results.

Next, we query these hashes for their PolySwarm and create a CSV value enumerating hash and PolyScore:

polyswarm --fmt json search hash -r hash.txt | jq -r '[.sha256, .polyscore|tostring] | join(", ")' > polyscore.txt

The resultant polyswarm.txt file contains:

$ head polyscore.txt 
8105b93e3acb65b19d43febbb139f3184f9feff2ea78a3a615e6ae58ca6dfd5d, 0.9974553968361014
f1967ed3583fed9ca3b326b8df3cc362ac65420082d93dcbda91069a8c91d7c2, 0.9364975545636711
7c6a626eb58f8324689f8d63924adc6bc9131574b801917ce5514adce3cd8fcf, 0.5883349845592594
d1fb5cc4ffd5df4bfe967edc9203a79032fc58bb76bdc0e63c9f203fa43f1eef, 0.9175460331627756
ea8701c24a9c05ab81f314468ff802cfbbe501d6048392c6e0400f038ca34d21, 0.9684573626545542
c6b975b98330feb6f04dbf262632e4027e9c9d1a5f5cdb3ef0190f592eef64e3, 0.9446378922543466
21ad837771588fb0617431a2e4d77ea16c0b6b9b1e6263c8df21d51696aab79b, 0.9974553968361014
a8472100e89a3c5e7f38c3df86fd7ff2a0af657513672226afe2fcec873e24b0, 0.9175460331627756
3883d38baeae67939b893dcb97245af2c9592a75d53f06a328a6041a65d4379a, 0.9650102417745068
e8ab71339736c76fdeafc8eda8612a502cb723fe3137427bdfc5d6ee2720b90a, 0.9995237243129788
Pivot on latent .pdb information in Windows Executables

FireEye published a blog post wherein they attribute threat actors using latent Windows debug information (.pdb paths) in malware executables.

We can use information extracted by the pefile Tool to search for Artifacts that include PDB paths.

via the CLI:

  • polyswarm search metadata 'pefile.pdb:*RControl*' - 10+ results
  • polyswarm search metadata 'pefile.pdb:*shellcode.pdb' - 142+ results
  • etc

Download Files

OUTPUT_DIR = '/tmp/'
EICAR_HASH = '275a021bbfb6489e54d471899f7db9d1663fc695ec2fe2a2c4538aabf651fd0f'

artifact = api.download(OUTPUT_DIR, EICAR_HASH)

Perform Hunts

Live Hunting

# create and start live hunt
YARA_RULE = 'banker_families.yar'
RULESET_NAME = 'Banker Families Live Hunt'
ACTIVE = True

live_hunt = api.live_create(rule=open(YARA_RULE).read(),
                            active=ACTIVE,
                            ruleset_name=RULESET_NAME)
print(f'ID: {live_hunt.id}')
print(f'Rule Set Name: {live_hunt.ruleset_name}')
print(f'Created: {live_hunt.created}')
print(f'Active: {live_hunt.active}')
print(f'Status: {live_hunt.status}')

# get live hunt list and IDs
hunt_list = api.live_list()

for hunt in hunt_list:
    print(f'HUNT ID: {hunt.id}')
    print(f'Rule Set Name: {hunt.ruleset_name}')
    print(f'Created: {hunt.created}')
    print(f'Active: {hunt.active}')
    print(f'Status: {hunt.status}')

# fetch results
HUNT_ID = 48079983714547442
SINCE = 2000 # How far back in seconds to request results (default: 1440)

results = api.live_results(hunt=HUNT_ID, since=SINCE)

for hunt in results:
    print(f'ID: {hunt.id}')
    print(f'Rule Name: {hunt.rule_name}')
    print(f'Tags: {hunt.tags}')
    print(f'Created: {hunt.created}')
    print(f'SHA256: {hunt.sha256}')
    print(f'Permalink: {hunt.artifact.permalink}')
    print(f'PolyScore: {hunt.artifact.polyscore}')

# delete live hunt
HUNT_ID = 48079983714547442

hunt_deleted = api.live_delete(HUNT_ID)

print(f'ID: {hunt_deleted.id}')
print(f'Rule Set Name: {hunt_deleted.ruleset_name}')
print(f'Created: {hunt_deleted.created}')
print(f'Active: {hunt_deleted.active}')
print(f'Status: {hunt_deleted.status}')

Historical Hunting

# create and start historical hunt
YARA_RULE = 'APT_signatures.yar'
RULESET_NAME = 'APT Historical Ruleset'

live_hunt = api.historical_create(rule=open(YARA_RULE).read(),
                                  ruleset_name=RULESET_NAME)
print(f'ID: {live_hunt.id}')
print(f'Rule Set Name: {live_hunt.ruleset_name}')
print(f'Created: {live_hunt.created}')
print(f'Active: {live_hunt.active}')
print(f'Status: {live_hunt.status}')

# get historical hunt list and IDs
hunt_list = api.historical_list()

for hunt in hunt_list:
    print(f'HUNT ID: {hunt.id}')
    print(f'Rule Set Name: {hunt.ruleset_name}')
    print(f'Created: {hunt.created}')
    print(f'Active: {hunt.active}')
    print(f'Status: {hunt.status}')

# fetch results
HUNT_ID = 41408929604057916

results = api.historical_results(hunt=HUNT_ID)

for hunt in results:
    print(f'ID: {hunt.id}')
    print(f'Rule Name: {hunt.rule_name}')
    print(f'Tags: {hunt.tags}')
    print(f'Created: {hunt.created}')
    print(f'SHA256: {hunt.sha256}')
    print(f'Permalink: {hunt.artifact.permalink}')
    print(f'PolyScore: {hunt.artifact.polyscore}')

# delete historical hunt
HUNT_ID = 41408929604057916

hunt_deleted = api.live_delete(HUNT_ID)

print(f'ID: {hunt_deleted.id}')
print(f'Rule Set Name: {hunt_deleted.ruleset_name}')
print(f'Created: {hunt_deleted.created}')
print(f'Active: {hunt_deleted.active}')
print(f'Status: {hunt_deleted.status}')

Perform Rescans

instance = api.rescan("275a021bbfb6489e54d471899f7db9d1663fc695ec2fe2a2c4538aabf651fd0f")
result = api.wait_for(instance)

if result.failed:
    print(f'Failed to get results')
    sys.exit()

positives = 0
total = 0

print('Engine Assertions:')
for assertion in result.assertions:
    if assertion.verdict:
        positives += 1
    total += 1
    print('\tEngine {} asserts {}'.\
            format(assertion.author_name,
                   'Malicious' if assertion.verdict else 'Benign'))

print(f'Positives: {positives}')
print(f'Total: {total}')
print(f'PolyScore: {result.polyscore}\n')

print(f'sha256: {result.sha256}')
print(f'sha1: {result.sha1}')
print(f'md5: {result.md5}')
print(f'Extended type: {result.extended_type}')
print(f'First Seen: {result.first_seen}')
print(f'Last Seen: {result.last_seen}\n')

print(f'Permalink: {result.permalink}')

Get a Stream

SINCE = 60 # Fetch stream from the last 60 minutes

streams = api.stream(since=SINCE)

for stream in streams:
    print(f'ID: {stream.id}')
    print(f'URI: {stream.uri}')
    print(f'Created: {stream.created}')
    print(f'Community: {stream.community}')

Stream is a feature that is added to an account on a case-by-case basis. If you'd like to add this feature to your account, contact us at info@polyswarm.io.

2020 © PolySwarm Pte. Ltd.