A proof of concept using ArangoDB to store STIX objects

In another tutorial

I showed how to store objects generated by the STIX 2 Python library to the filestore.

When it comes to storing and retrieving STIX 2.1 data at scale in an easy and efficient manner using a database is much more suited than the filestore.

There are number of considerations when selecting a database to use.

The team at Sekoia did a great job detailing their decision to use ArangoDB to store STIX Objects.

ArangoDB uses a few concepts useful for this;

  • Databases: A database contains Collections which hold documents and/or edges
  • Collections: There are two types of Collections
    1. Vertex Collections: contain documents on a graph
    2. Edge Collections: links documents from Vertex Collections
  • Documents: the data

In STIX, Documents are the STIX objects (but not Relationship SROS), Vertex Collections are groupings of those objects, and Edge Collections are collections of Relationship objects.

Therefore, I can start up ArangoDB as follows;

## Install
brew install arangodb
## Run
brew services start arangodb
## will now be accessible in a browser at: http://127.0.0.1:8529 . Default username is root with no password set (leave blank) 

And log into the UI to create a:

  1. Database = arango_stix_demo
  2. Vertex Collection = stix_objects
  3. Edge Collection = stix_relationships

Now those are defined I can start populating the database with some STIX 2.1 data.

Storing STIX 2.1 Objects in ArangoDB

In this example I will create two SDOs and link them together with an SRO (taken from this OASIS STIX example bundle)

ArangoDB provides its own query language named Arango Query Language (AQL) which will allow me to do this:

AQL is mainly a declarative language, meaning that a query expresses what result should be achieved but not how it should be achieved. AQL aims to be human-readable and therefore uses keywords from the English language.[…] Further design goals of AQL were the support of complex query patterns and the different data models ArangoDB offers.

The syntax of AQL queries is different to SQL, even if some keywords overlap. Nevertheless, AQL should be easy to understand for anyone with an SQL background.

In the queries section of the ArangoDB UI I will start by creating two SDOs, one Indicator (indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f) and one Malware (malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b);

LET objects = [
  {
    "_key": "indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f",
    "type": "indicator",
    "id": "indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f",
    "created": "2016-04-06T20:03:48.000Z",
    "modified": "2016-04-06T20:03:48.000Z",
    "created_by_ref": "identity--c2aceda2-0e46-431d-be40-7b4a4e797f81",
    "labels": [
      "malicious-activity"
    ],
    "name": "Poison Ivy Malware",
    "description": "This file is part of Poison Ivy",
    "pattern": "[ file:hashes.'SHA-256' = '4bac27393bdd9777ce02453256c5577cd02275510b2227f473d03f533924f877' ]",
    "valid_from": "2016-01-01T00:00:00Z"
  },
  {
    "_key": "malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b",
    "type": "malware",
    "id": "malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b",
    "created": "2016-04-06T20:07:09.000Z",
    "modified": "2016-04-06T20:07:09.000Z",
    "created_by_ref": "identity--c2aceda2-0e46-431d-be40-7b4a4e797f81",
    "name": "Poison Ivy"
  }
]
FOR object IN objects
 INSERT object INTO stix_objects

In the query I use the STIX id attribute as the document primary key (_key) so I can easily retrieve the documents later.

Now I need to store the SRO relationship between these Objects in the Edge Collection stix_relationships. Three attributes must be set in the AQL query for this: _key, _from and _to.

For the _key I am going to use STIX SRO ID in the same way I did for the SDOs. For the _from and _to field I will use the STIX SRO properties source_ref and target_ref respectively. The format to use for the _from and _to fields is as follows: <collection>/<_key>. Where <collection> is the Collection is the Document Collection name (stix_objects) and the <_key> is the _key SDO ID (I set _key as the STIX Object id, so the _key value is equal to the STIX Object id)

To demonstrate, the AQL request for this example is as follow:

INSERT {
    "_key": "relationship--44298a74-ba52-4f0c-87a3-1824e67d7fad+2016-04-06T20:06:37.000Z",
    "_from": "stix_objects/indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f",
    "_to": "stix_objects/malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b",
    "type": "relationship",
    "id": "relationship--44298a74-ba52-4f0c-87a3-1824e67d7fad",
    "created": "2016-04-06T20:06:37.000Z",
    "modified": "2016-04-06T20:06:37.000Z",
    "created_by_ref": "identity--c2aceda2-0e46-431d-be40-7b4a4e797f81"
    "relationship_type": "indicates",
    "source_ref": "indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f",
    "target_ref": "malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b"
} IN stix_relationships

As I talked about here

STIX objects also have embedded relationships, e.g. created_by_ref, sighting_of_ref, observed_data_refs, object_refs, etc.

An Object might have more than one embedded relationship property (identified where the property name ends with _ref or _refs. Embedded relationships can also contains lists of relationships. Here is an example of a Report SDO with three embedded relationships under the objects_refs property:

{
    "type": "report",
    "spec_version": "2.1",
    "id": "report--84e4d88f-44ea-4bcd-bbf3-b2c1c320bcb3",
    "created": "2015-12-21T19:59:11.000Z",
    "modified": "2015-12-21T19:59:11.000Z",
    "created_by_ref": "identity--c2aceda2-0e46-431d-be40-7b4a4e797f81",
    "name": "The Black Vine Cyberespionage Group",
    "description": "A simple report with an indicator and campaign",
    "published": "2015-12-21T19:59:11.000Z",
    "report_types": [
        "campaign"
    ],
    "object_refs": [
        "indicator--26ffb872-1dd9-446e-b6f5-d58527e5b5d2",
        "campaign--83422c77-904c-4dc1-aff5-5c38f3a2c55c",
        "relationship--f82356ae-fe6c-437c-9c24-6b64314ae68a"
    ]
}

These embedded relationships need to be represented as edges too, but in a slightly different way to native STIX Objects. In the previous examples I added pure STIX Objects into ArangoDB (as they would be received), however, embedded relationships require a custom object (non STIX Object) to define the relationship that in turn needs to be parsed out of a STIX Object.

In our implementation, embedded relationship edges have the _key containing both the STIX Object IDs for the embedded relationship, joined with a +. The edges also contain two unique properties; 1) type which is matches the _ref property, and 2) relationship_description property which is equal to the property in the STIX Object.

Let me demonstrate for clarity. First I would create the Report SDO as a Document;

INSERT {
    "_key": "report--84e4d88f-44ea-4bcd-bbf3-b2c1c320bcb3",
    "type": "report",
    "spec_version": "2.1",
    "id": "report--84e4d88f-44ea-4bcd-bbf3-b2c1c320bcb3",
    "created": "2015-12-21T19:59:11.000Z",
    "modified": "2015-12-21T19:59:11.000Z",
    "created_by_ref": "identity--c2aceda2-0e46-431d-be40-7b4a4e797f81",
    "name": "The Black Vine Cyberespionage Group",
    "description": "A simple report with an indicator and campaign",
    "published": "2015-12-21T19:59:11.000Z",
    "report_types": [
        "campaign"
    ],
    "object_refs": [
        "indicator--26ffb872-1dd9-446e-b6f5-d58527e5b5d2",
        "campaign--83422c77-904c-4dc1-aff5-5c38f3a2c55c",
        "relationship--f82356ae-fe6c-437c-9c24-6b64314ae68a"
    ]
} IN stix_objects

This Report SDO has four embedded relationships;

  • created_by_ref = identity--c2aceda2-0e46-431d-be40-7b4a4e797f81
  • object_refs = indicator--26ffb872-1dd9-446e-b6f5-d58527e5b5d2
  • object_refs = campaign--83422c77-904c-4dc1-aff5-5c38f3a2c55c
  • object_refs = relationship--f82356ae-fe6c-437c-9c24-6b64314ae68a

I will assume these SDOs already exist in the stix_objects document collection.

As such, all that that is required is to create the four embedded relationship edges in the stix_relationships Edge Collection.

LET embedded_relationships  = [
{
    "_key": "report--84e4d88f-44ea-4bcd-bbf3-b2c1c320bcb3+indicator--26ffb872-1dd9-446e-b6f5-d58527e5b5d2",
    "_from": "stix_objects/report--84e4d88f-44ea-4bcd-bbf3-b2c1c320bcb3",
    "_to": "stix_objects/indicator--26ffb872-1dd9-446e-b6f5-d58527e5b5d2",
    "relationship_type": "created_by_ref",
},
{
    "_key": "report--84e4d88f-44ea-4bcd-bbf3-b2c1c320bcb3+identity--a463ffb3-1bd9-4d94-b02d-74e4f1658283",
    "_from": "stix_objects/report--84e4d88f-44ea-4bcd-bbf3-b2c1c320bcb3",
    "_to": "stix_objects/identity--a463ffb3-1bd9-4d94-b02d-74e4f1658283",
    "relationship_type": "object_refs",
},
{
    "_key": "report--84e4d88f-44ea-4bcd-bbf3-b2c1c320bcb3+campaign--83422c77-904c-4dc1-aff5-5c38f3a2c55c",
    "_from": "stix_objects/report--84e4d88f-44ea-4bcd-bbf3-b2c1c320bcb3",
    "_to": "stix_objects/campaign--83422c77-904c-4dc1-aff5-5c38f3a2c55c",
    "relationship_type": "object_refs",
},
{
    "_key": "report--84e4d88f-44ea-4bcd-bbf3-b2c1c320bcb3+relationship--f82356ae-fe6c-437c-9c24-6b64314ae68a",
    "_from": "stix_objects/report--84e4d88f-44ea-4bcd-bbf3-b2c1c320bcb3",
    "_to": "stix_objects/relationship--f82356ae-fe6c-437c-9c24-6b64314ae68a",
    "relationship_type": "object_refs",
}
]
FOR embedded_relationship IN embedded_relationships
 INSERT embedded_relationship INTO stix_relationships

Now I can filter the results by relationship_type. For example, I could filter the documents in the stix_relationships collection using the query relationship_type==created_by_ref.

I talked a bit about versioning here

This can be achieved done using updates to documents in ArangoDB. Lets start by creating a new object;

INSERT {
    "_key": "report--02ee5fc1-6321-4007-b6b8-c3c5c8d5e1a1",
    "type": "report",
    "id": "report--02ee5fc1-6321-4007-b6b8-c3c5c8d5e1a1",
    "created": "2021-01-01T00:00:00.000Z",
    "modified": "2021-01-01T00:00:00.000Z",
    "created_by_ref": "identity--c2aceda2-0e46-431d-be40-7b4a4e797f81",
    "name": "Demoing version",
    "published": "2021-01-01T00:00:00.000Z",
    "report_types": ["campaign"]
} IN stix_objects

Then I will update it with the new properties;

UPDATE {
    "_key": "report--02ee5fc1-6321-4007-b6b8-c3c5c8d5e1a1",
    "type": "report",
    "id": "report--02ee5fc1-6321-4007-b6b8-c3c5c8d5e1a1",
    "created": "2021-01-01T00:00:00.000Z",
    "modified": "2022-01-01T00:00:00.000Z",
    "created_by_ref": "identity--c2aceda2-0e46-431d-be40-7b4a4e797f81",
    "name": "Demoing version",
    "description": "Adding a new field",
    "published": "2021-01-01T00:00:00.000Z",
    "report_types": ["campaign"]
} IN stix_objects

With each UPDATE to a document, ArangoDB tracks its revision under a _rev (Document Revision) property.

Querying STIX 2.1 Objects in ArangoDB

Now these STIX Objects and relationships have successfully been written, I can start to explore the database.

The ArangoDB UI contains a graph visualisation tool that is useful for browsing visually (Graphs > Your Graph).

I will need to create a Graph = stix_graph which contains;

  • stix_relationships as the Edge Definition (edge definition define a relation of the graph)
  • stix_objects for both the fromCollections (Collections that contain the start vertices of the relation) and toCollections (Collections that contain the end vertices of the relation)

Here is what the stix_graph looks like for the Objects I have added to the database;

As you can see from the first example, the two stix_objects in the Document Collection are represented as individual nodes on the graph, and the stix_relationships in the Edge Collection as an edge.

This is about the simplest example of a STIX Object Graph. Nodes (STIX SDO/SCOs) can have multiple edges (SROs) making for much more complex graphs.

In these cases, it is likely you will want to filter the information being returned using AQL queries. AQL is really well documented on the ArangoDB website, but I will cover some basic queries for this.

For example, here is a query using the first example to get the ID’s ( RETURN item.id) five (LIMIT 5) most recently created Objects (SORT item.created DESC) of type “indicator” (FILTER item.type == "indicator") in the stix_objects collection.

FOR item IN stix_objects
    FILTER item.type == "indicator"
    SORT item.created DESC
    LIMIT 5
    RETURN item.id

Which returns;

[
  "indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f"
]

Before I showed you the out-of-the-box graph view using the UI query builder (that I used to create the Graph stix_graph) – this created an automatically generated graph traversal query. It is also possible to write these queries to customise the nodes and edges returned in the graph to modelled. For example;

FOR vertex, edge, path IN 1..5
    OUTBOUND 'stix_objects/indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f'
    GRAPH 'stix_graph'
    RETURN path

Here is how this query is formed;

  • The FOR takes three variables, the vertex in the traversal, the edge in the traversal and the current path. The current path contains two members, vertices and edges. Here I am asking to return the paths matching the request that’s why I can find the edges and vertices keys in the result.
  • IN 1..5 specifies the minimal and maximal depth for the traversal. 0 would have been a traversal starting from the original vertex.
  • OUTBOUND specifies the direction to follow Possible values are OUTBOUND OR INBOUND OR ANY. The object I used as the original vertex for the traversal is the source in the relationship object. For this reason INBOUND would no return any result in our example.
  • stix_objects/{_key} defines the vertex where the traversal originates from.
  • GRAPH stix_graph is the name identifying the named graph to use for the traversal.

Running the query gives us the following json output (which can also be modelled as a graph):

[
  {
    "edges": [
      {
        "_key": "relationship--44298a74-ba52-4f0c-87a3-1824e67d7fad",
        "_id": "stix_relationships/relationship--44298a74-ba52-4f0c-87a3-1824e67d7fad",
        "_from": "stix_objects/indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f",
        "_to": "stix_objects/malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b",
        "_rev": "_ep_8O9C---",
        "type": "relationship",
        "id": "relationship--44298a74-ba52-4f0c-87a3-1824e67d7fad",
        "created": "2016-04-06T20:06:37.000Z",
        "modified": "2016-04-06T20:06:37.000Z",
        "created_by_ref": "identity--c2aceda2-0e46-431d-be40-7b4a4e797f81",
        "relationship_type": "indicates",
        "source_ref": "indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f",
        "target_ref": "malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b"
      }
    ],
    "vertices": [
      {
        "_key": "indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f",
        "_id": "stix_objects/indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f",
        "_rev": "_ep_1Eeq---",
        "type": "indicator",
        "id": "indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f",
        "created": "2016-04-06T20:03:48.000Z",
        "modified": "2016-04-06T20:03:48.000Z",
        "created_by_ref": "identity--c2aceda2-0e46-431d-be40-7b4a4e797f81",
        "labels": [
          "malicious-activity"
        ],
        "name": "Poison Ivy Malware",
        "description": "This file is part of Poison Ivy",
        "pattern": "[ file:hashes.'SHA-256' = '4bac27393bdd9777ce02453256c5577cd02275510b2227f473d03f533924f877' ]",
        "valid_from": "2016-01-01T00:00:00Z"
      },
      {
        "_key": "malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b",
        "_id": "stix_objects/malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b",
        "_rev": "_ep_1Eeq--_",
        "type": "malware",
        "id": "malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b",
        "created": "2016-04-06T20:07:09.000Z",
        "modified": "2016-04-06T20:07:09.000Z",
        "created_by_ref": "identity--c2aceda2-0e46-431d-be40-7b4a4e797f81",
        "name": "Poison Ivy"
      }
    ]
  }
]

As the graphs grow, these AQL queries prove very efficient at returning STIX 2.1 data with complex relationship structures.

You can see this proof-of-concept implemented here:

https://github.com/signalscorps/stix2arango