In another tutorial
I showed how to store objects generated by the STIX 2 Python library to the filestore.
When it comes to storing and retrieving STIX 2.1 data at scale in an easy and efficient manner using a database is much more suited than the filestore.
There are number of considerations when selecting a database to use.
The team at Sekoia did a great job detailing their decision to use ArangoDB to store STIX Objects.
ArangoDB uses a few concepts useful for this;
- Databases: A database contains Collections which hold documents and/or edges
- Collections: There are two types of Collections
- Vertex Collections: contain documents on a graph
- Edge Collections: links documents from Vertex Collections
- Documents: the data
In STIX, Documents are the STIX objects (but not Relationship SROS), Vertex Collections are groupings of those objects, and Edge Collections are collections of Relationship objects.
Therefore, I can start up ArangoDB as follows;
## Install
brew install arangodb
## Run
brew services start arangodb
## will now be accessible in a browser at: http://127.0.0.1:8529 . Default username is root with no password set (leave blank)
And log into the UI to create a:
- Database =
arango_stix_demo
- Vertex Collection =
stix_objects
- Edge Collection =
stix_relationships
Now those are defined I can start populating the database with some STIX 2.1 data.
Storing STIX 2.1 Objects in ArangoDB
In this example I will create two SDOs and link them together with an SRO (taken from this OASIS STIX example bundle)
ArangoDB provides its own query language named Arango Query Language (AQL) which will allow me to do this:
AQL is mainly a declarative language, meaning that a query expresses what result should be achieved but not how it should be achieved. AQL aims to be human-readable and therefore uses keywords from the English language.[…] Further design goals of AQL were the support of complex query patterns and the different data models ArangoDB offers.
The syntax of AQL queries is different to SQL, even if some keywords overlap. Nevertheless, AQL should be easy to understand for anyone with an SQL background.
In the queries section of the ArangoDB UI I will start by creating two SDOs, one Indicator (indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f
) and one Malware (malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b
);
LET objects = [
{
"_key": "indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f",
"type": "indicator",
"id": "indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f",
"created": "2016-04-06T20:03:48.000Z",
"modified": "2016-04-06T20:03:48.000Z",
"created_by_ref": "identity--c2aceda2-0e46-431d-be40-7b4a4e797f81",
"labels": [
"malicious-activity"
],
"name": "Poison Ivy Malware",
"description": "This file is part of Poison Ivy",
"pattern": "[ file:hashes.'SHA-256' = '4bac27393bdd9777ce02453256c5577cd02275510b2227f473d03f533924f877' ]",
"valid_from": "2016-01-01T00:00:00Z"
},
{
"_key": "malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b",
"type": "malware",
"id": "malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b",
"created": "2016-04-06T20:07:09.000Z",
"modified": "2016-04-06T20:07:09.000Z",
"created_by_ref": "identity--c2aceda2-0e46-431d-be40-7b4a4e797f81",
"name": "Poison Ivy"
}
]
FOR object IN objects
INSERT object INTO stix_objects
In the query I use the STIX id
attribute as the document primary key (_key
) so I can easily retrieve the documents later.
Now I need to store the SRO relationship between these Objects in the Edge Collection stix_relationships
. Three attributes must be set in the AQL query for this: _key
, _from
and _to
.
For the _key
I am going to use STIX SRO ID in the same way I did for the SDOs. For the _from
and _to
field I will use the STIX SRO properties source_ref
and target_ref
respectively. The format to use for the _from
and _to
fields is as follows: <collection>/<_key>
. Where <collection>
is the Collection is the Document Collection name (stix_objects
) and the <_key>
is the _key
SDO ID (I set _key
as the STIX Object id
, so the _key
value is equal to the STIX Object id
)
To demonstrate, the AQL request for this example is as follow:
INSERT {
"_key": "relationship--44298a74-ba52-4f0c-87a3-1824e67d7fad+2016-04-06T20:06:37.000Z",
"_from": "stix_objects/indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f",
"_to": "stix_objects/malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b",
"type": "relationship",
"id": "relationship--44298a74-ba52-4f0c-87a3-1824e67d7fad",
"created": "2016-04-06T20:06:37.000Z",
"modified": "2016-04-06T20:06:37.000Z",
"created_by_ref": "identity--c2aceda2-0e46-431d-be40-7b4a4e797f81"
"relationship_type": "indicates",
"source_ref": "indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f",
"target_ref": "malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b"
} IN stix_relationships
As I talked about here
STIX objects also have embedded relationships, e.g. created_by_ref
, sighting_of_ref
, observed_data_refs
, object_refs
, etc.
An Object might have more than one embedded relationship property (identified where the property name ends with _ref
or _refs
. Embedded relationships can also contains lists of relationships. Here is an example of a Report SDO with three embedded relationships under the objects_refs
property:
{
"type": "report",
"spec_version": "2.1",
"id": "report--84e4d88f-44ea-4bcd-bbf3-b2c1c320bcb3",
"created": "2015-12-21T19:59:11.000Z",
"modified": "2015-12-21T19:59:11.000Z",
"created_by_ref": "identity--c2aceda2-0e46-431d-be40-7b4a4e797f81",
"name": "The Black Vine Cyberespionage Group",
"description": "A simple report with an indicator and campaign",
"published": "2015-12-21T19:59:11.000Z",
"report_types": [
"campaign"
],
"object_refs": [
"indicator--26ffb872-1dd9-446e-b6f5-d58527e5b5d2",
"campaign--83422c77-904c-4dc1-aff5-5c38f3a2c55c",
"relationship--f82356ae-fe6c-437c-9c24-6b64314ae68a"
]
}
These embedded relationships need to be represented as edges too, but in a slightly different way to native STIX Objects. In the previous examples I added pure STIX Objects into ArangoDB (as they would be received), however, embedded relationships require a custom object (non STIX Object) to define the relationship that in turn needs to be parsed out of a STIX Object.
In our implementation, embedded relationship edges have the _key
containing both the STIX Object IDs for the embedded relationship, joined with a +
. The edges also contain two unique properties; 1) type
which is matches the _ref
property, and 2) relationship_description
property which is equal to the property in the STIX Object.
Let me demonstrate for clarity. First I would create the Report SDO as a Document;
INSERT {
"_key": "report--84e4d88f-44ea-4bcd-bbf3-b2c1c320bcb3",
"type": "report",
"spec_version": "2.1",
"id": "report--84e4d88f-44ea-4bcd-bbf3-b2c1c320bcb3",
"created": "2015-12-21T19:59:11.000Z",
"modified": "2015-12-21T19:59:11.000Z",
"created_by_ref": "identity--c2aceda2-0e46-431d-be40-7b4a4e797f81",
"name": "The Black Vine Cyberespionage Group",
"description": "A simple report with an indicator and campaign",
"published": "2015-12-21T19:59:11.000Z",
"report_types": [
"campaign"
],
"object_refs": [
"indicator--26ffb872-1dd9-446e-b6f5-d58527e5b5d2",
"campaign--83422c77-904c-4dc1-aff5-5c38f3a2c55c",
"relationship--f82356ae-fe6c-437c-9c24-6b64314ae68a"
]
} IN stix_objects
This Report SDO has four embedded relationships;
created_by_ref
=identity--c2aceda2-0e46-431d-be40-7b4a4e797f81
object_refs
=indicator--26ffb872-1dd9-446e-b6f5-d58527e5b5d2
object_refs
=campaign--83422c77-904c-4dc1-aff5-5c38f3a2c55c
object_refs
=relationship--f82356ae-fe6c-437c-9c24-6b64314ae68a
I will assume these SDOs already exist in the stix_objects
document collection.
As such, all that that is required is to create the four embedded relationship edges in the stix_relationships
Edge Collection.
LET embedded_relationships = [
{
"_key": "report--84e4d88f-44ea-4bcd-bbf3-b2c1c320bcb3+indicator--26ffb872-1dd9-446e-b6f5-d58527e5b5d2",
"_from": "stix_objects/report--84e4d88f-44ea-4bcd-bbf3-b2c1c320bcb3",
"_to": "stix_objects/indicator--26ffb872-1dd9-446e-b6f5-d58527e5b5d2",
"relationship_type": "created_by_ref",
},
{
"_key": "report--84e4d88f-44ea-4bcd-bbf3-b2c1c320bcb3+identity--a463ffb3-1bd9-4d94-b02d-74e4f1658283",
"_from": "stix_objects/report--84e4d88f-44ea-4bcd-bbf3-b2c1c320bcb3",
"_to": "stix_objects/identity--a463ffb3-1bd9-4d94-b02d-74e4f1658283",
"relationship_type": "object_refs",
},
{
"_key": "report--84e4d88f-44ea-4bcd-bbf3-b2c1c320bcb3+campaign--83422c77-904c-4dc1-aff5-5c38f3a2c55c",
"_from": "stix_objects/report--84e4d88f-44ea-4bcd-bbf3-b2c1c320bcb3",
"_to": "stix_objects/campaign--83422c77-904c-4dc1-aff5-5c38f3a2c55c",
"relationship_type": "object_refs",
},
{
"_key": "report--84e4d88f-44ea-4bcd-bbf3-b2c1c320bcb3+relationship--f82356ae-fe6c-437c-9c24-6b64314ae68a",
"_from": "stix_objects/report--84e4d88f-44ea-4bcd-bbf3-b2c1c320bcb3",
"_to": "stix_objects/relationship--f82356ae-fe6c-437c-9c24-6b64314ae68a",
"relationship_type": "object_refs",
}
]
FOR embedded_relationship IN embedded_relationships
INSERT embedded_relationship INTO stix_relationships
Now I can filter the results by relationship_type
. For example, I could filter the documents in the stix_relationships
collection using the query relationship_type==created_by_ref
.
I talked a bit about versioning here
This can be achieved done using updates to documents in ArangoDB. Lets start by creating a new object;
INSERT {
"_key": "report--02ee5fc1-6321-4007-b6b8-c3c5c8d5e1a1",
"type": "report",
"id": "report--02ee5fc1-6321-4007-b6b8-c3c5c8d5e1a1",
"created": "2021-01-01T00:00:00.000Z",
"modified": "2021-01-01T00:00:00.000Z",
"created_by_ref": "identity--c2aceda2-0e46-431d-be40-7b4a4e797f81",
"name": "Demoing version",
"published": "2021-01-01T00:00:00.000Z",
"report_types": ["campaign"]
} IN stix_objects
Then I will update it with the new properties;
UPDATE {
"_key": "report--02ee5fc1-6321-4007-b6b8-c3c5c8d5e1a1",
"type": "report",
"id": "report--02ee5fc1-6321-4007-b6b8-c3c5c8d5e1a1",
"created": "2021-01-01T00:00:00.000Z",
"modified": "2022-01-01T00:00:00.000Z",
"created_by_ref": "identity--c2aceda2-0e46-431d-be40-7b4a4e797f81",
"name": "Demoing version",
"description": "Adding a new field",
"published": "2021-01-01T00:00:00.000Z",
"report_types": ["campaign"]
} IN stix_objects
With each UPDATE to a document, ArangoDB tracks its revision under a _rev
(Document Revision) property.
Querying STIX 2.1 Objects in ArangoDB
Now these STIX Objects and relationships have successfully been written, I can start to explore the database.
The ArangoDB UI contains a graph visualisation tool that is useful for browsing visually (Graphs > Your Graph).
I will need to create a Graph = stix_graph
which contains;
stix_relationships
as the Edge Definition (edge definition define a relation of the graph)stix_objects
for both thefromCollections
(Collections that contain the start vertices of the relation) andtoCollections
(Collections that contain the end vertices of the relation)
Here is what the stix_graph
looks like for the Objects I have added to the database;
As you can see from the first example, the two stix_objects
in the Document Collection are represented as individual nodes on the graph, and the stix_relationships
in the Edge Collection as an edge.
This is about the simplest example of a STIX Object Graph. Nodes (STIX SDO/SCOs) can have multiple edges (SROs) making for much more complex graphs.
In these cases, it is likely you will want to filter the information being returned using AQL queries. AQL is really well documented on the ArangoDB website, but I will cover some basic queries for this.
For example, here is a query using the first example to get the ID’s ( RETURN item.id
) five (LIMIT 5
) most recently created Objects (SORT item.created DESC
) of type “indicator” (FILTER item.type == "indicator"
) in the stix_objects
collection.
FOR item IN stix_objects
FILTER item.type == "indicator"
SORT item.created DESC
LIMIT 5
RETURN item.id
Which returns;
[
"indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f"
]
Before I showed you the out-of-the-box graph view using the UI query builder (that I used to create the Graph stix_graph
) – this created an automatically generated graph traversal query. It is also possible to write these queries to customise the nodes and edges returned in the graph to modelled. For example;
FOR vertex, edge, path IN 1..5
OUTBOUND 'stix_objects/indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f'
GRAPH 'stix_graph'
RETURN path
Here is how this query is formed;
- The
FOR
takes three variables, thevertex
in the traversal, theedge
in the traversal and the currentpath
. The current path contains two members, vertices and edges. Here I am asking to return the paths matching the request that’s why I can find the edges and vertices keys in the result. IN
1..5
specifies the minimal and maximal depth for the traversal. 0 would have been a traversal starting from the original vertex.OUTBOUND
specifies the direction to follow Possible values areOUTBOUND
ORINBOUND
ORANY
. The object I used as the original vertex for the traversal is the source in the relationship object. For this reasonINBOUND
would no return any result in our example.stix_objects/{_key}
defines the vertex where the traversal originates from.GRAPH stix_graph
is the name identifying the named graph to use for the traversal.
Running the query gives us the following json output (which can also be modelled as a graph):
[
{
"edges": [
{
"_key": "relationship--44298a74-ba52-4f0c-87a3-1824e67d7fad",
"_id": "stix_relationships/relationship--44298a74-ba52-4f0c-87a3-1824e67d7fad",
"_from": "stix_objects/indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f",
"_to": "stix_objects/malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b",
"_rev": "_ep_8O9C---",
"type": "relationship",
"id": "relationship--44298a74-ba52-4f0c-87a3-1824e67d7fad",
"created": "2016-04-06T20:06:37.000Z",
"modified": "2016-04-06T20:06:37.000Z",
"created_by_ref": "identity--c2aceda2-0e46-431d-be40-7b4a4e797f81",
"relationship_type": "indicates",
"source_ref": "indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f",
"target_ref": "malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b"
}
],
"vertices": [
{
"_key": "indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f",
"_id": "stix_objects/indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f",
"_rev": "_ep_1Eeq---",
"type": "indicator",
"id": "indicator--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f",
"created": "2016-04-06T20:03:48.000Z",
"modified": "2016-04-06T20:03:48.000Z",
"created_by_ref": "identity--c2aceda2-0e46-431d-be40-7b4a4e797f81",
"labels": [
"malicious-activity"
],
"name": "Poison Ivy Malware",
"description": "This file is part of Poison Ivy",
"pattern": "[ file:hashes.'SHA-256' = '4bac27393bdd9777ce02453256c5577cd02275510b2227f473d03f533924f877' ]",
"valid_from": "2016-01-01T00:00:00Z"
},
{
"_key": "malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b",
"_id": "stix_objects/malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b",
"_rev": "_ep_1Eeq--_",
"type": "malware",
"id": "malware--31b940d4-6f7f-459a-80ea-9c1f17b5891b",
"created": "2016-04-06T20:07:09.000Z",
"modified": "2016-04-06T20:07:09.000Z",
"created_by_ref": "identity--c2aceda2-0e46-431d-be40-7b4a4e797f81",
"name": "Poison Ivy"
}
]
}
]
As the graphs grow, these AQL queries prove very efficient at returning STIX 2.1 data with complex relationship structures.