Skip to content
Dustin Demuth edited this page Feb 22, 2017 · 11 revisions

DRAFT

X-ARF

Aim: Define and document

  • A transformation of IntelMQ events into X-ARF format (emails).
  • (later) a transformation of X-ARF format (emails) into IntelMQ events.

X-ARF is an email based format. The core unit is a single report in one of the available X-ARF schemas.

IntelMQ's events have harmonised internal values.

Thus our transformation has only be defined for a single X-ARF report. All other X-ARF variants can be derived from it.

Data

As first example we consider Shadowserver-Botnet-drone data. (Attention: the description of the format on the Shadowserver site itself is sometimes outdated compared to the data that is actually send.)

#EXAMPLE DATA -- IP's and ASN where pseudonomized

"timestamp","ip","port","asn","geo","region","city","hostname","type","infection","url","agent","cc_ip","cc_port","cc_asn","cc_geo","cc_dns","count","proxy","application","p0f_genre","p0f_detail","machine_name","id","naics","sic","cc_naics","cc_sic","sector","cc_sector","ssl_cipher","family","tag","public_source"
"2016-07-24 00:00:01","198.51.100.4",,31334,"DE","BREMEN","BREMEN","198-51-100-4.example.net",,"bitdefender-ramnit",,,"198.51.100.182",,8075,"US",,,,,,,,,0,0,334111,357101,,"Communications",,,,
"2016-07-24 00:00:01","198.51.100.176",7960,3320,"DE","NORDRHEIN-WESTFALEN","BONN","198-51-100-176.example.net","udp","zeroaccess",,,"198.51.100.221",16471,22773,"US",,,,,,,,,0,0,517510,737415,,"Commercial Facilities",,,,

The current configuration of IntelMQ (as of 2017-02-08) will parse the above data into IntelMQ "events" like

# RESULT IN INTELMQ

# Dataset 1
{
	"classification.identifier": "botnet",
	"classification.taxonomy": "Malicious Code",
	"classification.type": "botnet drone",
	"destination.asn": 8075,
	"destination.geolocation.cc": "US",
	"destination.ip": "198.51.100.182",
	"extra": "{\"cc_naics\": \"334111\", \"cc_sector\": \"Communications\", \"cc_sic\": \"357101\"}",
	"feed.accuracy": 100.0,
	"feed.name": "Botnet-Drone-Hadoop",
	"feed.url": "file://localhost/tmp/sserver.csv",
	"malware.name": "bitdefender-ramnit",
	"raw": "THIS IS A VERY LONG BASE64 VALUE CONTAINING THE ORIGNAL CSV-ROW",
	"source.asn": 31334,
	"source.geolocation.cc": "DE",
	"source.geolocation.city": "BREMEN",
	"source.geolocation.region": "BREMEN",
	"source.ip": "198.51.100.4",
	"source.reverse_dns": "198-51-100-4.example.net",
	"time.observation": "2017-02-07T08:14:05+00:00",
	"time.source": "2016-07-24T00:00:01+00:00",
}

# Dataset 2
{
	"classification.identifier": "botnet",
	"classification.taxonomy": "Malicious Code",
	"classification.type": "botnet drone",
	"destination.asn": 22773,
	"destination.geolocation.cc": "US",
	"destination.ip": "198.51.100.221",
	"destination.port": 16471,
	"extra": "{\"cc_naics\": \"517510\", \"cc_sector\": \"Commercial Facilities\", \"cc_sic\": \"737415\"}",
	"feed.accuracy": 100.0,
	"feed.name": "Botnet-Drone-Hadoop",
	"feed.url": "file://localhost/tmp/sserver.csv",
	"malware.name": "zeroaccess",
	"protocol.transport": "udp",
	"raw": "THIS IS A VERY LONG BASE64 VALUE CONTAINING THE ORIGNAL CSV-ROW",
	"source.asn": 3320,
	"source.geolocation.cc": "DE",
	"source.geolocation.city": "BONN",
	"source.geolocation.region": "NORDRHEIN-WESTFALEN",
	"source.ip": "198.51.100.176",
	"source.port": 7960,
	"source.reverse_dns": "198-51-100-176.example.net",
	"time.observation": "2017-02-07T08:14:05+00:00",
	"time.source": "2016-07-24T00:00:01+00:00",
}

As we can see, this Data is reporting a malicious-code activity. We assume it is possible to map this to the X-ARF Report Type: Malware-Attack.

Mapping

Format independent Mapping

Known stable X-ARF schemas share a set of fields. We assume those fields to be the same over all X-ARF schemas and suggest to clearly state a common subsets of fields in the next iteration of the X-ARF specification.

Reported-From           reports@example.com
Report-ID               UUID@example.com
Date                    time.source # This is the value of IntelMQs time.source field, conversion to RFC 3339 not necessary
TLP                     none # This field cannot be determined, yet. The integration of TLP into IntelMQ is in discussion: /~https://github.com/certtools/intelmq/issues/252
User-Agent              IntelMQ-Mailgen # The User-Agent of the X-Arf generating Software
Attachment              none # If no Attachment exists, this must be none
Version                 0.2 # Most likely always 0.2, Version is Optional
Occurences              none # This field cannot be determined, its optional

Format specific Map

In addition to the known fields from above the format malware-attack contains the fields:

Category:                       abuse # This is a constant field, no Mapping to an IntelMQ Field is necessary
Report-Type:                    malware-attack # This is a constant field, no Mapping to an IntelMQ Field is necessary
Schema-URL:                     http://x-arf.org/schema/abuse_malware-attack_0.1.4.json
Source:                         source.ip # This is the value of the IntelMQ-Field source.ip
Source-Type:                    calculated_field # This field needs to be set to ipv4 or ipv6 depending on source.ip
Destination-System:             none # Cannot be determined
Download-Link:                  none # Cannot be determined
Download-Port:                  none # Cannot be determined
Malware-MD5:                    malware.hash.md5 # This is the value of the IntelMQ-Field malware.hash.md5, it does not exist in shadowserver data
Antivirus-Result:               none # Cannot be determined
Antivirus-Vendor:               none # Cannot be determined
Feedback-Link:                  none # Cannot be determined

Result

When mapping the aforementioned data according to these two maps, the following two datasets are the result (without X-ARF specific headers and MIME encoding / boundaries)

Dataset 1:

Schema-URL: http://x-arf.org/schema/abuse_malware-attack_0.1.4.json
Category: abuse
Report-Type: malware-attack
Reported-From: mail@example.com
Report-ID: TicketNumber#4711@example.com
User-Agent: IntelMQ-Mailgen
Date: 2016-07-24T00:00:01+00:00
Source: 198.51.100.4
Source-Type: ipv4
Attachment: none

Dataset 2:

Schema-URL: http://x-arf.org/schema/abuse_malware-attack_0.1.4.json
Category: abuse
Report-Type: malware-attack
Reported-From: mail@example.com
Report-ID: TicketNumber#0815@example.com
User-Agent: IntelMQ-Mailgen
Date: 2016-07-24T00:00:01+00:00
Source: 198.51.100.176
Source-Type: ipv4
Attachment: none

We can see that some information is lost from intelmq-event to X-ARF as defined in abuse_malware-attack_0.1.4, specially interesting would be

  • classification.type or other classification details not covered in Category and Report-Type.
  • malware.name
  • destination.ip
  • destination.port
  • protocol.transport
  • source.port
  • source.reverse_dns

Alternatively considering abuse_bot-infection_0.1.0.json, we could still miss:

  • classification.type or other classification details not covered in Category and Report-Type.
  • destination.port
  • protocol.transport
  • source.reverse_dns

Some values in the original report offer additional information, but can also be derived from others:

  • destination.ip and time.observation determines

  • destination.asn

  • destination.geolocation.cc

  • source.ip and time.observation determines

  • source.asn

  • source.geolocation.cc

  • source.geolocation.city

  • source.geolocation.region

Some values are internal to IntelMQ and others would usually be left out of a report to be send externally:

  • time.observation is the datetime when the report data entered IntelMQ.
  • extra potentially contains internal information

New schema for bot-infection.

As some fields are missing, we've created an updated schema for X-ARF bot-infections. See: /~https://github.com/Intevation/xarf-schemata/blob/master/abuse_bot-infection_0.2.0_unstable.json

The changes to 0.1.0 are documented in /~https://github.com/Intevation/xarf-schemata/blob/master/abuse_bot-infection_0.2.0_unstable.json.README.md

To use this schema, one can use this URL: https://raw.githubusercontent.com/Intevation/xarf-schemata/master/abuse_bot-infection_0.2.0_unstable.json

other X-ARF schemes

Can we find or define other schemes that provide a better fit?

Considerations on X-ARF BULK, based on example data

We've had the chance to see some real-world X-ARF messages using the abuse_bot-infection_0.1.0.json scheme and BULK format. The BULK format seems to carry a high amount of duplicated data, such as the E-Mail Texts.

The X-ARF Message within the BULK message carries some additional payload like the destination port in the data which is attached to the X-ARF message. Our proposed schema contains this data as a real X-ARF field Destination-Port. Other data, like the IP of the destination which is supported by the 0.1.0 schema (as destination) is left out from the X-ARF message, but also included within the attachment.