Malware Training Sets: A machine learning dataset for everyone

One of the most challenging tasks during Machine Learning processing is to define a great training (and possible dynamic) dataset. Assuming a well known learning algorithm and a periodic learning supervised process what you need is a classified dataset to best train your machine. Thousands of training datasets are available out there from "flowers" to "dices" passing through "genetics", but I was not able to find a great classified dataset for malware analyses. So, I decided to do it by myself and to share the dataset with the scientific community (and everybody interested on it) in order to give to everyone a base point to start with Machine Learning for Malware Analysis. The first challenge I faced was to define features and how to extract them.  Basically I had two choices:
  1. Extracting features directly from samples. This is the easiest solution since the possible extracted features would be directly related to the sample such as (but not limited to): file "sections", "entropy", "Syscalls" and decompiled assembly n-grams.
  2. Extracting features on samples analysis. This is the hardest solution since it would include both static analysis such as (but not limited to): file sections, entropy, "Syscall/API" and dynamic analysis such as (but not limited to): "Contacted IP", "DNS Queries", "execution processes", "AV signatures" , etc. etc. Plus I needed a complex system of dynamic analysis including multiple sandboxes and static analysers.   
I decided to follow the hardest path by extracting features from both: static analysis and dynamic analysis of samples detonation in order to collect as much features as I can letting to the data scientist the freedom to decide what feature to use and what feature to drop in his data mining process. The analyses where performed through the sample detonation in several SandBoxes (free and commercial ones) which defined a first stage of ontologically homogeneous blocks called "Analyses Results" (AR). AR are too much verbose and they are not  performing well in any text algorithm of my knowledge.

After more readings on the topic I came up with Malware Instruction Set for Behaviour Analysis ( MIST) described in Philipp Trinius et Al. (document available here).  MIST is basically a result based  optimised representation for effective and efficient analysis of behaviour using data mining and machine learning techniques. It can be obtained automatically during analysis of malware with a behaviour monitoring tool or by converting existing behaviour reports. The representation is not restricted to a particular monitoring tool and thus can also be used as a meta language to unify behaviour reports of different sources. The following image shows the MIST encoding structure. 



A simple example coming directly from the aforementioned paper is showed in the following image where "load.dll" has been detected. The ‘load dll’ system call is executed by every software during process initialisation and run-time several times, since under Windows, dynamic-link libraries (DLLs) are used to implement the Windows subsystem and offer an interface to the operating system. Following how the load.dll has been encoded into MIST meta language.

I decided to use the same concept of "meta language" but with auto-descriptive logic (without encoding the category operation since it would not afflict the analyses) and every information organised into a well formed JSON File rather then into a line based text file in order to be used in external environments with zero effort.  The produced datasets looks like following:

DataSet Snippest (click to enlarge)
Each JSON Property could be used as an algorithmic feature of your desired Machine Learning algorithm, but the most significative ones would be the "properties" ones (the one labelled properties). Each property, by meaning of each field placed under the "properties" section of the produced JSON file, is optional and is structured as follows:

category_action_with_description |  "sanitized" involved subjects with spaces

So for example:

"sig_copies_self": "e5ed769a e5ed769a 98e83379"

It means the category is sig (stands for signature) and the action is "copies itself".  e5ed769a e5ed769a 98e83379 are 3 sanitize evidences of where the samples copies itself (see the Sanitization Procedure) 

 "sig_antimalware_metascan": ""

It means the category is sig (stands for signature) and the action is "antimalware_metascan". The evidences are empty by meaning no signature found from metascan (in such a case).

"sig_antivirus_virustotal": "ffebfdb8 9dbdd699 600fe39f 45036f7d 9a72943b"

It means the signature virus_total found 5 evidences (ffebfdb8 9dbdd699 600fe39f 45036f7d 9a72943b).

A fundamental property is the "label" property which classifies the malware family. I decided to name this field "label" rather than: "malware_name", "malware_family" or "classification" in order to let the compatibility with many implemented machine learning algorithms which use the field "label" to properly work (it seems to be a defacto standards for many engine implementations).

Sanitization Procedures

Aim of the project is to provide an useful and classified dataset to researchers who want to investigate deeper in malware analysis by using Machine Learning techniques. It is essential to give a speed up in performances on text mining and for such a reason I decided to use some well known sanitization techniques in order to "hash" the evidences letting unchanged the meaning but drastically improving the speed for an algorithm point of view. The following picture shows the sanitization procedures:


Sanitization Procedures (click to enlarge)

From a developer prospective the cited (and showed) procedures are not well written; for example are not protected and ".replace" could be not safe within specific inputs. For such a reason I will not release such a code. But please keep in mind that the result of my project is not the "sanitization code" but the outcome of it: the classified malware analyses datased, so I focused my attention on features extraction, samples collection,  aggregation, conversion, and of course analyses, not really in developing production code.

Training DataSets Generation: The Simplified Process

The whole process to obtain the training datasets is described in the following flowchart. The detonation of a classified Malware into multiple sandboxes produces multiple static and dynamic analyses colliding into an analyses results artefact (AR).  AR would be translated into a MIST elaborated meta language to be software agnostic and to give freedom to data scientists.



Data Samples

Today (please refers to blog post date) the collected classified datasets is composed by the following samples:
  • APT1 292 Samples
  • Crypto 2024 Samples
  • Locker 434 Samples
  • Zeus 2014 Samples
If you own classified Malware samples and you want to share it with me in order to contribute at the Machine Learning Training Datasets you are welcome, just drop me an email !
I will definitely process the samples and build new datasets to share to everybody.

Where can I download the training datasets ?  HERE


Available Features and Frequency

The following list enumerates the available features per each sample. The features, as mentioned, are optional by meaning you might have no all the same features for every sample. If the sample you are analysing does not have a specific feature you want consider it as None (or undefined) since that feature was not available for the specified sample. So if you are writing your of machine learning algorithm you should include a "purification procedure" which will ignore None features from training and or query.

List of current available features with occurrences counter. :

   'file_access': 138759,
   'sig_infostealer_ftp': 13114,
   'sig_modifies_hostfile': 5,
   'sig_removes_zoneid_ads': 16,
   'sig_disables_uac': 33,
   'sig_static_versioninfo_anomaly': 0,
   'sig_stealth_webhistory': 417,
   'reg_write': 11942,
   'sig_network_cnc_http': 132,
   'api_resolv': 954690,
   'sig_stealth_network': 71,
   'sig_antivm_generic_bios': 6,
   'sig_polymorphic': 705,
   'sig_antivm_generic_disk': 7,
   'sig_antivm_vpc_keys': 0,
   'sig_antivm_xen_keys': 5,
   'sig_creates_largekey': 16,
   'sig_exec_crash': 6,
   'sig_antisandbox_sboxie_libs': 144,
   'sig_mimics_icon': 2,
   'sig_stealth_hidden_extension': 9,
   'sig_modify_proxy': 384,
   'sig_office_security': 20,
   'sig_bypass_firewall': 29,
   'sig_encrypted_ioc': 476,
   'sig_dropper': 671,
   'reg_delete': 2545,
   'sig_critical_process': 3,
   'service_start': 312,
   'net_dns': 486,
   'sig_ransomware_files': 5,
   'sig_virus': 781,
   'file_write': 20218,
   'sig_antisandbox_suspend': 2,
   'sig_sniffer_winpcap': 16,
   'sig_antisandbox_cuckoocrash': 11,
   'file_delete': 5405,
   'sig_antivm_vmware_devices': 1,
   'sig_ransomware_recyclebin': 0,
   'sig_infostealer_keylog': 44,
   'sig_clamav': 1350,
   'sig_packer_vmprotect': 1,
   'sig_antisandbox_productid': 18,
   'sig_persistence_service': 5,
   'sig_antivm_generic_diskreg': 162,
   'sig_recon_checkip': 4,
   'sig_ransomware_extensions': 4,
   'sig_network_bind': 190,
   'sig_antivirus_virustotal': 175975,
   'sig_recon_beacon': 23,
   'sig_deletes_shadow_copies': 24,
   'sig_browser_security': 216,
   'sig_modifies_desktop_wallpaper': 83,
   'sig_network_torgateway': 1,
   'sig_ransomware_file_modifications': 23,
   'sig_antivm_vbox_files': 7,
   'sig_static_pe_anomaly': 2194,
   'sig_copies_self': 591,
   'sig_antianalysis_detectfile': 51,
   'sig_antidbg_devices': 6,
   'file_drop': 6627,
   'sig_driver_load': 72,
   'sig_antimalware_metascan': 1045,
   'sig_modifies_certs': 46,
   'sig_antivm_vpc_files': 0,
   'sig_stealth_file': 1566,
   'sig_mimics_agent': 131,
   'sig_disables_windows_defender': 3,
   'sig_ransomware_message': 10,
   'sig_network_http': 216,
   'sig_injection_runpe': 474,
   'sig_antidbg_windows': 455,
   'sig_antisandbox_sleep': 271,
   'sig_stealth_hiddenreg': 13,
   'sig_disables_browser_warn': 20,
   'sig_antivm_vmware_files': 6,
   'sig_infostealer_mail': 617,
   'sig_ipc_namedpipe': 13,
   'sig_persistence_autorun': 2355,
   'sig_stealth_hide_notifications': 19,
   'service_create': 62,
   'sig_reads_self': 14460,
   'mutex_access': 15017,
   'sig_antiav_detectreg': 4,
   'sig_antivm_vbox_libs': 0,
   'sig_antisandbox_sunbelt_libs': 2,
   'sig_antiav_detectfile': 2,
   'reg_access': 774910,
   'sig_stealth_timeout': 1024,
   'sig_antivm_vbox_keys': 0,
   'sig_persistence_ads': 3,
   'sig_mimics_filetime': 3459,
   'sig_banker_zeus_url': 1,
   'sig_origin_langid': 71,
   'sig_antiemu_wine_reg': 1,
   'sig_process_needed': 137,
   'sig_antisandbox_restart': 24,
   'sig_recon_programs': 5318,
   'str': 1443775,
   'sig_antisandbox_unhook': 1364,
   'sig_antiav_servicestop': 78,
   'sig_injection_createremotethread': 311,
   'pe_imports': 301256,
   'sig_process_interest': 295,
   'sig_bootkit': 25,
   'reg_read': 458477,
   'sig_stealth_window': 1267,
   'sig_downloader_cabby': 50,
   'sig_multiple_useragents': 101,
   'pe_sec_character': 22180,
   'sig_disables_windowsupdate': 0,
   'sig_antivm_generic_system': 6,
   'cmd_exec': 2842,
   'net_con': 406,
   'sig_bcdedit_command': 14,
   'pe_sec_entropy': 22180,
   'pe_sec_name': 22180,
   'sig_creates_nullvalue': 1,
   'sig_packer_entropy': 3603,
   'sig_packer_upx': 1210,
   'sig_disables_system_restore': 6,
   'sig_ransomware_radamant': 0,
   'sig_infostealer_browser': 7,
   'sig_injection_rwx': 3613,
   'sig_deletes_self': 600,
    'file_read': 50632,
   'sig_fraudguard_threat_intel_api': 226,
   'sig_deepfreeze_mutex': 1,
   'sig_modify_uac_prompt': 1,
   'sig_api_spamming': 251,
   'sig_modify_security_center_warnings': 18,
   'sig_antivm_generic_disk_setupapi': 25,
   'sig_pony_behavior': 159,
   'sig_banker_zeus_mutex': 442,
   'net_http': 223,
   'sig_dridex_behavior': 0,
   'sig_internet_dropper': 3,
   'sig_cryptAM': 0,
   'sig_recon_fingerprint': 305,
   'sig_antivm_vmware_keys': 0,
   'sig_infostealer_bitcoin': 207,
   'sig_antiemu_wine_func': 0,
   'sig_rat_spynet': 3,
   'sig_origin_resource_langid': 2255


Cite The DataSet

If you find those results useful please cite them :

@misc{ MR,
author = "Marco Ramilli",
title = "Malware Training Sets: a machine learning dataset for everyone",
year = "2016",
url = "http://marcoramilli.blogspot.it/2016/12/malware-training-sets-machine-learning.html",
note = "[Online; December 2016]"
}


Again, if you want to contribute ad you own classified Samples please drop them to me I will empower the dataset.

Enjoy your new researches!