On 2016 I was working hard to find a way to classify Malware families through artificial intelligence (machine learning). One of the first difficulties I met was on finding a classified testing set in order to run new algorithms and to test specified features. So, I came up with this blog post and this GitHub repository where I proposed a new testing-set based on a modified version of Malware Instruction Set for Behavior-Based Analysis, also referred as MIST. Since that day I received hundreds of emails from students, researchers and practitioners all around the world asking me questions about how to followup that research and how to contribute to expand the training set.

Dataset Generation Process

I am so glad that many international researches used my classified Malware dataset as building block for making great analyses and for improving the state of the art on Malware research. Some of them are listed here, but many others papers, articles and researches have been released (just ask to Google).

Today I finally had chance to follow-it-up by adding a scripting section which would be useful to: (i) generate the modified version of MIST files (the one in training sets) and to (ii) convert the obtained results to ARFF (Attribute Relation File Format) by University of Waikato. The first script named mist_json.py is a reporting module that could be integrated into a running CuckooSandBox environment. It is able to take the cuckoo report and convert it into a modified version of MIST file. To do that, drop mist_json.py into your running instance of CuckooSandbox V1 (modules/reporting/) and add the specific configuration section into conf/reporting.conf. You might decide to force its execution without configuration by editing directly the source code. The result would be a MIST file for each Cuckoo analysed sample. The MIST file wraps out the generated features as described into the original post here. By using the second script named fromMongoToARFF.py you can convert your JSON object into ARFF which would be very useful to be imported into WEKA for testing your favorite algorithms.

Now, if you wish you are able to generate training sets by yourself and to test new algorithms directly into WEKA. The creation process follows those steps:

  • Upload the samples into a running CuckooSanbox patched with mist_json.py
  • The mist_json.py produces a MIST.json file for each submitted sample
  • Use a simple script to import your desired MIST.json files into a MongoDB. For example for i in **/*.json; do; mongoimport --db test --collection test --file $i; done;
  • Use the fromMongoToARFF.py to generate ARFF
  • Import the generated ARFF into Weka
  • Start your experimental sessions

If you want to share with the community your new MIST classified files please feel free to make pull requests directly on GitHub. Everybody is using this set will appreciate it.

One thought on “ Malware Training Sets: FollowUP ”

Comments are closed.