Marco Ramilli Web Corner

“Collection #I” Data Breach Analysis – Part 1

Few weeks ago I wrote about “How Data Breaches Happen“, where I shared some public available “pasties” within apparently (not tested) SQLi vulnerable websites. One of the most famous data breach in the past few years is happening in these days. I am not saying that the two events are linked, but I have fun in thinking that events happen in bursts. Many magazines all around the world wrote about the data breach published by Troy Hunt on 773 Millions of new Records (here). Today I’d like to write a quick partial analysis that I’ve been able to extract from those records (I grabbed data from public available pasties website). First of all, let me say that the work done has been super difficult (at least to me) since it required a huge amount of computational power and very high speed internet access because of the humongous collected data. In order to make analysis over such a humongous data breach I used a powerful Elastic Search Cloud instance and I wrote a tiny python script to import super dirty data into a common format. Some records were unable to load since the format type, the charset or whatever it had, so please consider a relative error about 4 to 5 % (circa) in the following data analyses.

PARTIAL Analysis on Collecetion#1

One of the first questions I wanted to answer was: “What are the most used passwords ?“. I am aware that many researches wrote about the most used passwords, but now I do have the opportunity to measure it. To get real used passwords and to evaluate the reality. So let’s see what are the most used passwords out there!

PARTIAL Analysis on used passwords

So far the most used passwords are: “123456”, “q1w2e3r4t5y6”, “123456789”, “1qaz2wsx3edc”, followed by most common passwords like “12345678” and “qwerty”. By observing the current graph and comparing it to common researches on frequently used passwords such as here, here, and here we might appreciate a significative difference: the pattern complexity! In fact, while years ago the most used passwords were about names, dates or simple patters such as “qwerty”, today we observe a significative increase in pattern complexity, but still too easy to be brute-forced.

A second question came by looking at leaked emails. “What are the domain names of the most leaked emails ?” Those domains are not the most vulnerable domains but rather the most used ones. So I’m not saying that those domains are/or have been vulnerable or Pwned, but I am trying to find what are the most leaked email providers. In other words if you receive an email from “@gmail.com” what is the probability that it has been leaked and potentially compromised ? Again I cannot answer to such a question since I do not have the total amount of “@gmail.com” accounts all around the word, but I think it might be a nice indicator to find out what are the most leaked email domain names.

PARTIAL Analysis on most leaked domain

The most leaked emails come from “yahoo.com”, “gmail.com”, “aol.com” and “hotmail.com”. This is quite interesting since we are mostly facing personal emails providers (domains) rather then professional emails providers (such as company.com). So apparently, attackers are mostly focused in targeting people rather then companies (maybe attacking not professional websites and/or distributing malware to people rather then companies domain names). Another interesting data to know is about the unique leaked email domain names: 4426, so far !

Finally, it would be great to know from what sources data is coming from ! At such a point I have no evidences of what I am going to write about, but I made some deductions from the data leaked structure. The following image shows collection-1 structure.

PARTIAL Analysis Collection#1 Structure

Each folder holds .TXT files which have names that look like domain names. Some of those are really domain names (tested), some other are on-sale right now, and many other seems to just look like a domain, but I had no evidences of them. Anyway I decided to assume that the file names looking like domain names are the domain from which the attacker leaked informations. So, having such in mind we might deduce where the attacker extracted the data (username and passwords) and perform a personal evaluation about the leaked information.

Conclusions

I downloaded a copy of the collection-1, collection-2, collection-3, collection-4 and collection-5 data leak from 5 public pasties linking to MegaUpload (after few hours MegaUpload dismissed such a links). According to many of you it is one of the biggest leaked compilation in history. I used a complete ELK stack, a local python script to elaborate such humongous data in few sleepless days in order to answer to some questions I had. I figured out that folders name: “NEW combo semi private” and “MAIL ACCESS combos” have some unknown data. I checked them agains “Have I Been Pwnd” API to check if the requested email was exclusive marked as “Collection#1”.

As next steps my colleagues ( Yoroi ) and I we’ll drill down focusing on nations and companies data. Releasing public information and statistics about our findings.

I you are wondering if you are in that collections, well probably Yes ! But if you need to exactly know it, please visit Have I Been Pwned , I will not check it. Have I Been Pwned is a great project born to answer to such a question. Moreover in order to avoid collection spreading I will not give the collection away.