Scraping the “TOR hidden world” is a quite complex topic. First of all you need an exceptional computational power (RAM mostly) for letting multiple runners grab web-pages, extracting new links and re-run the scraping-code against the just extracted links. Plus a queue manager system to manage scrapers conflicts and a database to store scraped data need to be consistent. Second, you need great starting points. In other words you need the
.onion addresses where your scrapers start from. You might decide to begin from common and well-known onion links such as The TOR-hidden-wiki or to start from great reddit threads such this one, but seldom those approaches bring you to what I refer as “interesting links”. For this post “interesting links” means specific links that are rare or not very widespread and mostly focused on cyber-attacks and/or cyber-espionage. Another approach needs be used in order to reach better results. One of the most profitable way to search for “interesting links” is to look for
.onion addresses in temporal and up-to-date spots such as: temporal pasties, IRC chats, slack or telegram groups, and so on and so forth. In there you might find links that bring you to more rare contents and to less spread information.
Today I want to start from here by showing some simple stats about scraped
.onion links in my domestic scraping cluster. From the following graph you might appreciate some statistics of active-and-inactive scraped hidden services. The represented week is actually a great stereotype of what I’ve got in the last whole quarter. What is interesting, at least in my personal point of view, is the percentage of offline (green) onion services versus the percentage of online (yellow) onion services.
This scenario changed dramatically in the past few months. While during Q1 (2019) most of the scraped websites were absolutely up-and-running on Q2 (2019) I see, most of the scraped hidden services, dismissed and/or closed even if they persists in the communication channels (IRC chat, Pasties, Telegram, etc.).
I think there are dual factors that so much affected last quarter in spotting active hidden service. (1) Old content revamping. For example bots pushing “interesting links” back online even after months of inactivity. This activity is not new at all, but during the past quarter has been abused too many time respect to previous quarters. (2) Hidden services are changing address much more fast respect to few months ago. In order to make hard to spot malicious actors, they might decide to keep up-and-running their hidden services only for few hours and then change address/location. Is that way to enumerate hidden-services passing away or is it a simple weird time-frame ? We will see it during the next “Scraping” months, stay tuned !