DarkBERT may sound like a character from a sci-fi film, but it’s actually a powerful new tool designed to help combat cybercrime. Developed by a team of South Korean developers, DarkBERT is a large language model (LLM) that has been trained solely on data collected from the dark web.
This may sound sinister, but the researchers behind DarkBERT believe it could actually be an antidote to the problem of AI-powered cybercrime. According to the team, modern AI technology has made it easier than ever for cybercriminals to commit illegal activities and steal data from organizations. DarkBERT could help us fight back by monitoring sites on the dark web that sell or publish confidential data obtained by ransomware groups.
The LLM is based on the RoBERTa architecture, which was developed by researchers from Facebook and Washington University in 2019. However, the key difference is that DarkBERT has been trained specifically on data collected from the dark web using the Tor network.
To train the model, the researchers used Tor to crawl the dark web and collect millions of pages of data. They then pre-processed the text to remove duplicate pages and balance the categories, before feeding the database to RoBERTa.
The researchers chose RoBERTa as the base model because it does not use Next Sentence Prediction (NSP) during training. This is useful when training a model based on the dark web, which has fewer sentence-like structures than the surface web.
DarkBERT’s ability to monitor illicit exchanges based on keywords is particularly useful. It could be used to identify conversations or postings related to cybercrime, thereby helping authorities to detect and prevent illegal activities on the dark web.
While DarkBERT is still a new and untested technology, it could have enormous potential in the fight against cybercrime. With AI-powered attacks becoming increasingly common, we need all the tools we can get to protect ourselves and our data.