Magika: AI-Powered File Type Identification System Now Open Source
Google has open-sourced Magika, its AI-powered file-type identification system, with the aim of helping others accurately detect binary and textual file types. This move is expected to greatly improve file identification accuracy for various software applications and contribute to the field of cybersecurity.
Accurate file-type detection has always been critical in determining how files should be processed. However, it is a challenging task due to the diverse structures of different file formats, particularly when it comes to textual formats and programming languages. Existing file-type identification tools heavily rely on manually crafted rules and heuristics, which are time-consuming to develop and prone to errors. Moreover, attackers often try to fool these detection systems with malicious payloads.
To address these challenges, Google developed Magika, an AI-powered file type detector. Magika utilizes a custom, highly optimized deep-learning model created using Keras. Remarkably, the model weighs only about 1MB, enabling fast and precise file identification within milliseconds, even on a CPU. Magika uses Onnx as an inference engine to ensure efficient file-type detection.
In terms of performance, Magika surpasses existing tools by approximately 20%, based on a benchmark evaluation of 1 million files spanning over 100 file types. Notably, Magika outperforms other tools significantly when it comes to identifying textual files, including code files and configuration files. This improvement in performance is particularly valuable for security applications.
Internally, Google has been successfully using Magika to enhance the safety of its users. By employing Magika, Gmail, Drive, and Safe Browsing files are accurately routed to the appropriate security and content policy scanners. On average, Magika improves file type identification accuracy by 50%, compared to the previous system that relied on manually generated rules. This increased accuracy enables Google to scan 11% more files with their specialized malicious AI document scanners and reduces the number of unidentified files to just 3%.
Magika is also set to integrate with VirusTotal, enhancing the platform’s existing Code Insight functionality, which uses Google’s generative AI to detect malicious code. This integration will act as a pre-filter, ensuring improved efficiency and precision when analyzing files using Code Insight. This collaboration further bolsters the global cybersecurity ecosystem, making the digital environment safer for users.
By open-sourcing Magika, Google intends to assist other software applications in improving their file identification accuracy. The code and model for Magika are now freely available on Github under the Apache2 License. It can be installed as a standalone utility and Python library using pip. In addition, an experimental npm package is available for those interested in the TensorFlow.js (TFJS) version.
Through Magika, Google has taken a significant step towards enhancing file-type detection accuracy through AI. This development is poised to benefit numerous organizations and researchers, allowing for the reliable identification of file types on a large scale.