Introducing State-of-the-Art in Multimodal Deepfake Detection

Oct 30, 2024

Today, we present our research on Multimodal Deepfake Detection, expanding our industry-leading deepfake detection platform to support image and video analysis. Our approach builds on our established audio detection system to deliver comprehensive protection across all media types. Detect Multimodal maintains the same enterprise-grade scalability and real-time processing capabilities that our customers rely on, while introducing sophisticated new detection algorithms for images, and video.

Breaking New Ground in Multimodal Detection

The rapid advancement of generative AI technologies, particularly in text-to-image and text-to-video models, has created an urgent need for robust detection methods. We present a unified approach to synthetic media detection that addresses the challenges posed by modern AI generators such as DALL-E 3, Midjourney, FLUX, SORA, MovieGen, and others.

Resemble AI created a testing dataset that represents a comprehensive benchmark for evaluating synthetic image detection capabilities across modern AI generation methods. We call this “Modern Dataset” in table 1. Unlike many existing benchmarks that focus primarily on GAN-generated images or older models, our dataset incorporates content from the latest text-to-image models including DALL-E 3, Midjourney, and FLUX, ensuring relevance to current real-world challenges in synthetic content detection.

The dataset is meticulously curated to maintain high quality and balanced representation across different generation methods. We employ a careful sampling strategy, taking up to 100 samples from each source to prevent any single generation method from dominating the evaluation metrics. Real images are sourced from established datasets including the Kaggle DALL-E recognition dataset and ArtiFact dataset, providing a robust foundation for testing false positive rates. Importantly, we maintain strict separation between our training and testing data – while our training set utilizes the complete collections of these sources, our test set uses distinct samples that never appear in training.

A key distinguishing feature of our dataset is its focus on high-resolution, high-quality images that reflect real-world usage. We specifically excluded lower-resolution datasets (such as CIFAKE with its 32×32 images) to ensure our testing represents practical deployment scenarios.

Resemble AI
RINE
CNNDetect
LGrad
DIMD
FreqDetect
GramNet
UnivFD
PatchCraft
SIDBench
92.5%
91.5%
71.5%
82.3%
78.1%
71.5%
81.4%
80.8%
81.7%
Modern Dataset
96.4%
65.9%
N/A
N/A
N/A
N/A
N/A
N/A
N/A

Table 1. Modern Dataset incorporates content from the latest text-to-image models including DALL-E 3, Midjourney, and FLUX 

Detecting Modern AI Text-to-Image Generators

Our image detection system leverages state-of-the-art neural networks to identify AI-generated content from all major image generation platforms, including DALL-E 3, Midjourney, Google’s Gemini, Grok, FLUX, and Stable Diffusion.

Capturing Nuances and Detecting Text-to-Video AI Models

The emergence of sophisticated text-to-video AI models like OpenAI’s SORA and Meta’s MovieGen has ushered in a new era of synthetic video content. In response, we’ve developed a comprehensive video analysis system that operates at multiple levels of sophistication. Our technology processes video content frame-by-frame, analyzing both temporal consistency and spatial artifacts to identify AI-generated footage with unprecedented accuracy.

What sets our video analysis apart is its ability to adapt to emerging generation techniques. As new text-to-video models emerge, our system’s architecture allows for rapid adaptation and continued accuracy in detection. This adaptability is crucial in maintaining effective protection against increasingly sophisticated video generation technologies.

Built for Enterprise Scale

Resemble Detect’s multimodal platform has been architected from the ground up to meet the demanding requirements of enterprise deployments. At its core, our system features an API-first design philosophy that enables seamless integration into existing enterprise workflows, whether you’re processing content through custom applications, content management systems, or security platforms. This flexibility allows organizations to implement detection capabilities exactly where they’re needed, without disrupting established processes.

Our scalable architecture adapts dynamically to processing demands, handling everything from individual file analysis to high-volume batch processing across all content types. The system’s distributed processing capabilities ensure consistent performance even under heavy loads, making it suitable for organizations dealing with massive content volumes across audio, image, and video formats. Whether you’re analyzing thousands of customer service calls, monitoring social media feeds, or validating assets in content management system, our platform maintains its speed and accuracy without compromise.

Security-conscious organizations will appreciate our comprehensive deployment options, including fully air-gapped installations for environments with the strictest security requirements. This flexibility extends to hybrid deployments that can balance security needs with operational efficiency. Our on-premise solutions provide the same advanced detection capabilities as our cloud-based offerings, ensuring that organizations never have to choose between data security and functionality.

More From This Category

Our Commitment to Consent

Our Commitment to Consent

Remember when creating a synthetic voice meant hours in a studio, carefully recording every syllable? Now, with a few clicks, you can clone anyone's voice. It's mind-blowing tech. But with great power comes great responsibility. At Resemble, we've always believed that...

read more
Introducing ‘Edit’ by Resemble AI: Say No More Beeps

Introducing ‘Edit’ by Resemble AI: Say No More Beeps

In audio production, mistakes are inevitable. You’ve wrapped up a recording session, but then you notice a mispronounced word, an awkward pause, or a phrase that just doesn’t flow right. The frustration kicks in—do you re-record the whole segment, or do you spend...

read more
Introducing Resemble Identity & Audio Intelligence

Introducing Resemble Identity & Audio Intelligence

We're excited to unveil two groundbreaking models designed to revolutionize your interaction with audio: Resemble Identity and Resemble Audio Intelligence. These tools enhance speaker recognition, real-time analysis, voice-based authentication, and more. Resemble...

read more