Leaked Evidence Suggests OpenAI’s AI Training Data Was Stolen by China’s DeepSeek

Feb 2

By Harry Negron, February 2, 2025

OpenAI is investigating allegations that Chinese AI startup DeepSeek and Alibaba have misappropriated its proprietary technology to develop their own advanced reasoning models. According to reports, OpenAI suspects that DeepSeek utilized a technique known as "model distillation" to replicate the capabilities of OpenAI's models, potentially violating intellectual property rights.

Model distillation involves training a smaller model to mimic the behavior of a larger, more complex one, often requiring access to the original model's outputs. OpenAI believes that DeepSeek may have exploited its API to facilitate this process, thereby infringing upon OpenAI's terms of service.

We received evidence from another insider that presents a contrasting view. A worker who has been training OpenAI's reasoning models for several months—whose identity will remain confidential—revealed that the model they are developing is essentially identical to Deepseek’s. This individual reported testing identical problems on OpenAI's teacher model (the version available to them for training), noting that the reasoning text and outcomes were virtually indistinguishable between Deepseek’s model and OpenAI’s teacher model.

It remains unclear how these Chinese companies acquired such information. However, there is a possibility that the Chinese government has installed proxies—whether workers, administrators, or both—within OpenAI and on training platforms like DataAnnotation.Tech, which may have facilitated the breach.

In a related development, security researchers have reportedly jailbroken DeepSeek's AI model, uncovering internal information that suggests a connection to OpenAI's technology. This finding has intensified concerns about the unauthorized use of proprietary data in developing competing AI models.

In response to these developments, OpenAI has expedited the release of its o3 reasoning models, aiming to maintain its competitive edge in the AI industry. The o3 models are designed to enhance reasoning capabilities, offering improved problem-solving skills across various complex fields, akin to what Deepseek already does but without the censorship and bias.

These incidents underscore the challenges of protecting intellectual property in the rapidly evolving field of artificial intelligence. As AI models become increasingly sophisticated, ensuring the integrity and security of proprietary technologies remains a critical concern for industry leaders.

Leaked Evidence Suggests OpenAI’s AI Training Data Was Stolen by China’s DeepSeek

No Man’s Sky Player Finds ‘Trans Flag’ Gas Giant, Sparks Heated Debate on Reddit

OpenAI’s New o3 AI Models Promise Smarter Problem-Solving