Why Did I Criticize That AI Paper Featured on the Cover of Nature?
Published in Computational Sciences
Last September, the research on DeepSeek-R1 made the cover of Nature, being hailed as "the world's first peer-reviewed mainstream large language model." As a researcher focused on AI governance, I felt both excitement for this advancement and a touch of unease: when large models start being published as "scientific achievements," what criteria should we use to evaluate them?
Therefore, I wrote a critical commentary—"Examining methodological rigor, ethical governance, and scientific claims: a critical review of the DeepSeek-R1 study" The article was eventually published in the "Review" section of Discover Artificial Intelligence.
What did I see?
DeepSeek-R1 is indeed impressive: it uses pure reinforcement learning to train powerful reasoning abilities, performs remarkably well on tasks such as mathematics and programming, and has even open-sourced its model weights. However, when I carefully read the paper and its 83-page supplementary materials, some issues came to light:
The main text of the paper only describes the training data in one sentence: "All from the internet." Data is the "raw material" of the model; if it is not transparent, it is like a chemical experiment without revealing which reagents were used.
Although the supplementary materials list in detail how many math problems and how much code were used, key information such as where the data specifically came from, how it was collected, whether it contains bias, or whether there are copyright issues—all of this is absent.
The safety evaluation is very thorough, but it is all static testing. In the real world, malicious users will find ways to attack the model, which static tests cannot detect.
Most importantly, the paper does not discuss at all questions like "Who is responsible if the model generates problematic content?"
My criticism is not meant to "nitpick"
Someone asked me: Did you write this article to dismiss DeepSeek's work? On the contrary. I believe that DeepSeek-R1's technical contributions are tremendous, but precisely because of its influence, we need to scrutinize its transparency more rigorously. If even a top-tier journal paper has disclosure gaps, how far is the entire AI field from the scientific standards of "reproducibility and responsibility"?
I shifted the focus of my critique from "whether disclosures exist" to "how disclosures are made." I proposed three original frameworks:
Data Disclosure Template: Guides researchers on how to document data sources, cleaning processes, and legal compliance information in the future.
AI Accountability Dimension Framework: Expands safety assessment from technical metrics to governance aspects such as accountability, ethics, and risk warnings.
Multi-Stakeholder Action Agenda: Provides concrete and actionable recommendations for the academic publishing community, policymakers, and corporate labs, respectively.
The Most Thrilling Moment: The 'Single-Word Dispute' During the Proof Stage
After the article was accepted, I discovered during the final proof that the term 'data sheets' had been mistakenly written as 'data tables' in a table. In English, these two terms differ by a single Word but lead to a completely different meaning—the former refers to the standard framework of 'Datasheets for Datasets' proposed by scholars such as Gebru, while the latter is just a regular data table. The journal stipulates that only errors that 'directly compromise the integrity of the science' may be corrected at the proof stage. Anxiously, I wrote an email to the editor, explaining that this terminological confusion could mislead readers about what I was truly advocating. Fortunately, the editor understood and agreed to the correction.
At that moment, I realized: the rigor of academic writing sometimes lies hidden in the choice of a single word.
Why is this article worth writing?
Today, AI research is being commercialized at an unprecedented speed. Many laboratories, when publishing papers, only flaunt scores or demos, yet remain evasive about the origins of their data and ethical considerations. This is not how science should be. My article may not change the entire industry, but if it can prompt more researchers to start asking "Where is your dataset documentation?", then it is worthwhile.
I am especially grateful to the editors of "Discovering Artificial Intelligence" for being willing to publish such a critical article. This indicates that the academic publishing community is also engaging in reflection: for technologies like AI that will profoundly impact society, it is far from enough to merely ask 'Is the performance strong?'; we must also question 'Are you responsible and transparent?'
In the future, I hope to see more researchers not only releasing model weights but also providing complete "Datasheets for Datasets" and "model cards." Scientific progress requires not only speed but also greater reliability and reproducibility. And the path to that begins with each piece of openly transparent documentation.
Follow the Topic
-
Discover Artificial Intelligence
This is a transdisciplinary, international journal that publishes papers on all aspects of the theory, the methodology and the applications of artificial intelligence (AI).
Related Collections
With Collections, you can get published faster and increase your visibility.
Transforming Education through Artificial Intelligence: Opportunities, Challenges, and Future Directions
Artificial Intelligence (AI) is rapidly changing the educational field by enabling personalized learning, intelligent tutoring systems, automated assessments, learning analytics, and administrative automation.
This collection invites original research, systematic reviews, and visionary perspectives on the transformative impact of AI in education. It aims to explore how AI technologies can enhance equity, inclusion, and efficiency in educational settings across different contexts, including higher education, K-12, vocational training, and lifelong learning. This collection will address technical, pedagogical, ethical, and policy aspects, fostering interdisciplinary perspectives and evidence-based insights.
This Collection supports and amplifies research related to SDG 4 and SDG 9.
Keywords: Artificial Intelligence, AI in Education, Educational Technology, Data Analytics, AI Ethics
Publishing Model: Open Access
Deadline: May 31, 2026
AI for Image and Video Analysis: Emerging Trends and Applications
The application of AI in image and video analysis has revolutionized a wide range of domains, offering more accurate and efficient visual data processing. Thanks to advances in neural networks, large-scale datasets, and computational power, AI algorithms have surpassed traditional computer vision techniques in performance. This transformation has had a profound impact on areas like healthcare (where AI aids in diagnosing diseases through medical imaging), security (with real-time video surveillance), and entertainment (enhancing video quality and enabling automated content tagging). As AI continues to evolve, new challenges emerge, including the need for explainability, handling large datasets efficiently, improving robustness in real-world environments, and addressing biases in AI models. These open questions necessitate continued research, collaboration, and discourse. The proposed Collection focuses on the intersection of artificial intelligence (AI) and image and video analysis, exploring the latest advancements, challenges, and applications in this rapidly evolving field. As AI-powered techniques such as deep learning, computer vision, and generative models mature, they are increasingly being leveraged for tasks like image classification, object detection, video segmentation, activity recognition, facial recognition, and more. These technologies are pivotal in industries including healthcare, security, autonomous vehicles, entertainment, and smart cities, to name a few. We invite researchers and practitioners to submit articles related to, but not limited to, the following topics:
- Deep learning techniques for image and video analysis
- AI-based object detection and recognition
- Image segmentation and annotation using AI
- Video classification and activity recognition
- Real-time video surveillance and security systems
- AI for medical image analysis and diagnostics
- Generative adversarial networks (GANs) for image and video generation
- AI in autonomous driving and smart transportation systems
- AI-powered multimedia search and retrieval
- Human-Computer Interaction (HCI) through AI-based video analysis
- AI techniques for image and video compression
- Ethical concerns and responsible AI in image and video analysis
This Collection supports and amplifies research related to SDG 9 and SDG 11.
Keywords: computer vision; image segmentation; object detection; video surveillance
Publishing Model: Open Access
Deadline: Sep 15, 2026
Please sign in or register for FREE
If you are a registered user on Research Communities by Springer Nature, please sign in
Recently, Deep Seek has experienced several crashes. This is consistent with the situations analyzed in the paper; however, since there was no supporting evidence at the time, I removed this part from the published paper. The following is the content I deleted:
More importantly, the theoretical foundation on which its core algorithm design relies may itself be flawed. DeepSeek-R1 employs the GRPO algorithm, whose objective function explicitly includes the reverse KL divergence DKL(πref∥πθ) as a regularization term. The paper adopts the common intuition from generative models, considering this design intended to constrain the policy from deviating excessively from the reference distribution. However, recent theoretical studies indicate that directly transferring the mode-covering/mode-seeking characteristics of forward/reverse KL to the reinforcement learning framework is a misconception. Research points out that in reward-based optimization, the primary tendency of any KL regularization is to induce mode collapse, with its behavior largely determined by the strength of regularization and the scale of rewards, rather than the direction of the KL divergence itself [11]. This implies that the regularization design relied upon by DeepSeek-R1 may be theoretically flawed and non-diverse by premise.
These concerns are not unfounded. A substantial body of empirical research in recent years has repeatedly confirmed that reinforcement learning training can systematically reduce the diversity of language model outputs, leading to the phenomenon known as entropy collapse [11]. This loss of diversity is broadly manifested across multiple dimensions, including formatting, random generation, creative exploration, and reasoning paths. Theoretical analyses further indicate that in policy gradient optimization, if an agent's action probabilities are highly correlated with their advantage values, entropy tends to decrease. This suggests that the tendency for chains of thought (CoT) in DeepSeek-R1 to be lengthy is likely not an isolated case, but rather a systemic risk inherent to the KL-regularized reinforcement learning framework it employs when faced with complex reward signals.
11. Chen, A.G. et al.: KL-Regularized Reinforcement Learning is Designed to Mode Collapse. (2025) https://doi.org/10.48550/arXiv.2510.20817
The reverse KL divergence DKL(πθ∥πref)
Crucially, the theoretical foundation underlying its methodology is being challenged by recent research. The paper uses reverse KL divergence as the core regularization term. However, recent studies have systematically rejected the intuition of simply transferring "mode-seeking" from generative models to reinforcement learning, showing that under common settings of low regularization strength and equivalent rewards, the KL-regularized objective itself leads to a unimodal optimal distribution, structurally suppressing output diversity [11]. This suggests that the claimed potential to "explore advanced, non-human reasoning paths" may be inherently constrained by the algorithm's own tendency toward "mode collapse." Consistently, a large body of empirical research also confirms that reinforcement learning training systematically induces "entropy collapse" in large language models, manifested as diminished output diversity, which is particularly pronounced in tasks requiring creativity.
Therefore, the phenomena observed in the paper, interpreted as the "emergence of reasoning ability" (such as the continuously increasing length of thought chains), need to be rigorously distinguished from another possibility: is this truly a deepening of the model's intelligence, or is it the model exhibiting saturation-style optimization within a limited pattern, or even falling into repetitive expression? The peer review process of the study has also pointed out that several of its design decisions lack corresponding empirical evidence, indicating that to support its core claims, more thorough methodological validation and controlled experiments are necessary (for example, ablation studies on the choice of regularization terms and components of the reward function) to rule out the above alternative explanations.
This may suggest that future research aiming to construct truly robust and creative reasoning models must directly address the challenge of entropy collapse. Promising approaches are beginning to emerge, including count-based intrinsic exploration rewards to encourage models to cover a broader solution space; or designing dynamic regularization strategies that balance reward optimization and distribution breadth during training. These ideas, emerging from the latest community research, offer more refined control tools over training the next generation of large language models than relying solely on KL regularization.