Why Did I Criticize That AI Paper Featured on the Cover of Nature?
Published in Computational Sciences
Last September, the research on DeepSeek-R1 made the cover of Nature, being hailed as "the world's first peer-reviewed mainstream large language model." As a researcher focused on AI governance, I felt both excitement for this advancement and a touch of unease: when large models start being published as "scientific achievements," what criteria should we use to evaluate them?
Therefore, I wrote a critical commentary—"Examining methodological rigor, ethical governance, and scientific claims: a critical review of the DeepSeek-R1 study" The article was eventually published in the "Review" section of Discover Artificial Intelligence.
What did I see?
DeepSeek-R1 is indeed impressive: it uses pure reinforcement learning to train powerful reasoning abilities, performs remarkably well on tasks such as mathematics and programming, and has even open-sourced its model weights. However, when I carefully read the paper and its 83-page supplementary materials, some issues came to light:
The main text of the paper only describes the training data in one sentence: "All from the internet." Data is the "raw material" of the model; if it is not transparent, it is like a chemical experiment without revealing which reagents were used.
Although the supplementary materials list in detail how many math problems and how much code were used, key information such as where the data specifically came from, how it was collected, whether it contains bias, or whether there are copyright issues—all of this is absent.
The safety evaluation is very thorough, but it is all static testing. In the real world, malicious users will find ways to attack the model, which static tests cannot detect.
Most importantly, the paper does not discuss at all questions like "Who is responsible if the model generates problematic content?"
My criticism is not meant to "nitpick"
Someone asked me: Did you write this article to dismiss DeepSeek's work? On the contrary. I believe that DeepSeek-R1's technical contributions are tremendous, but precisely because of its influence, we need to scrutinize its transparency more rigorously. If even a top-tier journal paper has disclosure gaps, how far is the entire AI field from the scientific standards of "reproducibility and responsibility"?
I shifted the focus of my critique from "whether disclosures exist" to "how disclosures are made." I proposed three original frameworks:
Data Disclosure Template: Guides researchers on how to document data sources, cleaning processes, and legal compliance information in the future.
AI Accountability Dimension Framework: Expands safety assessment from technical metrics to governance aspects such as accountability, ethics, and risk warnings.
Multi-Stakeholder Action Agenda: Provides concrete and actionable recommendations for the academic publishing community, policymakers, and corporate labs, respectively.
The Most Thrilling Moment: The 'Single-Word Dispute' During the Proof Stage
After the article was accepted, I discovered during the final proof that the term 'data sheets' had been mistakenly written as 'data tables' in a table. In English, these two terms differ by a single Word but lead to a completely different meaning—the former refers to the standard framework of 'Datasheets for Datasets' proposed by scholars such as Gebru, while the latter is just a regular data table. The journal stipulates that only errors that 'directly compromise the integrity of the science' may be corrected at the proof stage. Anxiously, I wrote an email to the editor, explaining that this terminological confusion could mislead readers about what I was truly advocating. Fortunately, the editor understood and agreed to the correction.
At that moment, I realized: the rigor of academic writing sometimes lies hidden in the choice of a single word.
Why is this article worth writing?
Today, AI research is being commercialized at an unprecedented speed. Many laboratories, when publishing papers, only flaunt scores or demos, yet remain evasive about the origins of their data and ethical considerations. This is not how science should be. My article may not change the entire industry, but if it can prompt more researchers to start asking "Where is your dataset documentation?", then it is worthwhile.
I am especially grateful to the editors of "Discovering Artificial Intelligence" for being willing to publish such a critical article. This indicates that the academic publishing community is also engaging in reflection: for technologies like AI that will profoundly impact society, it is far from enough to merely ask 'Is the performance strong?'; we must also question 'Are you responsible and transparent?'
In the future, I hope to see more researchers not only releasing model weights but also providing complete "Datasheets for Datasets" and "model cards." Scientific progress requires not only speed but also greater reliability and reproducibility. And the path to that begins with each piece of openly transparent documentation.
Follow the Topic
-
Discover Artificial Intelligence
This is a transdisciplinary, international journal that publishes papers on all aspects of the theory, the methodology and the applications of artificial intelligence (AI).
Related Collections
With Collections, you can get published faster and increase your visibility.
Transforming Education through Artificial Intelligence: Opportunities, Challenges, and Future Directions
Artificial Intelligence (AI) is rapidly changing the educational field by enabling personalized learning, intelligent tutoring systems, automated assessments, learning analytics, and administrative automation.
This collection invites original research, systematic reviews, and visionary perspectives on the transformative impact of AI in education. It aims to explore how AI technologies can enhance equity, inclusion, and efficiency in educational settings across different contexts, including higher education, K-12, vocational training, and lifelong learning. This collection will address technical, pedagogical, ethical, and policy aspects, fostering interdisciplinary perspectives and evidence-based insights.
This Collection supports and amplifies research related to SDG 4 and SDG 9.
Keywords: Artificial Intelligence, AI in Education, Educational Technology, Data Analytics, AI Ethics
Publishing Model: Open Access
Deadline: Nov 30, 2026
Artificial Intelligence in Medical Imaging
This Topical Collection focuses on artificial intelligence (AI) in medical imaging, which aims to highlight recent advancements in the field of medical imaging analysis using AI and big data. Medical imaging is an essential tool for diagnosis, treatment, and monitoring of various medical conditions. However, analyzing medical images can be time-consuming, costly, and prone to human error. With the emergence of AI, many of these challenges can be addressed by automating tasks involved in medical imaging analysis.
We welcome submissions on various topics related to AI in medical imaging, including, but not limited to, novel AI algorithms and techniques for medical image analysis, the integration of AI into clinical workflows, the development of software packages for medical imaging analysis, and the evaluation of AI methods for clinical use. Additionally, we encourage submissions that explore the ethical and social implications of AI in medical imaging, such as the impact on patient privacy, data security, and clinical decision-making.
Overall, this Topical Collection aims to provide a comprehensive overview of the recent advancements in AI in medical imaging and to promote interdisciplinary research and collaborations between AI researchers, medical imaging experts, and clinicians.
Keywords: Clinical Decision Support System; Computer-Aided Diagnosis; Computer Vision; Deep Learning; Diagnostic Imaging; Image Classification; Image Processing; Image Segmentation; Object Detection; Precision Medicine; Radiomics
Publishing Model: Open Access
Deadline: Aug 10, 2026
Please sign in or register for FREE
If you are a registered user on Research Communities by Springer Nature, please sign in
Recently, Deep Seek has experienced several crashes. This is consistent with the situations analyzed in the paper; however, since there was no supporting evidence at the time, I removed this part from the published paper. The following is the content I deleted:
More importantly, the theoretical foundation on which its core algorithm design relies may itself be flawed. DeepSeek-R1 employs the GRPO algorithm, whose objective function explicitly includes the reverse KL divergence DKL(πref∥πθ) as a regularization term. The paper adopts the common intuition from generative models, considering this design intended to constrain the policy from deviating excessively from the reference distribution. However, recent theoretical studies indicate that directly transferring the mode-covering/mode-seeking characteristics of forward/reverse KL to the reinforcement learning framework is a misconception. Research points out that in reward-based optimization, the primary tendency of any KL regularization is to induce mode collapse, with its behavior largely determined by the strength of regularization and the scale of rewards, rather than the direction of the KL divergence itself [11]. This implies that the regularization design relied upon by DeepSeek-R1 may be theoretically flawed and non-diverse by premise.
These concerns are not unfounded. A substantial body of empirical research in recent years has repeatedly confirmed that reinforcement learning training can systematically reduce the diversity of language model outputs, leading to the phenomenon known as entropy collapse [11]. This loss of diversity is broadly manifested across multiple dimensions, including formatting, random generation, creative exploration, and reasoning paths. Theoretical analyses further indicate that in policy gradient optimization, if an agent's action probabilities are highly correlated with their advantage values, entropy tends to decrease. This suggests that the tendency for chains of thought (CoT) in DeepSeek-R1 to be lengthy is likely not an isolated case, but rather a systemic risk inherent to the KL-regularized reinforcement learning framework it employs when faced with complex reward signals.
11. Chen, A.G. et al.: KL-Regularized Reinforcement Learning is Designed to Mode Collapse. (2025) https://doi.org/10.48550/arXiv.2510.20817
The reverse KL divergence DKL(πθ∥πref)
Crucially, the theoretical foundation underlying its methodology is being challenged by recent research. The paper uses reverse KL divergence as the core regularization term. However, recent studies have systematically rejected the intuition of simply transferring "mode-seeking" from generative models to reinforcement learning, showing that under common settings of low regularization strength and equivalent rewards, the KL-regularized objective itself leads to a unimodal optimal distribution, structurally suppressing output diversity [11]. This suggests that the claimed potential to "explore advanced, non-human reasoning paths" may be inherently constrained by the algorithm's own tendency toward "mode collapse." Consistently, a large body of empirical research also confirms that reinforcement learning training systematically induces "entropy collapse" in large language models, manifested as diminished output diversity, which is particularly pronounced in tasks requiring creativity.
Therefore, the phenomena observed in the paper, interpreted as the "emergence of reasoning ability" (such as the continuously increasing length of thought chains), need to be rigorously distinguished from another possibility: is this truly a deepening of the model's intelligence, or is it the model exhibiting saturation-style optimization within a limited pattern, or even falling into repetitive expression? The peer review process of the study has also pointed out that several of its design decisions lack corresponding empirical evidence, indicating that to support its core claims, more thorough methodological validation and controlled experiments are necessary (for example, ablation studies on the choice of regularization terms and components of the reward function) to rule out the above alternative explanations.
This may suggest that future research aiming to construct truly robust and creative reasoning models must directly address the challenge of entropy collapse. Promising approaches are beginning to emerge, including count-based intrinsic exploration rewards to encourage models to cover a broader solution space; or designing dynamic regularization strategies that balance reward optimization and distribution breadth during training. These ideas, emerging from the latest community research, offer more refined control tools over training the next generation of large language models than relying solely on KL regularization.