Behind the Paper

Why Did I Criticize That AI Paper Featured on the Cover of Nature?

When large models are published as 'scientific achievements,' what are the standards by which we judge them? With this question in mind, I wrote a critical commentary. Today, I want to share with everyone the story behind this article.

Published in Computational Sciences

Mar 27, 2026

Huimin Peng

Dr, Guilin Medical University

Liked by India Ambler and 2 others

Explore the Research

Last September, the research on DeepSeek-R1 made the cover of Nature, being hailed as "the world's first peer-reviewed mainstream large language model." As a researcher focused on AI governance, I felt both excitement for this advancement and a touch of unease: when large models start being published as "scientific achievements," what criteria should we use to evaluate them?

Therefore, I wrote a critical commentary—"Examining methodological rigor, ethical governance, and scientific claims: a critical review of the DeepSeek-R1 study" The article was eventually published in the "Review" section of Discover Artificial Intelligence.

What did I see?

DeepSeek-R1 is indeed impressive: it uses pure reinforcement learning to train powerful reasoning abilities, performs remarkably well on tasks such as mathematics and programming, and has even open-sourced its model weights. However, when I carefully read the paper and its 83-page supplementary materials, some issues came to light:

The main text of the paper only describes the training data in one sentence: "All from the internet." Data is the "raw material" of the model; if it is not transparent, it is like a chemical experiment without revealing which reagents were used.

Although the supplementary materials list in detail how many math problems and how much code were used, key information such as where the data specifically came from, how it was collected, whether it contains bias, or whether there are copyright issues—all of this is absent.

The safety evaluation is very thorough, but it is all static testing. In the real world, malicious users will find ways to attack the model, which static tests cannot detect.

Most importantly, the paper does not discuss at all questions like "Who is responsible if the model generates problematic content?"

My criticism is not meant to "nitpick"

Someone asked me: Did you write this article to dismiss DeepSeek's work? On the contrary. I believe that DeepSeek-R1's technical contributions are tremendous, but precisely because of its influence, we need to scrutinize its transparency more rigorously. If even a top-tier journal paper has disclosure gaps, how far is the entire AI field from the scientific standards of "reproducibility and responsibility"?

I shifted the focus of my critique from "whether disclosures exist" to "how disclosures are made." I proposed three original frameworks:

Data Disclosure Template: Guides researchers on how to document data sources, cleaning processes, and legal compliance information in the future.

AI Accountability Dimension Framework: Expands safety assessment from technical metrics to governance aspects such as accountability, ethics, and risk warnings.

Multi-Stakeholder Action Agenda: Provides concrete and actionable recommendations for the academic publishing community, policymakers, and corporate labs, respectively.

The Most Thrilling Moment: The 'Single-Word Dispute' During the Proof Stage

After the article was accepted, I discovered during the final proof that the term 'data sheets' had been mistakenly written as 'data tables' in a table. In English, these two terms differ by a single Word but lead to a completely different meaning—the former refers to the standard framework of 'Datasheets for Datasets' proposed by scholars such as Gebru, while the latter is just a regular data table. The journal stipulates that only errors that 'directly compromise the integrity of the science' may be corrected at the proof stage. Anxiously, I wrote an email to the editor, explaining that this terminological confusion could mislead readers about what I was truly advocating. Fortunately, the editor understood and agreed to the correction.

At that moment, I realized: the rigor of academic writing sometimes lies hidden in the choice of a single word.

Why is this article worth writing?

Today, AI research is being commercialized at an unprecedented speed. Many laboratories, when publishing papers, only flaunt scores or demos, yet remain evasive about the origins of their data and ethical considerations. This is not how science should be. My article may not change the entire industry, but if it can prompt more researchers to start asking "Where is your dataset documentation?", then it is worthwhile.

I am especially grateful to the editors of "Discovering Artificial Intelligence" for being willing to publish such a critical article. This indicates that the academic publishing community is also engaging in reflection: for technologies like AI that will profoundly impact society, it is far from enough to merely ask 'Is the performance strong?'; we must also question 'Are you responsible and transparent?'

In the future, I hope to see more researchers not only releasing model weights but also providing complete "Datasheets for Datasets" and "model cards." Scientific progress requires not only speed but also greater reliability and reproducibility. And the path to that begins with each piece of openly transparent documentation.

Huimin Peng

Dr, Guilin Medical University

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Huimin Peng

2 months ago

Recently, Deep Seek has experienced several crashes. This is consistent with the situations analyzed in the paper; however, since there was no supporting evidence at the time, I removed this part from the published paper. The following is the content I deleted:

More importantly, the theoretical foundation on which its core algorithm design relies may itself be flawed. DeepSeek-R1 employs the GRPO algorithm, whose objective function explicitly includes the reverse KL divergence DKL(πref∥πθ) as a regularization term. The paper adopts the common intuition from generative models, considering this design intended to constrain the policy from deviating excessively from the reference distribution. However, recent theoretical studies indicate that directly transferring the mode-covering/mode-seeking characteristics of forward/reverse KL to the reinforcement learning framework is a misconception. Research points out that in reward-based optimization, the primary tendency of any KL regularization is to induce mode collapse, with its behavior largely determined by the strength of regularization and the scale of rewards, rather than the direction of the KL divergence itself [11]. This implies that the regularization design relied upon by DeepSeek-R1 may be theoretically flawed and non-diverse by premise.

These concerns are not unfounded. A substantial body of empirical research in recent years has repeatedly confirmed that reinforcement learning training can systematically reduce the diversity of language model outputs, leading to the phenomenon known as entropy collapse [11]. This loss of diversity is broadly manifested across multiple dimensions, including formatting, random generation, creative exploration, and reasoning paths. Theoretical analyses further indicate that in policy gradient optimization, if an agent's action probabilities are highly correlated with their advantage values, entropy tends to decrease. This suggests that the tendency for chains of thought (CoT) in DeepSeek-R1 to be lengthy is likely not an isolated case, but rather a systemic risk inherent to the KL-regularized reinforcement learning framework it employs when faced with complex reward signals.

11. Chen, A.G. et al.: KL-Regularized Reinforcement Learning is Designed to Mode Collapse. (2025) https://doi.org/10.48550/arXiv.2510.20817

Huimin Peng

2 months ago

The reverse KL divergence DKL(πθ∥πref)

Huimin Peng

2 months ago

Crucially, the theoretical foundation underlying its methodology is being challenged by recent research. The paper uses reverse KL divergence as the core regularization term. However, recent studies have systematically rejected the intuition of simply transferring "mode-seeking" from generative models to reinforcement learning, showing that under common settings of low regularization strength and equivalent rewards, the KL-regularized objective itself leads to a unimodal optimal distribution, structurally suppressing output diversity [11]. This suggests that the claimed potential to "explore advanced, non-human reasoning paths" may be inherently constrained by the algorithm's own tendency toward "mode collapse." Consistently, a large body of empirical research also confirms that reinforcement learning training systematically induces "entropy collapse" in large language models, manifested as diminished output diversity, which is particularly pronounced in tasks requiring creativity.

Therefore, the phenomena observed in the paper, interpreted as the "emergence of reasoning ability" (such as the continuously increasing length of thought chains), need to be rigorously distinguished from another possibility: is this truly a deepening of the model's intelligence, or is it the model exhibiting saturation-style optimization within a limited pattern, or even falling into repetitive expression? The peer review process of the study has also pointed out that several of its design decisions lack corresponding empirical evidence, indicating that to support its core claims, more thorough methodological validation and controlled experiments are necessary (for example, ablation studies on the choice of regularization terms and components of the reward function) to rule out the above alternative explanations.

This may suggest that future research aiming to construct truly robust and creative reasoning models must directly address the challenge of entropy collapse. Promising approaches are beginning to emerge, including count-based intrinsic exploration rewards to encourage models to cover a broader solution space; or designing dynamic regularization strategies that balance reward optimization and distribution breadth during training. These ideas, emerging from the latest community research, offer more refined control tools over training the next generation of large language models than relying solely on KL regularization.

Follow the Topic

Artificial Intelligence

Mathematics and Computing > Computer Science > Artificial Intelligence

Discover Artificial Intelligence

Discover Artificial Intelligence

This is a transdisciplinary, international journal that publishes papers on all aspects of the theory, the methodology and the applications of artificial intelligence (AI).

More about the journal

Related Collections

With Collections, you can get published faster and increase your visibility.

Enhancing Trust in Healthcare: Implementing Explainable AI

Healthcare increasingly relies on Artificial Intelligence (AI) to assist in various tasks, including decision-making, diagnosis, and treatment planning. However, integrating AI into healthcare presents challenges. These are primarily related to enhancing trust in its trustworthiness, which encompasses aspects such as transparency, fairness, privacy, safety, accountability, and effectiveness. Patients, doctors, stakeholders, and society need to have confidence in the ability of AI systems to deliver trustworthy healthcare. Explainable AI (XAI) is a critical tool that provides insights into AI decisions, making them more comprehensible (i.e., explainable/interpretable) and thus contributing to their trustworthiness. This topical collection explores the contribution of XAI in ensuring the trustworthiness of healthcare AI and enhancing the trust of all involved parties. In particular, the topical collection seeks to investigate the impact of trustworthiness on patient acceptance, clinician adoption, and system effectiveness. It also delves into recent advancements in making healthcare AI decisions trustworthy, especially in complex scenarios. Furthermore, it underscores the real-world applications of XAI in healthcare and addresses ethical considerations tied to diverse aspects such as transparency, fairness, and accountability.

We invite contributions to research into the theoretical underpinnings of XAI in healthcare and its applications. Specifically, we solicit original (interdisciplinary) research articles that present novel methods, share empirical studies, or present insightful case reports. We also welcome comprehensive reviews of the existing literature on XAI in healthcare, offering unique perspectives on the challenges, opportunities, and future trajectories. Furthermore, we are interested in practical implementations that showcase real-world, trustworthy AI-driven systems for healthcare delivery that highlight lessons learned.

We invite submissions related to the following topics (but not limited to):

- Theoretical foundations and practical applications of trustworthy healthcare AI: from design and development to deployment and integration.

- Transparency and responsibility of healthcare AI.

- Fairness and bias mitigation.

- Patient engagement.

- Clinical decision support.

- Patient safety.

- Privacy preservation.

- Clinical validation.

- Ethical, regulatory, and legal compliance.

Publishing Model: Open Access

Deadline: Sep 10, 2026

Explore this Collection

Artificial Intelligence for Sustainable Agriculture and Food Security

Artificial intelligence (AI) is rapidly transforming the agri-food value chain: from precise crop and soil monitoring, adaptive water and nutrient management, and early detection of pests and diseases, to yield forecasting under increasing climate variability and the optimization of transparent supply chain logistics.

This Collection aims to gather cutting-edge interdisciplinary research demonstrating how AI can enhance agricultural productivity, resilience and sustainability while safeguarding biodiversity and promoting equitable access to nutritious food. We welcome theoretical advances, novel algorithms, field-validated prototypes and socio-technical studies that bridge the gap between AI research and real-world agricultural impact, with particular attention to smallholder contexts, climate-smart practices and responsible, explainable AI.

This Collection supports and amplifies research related to SDG 2, SDG 9, SDG 12, and SDG 13.

Keywords: Artificial Intelligence; Sustainable Agriculture; Food Security; Autonomous Robotics; Agricultural IoT; Precision Farming; Crop Monitoring; Supply‑chain Optimization; Climate‑smart Agriculture; Remote Sensing

Publishing Model: Open Access

Deadline: Jun 30, 2026

Explore this Collection

After Publication: Non-Immediate Responses Concerning the Scientific Value

Cookies

We use cookies to ensure the functionality of our website, to personalize content and advertising, to provide social media features, and to analyze our traffic. If you allow us to do so, we also inform our social media, advertising and analysis partners about your use of our website. You can decide for yourself which categories you want to deny or allow. Please note that based on your settings not all functionalities of the site are available.

Further information can be found in our privacy policy.

Why Did I Criticize That AI Paper Featured on the Cover of Nature?

Share this post

Share with...

...or copy the link