Why Did I Criticize That AI Paper Featured on the Cover of Nature?

When large models are published as 'scientific achievements,' what are the standards by which we judge them? With this question in mind, I wrote a critical commentary. Today, I want to share with everyone the story behind this article.
Like

Share this post

Choose a social network to share with, or copy the URL to share elsewhere

This is a representation of how your post may appear on social media. The actual post will vary between social networks

Last September, the research on DeepSeek-R1 made the cover of Nature, being hailed as "the world's first peer-reviewed mainstream large language model." As a researcher focused on AI governance, I felt both excitement for this advancement and a touch of unease: when large models start being published as "scientific achievements," what criteria should we use to evaluate them? 

Therefore, I wrote a critical commentary"Examining methodological rigor, ethical governance, and scientific claims: a critical review of the DeepSeek-R1 study" The article was eventually published in the "Review" section of Discover Artificial Intelligence. 

What did I see? 

DeepSeek-R1 is indeed impressive: it uses pure reinforcement learning to train powerful reasoning abilities, performs remarkably well on tasks such as mathematics and programming, and has even open-sourced its model weights. However, when I carefully read the paper and its 83-page supplementary materials, some issues came to light:

The main text of the paper only describes the training data in one sentence: "All from the internet." Data is the "raw material" of the model; if it is not transparent, it is like a chemical experiment without revealing which reagents were used. 

Although the supplementary materials list in detail how many math problems and how much code were used, key information such as where the data specifically came from, how it was collected, whether it contains bias, or whether there are copyright issuesall of this is absent. 

The safety evaluation is very thorough, but it is all static testing. In the real world, malicious users will find ways to attack the model, which static tests cannot detect. 

Most importantly, the paper does not discuss at all questions like "Who is responsible if the model generates problematic content?"

My criticism is not meant to "nitpick"

Someone asked me: Did you write this article to dismiss DeepSeek's work? On the contrary. I believe that DeepSeek-R1's technical contributions are tremendous, but precisely because of its influence, we need to scrutinize its transparency more rigorously. If even a top-tier journal paper has disclosure gaps, how far is the entire AI field from the scientific standards of "reproducibility and responsibility"?

I shifted the focus of my critique from "whether disclosures exist" to "how disclosures are made." I proposed three original frameworks:

Data Disclosure Template: Guides researchers on how to document data sources, cleaning processes, and legal compliance information in the future.

AI Accountability Dimension Framework: Expands safety assessment from technical metrics to governance aspects such as accountability, ethics, and risk warnings.

Multi-Stakeholder Action Agenda: Provides concrete and actionable recommendations for the academic publishing community, policymakers, and corporate labs, respectively.

The Most Thrilling Moment: The 'Single-Word Dispute' During the Proof Stage

After the article was accepted, I discovered during the final proof that the term 'data sheets' had been mistakenly written as 'data tables' in a table. In English, these two terms differ by a single letter but lead to a completely different meaningthe former refers to the standard framework of 'Datasheets for Datasets' proposed by scholars such as Gebru, while the latter is just a regular data table. The journal stipulates that only errors that 'directly compromise the integrity of the science' may be corrected at the proof stage. Anxiously, I wrote an email to the editor, explaining that this terminological confusion could mislead readers about what I was truly advocating. Fortunately, the editor understood and agreed to the correction.

At that moment, I realized: the rigor of academic writing sometimes lies hidden in the choice of a single word.

Why is this article worth writing?

Today, AI research is being commercialized at an unprecedented speed. Many laboratories, when publishing papers, only showcase scores or demos, yet remain evasive about the origins of their data and ethical considerations. This is not how science should be. My article may not change the entire industry, but if it can prompt more researchers to start asking "Where is your dataset documentation?", then it is worthwhile.

I am especially grateful to the editors of "Discovering Artificial Intelligence" for being willing to publish such a critical article. This indicates that the academic publishing community is also engaging in reflection: for technologies like AI that will profoundly impact society, it is far from enough to merely ask 'Is the performance strong?'; we must also question 'Are you responsible and transparent?'

In the future, I hope to see more researchers not only releasing model weights but also providing complete "dataset documentation" and "model cards." Scientific progress requires not only speed but also greater reliability and reproducibility. And the path to that begins with each piece of openly transparent documentation.

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in