Behind the Paper

Why I Tried to Measure How AI Speaks, Not Just What It Says

Most AI evaluations focus on whether answers are correct. My research started from a different question: even when answers are factually similar, do AI systems speak in the same way? Differences in tone and framing led me to design a method to measure discursive variation across models.

When we talk about artificial intelligence, we usually focus on correctness. Is the answer right or wrong? Is it hallucinated or accurate? Is it biased or fair?
While working with generative AI in educational and journalistic contexts, I kept noticing something different. Even when answers were factually acceptable, they did not sound the same. The tone changed. The perspective changed. The way responsibility, suffering, conflict, or legitimacy were described also changed.

This raised a simple but uncomfortable question. If two AI systems answer the same question with the same facts but with a different tone and framing, are they really neutral in the same way?

That question became the starting point of my research on the discursive behavior of large language models.

From impression to measurement

At first this was only a qualitative impression. Some models sounded more empathetic. Others more technical. Others more journalistic. Others more normative. But impressions are not enough in research. I needed a method.

The challenge was to move from “this sounds different” to “this difference can be classified, compared, and replicated.”

Instead of evaluating truthfulness or bias labels, I focused on two discursive dimensions:

Tone. How the answer is expressed. Is it cold, descriptive, empathic, technical, balanced, assertive?

Framing. From which interpretive angle the issue is presented. Is it legal, historical, humanitarian, ethical, journalistic?

These are concepts that come from discourse analysis and communication studies, but they are rarely applied in a structured way to AI outputs. I built a coding grid that allows responses to be categorized along these two axes.

The goal was not to prove that models are “good” or “bad,” but to see whether their discursive profiles are systematically different under identical conditions.

The experiment design

I selected five widely used language models and asked them the same ten open ended questions on geopolitical and humanitarian topics. The prompts were written in Italian and covered controversial and value loaded issues. Each model received exactly the same prompts.

Every answer was then coded using the tone and framing grid. The full coding table was published openly so that anyone can verify, reuse, or challenge the classifications.

Methodological transparency was a central design choice. If we want to talk about AI neutrality, our own method must be inspectable.

What surprised me

It is not surprising that models differ. They are trained differently and aligned differently. What was more interesting was that the differences were structured and recurrent at the discursive level.

Some models consistently adopted a journalistic and descriptive stance. Others showed a stronger humanitarian or ethical framing. Some preferred legal institutional reasoning. Others leaned toward empathic language.

Under identical prompts, discursive style was not random noise. It showed model specific tendencies.

This does not mean that a model has an ideology in a human sense. It means that discursive positioning emerges from training data, alignment strategies, and safety tuning. Neutrality, in practice, is not a built in property. It is an outcome that must be examined.

Why this matters beyond research

Many discussions about AI risk focus on hallucinations and factual errors. Those are important. But discursive style also shapes interpretation.

In journalism, tone influences how responsibility and legitimacy are perceived.
In education, framing influences how students understand conflicts and moral dilemmas.
In policy contexts, legal or humanitarian framing can shift how decisions are justified.

An answer that sounds neutral may still guide interpretation in subtle ways.

This suggests that evaluating AI systems should not stop at fact checking. We also need discursive checking.

From research method to classroom tool

One of the most rewarding developments after the study was translating the coding grid into a didactic tool.

I created structured evaluation sheets that students can use to classify AI answers by tone and framing. Instead of passively accepting responses, learners can ask:

What tone is the system using?
Which perspective is being emphasized?
Which dimensions are absent?
Would another framing change the interpretation?

This turns AI from an oracle into an object of critical analysis. It supports digital literacy and critical thinking. Students learn not only to use AI, but to read it.

A reproducible framework

A key contribution of the study is not only the findings but the protocol. I proposed a reproducible framework for discursive auditing of AI systems. It includes prompt design, model selection, coding rules, transparency requirements, and comparative analysis steps.

The framework is intentionally lightweight. It can be adapted across languages, domains, and model families. Researchers, educators, and even newsrooms can reuse it.

All prompts, coding schemes, and aggregated data are publicly available. Reproducibility is not an afterthought. It is part of the method.

Limits and next steps

The study has limits. Each model was queried once per prompt, so it does not capture full stochastic variability. Coding was performed by a single expert coder, which introduces interpretive perspective. Models evolve over time, so discursive profiles may drift.

Future work should include multi coder annotation, repeated sampling, and longitudinal tracking. But even with these limits, the study shows that discursive variation can be measured, not only perceived.

The bigger picture

Generative AI systems are becoming participants in our discursive ecosystem. They help write, summarize, explain, and recommend. They are already shaping how issues are described and understood.

If language shapes perception, then the language of AI matters.

Measuring how AI speaks is not only a technical exercise. It is part of building accountable, transparent, and socially responsible AI systems.