Internet mediated research has transformed the way we collect data. Over the past two decades (and especially since 2020) online methods have enabled researchers to continue their work when in‑person studies were impossible, while also opening doors to large, diverse, and hard‑to‑reach populations.
However, this convenience comes with its own set of challenges. Online data collection introduces risks to data integrity, participant privacy, and ethical research conduct that are less visible than in traditional face‑to‑face research. And as everyday internet users, we also leave behind digital traces that may end up in someone else’s dataset...sometimes without our awareness.
Therefore, today we will focus on both sides of the equation: how you as a researcher can protect the integrity of your own internet-based studies, and how to remain mindful of the digital footprints you leave behind.
In online environments, researchers lose many of the natural protections that come from in person interaction. Data integrity risks often fall into the following categories:
- Non‑genuine participants: individuals misrepresenting their identity or lived experience
- Repeat responders: the same person completing a survey multiple times
- Misrepresentation: exaggeration or fabrication of details
- Lack of engagement: speeding, random clicking, not reading questions
- Bots: automated scripts completing surveys
- NEW: LLM‑generated responses: homogenized, inauthentic text produced by Large Language Models (LLM) like ChatGPT, Claude.ai, CoPilot and many others
As a researcher, there are some tools and techniques you can adopt that add an extra level of security. For instance, reCAPTCHA is a free service that uses risk analysis techniques to tell humans and bots apart. You also have the option of running IP checks to spot duplicate submissions. Alternatively, you could opt for a crowdsourcing site to gather your data, as they have the additional benefit of using various strategies to mitigate the data integrity risks listed above. Some examples of these sites include Prolific, CloudResearch, or Amazon Mechanical Turk (MTurk). All of these options help, but they are not foolproof, especially as automated and AI‑generated responses become more sophisticated.
On the ethical side, internet‑based research raises additional questions:
- Is publicly accessible data truly “public”?
- What does informed consent look like on social platforms?
- Can individuals be re‑identified even after anonymization?
- Are vulnerable groups (e.g., minors) unknowingly included?
- How should researchers handle global participants when ethics approval is local?
Just because data is publicly accessible does not mean it can be processed for any purpose. Researchers must be prepared to justify their decisions, document their safeguards, and demonstrate respect for participants' expectations.
It's time to test yourself on your online data integrity knowledge. Read through the research designs below. Which one do you think presents the biggest data integrity red flag?
- A: Scraping 10,000 public tweets about a political event. No informed consent is obtained as the data is publicly available.
- B: Running a web‑based survey on chronic pain, recruiting through social media ads. The data sets show unusually fast completion times and nearly identical open‑ended responses across dozens of participants.
- C: Conducting qualitative interviews via Zoom and verifying participant's identity by mailing compensation to a physical address.
Take a moment. Sip your coffee. Put on your “data integrity” hat.
[image description: the words “fake” and “fact” intertwined like in a crossword puzzle)
The correct answer is B! While all three scenarios raise important considerations, B is the clearest case of compromised data integrity. Fast completion times, patterned responses, and identical open-ended answers are classic indicators of bots, repeat responders, low‑effort participation, and possibly LLM‑generated text.
If you encounter this in your own research, you should be prepared to:
- Establish your data cleaning thresholds BEFORE obtaining the data
- Describe how you detected fraudulent or low-quality responses
- Justify your final sample
- Discuss limitations transparently
These practices are essential for maintaining trust in online research. Now let us evaluate the other scenarios.
- A (Public social‑media data): Not automatically unethical, but not automatically permissible either. Researchers must consider reasonable expectations of privacy for the participants, GDPR and local data‑protection laws, risk of re‑identification, presence of minors or vulnerable users, platform terms of service. A strong ethics justification alongside consideration of appropriate ethics approval and permissions would also be needed.
- C (Identity verification in online interviews): Methods like videoconferencing or mailing incentives can help deter fraudulent participation. They are not perfect and introduce privacy considerations, but they demonstrate an active attempt to maintain data integrity.
Now that we have discussed how to gather data with integrity and appropriate consideration of the ethics implications, we also wanted to bring your attention to the flip side of the coin: when the data gatherer becomes the data giver. These are some quick and easy tips if you want to reduce the likelihood that your own digital traces end up in someone’s dataset:
- Using privacy‑friendly browsers (e.g. Firefox, Brave)
- Rejecting cookies whenever possible
- Enabling Global Privacy Control
- Tightening privacy settings on platforms you use regularly
- Using a VPN on public Wi-Fi
- Turning off location services when not needed
- Remembering that Incognito Mode does not prevent tracking
- Covering your webcam when not in use
This isn’t about “having something to hide”, it’s about limiting how companies aggregate your data and infer your behaviours without your explicit consent to do so.
Further Reading & Resources
- Springer Nature Editor training course: Ethics of Internet Research for external editors, also free for researchers
- Annette Markham & Elizabeth Buchanan. Ethical Decision-Making and Internet Research. Recommendations from the AoIR Ethics Working Committee (Version 2), p.3. Available here.
- Matthew Zook et al. (2017). Ten simple rules for responsible big data research. Editorial. Plos Computational Biology, March 30, 2017, p.1. Available here.
- Social Media Research: A Guide to Ethics. Leanne Townsend & Claire Wallace. Social Media Research: A Guide to Ethics. University of Aberdeen. Available here.