#SciData19 Writing Competition: Winning Entry #2

We are proud to publish the second of this year's four winning entries for this years Better Science through Better Data writing competition - congratulations to Jazlynn Tan
#SciData19 Writing Competition: Winning Entry #2

Read the paper

Eventbrite Eventbrite

Better Science through Better Data 2019

In ‘Better Science through Better Data’ (#scidata19) Springer Nature and The Wellcome Trust partner to bring together researchers to discuss innovative approaches to data sharing, open science, and reproducible research, together with demonstrations of exemplary projects and tools. If you are a researcher, this event will give you the chance to learn more about how research data skills can aid career progression, including how good practice in data sharing can enable you to publish stronger peer-reviewed publications. Tickets for the event have now sold out - but you can register for the live stream to watch our keynote talks as they happen from wherever you are in the world. Keynote speakers Shelley Stall Senior Director, Data Leadership American Geophysical Union (AGU) Shelley Stall is the Senior Director for the American Geophysical Union’s Data Leadership Program. She works with AGU’s members, their organizations, and the broader research community to improve data and digital object practices with the ultimate goal of elevating how research data is managed and valued. Better data management results in better science. Shelley’s diverse experience working as a program and project manager, software architect, database architect, performance and optimization analyst, data product provider, and data integration architect for international communities, both non-profit and commercial, provides her with a core capability to guide development of practical and sustainable data policies and practices ready for adoption and adapting by the broad research community. Shelley’s recent work includes the Enabling FAIR Data project, engaging over 300 stakeholders in the Earth, space, and environmental sciences to make data open and FAIR, targeting the publishing and repository communities to change practices by no longer archiving data in the supplemental information of a paper but instead depositing the data supporting the research into a trusted repository where it can be discovered, managed, and preserved. Her talk is entitled: Your Digital Presence Mikko Tolonen Assistant Professor Faculty of Arts at the University of Helsinki Mikko Tolonen is an assistant professor of Digital Humanities at the University of Helsinki. He is the PI of Helsinki Computational History Group (COMHIS). In 2015-17 he also worked in the National Library of Finland on digitized newspapers as professor of research on digital resources. He is the chair of Digital Humanities in the Nordic Countries (DHN). His current main research focus is on an integrated study of early modern public discourse and knowledge production that combines bibliographic metadata and full-text sources. In 2016, he was awarded an Open Science and Research Award by the Finnish Ministry of Education and Culture. His talk is entitled: Integrating Open Science in the Humanities: the Case of Computational History David Stillwell Lecturer in Big Data Analytics and Quantitative Social Science Judge Business School, University of Cambridge David is Lecturer in Big Data Analytics and Quantitative Social Science at Cambridge University’s Judge Business School. David’s research uses big data to understand psychology. He published papers using social media data from millions of consenting individuals to show that the computer can predict a user’s personality as accurately as their spouse can. This research has important public policy implications. How should consumers’ data be used to target them? Should regulators step in, and if so how? David has spoken at workshops at the EU Parliament and to UK government regulators. David has also published research using various big data sources such as from credit card data and textual data to show that spending money on products that match one’s personality leads to greater life satisfaction, that people tend to date others whose personality is similar, and that people who swear seem to be more honest. His talk is entitled: Getting Big Data: Social scientists must strive to be autonomous from corporate charity. Tomas Knapen Assistant Professor Vrije Universiteit Amsterdam - Cognitive Psychology Tomas is a cognitive neuroscientist whose research focuses on the role sensory topographies (visual retinotopy, auditory tonotopy and bodily somatotopy) play in the detailed organization of the human brain and cognition. For this work, Tomas uses state of the art 7-Tesla MRI techniques. Early-career experiences where he ‘failed to replicate’ previous findings have impressed upon him the need to make research reproducible from top to bottom. Because of this, his lab uses only open methods and puts all their data and methods online. Having invested in these methods, Tomas is convinced that, in the end, it is not a burden to perform open science, rather it provides researchers with great opportunities for ground-breaking science. His talk is entitled: How I learned to stop worrying and love Open Science See the event programme. Meet the Programme Committee. Register for the live stream.

Question: What are the benefits and risks of unrestricted data use?


Jazlynn Tan - Imperial College London

As the wave of machine learning techniques sweeps across nearly every field in science, data availability and quality have found their way into many debates. Reusing data saves time and money. Sharing data enables us to learn from failed ventures while building upon successes. Furthermore, the advancement of machine learning unlocks the potential of rich databases to yield new findings when re-analysing data. For example, deep learning using The Cancer Genome Atlas revealed new cancer-causing mutations while the critical features of a class of semiconductors were predicted using Novel Materials Discovery. However, unrestricted data reuse has its pitfalls. Giving credit where it is due is not only ethical, but also important for quality control. Even now, plagiarism rates are nearly 25% in some USA states and a similar issue will likely surface in data reuse. When reusing data, whether by the original author in a new publication or by other scholars, complications may arise. This could stem from the lack of metadata, force-fitting existing data to new hypotheses or making invalid assumptions when lacking the bigger picture. van Raaji identified 18 problems associated with data reuse, some of which such as contradicting previous conclusions from the data without explanation are seriously worrying. These actions compromise the quality of research and mislead others even if unintentional. Laboratories have much to gain from data reuse, but only if done methodically and under regulation. Published data must meet standards for reusability while providing licensing details to support safeguarding intellectual property. The FAIR (findability, accessibility, interoperability, reusability) principles are great guidelines for sharing data. However, better structural support and incentives are needed to achieve this. With over 28 million public repositories of source code, GitHub encourages users to document code and authorship based on standard practices. An equivalent host for data sharing can be established. While similar infrastructure exists (eg. ToxFx for pharmacology), a cross-disciplinary, universally standardised database with well-documented data handling procedures is lacking. As data grows beyond petabytes, traditional peer-reviewing systems cannot keep up. A public reporting system is crucial to prevent the perpetuation of wrong data so scientists who reuse the data can simultaneously look out for red flags and report them. Others need to be alerted to mistakes, suspicious and incomplete data. We also need a culture where scientists who reuse data are responsive to clarification and debate should others detect potential misinterpretations. Despite its sheer volume, diversity is severely lacking in existing data. Most existing data were generated by well-funded, developed countries and hence target their populations and problems. For example, genomic data of the European population is comprehensive whereas that of ethnic minorities is sparse. Manrai et al. found that black Americans were more often misdiagnosed for hypertrophic cardiomyopathy than white Americans due to severe under-representation in genomic data. Reusing existing data is certainly convenient, but it remains our responsibility to gather new data to correct this disparity. Data informs most of mankind’s decisions. For better decisions, we need better science and for that, we need better data.

Don't forget to register for Better Science through Better Data on November 6th at the Wellcome Collection in London to learn about data sharing and open science.

Meet the other writing competition winners here.

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in