I nearly made it through two engineering degrees with few medical thoughts beyond "mitochondria, the powerhouse of the cell”. Thankfully, mentors in hospital labs let me help with impressive studies implementing data processing and statistical models. I found myself concurrently publishing in optimization at national conferences, and stumbling over allele frequencies and intimidated by FASTQ files in an introductory course. I dreamed of synergies beyond my skills in the intersections between data science and personal medicine. As LLMs skyrocketed past the information age, I found just the laboratory in this intersection. I admire the field-bridging contribution we were able to make.
Distributing intelligent computing and education through data-driven medicine
The promise of the information age: all knowledge for everyone, everywhere, has taken time to materialize. Access to education, work opportunities and the latest research haven’t spread as far and quickly as the world wide web. Advancing on all fronts, the multi-disciplinary team of researchers at the Stanford Deep Data Research Center built a platform that containerizes various biomedical and data science tools. More than hardware-agnostic portable research tools, Stanford Data Ocean (SDO) [1] offers modules introducing research methods. These include various datasets across lipidomics, metabolites, dozens of genomes, and wearables’ data from COVID early-detection research. SDO also includes a customized AI tutor leveraging Large Language Models (LLMs) with guardrails to support student progress through the material, and a novel AI data visualization feature that facilitates engaging with the data and tweaking analyses in either Python or R. Unlike existing intelligent visualization AI, SDO’s automatic data analysis tool is grammar agnostic. Varied data types across medical research modularities are integrated seamlessly. Moreover, the processing code for generating figures is available and thus editable for additional experiments and illustrations. There’s revolutionary educational potential, expedited reproducibility, and simultaneously a lowered barrier for precision medicine research across fields. Researchers from either corner of biology and computing departments can seek correlations between genetic markers, wearables’ data, and metabolites. We’ve built an early compass for navigating data and uniting work throughout the broad field of omics, studying all molecules from proteins to mRNA to enzymes in regular cell function. Simultaneous data availability across omics and walk-through processing through SDO’s live notebooks invite hands-on education in multiple subjects rarely available to the general public.
The SDO platform also follows HIPAA-compliant best practices, giving users access to real health data while keeping patient privacy safe and secure. The serverless platform is deployed through third-party models with strict data privacy requirements. Hosted through scalable applications, learning modules are flexible and centrally managed by SDO. Containers provide low-maintenance, disposable environments for the learning modules, and virtual machines provide environments that are continuously in operation. Moreover, all data on the platform follows the FAIR principles: the datasets are findable, accessible, interoperable, and reusable. And these datasets and tools have provided an impressive education to thousands.
Impactful Innovation
My personal favorite is the work’s broad accessibility. There was no programming in my grade school, and here we’ve built a system exposing students to guided, code-based projects in transcriptomics, cloud computing, data science and beyond. SDO offers a starter certificate for completing its modules and related quizzes. Certificates for comparable programs at other R1 universities are often priced in the thousands, past the reach of the majority of the globe. SDO, priced similarly due to computing costs and dedicated research hours, has offered over 3500 scholarships to learners in 92 countries. The modules we designed, based on our research and coursework in the highly technical and fortunate Bay Area, crossed the world. The course was accompanied by office hours held by developing team members, and at times whole cohorts from other institutions participated. Regular workshops and seminars from SDO developers also accompanied the program. Over 80% of all students recommend the courses, and the average completion rates more than quadruple those of general asynchronous, online courses. More than 20% of students also reported that the program helped them secure positions in STEM fields.
Developing the AI tutor for the platform, we explored LLMs’ capabilities as trustworthy, manageable educational tools. Ten popular LLM models were evaluated on nearly 300 multiple-choice questions written by the SDO team and over 2000 student-sourced questions. Ideally, the AI-tutors shouldn’t provide erroneous information and shouldn’t stray far from the course material for the integrity of the application. We designed guardrails for this, and SDO tested the AI-tutor’s capacity to identify course content from the student questions. The model’s guardrails had perfect precision, never straying from course-relevant content. The F1 score, moderating precision and recall, was 96.6%. Aligning our metrics with impact, nearly three quarters of the students reported that the AI-tutor enhanced their understanding of course material.
For its social goals and innovative service, SDO’s team received several prestigious awards such as the Anthem Award, the Don Norman Design Award, and Stanford’s Walter J. Gores Award. We hope the research tools built within will empower future discoveries.
A new standard for research collaboration and accessibility
More than an education tool, SDO includes research modules reproducing the methods and findings of cutting edge work. One such example is the NightSignal [2] algorithm which detects SARS-CoV-2 in presymptomatic and asymptomatic patients based on wearable data collected by the Stanford MyPHD platform [3]. These research modules promote interaction with the frontier of research, enhancing SDO’s educational experience. Furthermore, an interactive model and the relevant data to reproduce recent findings gives users the opportunity to contribute to future breakthroughs, benefiting researchers and the entire field. Similarly, the data visualization helper can do much more than support scientific communication and statistics education. Medical researchers can coalesce data from a variety of file formats, inviting new insights without manual processing of omics-specific file types. Academics and industry researchers can easily replicate results and insights from published datasets. The added transparency bolsters existing findings and facilitates future work. The platform’s grammar-agnostic data analyzes also lower the barrier to utilizing and building on cutting-edge research beyond one’s field.
Stanford Data Ocean is an education platform. It is a novel, secure, scalable, data processing structure that can empower researchers. It is a testbed for the interactions between LLMs, multi-omics research, and the future of experience based education in STEM fields. The platform continues, now with over 4000 cumulative users, and we are developing new courses around Computational Cancer Biology, Fundamentals of Data-Driven Precision Medicine for Diabetes and How to Become a Bioinformatician using LLMs. I, for one, am excited to see its impact and achievements diffuse through industry and academia, empowering the collaboration and multi-omics analyzes that will deliver the ambitious promises of precision medicine.
Also, if anyone is interested, the blog poster graphs are SDO AI-generated UMAP mappings of gene expression across different cell types. Learn more in our paper [1].
References