Behind the Paper

T3 Talk2Text - A model for near real-time voice transcription in virtual group meetings

Group projects thrive on communication, but how can students revisit discussions effortlessly? We present T3 Talk2Text, an open-source tool for real-time meeting transcription, enabling reflection and collaboration analysis. Discover the tech behind it!

The Challenge: Capturing Group Discussions for Reflection

Group projects are a cornerstone of modern education, but remote collaboration introduces challenges like unequal participation, miscommunication, and the difficulty of recalling past discussions. While video conferencing tools facilitate real-time interaction, they lack built-in support for documenting conversations. Manual note-taking is cumbersome, and post-meeting transcriptions from recordings are time-consuming.

We asked: How can we provide students and educators with an effortless way to capture and reflect on group discussions in near real-time?

Introducing T3 Talk2Text

Our solution, T3 Talk2Text, is an open-source web application that integrates:

  • WebRTC-based video conferencing (peer-to-peer, no expensive licenses)
  • Automatic Speech Recognition (ASR) and Real-time transcription via OpenAI’s Whisper model
  • On-demand summaries (using Llama3 for AI-generated insights)

Unlike commercial tools (e.g., Microsoft Teams), T3 prioritizes privacy (self-hosted), multilingual support, and accessibility (works on any device with a browser).

                       

Key Innovations

  1. Voice Activity Detection (VAD) Pipeline

    • Filters background noise and segments speech for accurate ASR input.

    • Achieves 8.04% Word Error Rate (WER) in German—comparable to human transcribers.

  2. Dynamic Transcript Formats
    Users can download transcripts as:

    • PDFs (messenger-style, with speaker alignment)

    • CSVs (structured for analysis)

    • AI summaries (condensed key points)

  3. Scalable Architecture

    • Lightweight SQLite storage for transcripts.

    • Peer-to-peer media streaming reduces server load.

Behind the Scenes: Overcoming Technical Hurdles

Challenge 1: Real-Time Processing
WebRTC’s low-latency streams were ideal for video chat but required careful buffering to feed Whisper ASR without delays. Our VAD component ensured only speech segments were processed, optimizing resource use.

Challenge 2: Multilingual Support
Whisper’s multilingual capabilities let T3 adapt to diverse classrooms, though future work will explore fine-tuning for non-native accents.

Challenge 3: Privacy-First Design
All data stays on institutional servers, and temporary audio files are deleted post-transcription, which is critical for GDPR compliance.

Impact and Future Directions

Initial tests with student groups showed T3 seamlessly integrated into discussions without disrupting collaboration. Educators highlighted its potential for:

  • Identifying participation gaps (via speaker-labeled transcripts).

  • Conflict resolution (revisiting past dialogue).

  • Research (analyzing communication patterns across courses).

Next steps include:

  • Deploying Whisper Large Turbo for faster, even more accurate transcriptions.

  • Longitudinal studies in university courses to measure learning outcomes.

Try It Out!

T3 is open-source and available for institutions to adapt. We welcome collaborations to explore its use in classrooms and beyond.

Read the full paper: The Paper
Code repository: Github
Contact: Thomas.Kasakowskij@fernuni-hagen.de