Skip to content
Serafin Sanchez
Audio Transcriber with Python, Streamlit, and AssemblyAI preview

Audio Transcriber with Python, Streamlit, and AssemblyAI

A web app for fast, accurate multi-language audio transcription and translation, featuring speaker diarization, custom speaker naming, and instant subtitle export.

Python
Streamlit
AssemblyAI
Google Cloud Translate
AI

Project Overview

I built this transcription tool because I was tired of paying monthly subscriptions for basic audio transcription services. It's a simple tool that I used to transcribe autio from NotebookLM audio overviews but also works with other interview recordings.

This tool provides straightforward audio-to-text conversion with speaker identification and custom labeling. It supports multiple audio formats and languages, includes translation capabilities, and can export results in various formats including subtitle files for video projects.

Key Features

๐ŸŽ™๏ธ Audio Upload & Transcription

  • Upload audio files in multiple formats (mp3, wav, ogg)
  • Instant transcription with high accuracy
  • Support for 20+ languages with selectable dropdown interface
  • Secure temporary file handling with automatic cleanup

๐Ÿ‘ฅ Speaker Diarization

  • Automatically detects and labels different speakers in audio
  • Custom speaker naming for personalized transcripts
  • Clear speaker separation in final transcripts
  • Handles multi-speaker conversations and interviews

๐ŸŒ Multi-Language Support

  • Transcription available in 20+ languages
  • User-friendly language selection dropdown
  • Instant translation to any Google Translate-supported language
  • Side-by-side display of original and translated text

๐Ÿ“„ Multiple Export Formats

  • Download transcripts in TXT format
  • Export subtitles in SRT and VTT formats
  • Translated versions available for all formats
  • Instant download with proper file naming

๐Ÿ”„ Session Management

  • Clean session state management

  • Start new sessions seamlessly

  • View both sample and full transcripts

  • No login required for frictionless experience

Architecture and Backend Design

Streamlit Web Framework

  • Streamlit for rapid, interactive web interface development
  • Python Backend leveraging AssemblyAI and Google Cloud Translate APIs
  • Session State Management for maintaining user data and UI state
  • Modular Functions for transcription, translation, and file generation

API Integration

  • AssemblyAI API for high-quality multi-language transcription
  • Google Cloud Translate API for comprehensive translation support
  • Secure API Key Management using python-dotenv
  • Rate Limiting and Error Handling for robust API interactions

Data Processing

  • No Database Required: All data processed in-memory for privacy
  • Temporary File Handling: Secure upload and processing of audio files
  • Dynamic File Generation: On-the-fly creation of downloadable files
  • Multi-Format Support: Handles various audio input and output formats

Security Measures

Data Privacy & Protection

  • No Persistent Storage: All user data and audio files are processed transiently
  • Secure API Key Management: All API keys loaded from environment variables
  • Temporary File Cleanup: Automatic deletion of uploaded files after processing
  • No User Accounts: Privacy-first approach with no authentication required

Compliance & Best Practices

  • Local Processing Preference: Transcription metadata processed locally where possible
  • Secure File Handling: Industry-standard temporary file management

Technical Challenges Overcome

Multi-Language Integration

  • Seamlessly combined AssemblyAI's multi-language transcription with Google Translate
  • Created unified language selection interface supporting both services
  • Handled language code mapping between different API systems
  • Optimized API calls to minimize latency and costs

Speaker Diarization Implementation

  • Robust speaker detection and labeling system
  • Custom speaker naming with persistent state management
  • Clear visual separation of speakers in transcript output
  • Handled edge cases with single-speaker or unclear speaker audio

Efficient State Management

  • Leveraged Streamlit's session state for smooth user experience
  • Maintained conversation state across different app sections
  • Implemented clean session reset functionality
  • Optimized memory usage for large audio files

Dynamic File Generation

  • Created downloadable files on-the-fly without server storage
  • Generated multiple format types (TXT, SRT, VTT) from single transcription
  • Implemented proper file naming conventions with timestamps
  • Handled both original and translated content generation

UI/UX Details

Streamlit Modern Interface

  • Clean, intuitive workflow with clear step-by-step process
  • Responsive design working across desktop and mobile devices
  • Real-time processing indicators and progress feedback
  • Large, readable text areas for transcript review

User Experience Design

  • Drag-and-drop file upload with visual feedback
  • Clear separation between original and translated content
  • Prominent download buttons for all output formats
  • Spinner indicators for processing states

Accessibility Features

  • High contrast text for improved readability
  • Keyboard navigation support
  • Clear error messages and user guidance
  • Mobile-responsive design for on-the-go use

Impact and Problem Solved

Global Accessibility

  • Makes professional transcription accessible to users worldwide
  • Removes language barriers for international content creators
  • Enables accurate transcription without expensive software or subscriptions

Professional Applications

  • Researchers: Convert interviews and focus groups into analyzable text
  • Journalists: Transcribe interviews and press conferences with speaker identification
  • Educators: Create accessible content from lectures and presentations
  • Global Teams: Bridge language gaps in international communications

Technical Innovation

  • No-Code Solution: Technical transcription made accessible to non-technical users
  • Privacy-First Design: No user data storage or account requirements
  • Multi-Format Output: Comprehensive export options for various use cases
  • Real-Time Processing: Fast transcription and translation workflows

Key Results:

  • Fast, accurate multi-language transcription without technical setup

  • Speaker diarization with custom naming for professional use cases

  • Instant translation expanding content accessibility globally

  • Multiple export formats supporting diverse professional workflows

Related Projects

Other projects you might find interesting

I modified and extended an open-source tool for music source separation, enabling users to split audio tracks into stems from folders of audio files using a CLI. Widely used by musicians, producers, and researchers.
Python
PyTorch
AI
+2
A modern web app for musicians and producers to extract audio stems, manage credits, and process payments, all in a secure, scalable environment.
Next.js
React
Supabase
+3
A modern web app for browsing, analyzing, and making informed decisions about online auction events and items, featuring AI-powered value estimation and a responsive, animated UI.
React
Vite
Node.js
+3