Audio Transcriber with Python, Streamlit, and AssemblyAI

Project Overview

I built this transcription tool because I was tired of paying monthly subscriptions for basic audio transcription services. It's a simple tool that I used to transcribe autio from NotebookLM audio overviews but also works with other interview recordings.

This tool provides straightforward audio-to-text conversion with speaker identification and custom labeling. It supports multiple audio formats and languages, includes translation capabilities, and can export results in various formats including subtitle files for video projects.

Key Features

🎙️ Audio Upload & Transcription

Upload audio files in multiple formats (mp3, wav, ogg)
Instant transcription with high accuracy
Support for 20+ languages with selectable dropdown interface
Secure temporary file handling with automatic cleanup

👥 Speaker Diarization

Automatically detects and labels different speakers in audio
Custom speaker naming for personalized transcripts
Clear speaker separation in final transcripts
Handles multi-speaker conversations and interviews

🌍 Multi-Language Support

Transcription available in 20+ languages
User-friendly language selection dropdown
Instant translation to any Google Translate-supported language
Side-by-side display of original and translated text

📄 Multiple Export Formats

Download transcripts in TXT format
Export subtitles in SRT and VTT formats
Translated versions available for all formats
Instant download with proper file naming

🔄 Session Management

Clean session state management
Start new sessions seamlessly
View both sample and full transcripts
No login required for frictionless experience

Architecture and Backend Design

Streamlit Web Framework

Streamlit for rapid, interactive web interface development
Python Backend leveraging AssemblyAI and Google Cloud Translate APIs
Session State Management for maintaining user data and UI state
Modular Functions for transcription, translation, and file generation

API Integration

AssemblyAI API for high-quality multi-language transcription
Google Cloud Translate API for comprehensive translation support
Secure API Key Management using python-dotenv
Rate Limiting and Error Handling for robust API interactions

Data Processing

No Database Required: All data processed in-memory for privacy
Temporary File Handling: Secure upload and processing of audio files
Dynamic File Generation: On-the-fly creation of downloadable files
Multi-Format Support: Handles various audio input and output formats

Security Measures

Data Privacy & Protection

No Persistent Storage: All user data and audio files are processed transiently
Secure API Key Management: All API keys loaded from environment variables
Temporary File Cleanup: Automatic deletion of uploaded files after processing
No User Accounts: Privacy-first approach with no authentication required

Compliance & Best Practices

Local Processing Preference: Transcription metadata processed locally where possible
Secure File Handling: Industry-standard temporary file management

Technical Challenges Overcome

Multi-Language Integration

Seamlessly combined AssemblyAI's multi-language transcription with Google Translate
Created unified language selection interface supporting both services
Handled language code mapping between different API systems
Optimized API calls to minimize latency and costs

Speaker Diarization Implementation

Robust speaker detection and labeling system
Custom speaker naming with persistent state management
Clear visual separation of speakers in transcript output
Handled edge cases with single-speaker or unclear speaker audio

Efficient State Management

Leveraged Streamlit's session state for smooth user experience
Maintained conversation state across different app sections
Implemented clean session reset functionality
Optimized memory usage for large audio files

Dynamic File Generation

Created downloadable files on-the-fly without server storage
Generated multiple format types (TXT, SRT, VTT) from single transcription
Implemented proper file naming conventions with timestamps
Handled both original and translated content generation

UI/UX Details

Streamlit Modern Interface

Clean, intuitive workflow with clear step-by-step process
Responsive design working across desktop and mobile devices
Real-time processing indicators and progress feedback
Large, readable text areas for transcript review

User Experience Design

Drag-and-drop file upload with visual feedback
Clear separation between original and translated content
Prominent download buttons for all output formats
Spinner indicators for processing states

Accessibility Features

High contrast text for improved readability
Keyboard navigation support
Clear error messages and user guidance
Mobile-responsive design for on-the-go use

Impact and Problem Solved

Global Accessibility

Makes professional transcription accessible to users worldwide
Removes language barriers for international content creators
Enables accurate transcription without expensive software or subscriptions

Professional Applications

Researchers: Convert interviews and focus groups into analyzable text
Journalists: Transcribe interviews and press conferences with speaker identification
Educators: Create accessible content from lectures and presentations
Global Teams: Bridge language gaps in international communications

Technical Innovation

No-Code Solution: Technical transcription made accessible to non-technical users
Privacy-First Design: No user data storage or account requirements
Multi-Format Output: Comprehensive export options for various use cases
Real-Time Processing: Fast transcription and translation workflows

Key Results:

Fast, accurate multi-language transcription without technical setup
Speaker diarization with custom naming for professional use cases
Instant translation expanding content accessibility globally
Multiple export formats supporting diverse professional workflows