Google Cloud Speech-to-Text: Simpler Than AWS Transcribe
We needed accurate speech transcription for a Flutter mobile app where users record voice memories. The transcription had to be fast, accurate, and cost-effective at scale.
Google Cloud Speech-to-Text beat AWS Transcribe on two key factors: simpler integration (direct API calls vs S3 upload requirement) and better long-term economics (permanent free tier vs AWS's 12-month limit). Both services cost $0.024 per minute after free tier, but Google's implementation is significantly cleaner.
Implementation
The Google Cloud approach is straightforward—encode audio as base64 and send it directly in the API request:
// Load service account credentials
final credentialsJson = await rootBundle.loadString('assets/google-credentials.json');
final credentials = ServiceAccountCredentials.fromJson(json.decode(credentialsJson));
final client = await clientViaServiceAccount(
credentials,
['https://www.googleapis.com/auth/cloud-platform']
);
// Read and encode audio
final audioBytes = await File(audioFilePath).readAsBytes();
final audioBase64 = base64Encode(audioBytes);
// Configure and send request
final requestBody = {
'config': {
'encoding': 'MP3', // Google treats AAC/M4A as MP3
'sampleRateHertz': 16000,
'languageCode': 'en-US',
'enableAutomaticPunctuation': true,
},
'audio': {'content': audioBase64},
};
final response = await client.post(
Uri.parse('https://speech.googleapis.com/v1/speech:recognize'),
headers: {'Content-Type': 'application/json'},
body: jsonEncode(requestBody),
);
// Extract transcript
final transcript = jsonDecode(response.body)['results']
.map((result) => result['alternatives'][0]['transcript'])
.join(' ');
Audio Configuration
Recording at 16kHz sample rate is optimal for speech recognition—lower than music quality (44.1kHz) but perfect for voice. This reduces file size and API costs without sacrificing transcription accuracy.
We use AAC-LC encoding in M4A containers at 128kbps. Google's API accepts this format directly by specifying 'MP3' encoding (it handles AAC/M4A transparently). No conversion needed.
Why Not AWS Transcribe
AWS Transcribe requires uploading audio files to S3 before transcription, adding complexity and latency. You need to:
- Configure S3 bucket with proper permissions
- Upload audio file
- Start transcription job with S3 URI
- Poll for completion
- Retrieve results
Google's synchronous API eliminates steps 1, 2, and 4. For recordings under 1 minute (our typical use case), the response comes back in 2-3 seconds.
Cost Comparison
Both services cost $0.024 per minute after free tier. The difference is the free tier duration:
- Google: 60 minutes/month, permanent
- AWS: 60 minutes/month, first 12 months only
For a 2-minute recording:
- Cost: $0.048
- Free tier covers: 30 recordings/month
- After free tier: $4.80 per 100 recordings
Results
Transcription accuracy is excellent for conversational speech. Latency averages 2-3 seconds for typical 1-2 minute recordings. The automatic punctuation feature saves post-processing work—transcripts are immediately readable.
The permanent free tier means development and testing never incur costs. Even in production, most users stay within the 30 recordings/month free allocation. The service account authentication integrates cleanly with our existing Google Cloud setup for Gemini AI, using a single credential file for both services.