Google Cloud Speech-to-Text: Simpler Than AWS Transcribe

We needed accurate speech transcription for a Flutter mobile app where users record voice memories. The transcription had to be fast, accurate, and cost-effective at scale.

Google Cloud Speech-to-Text beat AWS Transcribe on two key factors: simpler integration (direct API calls vs S3 upload requirement) and better long-term economics (permanent free tier vs AWS's 12-month limit). Both services cost $0.024 per minute after free tier, but Google's implementation is significantly cleaner.

Implementation

The Google Cloud approach is straightforward—encode audio as base64 and send it directly in the API request:

// Load service account credentials
final credentialsJson = await rootBundle.loadString('assets/google-credentials.json');
final credentials = ServiceAccountCredentials.fromJson(json.decode(credentialsJson));
final client = await clientViaServiceAccount(
  credentials, 
  ['https://www.googleapis.com/auth/cloud-platform']
);

// Read and encode audio
final audioBytes = await File(audioFilePath).readAsBytes();
final audioBase64 = base64Encode(audioBytes);

// Configure and send request
final requestBody = {
  'config': {
    'encoding': 'MP3',  // Google treats AAC/M4A as MP3
    'sampleRateHertz': 16000,
    'languageCode': 'en-US',
    'enableAutomaticPunctuation': true,
  },
  'audio': {'content': audioBase64},
};

final response = await client.post(
  Uri.parse('https://speech.googleapis.com/v1/speech:recognize'),
  headers: {'Content-Type': 'application/json'},
  body: jsonEncode(requestBody),
);

// Extract transcript
final transcript = jsonDecode(response.body)['results']
    .map((result) => result['alternatives'][0]['transcript'])
    .join(' ');

Audio Configuration

Recording at 16kHz sample rate is optimal for speech recognition—lower than music quality (44.1kHz) but perfect for voice. This reduces file size and API costs without sacrificing transcription accuracy.

We use AAC-LC encoding in M4A containers at 128kbps. Google's API accepts this format directly by specifying 'MP3' encoding (it handles AAC/M4A transparently). No conversion needed.

Why Not AWS Transcribe

AWS Transcribe requires uploading audio files to S3 before transcription, adding complexity and latency. You need to:

Configure S3 bucket with proper permissions
Upload audio file
Start transcription job with S3 URI
Poll for completion
Retrieve results

Google's synchronous API eliminates steps 1, 2, and 4. For recordings under 1 minute (our typical use case), the response comes back in 2-3 seconds.

Cost Comparison

Both services cost $0.024 per minute after free tier. The difference is the free tier duration:

Google: 60 minutes/month, permanent
AWS: 60 minutes/month, first 12 months only

For a 2-minute recording:

Cost: $0.048
Free tier covers: 30 recordings/month
After free tier: $4.80 per 100 recordings

Results

Transcription accuracy is excellent for conversational speech. Latency averages 2-3 seconds for typical 1-2 minute recordings. The automatic punctuation feature saves post-processing work—transcripts are immediately readable.

The permanent free tier means development and testing never incur costs. Even in production, most users stay within the 30 recordings/month free allocation. The service account authentication integrates cleanly with our existing Google Cloud setup for Gemini AI, using a single credential file for both services.