If you've deployed Amazon Lex behind Amazon Connect, you've probably hit this: a caller is at a busy intersection, a coffee shop, or driving with the windows down, and your bot completely falls apart. It mis-transcribes utterances, matches the wrong intent, falls back repeatedly, and eventually transfers the caller to an agent who has to start from scratch.
The problem isn't Lex's NLU. It's the built-in speech-to-text. And there's a surprisingly simple pattern to fix it.
The Problem: Lex's STT Wasn't Built for Noise
When you send audio to Lex via RecognizeUtterance, here's what happens internally:
Caller Audio (8kHz PCM) --> Lex Built-in STT --> Text --> NLU --> Intent Match
Lex's internal STT engine works fine in quiet environments. But in real-world contact center scenarios, callers are often in noisy places. Background traffic, music, crowds, wind, car noise, and other environmental sounds degrade transcription accuracy significantly.
The result is a cascade of failures:
- Garbled transcription -- Lex hears "I need help with my bill" as "I need held mitt pile"
- Wrong intent match -- The garbled text matches
FallbackIntentinstead ofAccountLookup - Retry loop -- The bot asks the caller to repeat, they do, same noise, same failure
- Agent transfer -- After 2-3 fallbacks, the bot gives up and transfers to an agent
You can lower the NLU confidence threshold, but that doesn't fix the root cause. The text going into NLU is wrong.
The Pattern: Decouple STT from NLU
The fix is to stop using Lex's built-in speech-to-text and instead pre-process the audio through Amazon Transcribe before sending the cleaned text to Lex:
Standard Path (noisy environments fail):
Caller Audio --> Lex RecognizeUtterance (built-in STT + NLU)
Improved Path (noise-resilient):
Caller Audio --> Amazon Transcribe Streaming --> Clean Text --> Lex RecognizeText (NLU only)
Why does this work? Amazon Transcribe's telephony models are specifically trained on real-world call audio with background noise. It's been battle-tested across thousands of Amazon Connect deployments. When you give it the same noisy audio that trips up Lex's internal STT, Transcribe produces dramatically better transcriptions.
The key insight is that RecognizeText gives you the same NLU pipeline as RecognizeUtterance, minus the STT step. You get the same intent matching, slot filling, and dialog management. You're just feeding it better input.
Architecture
We built a complete Terraform module that deploys both paths side-by-side with an A/B comparison test harness:
+-------------------+
| Microphone / |
| Audio Source |
+--------+----------+
|
+-------------+-------------+
| |
v v
+----------------------------+ +----------------------------+
| Path A: Direct Audio | | Path B: Transcribe Pre- |
| | | processor |
| Resample to 8kHz PCM | | Resample to 16kHz PCM |
| | | | | |
| v | | v |
| Lex RecognizeUtterance | | Transcribe Streaming |
| (built-in STT + NLU) | | (telephony-optimized) |
| | | | | |
| v | | v |
| Intent + Confidence | | Clean text |
| | | | |
| | | v |
| | | Lex RecognizeText |
| | | (NLU only) |
| | | | |
| | | v |
| | | Intent + Confidence |
+----------------------------+ +----------------------------+
| |
+-------------+-------------+
|
v
+-------------------+
| Compare Results |
| Transcript, |
| Intent, Score |
+-------------------+
What Gets Deployed
The Terraform module creates:
- Amazon Lex V2 Bot with 5 intents (Greeting, AccountLookup, TransferToAgent, EndCall, Fallback)
- Lambda fulfillment handler with slot validation, retry tracking, and fallback escalation
- S3 bucket for audio/text conversation logging (30-day lifecycle to IA, 90-day expiry)
- CloudWatch dashboard tracking fallback count, agent transfer rate, and retry count
- IAM roles scoped for least privilege
The Lex Bot
The bot is designed to exercise real-world contact center patterns:
| Intent | Purpose | Slots |
|---|---|---|
GreetingIntent |
Opening the conversation | None |
AccountLookup |
Multi-turn slot filling | AccountNumber (6-10 digits), IssueType (billing/technical/general), CallerName |
TransferToAgent |
Explicit agent request | None |
EndCallIntent |
Closing the conversation | None |
FallbackIntent |
Catch-all with escalation | None (auto-transfers after 2 consecutive fallbacks) |
The AccountLookup intent is deliberately complex. It requires multi-turn dialog with slot validation, which is exactly where noisy audio causes the most failures. When the bot asks "What is your account number?" and the caller says "1234567" in a noisy environment, the standard path frequently mis-transcribes the digits.
The Lambda Fulfillment Handler
The fulfillment handler implements noise-aware retry logic:
def handle_dialog_code_hook(event):
"""Validate slots during dialog -- retry-aware"""
session_attrs = event.get('sessionState', {}).get('sessionAttributes', {}) or {}
retry_count = int(session_attrs.get('retryCount', '0'))
# Validate AccountNumber: must be 6-10 digits
account_number = slots.get('AccountNumber', {})
if account_number and account_number.get('value', {}).get('interpretedValue'):
value = account_number['value']['interpretedValue']
if not value.isdigit() or not (6 <= len(value) <= 10):
retry_count += 1
if retry_count >= max_retries:
# Too many retries -- transfer to agent
return transfer_to_agent(session_attrs)
return elicit_slot('AccountNumber', 'Please provide a valid account number.',
session_attrs, retry_count)
Key behaviors:
- Retry tracking via session attributes persisted across turns
- Progressive escalation -- after configurable max retries (default 3), transfers to agent
- Fallback counting -- 2 consecutive fallbacks triggers agent transfer
- Metric emission -- CloudWatch metric filters track fallback, retry, and transfer rates
Running the A/B Comparison
The test harness records audio from your microphone and sends the same recording through both paths simultaneously:
# Clone and deploy
git clone https://github.com/AIOpsCrew/terraform-module-lexbot-noisy-caller-poc.git
cd ai-lex-bot/terraform
terraform init && terraform apply
# Set up the test harness
cd ..
python3 -m venv venv && source venv/bin/activate
pip install boto3 pyaudio numpy scipy amazon-transcribe
# Run the comparison
python3 record_and_send.py
The output shows both paths side by side:
[Path A -- Direct to Lex (0.8s)]
Heard: i need help mitt my account
Intent: FallbackIntent (0.42)
Bot says: I didn't understand that. Could you try again?
[Path B -- Transcribe (0.3s) + Lex text (0.2s)]
Transcribe heard: i need help with my account
Intent: AccountLookup (0.97)
Bot says: I'd be happy to help! What is your account number?
In noisy environments, the difference is stark. Path B consistently outperforms Path A on intent accuracy, especially for multi-turn dialogs where slot values contain numbers or proper nouns.
Latency Trade-off
Path B adds an extra hop. In our testing:
| Metric | Path A | Path B | Delta |
|---|---|---|---|
| STT Latency | ~0ms (built into Lex) | 200-500ms (Transcribe) | +200-500ms |
| NLU Latency | ~300ms | ~200ms | -100ms |
| Total | ~300ms | ~400-700ms | +100-400ms |
The additional latency is noticeable but acceptable for most contact center scenarios. The accuracy improvement far outweighs the speed cost, especially when you factor in the time wasted on retries and agent transfers in the standard path.
Deploying in Production
The test harness proves the pattern. For production deployment in Amazon Connect, the architecture looks like this:
Caller --> Connect --> Kinesis Video Stream --> Lambda --> Transcribe Streaming
|
v
Clean Transcription
|
v
Lex RecognizeText --> Contact Flow
The Lambda function sits between Connect's Kinesis Video Stream and Lex. It:
- Receives the audio stream from Connect
- Forwards it to Transcribe Streaming
- Takes the transcribed text
- Calls Lex
RecognizeTextwith the clean transcription - Returns the bot response back to the contact flow
This is a drop-in replacement for the standard Connect + Lex integration. The contact flow doesn't change. The caller experience doesn't change. The only difference is better accuracy.
Cost Analysis
For a contact center handling 10,000 calls/month with an average of 5 utterances per call:
| Service | Standard Path | Transcribe Pre-process Path |
|---|---|---|
| Lex | $3.75 (50k audio reqs) | $0.75 (50k text reqs) |
| Transcribe | $0 | $50 (833 minutes) |
| Lambda | $0.50 | $1.00 |
| Total | $4.25/mo | $51.75/mo |
The Transcribe pre-processing path costs more. But consider the cost of agent time:
- If the standard path causes 500 unnecessary agent transfers per month
- At an average handle time of 5 minutes and $25/hour agent cost
- That's $1,042/month in wasted agent time
The Transcribe path pays for itself many times over by keeping callers in the self-service flow.
Getting Started
The entire project is open source and deploys with a single terraform apply:
Get the Lex Noisy Calls Elimination Module
Enter your email to get instant access to the GitHub repository.
No spam. Unsubscribe anytime. We respect your privacy.
Prerequisites
- AWS account with Lex V2, Transcribe, Lambda, and S3 access
- Terraform 1.5+ with both
awsandawsccproviders - Python 3.10+ for the test harness
- A microphone (for live A/B testing)
Quick Start
# Deploy the infrastructure
cd terraform
terraform init
terraform apply
# Run the A/B test
cd ..
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
python3 record_and_send.py
The test harness supports interactive commands:
- Enter -- Record and send through both paths
- a -- Path A only (direct Lex audio)
- b -- Path B only (Transcribe pre-process)
- n -- Start new sessions
- q -- Quit
When to Use This Pattern
This pattern is most valuable when:
- Your callers are frequently in noisy environments (field workers, drivers, public spaces)
- You're seeing high fallback rates or unnecessary agent transfers
- Your bot handles slot filling with numbers, names, or specific values
- You need measurable accuracy data to justify the investment
It's less necessary when:
- Callers are primarily in quiet office environments
- Your bot only handles simple yes/no intents
- Latency is more critical than accuracy (rare in contact centers)
Conclusion
The standard Amazon Lex audio pipeline works fine in ideal conditions. But contact centers don't operate in ideal conditions. By decoupling speech-to-text from natural language understanding, you can plug in a purpose-built STT engine that handles real-world noise.
The pattern is simple: Transcribe Streaming for STT, Lex RecognizeText for NLU. Two services, each doing what they do best.
The module is open source, deploys in minutes, and includes a test harness so you can measure the improvement in your own environment before committing to production changes.