← Back to Blog

If you've deployed Amazon Lex behind Amazon Connect, you've probably hit this: a caller is at a busy intersection, a coffee shop, or driving with the windows down, and your bot completely falls apart. It mis-transcribes utterances, matches the wrong intent, falls back repeatedly, and eventually transfers the caller to an agent who has to start from scratch.

The problem isn't Lex's NLU. It's the built-in speech-to-text. And there's a surprisingly simple pattern to fix it.


The Problem: Lex's STT Wasn't Built for Noise

When you send audio to Lex via RecognizeUtterance, here's what happens internally:

Caller Audio (8kHz PCM) --> Lex Built-in STT --> Text --> NLU --> Intent Match

Lex's internal STT engine works fine in quiet environments. But in real-world contact center scenarios, callers are often in noisy places. Background traffic, music, crowds, wind, car noise, and other environmental sounds degrade transcription accuracy significantly.

The result is a cascade of failures:

  1. Garbled transcription -- Lex hears "I need help with my bill" as "I need held mitt pile"
  2. Wrong intent match -- The garbled text matches FallbackIntent instead of AccountLookup
  3. Retry loop -- The bot asks the caller to repeat, they do, same noise, same failure
  4. Agent transfer -- After 2-3 fallbacks, the bot gives up and transfers to an agent

You can lower the NLU confidence threshold, but that doesn't fix the root cause. The text going into NLU is wrong.


The Pattern: Decouple STT from NLU

The fix is to stop using Lex's built-in speech-to-text and instead pre-process the audio through Amazon Transcribe before sending the cleaned text to Lex:

Standard Path (noisy environments fail):
  Caller Audio --> Lex RecognizeUtterance (built-in STT + NLU)

Improved Path (noise-resilient):
  Caller Audio --> Amazon Transcribe Streaming --> Clean Text --> Lex RecognizeText (NLU only)

Why does this work? Amazon Transcribe's telephony models are specifically trained on real-world call audio with background noise. It's been battle-tested across thousands of Amazon Connect deployments. When you give it the same noisy audio that trips up Lex's internal STT, Transcribe produces dramatically better transcriptions.

The key insight is that RecognizeText gives you the same NLU pipeline as RecognizeUtterance, minus the STT step. You get the same intent matching, slot filling, and dialog management. You're just feeding it better input.


Architecture

We built a complete Terraform module that deploys both paths side-by-side with an A/B comparison test harness:

                     +-------------------+
                     |   Microphone /    |
                     |   Audio Source    |
                     +--------+----------+
                              |
                +-------------+-------------+
                |                           |
                v                           v
  +----------------------------+  +----------------------------+
  |    Path A: Direct Audio    |  |  Path B: Transcribe Pre-   |
  |                            |  |     processor              |
  |  Resample to 8kHz PCM     |  |  Resample to 16kHz PCM     |
  |           |                |  |           |                |
  |           v                |  |           v                |
  |  Lex RecognizeUtterance    |  |  Transcribe Streaming      |
  |  (built-in STT + NLU)     |  |  (telephony-optimized)     |
  |           |                |  |           |                |
  |           v                |  |           v                |
  |  Intent + Confidence       |  |  Clean text                |
  |                            |  |           |                |
  |                            |  |           v                |
  |                            |  |  Lex RecognizeText         |
  |                            |  |  (NLU only)                |
  |                            |  |           |                |
  |                            |  |           v                |
  |                            |  |  Intent + Confidence       |
  +----------------------------+  +----------------------------+
                |                           |
                +-------------+-------------+
                              |
                              v
                     +-------------------+
                     |  Compare Results  |
                     |  Transcript,      |
                     |  Intent, Score    |
                     +-------------------+

What Gets Deployed

The Terraform module creates:

  • Amazon Lex V2 Bot with 5 intents (Greeting, AccountLookup, TransferToAgent, EndCall, Fallback)
  • Lambda fulfillment handler with slot validation, retry tracking, and fallback escalation
  • S3 bucket for audio/text conversation logging (30-day lifecycle to IA, 90-day expiry)
  • CloudWatch dashboard tracking fallback count, agent transfer rate, and retry count
  • IAM roles scoped for least privilege

The Lex Bot

The bot is designed to exercise real-world contact center patterns:

Intent Purpose Slots
GreetingIntent Opening the conversation None
AccountLookup Multi-turn slot filling AccountNumber (6-10 digits), IssueType (billing/technical/general), CallerName
TransferToAgent Explicit agent request None
EndCallIntent Closing the conversation None
FallbackIntent Catch-all with escalation None (auto-transfers after 2 consecutive fallbacks)

The AccountLookup intent is deliberately complex. It requires multi-turn dialog with slot validation, which is exactly where noisy audio causes the most failures. When the bot asks "What is your account number?" and the caller says "1234567" in a noisy environment, the standard path frequently mis-transcribes the digits.


The Lambda Fulfillment Handler

The fulfillment handler implements noise-aware retry logic:

def handle_dialog_code_hook(event):
    """Validate slots during dialog -- retry-aware"""
    session_attrs = event.get('sessionState', {}).get('sessionAttributes', {}) or {}
    retry_count = int(session_attrs.get('retryCount', '0'))

    # Validate AccountNumber: must be 6-10 digits
    account_number = slots.get('AccountNumber', {})
    if account_number and account_number.get('value', {}).get('interpretedValue'):
        value = account_number['value']['interpretedValue']
        if not value.isdigit() or not (6 <= len(value) <= 10):
            retry_count += 1
            if retry_count >= max_retries:
                # Too many retries -- transfer to agent
                return transfer_to_agent(session_attrs)
            return elicit_slot('AccountNumber', 'Please provide a valid account number.',
                             session_attrs, retry_count)

Key behaviors:

  • Retry tracking via session attributes persisted across turns
  • Progressive escalation -- after configurable max retries (default 3), transfers to agent
  • Fallback counting -- 2 consecutive fallbacks triggers agent transfer
  • Metric emission -- CloudWatch metric filters track fallback, retry, and transfer rates

Running the A/B Comparison

The test harness records audio from your microphone and sends the same recording through both paths simultaneously:

# Clone and deploy
git clone https://github.com/AIOpsCrew/terraform-module-lexbot-noisy-caller-poc.git
cd ai-lex-bot/terraform
terraform init && terraform apply

# Set up the test harness
cd ..
python3 -m venv venv && source venv/bin/activate
pip install boto3 pyaudio numpy scipy amazon-transcribe

# Run the comparison
python3 record_and_send.py

The output shows both paths side by side:

[Path A -- Direct to Lex (0.8s)]
  Heard: i need help mitt my account
  Intent: FallbackIntent (0.42)
  Bot says: I didn't understand that. Could you try again?

[Path B -- Transcribe (0.3s) + Lex text (0.2s)]
  Transcribe heard: i need help with my account
  Intent: AccountLookup (0.97)
  Bot says: I'd be happy to help! What is your account number?

In noisy environments, the difference is stark. Path B consistently outperforms Path A on intent accuracy, especially for multi-turn dialogs where slot values contain numbers or proper nouns.

Latency Trade-off

Path B adds an extra hop. In our testing:

Metric Path A Path B Delta
STT Latency ~0ms (built into Lex) 200-500ms (Transcribe) +200-500ms
NLU Latency ~300ms ~200ms -100ms
Total ~300ms ~400-700ms +100-400ms

The additional latency is noticeable but acceptable for most contact center scenarios. The accuracy improvement far outweighs the speed cost, especially when you factor in the time wasted on retries and agent transfers in the standard path.


Deploying in Production

The test harness proves the pattern. For production deployment in Amazon Connect, the architecture looks like this:

Caller --> Connect --> Kinesis Video Stream --> Lambda --> Transcribe Streaming
                                                             |
                                                             v
                                                    Clean Transcription
                                                             |
                                                             v
                                          Lex RecognizeText --> Contact Flow

The Lambda function sits between Connect's Kinesis Video Stream and Lex. It:

  1. Receives the audio stream from Connect
  2. Forwards it to Transcribe Streaming
  3. Takes the transcribed text
  4. Calls Lex RecognizeText with the clean transcription
  5. Returns the bot response back to the contact flow

This is a drop-in replacement for the standard Connect + Lex integration. The contact flow doesn't change. The caller experience doesn't change. The only difference is better accuracy.


Cost Analysis

For a contact center handling 10,000 calls/month with an average of 5 utterances per call:

Service Standard Path Transcribe Pre-process Path
Lex $3.75 (50k audio reqs) $0.75 (50k text reqs)
Transcribe $0 $50 (833 minutes)
Lambda $0.50 $1.00
Total $4.25/mo $51.75/mo

The Transcribe pre-processing path costs more. But consider the cost of agent time:

  • If the standard path causes 500 unnecessary agent transfers per month
  • At an average handle time of 5 minutes and $25/hour agent cost
  • That's $1,042/month in wasted agent time

The Transcribe path pays for itself many times over by keeping callers in the self-service flow.


Getting Started

The entire project is open source and deploys with a single terraform apply:

Get the Lex Noisy Calls Elimination Module

Enter your email to get instant access to the GitHub repository.

No spam. Unsubscribe anytime. We respect your privacy.

Prerequisites

  • AWS account with Lex V2, Transcribe, Lambda, and S3 access
  • Terraform 1.5+ with both aws and awscc providers
  • Python 3.10+ for the test harness
  • A microphone (for live A/B testing)

Quick Start

# Deploy the infrastructure
cd terraform
terraform init
terraform apply

# Run the A/B test
cd ..
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
python3 record_and_send.py

The test harness supports interactive commands:

  • Enter -- Record and send through both paths
  • a -- Path A only (direct Lex audio)
  • b -- Path B only (Transcribe pre-process)
  • n -- Start new sessions
  • q -- Quit

When to Use This Pattern

This pattern is most valuable when:

  • Your callers are frequently in noisy environments (field workers, drivers, public spaces)
  • You're seeing high fallback rates or unnecessary agent transfers
  • Your bot handles slot filling with numbers, names, or specific values
  • You need measurable accuracy data to justify the investment

It's less necessary when:

  • Callers are primarily in quiet office environments
  • Your bot only handles simple yes/no intents
  • Latency is more critical than accuracy (rare in contact centers)

Conclusion

The standard Amazon Lex audio pipeline works fine in ideal conditions. But contact centers don't operate in ideal conditions. By decoupling speech-to-text from natural language understanding, you can plug in a purpose-built STT engine that handles real-world noise.

The pattern is simple: Transcribe Streaming for STT, Lex RecognizeText for NLU. Two services, each doing what they do best.

The module is open source, deploys in minutes, and includes a test harness so you can measure the improvement in your own environment before committing to production changes.