← Back to Blog

Ever wondered how CloudWatch AI Agent can investigate alarms autonomously? Let's dive into the technical details of how our AI-powered investigation works.

The Investigation Flow

When a CloudWatch alarm triggers, here's what happens:

1. Alarm Reception (< 1 second)

CloudWatch Alarm → SNS Topic → Lambda Function

Our Lambda function receives the SNS notification containing:

  • Alarm name and description
  • Metric namespace and name
  • Current state and threshold
  • Resource dimensions (instance ID, etc.)

2. AI Agent Initialization (1-2 seconds)

The Lambda function initializes an AI agent powered by Amazon Bedrock's Nova Lite model with access to 6 specialized AWS tools:

agent = StrandsAgent(
    model="amazon.nova-lite-v1:0",
    tools=[
        get_metric_statistics,
        describe_ec2_instance,
        describe_rds_instance,
        describe_lambda_function,
        search_cloudwatch_logs,
        get_alarm_history
    ]
)

3. Autonomous Investigation (8-15 seconds)

The AI agent automatically decides which tools to use based on the alarm type. Here's a typical investigation sequence:

Step 1: Check Current Metrics

# Agent calls: get_metric_statistics()
metrics = get_metric_statistics(
    namespace="AWS/EC2",
    metric_name="CPUUtilization",
    instance_id="i-1234567890abcdef0",
    period_minutes=60
)

Result: Sees real-time CPU values, not just the alarm threshold

Step 2: Inspect Resource Configuration

# Agent calls: describe_ec2_instance()
instance = describe_ec2_instance(
    instance_id="i-1234567890abcdef0"
)

Result: Discovers instance type, size, launch time, networking

Step 3: Check Historical Patterns

# Agent calls: get_alarm_history()
history = get_alarm_history(
    alarm_name="prod-web-high-cpu",
    hours=24
)

Result: Identifies if this is a recurring issue

Step 4: Search for Errors (if relevant)

# Agent calls: search_cloudwatch_logs()
logs = search_cloudwatch_logs(
    log_group="/aws/ec2/i-1234567890abcdef0",
    filter_pattern="ERROR",
    hours=1
)

Result: Finds any recent errors in application logs

4. AI Analysis (2-3 seconds)

The agent synthesizes all the collected data and generates:

  • What happened: Plain-language explanation of the alarm
  • Current status: Real data from AWS queries
  • Root cause: Data-driven diagnosis
  • Remediation: Specific, actionable steps

5. Slack Delivery (1-2 seconds)

The formatted message is posted to Slack with:

  • Color-coded severity (green/yellow/red)
  • Structured blocks for easy reading
  • All investigation findings
  • Actionable recommendations

The 6 AWS Investigation Tools

Tool 1: get_metric_statistics

Purpose: Get recent CloudWatch metric values

Example Use:

{
  "namespace": "AWS/RDS",
  "metric_name": "DatabaseConnections",
  "db_instance_id": "prod-database",
  "statistic": "Average",
  "period_minutes": 60
}

Returns:

Recent DatabaseConnections (last 60 minutes):
  2025-12-16 14:00 UTC: 245
  2025-12-16 14:05 UTC: 289
  2025-12-16 14:10 UTC: 312
  2025-12-16 14:15 UTC: 298
Average: 286

Tool 2: describe_ec2_instance

Purpose: Get EC2 instance details

Returns:

  • Instance type and size
  • State (running, stopped, etc.)
  • Private/public IP addresses
  • Launch time
  • VPC and security groups
  • Tags

Tool 3: describe_rds_instance

Purpose: Get RDS database details

Returns:

  • Database engine and version
  • Instance class
  • Storage type and size
  • Multi-AZ configuration
  • Endpoint and availability zone

Tool 4: describe_lambda_function

Purpose: Get Lambda function details

Returns:

  • Runtime and handler
  • Memory and timeout settings
  • Last modified time
  • VPC configuration
  • Code size

Tool 5: search_cloudwatch_logs

Purpose: Search logs for patterns

Example:

{
  "log_group": "/aws/lambda/api-handler",
  "filter_pattern": "ERROR",
  "hours": 1
}

Returns:

Found 5 matching entries:
[2025-12-16 14:25:33] ERROR: Connection timeout
[2025-12-16 14:26:15] ERROR: Retry failed
...

Tool 6: get_alarm_history

Purpose: Get alarm state change history

Returns:

State changes (last 24 hours):
[2025-12-16 14:30] ALARM (from OK)
[2025-12-16 12:15] OK (from ALARM)
[2025-12-16 10:45] ALARM (from OK)

Why This Approach Works

1. Data-Driven, Not Guesswork

Traditional alerts provide theoretical suggestions. Our agent provides conclusions based on actual AWS state.

2. Autonomous Decision Making

The AI decides which tools to use and in what order based on the alarm context. No manual runbooks needed.

3. Parallel Execution

When possible, the agent runs multiple tool queries in parallel to minimize investigation time.

4. Read-Only & Safe

All tools are strictly read-only. The agent can query but never modify resources.

Performance Metrics

  • Average investigation time: 12-15 seconds
  • Tool calls per alarm: 2-4 (average)
  • Accuracy rate: 95%+ for common scenarios
  • Cost per alarm: ~$0.001 (Bedrock + API calls)

Security Considerations

IAM Permissions

The Lambda function has least-privilege IAM permissions:

Action = [
  "cloudwatch:GetMetricStatistics",
  "ec2:DescribeInstances",
  "rds:DescribeDBInstances",
  "lambda:GetFunction",
  "logs:FilterLogEvents",
  "cloudwatch:DescribeAlarmHistory"
]

No write, delete, or modify permissions are granted.

Data Handling

  • No sensitive data is logged
  • All queries use encrypted connections
  • Slack webhook URLs are stored securely
  • No data is retained after analysis

What's Next

We're working on enhancing the investigation capabilities:

  • Cross-resource correlation: Analyze related resources together
  • Pattern detection: Identify common failure patterns
  • Predictive analysis: Warn before alarms trigger
  • Custom tools: Let users add their own investigation tools

Try It Yourself

Want to see AI-powered investigation in action?

Get started today and transform your CloudWatch monitoring experience.


Questions about the technical implementation? Contact us or read our full documentation.