← Back to Blog

How AI Investigates AWS Alarms - Under the Hood

CloudWatch AI Team • December 16, 2025 • 5 min read

technicalawsaiarchitecture

Ever wondered how CloudWatch AI Agent can investigate alarms autonomously? Let's dive into the technical details of how our AI-powered investigation works.

The Investigation Flow

When a CloudWatch alarm triggers, here's what happens:

1. Alarm Reception (< 1 second)

CloudWatch Alarm → SNS Topic → Lambda Function

Our Lambda function receives the SNS notification containing:

Alarm name and description
Metric namespace and name
Current state and threshold
Resource dimensions (instance ID, etc.)

2. AI Agent Initialization (1-2 seconds)

The Lambda function initializes an AI agent powered by Amazon Bedrock's Nova Lite model with access to 6 specialized AWS tools:

agent = StrandsAgent(
    model="amazon.nova-lite-v1:0",
    tools=[
        get_metric_statistics,
        describe_ec2_instance,
        describe_rds_instance,
        describe_lambda_function,
        search_cloudwatch_logs,
        get_alarm_history
    ]
)

3. Autonomous Investigation (8-15 seconds)

The AI agent automatically decides which tools to use based on the alarm type. Here's a typical investigation sequence:

Step 1: Check Current Metrics

# Agent calls: get_metric_statistics()
metrics = get_metric_statistics(
    namespace="AWS/EC2",
    metric_name="CPUUtilization",
    instance_id="i-1234567890abcdef0",
    period_minutes=60
)

Result: Sees real-time CPU values, not just the alarm threshold

Step 2: Inspect Resource Configuration

# Agent calls: describe_ec2_instance()
instance = describe_ec2_instance(
    instance_id="i-1234567890abcdef0"
)

Result: Discovers instance type, size, launch time, networking

Step 3: Check Historical Patterns

# Agent calls: get_alarm_history()
history = get_alarm_history(
    alarm_name="prod-web-high-cpu",
    hours=24
)

Result: Identifies if this is a recurring issue

Step 4: Search for Errors (if relevant)

# Agent calls: search_cloudwatch_logs()
logs = search_cloudwatch_logs(
    log_group="/aws/ec2/i-1234567890abcdef0",
    filter_pattern="ERROR",
    hours=1
)

Result: Finds any recent errors in application logs

4. AI Analysis (2-3 seconds)

The agent synthesizes all the collected data and generates:

What happened: Plain-language explanation of the alarm
Current status: Real data from AWS queries
Root cause: Data-driven diagnosis
Remediation: Specific, actionable steps

5. Slack Delivery (1-2 seconds)

The formatted message is posted to Slack with:

Color-coded severity (green/yellow/red)
Structured blocks for easy reading
All investigation findings
Actionable recommendations

The 6 AWS Investigation Tools

Tool 1: get_metric_statistics

Purpose: Get recent CloudWatch metric values

Example Use:

{
  "namespace": "AWS/RDS",
  "metric_name": "DatabaseConnections",
  "db_instance_id": "prod-database",
  "statistic": "Average",
  "period_minutes": 60
}

Returns:

Recent DatabaseConnections (last 60 minutes):
  2025-12-16 14:00 UTC: 245
  2025-12-16 14:05 UTC: 289
  2025-12-16 14:10 UTC: 312
  2025-12-16 14:15 UTC: 298
Average: 286

Tool 2: describe_ec2_instance

Purpose: Get EC2 instance details

Returns:

Instance type and size
State (running, stopped, etc.)
Private/public IP addresses
Launch time
VPC and security groups
Tags

Tool 3: describe_rds_instance

Purpose: Get RDS database details

Returns:

Database engine and version
Instance class
Storage type and size
Multi-AZ configuration
Endpoint and availability zone

Tool 4: describe_lambda_function

Purpose: Get Lambda function details

Returns:

Runtime and handler
Memory and timeout settings
Last modified time
VPC configuration
Code size

Tool 5: search_cloudwatch_logs

Purpose: Search logs for patterns

Example:

{
  "log_group": "/aws/lambda/api-handler",
  "filter_pattern": "ERROR",
  "hours": 1
}

Returns:

Found 5 matching entries:
[2025-12-16 14:25:33] ERROR: Connection timeout
[2025-12-16 14:26:15] ERROR: Retry failed
...

Tool 6: get_alarm_history

Purpose: Get alarm state change history

Returns:

State changes (last 24 hours):
[2025-12-16 14:30] ALARM (from OK)
[2025-12-16 12:15] OK (from ALARM)
[2025-12-16 10:45] ALARM (from OK)

Why This Approach Works

1. Data-Driven, Not Guesswork

Traditional alerts provide theoretical suggestions. Our agent provides conclusions based on actual AWS state.

2. Autonomous Decision Making

The AI decides which tools to use and in what order based on the alarm context. No manual runbooks needed.

3. Parallel Execution

When possible, the agent runs multiple tool queries in parallel to minimize investigation time.

4. Read-Only & Safe

All tools are strictly read-only. The agent can query but never modify resources.

Performance Metrics

Average investigation time: 12-15 seconds
Tool calls per alarm: 2-4 (average)
Accuracy rate: 95%+ for common scenarios
Cost per alarm: ~$0.001 (Bedrock + API calls)

Security Considerations

IAM Permissions

The Lambda function has least-privilege IAM permissions:

Action = [
  "cloudwatch:GetMetricStatistics",
  "ec2:DescribeInstances",
  "rds:DescribeDBInstances",
  "lambda:GetFunction",
  "logs:FilterLogEvents",
  "cloudwatch:DescribeAlarmHistory"
]

No write, delete, or modify permissions are granted.

Data Handling

No sensitive data is logged
All queries use encrypted connections
Slack webhook URLs are stored securely
No data is retained after analysis

What's Next

We're working on enhancing the investigation capabilities:

Cross-resource correlation: Analyze related resources together
Pattern detection: Identify common failure patterns
Predictive analysis: Warn before alarms trigger
Custom tools: Let users add their own investigation tools

Try It Yourself

Want to see AI-powered investigation in action?

Get started today and transform your CloudWatch monitoring experience.

Questions about the technical implementation? Contact us or read our full documentation.