Ever wondered how CloudWatch AI Agent can investigate alarms autonomously? Let's dive into the technical details of how our AI-powered investigation works.
The Investigation Flow
When a CloudWatch alarm triggers, here's what happens:
1. Alarm Reception (< 1 second)
CloudWatch Alarm → SNS Topic → Lambda Function
Our Lambda function receives the SNS notification containing:
- Alarm name and description
- Metric namespace and name
- Current state and threshold
- Resource dimensions (instance ID, etc.)
2. AI Agent Initialization (1-2 seconds)
The Lambda function initializes an AI agent powered by Amazon Bedrock's Nova Lite model with access to 6 specialized AWS tools:
agent = StrandsAgent(
model="amazon.nova-lite-v1:0",
tools=[
get_metric_statistics,
describe_ec2_instance,
describe_rds_instance,
describe_lambda_function,
search_cloudwatch_logs,
get_alarm_history
]
)
3. Autonomous Investigation (8-15 seconds)
The AI agent automatically decides which tools to use based on the alarm type. Here's a typical investigation sequence:
Step 1: Check Current Metrics
# Agent calls: get_metric_statistics()
metrics = get_metric_statistics(
namespace="AWS/EC2",
metric_name="CPUUtilization",
instance_id="i-1234567890abcdef0",
period_minutes=60
)
Result: Sees real-time CPU values, not just the alarm threshold
Step 2: Inspect Resource Configuration
# Agent calls: describe_ec2_instance()
instance = describe_ec2_instance(
instance_id="i-1234567890abcdef0"
)
Result: Discovers instance type, size, launch time, networking
Step 3: Check Historical Patterns
# Agent calls: get_alarm_history()
history = get_alarm_history(
alarm_name="prod-web-high-cpu",
hours=24
)
Result: Identifies if this is a recurring issue
Step 4: Search for Errors (if relevant)
# Agent calls: search_cloudwatch_logs()
logs = search_cloudwatch_logs(
log_group="/aws/ec2/i-1234567890abcdef0",
filter_pattern="ERROR",
hours=1
)
Result: Finds any recent errors in application logs
4. AI Analysis (2-3 seconds)
The agent synthesizes all the collected data and generates:
- What happened: Plain-language explanation of the alarm
- Current status: Real data from AWS queries
- Root cause: Data-driven diagnosis
- Remediation: Specific, actionable steps
5. Slack Delivery (1-2 seconds)
The formatted message is posted to Slack with:
- Color-coded severity (green/yellow/red)
- Structured blocks for easy reading
- All investigation findings
- Actionable recommendations
The 6 AWS Investigation Tools
Tool 1: get_metric_statistics
Purpose: Get recent CloudWatch metric values
Example Use:
{
"namespace": "AWS/RDS",
"metric_name": "DatabaseConnections",
"db_instance_id": "prod-database",
"statistic": "Average",
"period_minutes": 60
}
Returns:
Recent DatabaseConnections (last 60 minutes):
2025-12-16 14:00 UTC: 245
2025-12-16 14:05 UTC: 289
2025-12-16 14:10 UTC: 312
2025-12-16 14:15 UTC: 298
Average: 286
Tool 2: describe_ec2_instance
Purpose: Get EC2 instance details
Returns:
- Instance type and size
- State (running, stopped, etc.)
- Private/public IP addresses
- Launch time
- VPC and security groups
- Tags
Tool 3: describe_rds_instance
Purpose: Get RDS database details
Returns:
- Database engine and version
- Instance class
- Storage type and size
- Multi-AZ configuration
- Endpoint and availability zone
Tool 4: describe_lambda_function
Purpose: Get Lambda function details
Returns:
- Runtime and handler
- Memory and timeout settings
- Last modified time
- VPC configuration
- Code size
Tool 5: search_cloudwatch_logs
Purpose: Search logs for patterns
Example:
{
"log_group": "/aws/lambda/api-handler",
"filter_pattern": "ERROR",
"hours": 1
}
Returns:
Found 5 matching entries:
[2025-12-16 14:25:33] ERROR: Connection timeout
[2025-12-16 14:26:15] ERROR: Retry failed
...
Tool 6: get_alarm_history
Purpose: Get alarm state change history
Returns:
State changes (last 24 hours):
[2025-12-16 14:30] ALARM (from OK)
[2025-12-16 12:15] OK (from ALARM)
[2025-12-16 10:45] ALARM (from OK)
Why This Approach Works
1. Data-Driven, Not Guesswork
Traditional alerts provide theoretical suggestions. Our agent provides conclusions based on actual AWS state.
2. Autonomous Decision Making
The AI decides which tools to use and in what order based on the alarm context. No manual runbooks needed.
3. Parallel Execution
When possible, the agent runs multiple tool queries in parallel to minimize investigation time.
4. Read-Only & Safe
All tools are strictly read-only. The agent can query but never modify resources.
Performance Metrics
- Average investigation time: 12-15 seconds
- Tool calls per alarm: 2-4 (average)
- Accuracy rate: 95%+ for common scenarios
- Cost per alarm: ~$0.001 (Bedrock + API calls)
Security Considerations
IAM Permissions
The Lambda function has least-privilege IAM permissions:
Action = [
"cloudwatch:GetMetricStatistics",
"ec2:DescribeInstances",
"rds:DescribeDBInstances",
"lambda:GetFunction",
"logs:FilterLogEvents",
"cloudwatch:DescribeAlarmHistory"
]
No write, delete, or modify permissions are granted.
Data Handling
- No sensitive data is logged
- All queries use encrypted connections
- Slack webhook URLs are stored securely
- No data is retained after analysis
What's Next
We're working on enhancing the investigation capabilities:
- Cross-resource correlation: Analyze related resources together
- Pattern detection: Identify common failure patterns
- Predictive analysis: Warn before alarms trigger
- Custom tools: Let users add their own investigation tools
Try It Yourself
Want to see AI-powered investigation in action?
Get started today and transform your CloudWatch monitoring experience.
Questions about the technical implementation? Contact us or read our full documentation.