← Back to Blog

We've been building AI agents for a while now — our CloudWatch alarm agent, Jenkins Sentinel, internal Slack bots. Every time, we hit the same friction: agent frameworks that add thousands of lines of dependency code, abstractions that fight against AWS-native services, and deployment pipelines that need Docker, Makefiles, and custom packaging scripts just to get a Lambda function running.

So we built a module that strips all of that away. Define your agent's behavior in markdown. Write tools as plain Python functions. Deploy with terraform apply. We're open-sourcing it.

The Problem with Agent Frameworks

If you want to build an AI agent today, you're probably looking at LangChain, AutoGen, CrewAI, or similar libraries. They're powerful — but they come with trade-offs:

  • Dependency weight: LangChain alone pulls in 50+ transitive dependencies. That's a lot of surface area for a Lambda function.
  • Abstraction overhead: Most frameworks wrap the LLM API in layers of classes, chains, and runnables. When something breaks, you're debugging the framework, not your logic.
  • Deployment complexity: Getting a framework-heavy agent into Lambda means Docker builds, large layers, and cold start penalties.
  • Vendor lock-in: Framework-specific concepts (chains, agents, runnables) don't transfer. Switch frameworks and you rewrite everything.

What if you could skip all of that and talk directly to Bedrock's Converse API — with just enough structure to make it production-ready?


How It Works

The module deploys a complete AI agent stack on AWS. The entire runtime engine is ~140 lines of Python that uses only boto3 — which is already in the Lambda runtime. Zero external dependencies for the core loop.

┌─────────────────┐     ┌──────────────┐     ┌─────────────────────┐
│  Slack / HTTP    │────▸│ API Gateway  │────▸│   Lambda Function   │
└─────────────────┘     └──────────────┘     │                     │
                                              │  ┌───────────────┐  │
┌─────────────────┐                           │  │ Handler       │  │
│  EventBridge    │──────────────────────────▸│  │   ↓           │  │
└─────────────────┘                           │  │ Runtime Engine│  │
                                              │  │   ↓     ↓     │  │
                                              │  │ Skills  Tools │  │
                                              │  └───────────────┘  │
                                              │         ↓           │
                                              │  ┌───────────────┐  │
                                              │  │ Bedrock       │  │
                                              │  │ Converse API  │  │
                                              │  └───────────────┘  │
                                              └─────────────────────┘
                                                        ↕
                                              ┌─────────────────────┐
                                              │  DynamoDB (memory)  │
                                              └─────────────────────┘

The agent loop:

  1. User sends a message (Slack, HTTP, or EventBridge schedule)
  2. Handler loads conversation history from DynamoDB (if enabled)
  3. Runtime engine loads the skill markdown → becomes the system prompt
  4. Shared rules from rules/ are appended to every prompt
  5. Engine calls bedrock.converse() in a loop
  6. If the model calls a tool → route to your Python function → feed result back
  7. When the model returns text, save to memory and respond

That's it. No chains, no runnables, no graph DSL. Just a loop.


Skills as Markdown

This is the core idea. Instead of defining agent behavior in code — classes, decorators, configuration objects — you write a markdown file:

---
name: my-coordinator
version: 1.0.0
description: Routes requests to the right tools
tags: [coordinator, routing]
---

# Agent Coordinator

## When to Use
This is the default entry skill for all interactions.

## Available Tools
- **get_time**: Get the current UTC time
- **get_weather**: Get weather for a city
- **search_logs**: Search CloudWatch logs for errors

## Process
1. Read the incoming message
2. Classify the request type
3. Use the appropriate tool
4. Summarize findings in a clear response

## Guardrails
- Keep responses concise and helpful
- Never fabricate data — use tools when available
- If a tool fails, explain what happened

## Standalone Mode
Without tools, respond conversationally. Explain what
tools would be needed for tool-dependent questions.

Drop this file in skills/ and your agent knows what to do. The markdown becomes the system prompt — readable by developers and non-developers alike.

Why markdown?

  • Version-controllable: Skill behavior is tracked in git like any other code
  • Reviewable: Product managers and security teams can read and audit agent behavior without knowing Python
  • Composable: Multiple skills can be combined through delegation
  • Portable: Markdown files work across projects — swap one into a different agent

Shared Rules

Files in rules/ are appended to every skill's system prompt. Use them for company-wide policies:

# Formatting Rules

- Use Slack mrkdwn formatting
- Keep responses under 3000 characters
- Use bullet points for lists
- Bold key findings with *asterisks*

One formatting rule file. Every skill follows it. No duplication.


Tools as Plain Python Functions

No decorators. No base classes. No framework. Just functions:

Define the spec (tools/specs/my_tools.py):

MY_TOOL_SPECS = [
    {
        "toolSpec": {
            "name": "search_logs",
            "description": "Search CloudWatch logs for a pattern",
            "inputSchema": {
                "json": {
                    "type": "object",
                    "properties": {
                        "log_group": {
                            "type": "string",
                            "description": "CloudWatch log group name"
                        },
                        "pattern": {
                            "type": "string",
                            "description": "Search pattern"
                        }
                    },
                    "required": ["log_group", "pattern"]
                }
            }
        }
    }
]

Implement the handler (tools/my_tools.py):

import boto3

def search_logs(log_group: str, pattern: str) -> str:
    client = boto3.client("logs")
    resp = client.filter_log_events(
        logGroupName=log_group,
        filterPattern=pattern,
        limit=20,
    )
    events = [e["message"] for e in resp.get("events", [])]
    return "\n".join(events) if events else "No matching log events found."

Register it (tools/registry.py):

from tools.specs.my_tools import MY_TOOL_SPECS
from tools.my_tools import search_logs

TOOL_HANDLERS = {
    "search_logs": lambda name, tool_input: search_logs(**tool_input),
}

def get_all_specs():
    return MY_TOOL_SPECS

def handle_tool(name, tool_input):
    if name not in TOOL_HANDLERS:
        raise ValueError(f"Unknown tool: {name}")
    return TOOL_HANDLERS[name](name, tool_input)

That's three files. Your agent can now search CloudWatch logs. Add more tools by adding more functions and specs — no framework code to learn.


Multi-Agent Delegation

For complex workflows, a coordinator skill can delegate to specialized sub-skills. The coordinator doesn't need to know how the sub-skill works — it just hands off a task and gets a result.

User Message
    │
    ▼
┌──────────────────────┐
│  coordinator skill   │
│  "Route this request"│
└──────────┬───────────┘
           │ delegate_to_skill("log-analyst", "Find errors in prod")
           ▼
   ┌───────────────────┐
   │  log-analyst skill│
   │  (own tools,      │
   │   own prompt)     │
   └───────────────────┘
           │
           ▼
     Result flows back
     to coordinator

Each sub-skill gets its own system prompt and tool set. Delegation depth is limited (default: 3 levels) to prevent infinite recursion. This enables complex multi-agent workflows without service-to-service calls — everything runs in a single Lambda invocation.


Conversation Memory

The module optionally creates a DynamoDB table for multi-turn conversations. Messages are stored per thread with automatic TTL cleanup:

┌─────────────────────────────────────────────┐
│ DynamoDB: my-agent-conversations            │
├───────────────────┬─────────────────────────┤
│ PK                │ SK                      │
│ THREAD#slack-123  │ MSG#1710000001.000      │
│ THREAD#slack-123  │ MSG#1710000002.000      │
│ THREAD#slack-456  │ MSG#1710000003.000      │
└───────────────────┴─────────────────────────┘
  • History is capped at 100 messages per thread to prevent exceeding Bedrock context windows
  • Old messages auto-expire via DynamoDB TTL (configurable, default 30 days)
  • Collision-safe sort keys prevent race conditions in concurrent conversations

Enable it with one variable:

enable_memory_table = true

What the Module Deploys

One terraform apply creates everything:

Component Description
Lambda Function Your agent code + runtime engine, with create_before_destroy lifecycle
Lambda Layer Optional — pip dependencies from requirements.txt
IAM Role Least-privilege policies scoped to deployment region
API Gateway HTTP API with throttling and access logs (optional)
DynamoDB Table Conversation memory with encryption and TTL (optional)
EventBridge Rules Scheduled tasks for cron-based agent invocations (optional)
CloudWatch Logs Log groups with configurable retention

Security Built In

  • IAM policies are scoped to the deployment region and specific resource ARNs
  • Bedrock access is limited to Anthropic models in the deployed region
  • Slack signature verification (HMAC-SHA256) runs before processing events
  • Secrets stored as SecureString in SSM Parameter Store — Lambda only reads explicitly listed prefixes
  • Tool errors don't leak internal details to end users
  • Skill names are validated to prevent path traversal attacks

Cost Breakdown

Running an AI agent on this module is remarkably cheap:

Component Configuration Monthly Cost
Lambda 1024 MB, ~1000 invocations/day ~$3
Bedrock (Claude Sonnet) ~1000 conversations/day, avg 3 turns ~$30-80
DynamoDB On-demand, memory table ~$2
API Gateway HTTP API, ~30k requests/month ~$1
CloudWatch Logs 14-day retention ~$2
SSM Parameter Store Standard parameters ~$0
Total Moderate usage ~$38-88/month

Most of the cost is Bedrock inference — the infrastructure itself is negligible. Compare that to running a framework-heavy agent on ECS/Fargate ($50-100/month just for compute) or paying per-seat for a hosted agent platform.

Cost optimization tips:

  • Use lambda_reserved_concurrency to cap concurrent invocations and control Bedrock spend
  • Set memory_ttl_days to auto-clean old conversations
  • Choose the right model — Claude Haiku for simple routing, Sonnet for complex reasoning
  • API Gateway caching reduces duplicate Lambda invocations

Getting Started

Prerequisites

  • AWS account with Bedrock access (Claude models enabled)
  • Terraform >= 1.5
  • Python 3.12

Step 1: Get the Module

Want to try the module? Enter your email to get the GitHub repository URL.

Get Free Access to the Markdown Agent Module

Enter your email to get instant access to the GitHub repository.

No spam. Unsubscribe anytime. We respect your privacy.

Step 2: Create Your Agent

Set up your project structure:

my-agent/
├── main.tf
├── requirements.txt          # pip dependencies (e.g., requests)
└── src/
    ├── orchestrator/
    │   ├── handler.py        # Lambda entry point (copy from module examples)
    │   └── agent.py          # Calls runtime engine
    ├── runtime/
    │   ├── engine.py         # Copy from module runtime/
    │   └── memory.py         # Copy from module runtime/
    ├── skills/
    │   └── my-skill.md       # Your agent's behavior
    ├── rules/
    │   └── formatting.md     # Shared rules
    └── tools/
        ├── registry.py       # Tool registry
        ├── specs/
        │   └── my_tools.py   # Bedrock toolSpec definitions
        └── my_tools.py       # Tool implementations

Step 3: Configure Terraform

module "agent" {
  source = "github.com/AIOpsCrew/terraform-module-markdown-agent"

  name        = "my-agent"
  environment = "prod"

  bedrock_model_id = "us.anthropic.claude-sonnet-4-5-20250929-v1:0"

  source_dir = "${path.module}/src"
  layer_path = "${path.module}/dist/layer.zip"

  ssm_parameter_prefixes = ["/my-agent/slack/*"]

  lambda_environment_variables = {
    MEMORY_TABLE = "my-agent-conversations"
    MODEL_ID     = "us.anthropic.claude-sonnet-4-5-20250929-v1:0"
  }

  enable_api_gateway    = true
  enable_memory_table   = true

  tags = {
    Project = "my-agent"
  }
}

Step 4: Deploy

# Build the Lambda layer (if you have pip dependencies)
bash scripts/build_layer.sh .

# Deploy
terraform init
terraform apply

# Copy the API Gateway URL to your Slack app's Event Subscriptions
terraform output api_gateway_url

Step 5: Add Scheduled Tasks (Optional)

scheduled_tasks = [
  {
    name                = "daily-report"
    description         = "Generate daily summary"
    schedule_expression = "cron(0 13 * * ? *)"
    input = {
      source        = "scheduled"
      task          = "daily-report"
      slack_channel = "C123ABC"
      prompt        = "Generate the daily operations report"
    }
  }
]

Real-World Use Cases

Here's what we've built with this module:

1. Slack DevOps Assistant

Skill: devops-coordinator.md with tools for CloudWatch, EC2, RDS Result: Engineers ask questions in Slack, agent investigates AWS resources in real time

2. Scheduled Compliance Checker

Skill: compliance-auditor.md with tools for AWS Config and IAM Trigger: EventBridge cron, daily at 9am Result: Daily Slack report of security findings — no human has to remember to check

3. Customer Support Triage Bot

Skill: support-router.md with delegation to billing-agent.md and technical-agent.md Result: Multi-agent system that classifies and handles support requests in Slack

4. Incident Response Automation

Skill: incident-responder.md with tools for PagerDuty, Jira, and CloudWatch Trigger: HTTP webhook from monitoring system Result: Automatically gathers context, creates Jira ticket, and posts summary to incident channel


Troubleshooting

Slack 3-second timeout: Slack retries if it doesn't get a response within 3 seconds. The handler acknowledges retries with HTTP 200 immediately to prevent duplicate processing.

Bedrock throttling: The engine retries with exponential backoff. For sustained throttling, request a quota increase or set lambda_reserved_concurrency to limit concurrent invocations.

Cold starts: Keep lambda_memory at 1024+ MB for faster initialization. The runtime caches the Bedrock client and SSM secrets across warm invocations.


Contributing

We welcome contributions! Here's how you can help:

Bug Reports: Open an issue on GitHub with your Terraform version, module version, error messages, and expected vs. actual behavior.

Feature Requests: Describe your use case and why the feature would help.

Pull Requests: Fork the repository, create a feature branch, and submit a PR with a clear description of your changes.

Share Your Skills: Built a useful skill markdown file? We'd love to feature community skills in the documentation.


About AI Ops Crew

We build production-ready Terraform modules for AWS operations. Our mission: make infrastructure automation accessible to every engineering team.

Our Modules:

  • CloudWatch AI Agent (Premium - $5/mo): AI-powered alarm investigation with real-time AWS analysis
  • n8n Fargate Cluster (Free): Workflow automation platform on AWS
  • Jenkins Sentinel (Free): AI-powered pipeline failure analysis
  • API Gateway Custom Auth (Free): Plugin-based Lambda authorizer for API Gateway
  • Markdown Agent (Free): Deploy AI agents on AWS with markdown skills
  • More coming soon...

Follow our journey as we open-source more infrastructure tools. Subscribe to our newsletter for updates.


Ready to build your first markdown-driven agent? Get the module and have it running in minutes.


Have questions about the module or want to share your skills? Email us at info@aiopscrew.com or open a discussion on GitHub.