The AWS bill that made me rethink everything about infrastructure

I opened the AWS billing console on a Wednesday morning and stared at the number for about thirty seconds. Our projected budget was $400 per month. The bill for month three was$ 1,640. Four times what I expected. The kind of surprise that makes a bootstrapped startup founder reconsider every infrastructure choice they have ever made.

The frustrating part was that our compute costs were exactly what I projected. Two t3.medium instances, an RDS db.t3.small, and an ElastiCache t3.micro. The compute line items totaled $380. The other$ 1,260 came from three services I barely thought about when I set up the infrastructure.

Generate a realistic engineering editorial image with monitors, architecture notes, and system artifacts, dark slate and blue palette, 16:9, no text, no clip art, no generic stock office look.

The shape of the problem before the fix.

A lot of my month-one leadership came through infrastructure choices that looked small from the outside. It also builds on what I learned earlier in “My first homelab rack: a mini PC, k3s, and the itch to self-host everything.” I was building the muscle memory that later fed the infrastructure and ctrlpane projects at home: reproducible defaults, cheap feedback loops, and enough observability that I did not need to guess under pressure.

Editorial supporting image for the section "The Three Hidden Cost Centers" in the article "The AWS bill that made me rethink everything about infrastructure". Show mini PC rack, patch cables, terminal windows, Grafana-style dashboard glow, labeled hardware with signs of real use for "The AWS bill that made me rethink everything about infrastructure". Focus on one operational artifact that makes the post feel lived-in rather than conceptual. Color palette: teal, zinc, dim server LEDs, warm tungsten accents. Mood: hungry, hands-on, slightly sleep-deprived, battle-tested before the title felt real. Composition: 16:9 landscape image, documentary/editorial feel, no text overlays, no stock-photo polish. Avoid: No spotless enterprise data-center stock photos, no blue LED overkill, no clip-art servers, no text overlays.

The infrastructure mess that made the lesson stick.

The Three Hidden Cost Centers

I spent the rest of that Wednesday doing a forensic breakdown of every line item on the bill. The three culprits were hiding in plain sight.

NAT Gateway: $340. Our application instances lived in private subnets and all outbound traffic routed through a NAT Gateway. Every API call to payment processors, every npm install in CI, every health check to external services went through the NAT and incurred data processing charges at$ 0.045 per GB.
Data Transfer: $280. Cross-AZ traffic between our application servers and RDS was$ 0.01 per GB in each direction. Our application made 40,000 database queries per day, each returning an average of 2 KB. That adds up. Plus, every API response to clients outside the region incurred standard data transfer out charges.
CloudWatch Logs: $260. I had set log retention to "never delete" and our application was logging every request at INFO level with full headers and body. Three months of logs at that verbosity costs real money. CloudWatch Logs ingestion charges at$ 0.50 per GB were the silent killer.

The NAT Gateway alone was eating $340 per month for the privilege of letting our servers talk to the internet. That is almost as much as all our compute costs combined.

The Architectural Changes

I spent the following week making four changes that brought the bill from $1,640 down to$ 650. A 60% reduction without sacrificing any functionality or reliability.

Moved to public subnets with security groups: Instead of private subnets behind a NAT Gateway, I moved our application instances to public subnets and used security groups to restrict inbound traffic to the load balancer only. Same security posture, zero NAT Gateway charges. This is not the right move for every workload, but for a startup with two instances it is a reasonable tradeoff.
Colocated application and database in the same AZ: Cross-AZ redundancy is important for production resilience, but our RDS instance was single-AZ anyway. Putting the application in the same AZ eliminated cross-AZ data transfer charges. When we upgrade to Multi-AZ RDS, we will revisit this.
Log level and retention policy: Changed application logging from INFO to WARN for request bodies. Kept INFO for payment processing paths where audit trails matter. Set log retention to 30 days instead of indefinite. Added log sampling for high-volume health check endpoints. CloudWatch costs dropped from $260 to$ 40.
Switched to VPC endpoints for AWS services: S3 and SQS traffic now routes through VPC endpoints instead of the public internet. Zero data processing charges for internal AWS service calls.

The Spreadsheet Model

After the bill shock I built a spreadsheet that I now use before deploying anything new to AWS. It models four cost dimensions that are easy to overlook.

Cost Model Checklist:

1. COMPUTE: Instance type x hours x count
   → Straightforward. Everyone models this.

2. DATA TRANSFER: GB out to internet + GB cross-AZ + GB cross-region
   → Estimate daily API response volume x average payload size
   → Multiply by 30 for monthly projection

3. DATA PROCESSING: NAT Gateway + Load Balancer + VPC endpoints
   → Every GB through a NAT Gateway costs $0.045
   → Every GB through an ALB costs $0.008

4. STORAGE & LOGGING: S3 + EBS + CloudWatch + RDS storage
   → Log volume grows every day. Model the 90-day projection.
   → RDS storage auto-scaling sounds free. It is not.

The key insight is that compute costs are the minority of most AWS bills at startup scale. Data transfer and logging are the costs that sneak up on you because they scale with traffic, not with provisioned resources. You cannot see them coming by looking at your infrastructure diagram.

The Broader Lesson

Cloud bills are not a technical problem. They are an architecture problem. Every architectural decision has a cost implication that is invisible until the bill arrives. Private subnets are a security best practice, but they come with a $340 monthly NAT Gateway tax. Verbose logging is an operational best practice, but it comes with a storage cost that compounds every day. Cross-AZ deployment is a reliability best practice, but it doubles your data transfer charges.

None of these costs are visible in the signup flow, the quick start guide, or the architecture tutorial. They are visible in the billing console thirty days later when the damage is already done. The only defense is modeling costs before deploying and reviewing the bill line by line every month.

Create a realistic engineering leadership editorial image with a cleaner desk, refined system notes, calmer lighting, 4:3, no cliché teamwork stock imagery.

What changed once the system matured.

The builder phase was less glamorous than people imagine. It was mostly a series of stubborn, unfashionable choices that kept future-me out of 2 a.m. incident calls. I still make the same kind of choices inside portfolio, pipeline-sdk, and dotfiles.

Startups die from cloud bills more often than they admit. The first time you open the billing console and flinch, take it seriously. Model the costs, make the changes, and never deploy infrastructure without a cost projection again.