The AWS bill that made me rethink everything about infrastructure
Our third month AWS bill was 4x what I projected. The culprit was not compute. It was data transfer, NAT gateway charges, and CloudWatch log ingestion.
I opened the AWS billing console on a Wednesday morning and stared at the number for about thirty seconds. Our projected budget was 1,640. Four times what I expected. The kind of surprise that makes a bootstrapped startup founder reconsider every infrastructure choice they have ever made.
The frustrating part was that our compute costs were exactly what I projected. Two t3.medium instances, an RDS db.t3.small, and an ElastiCache t3.micro. The compute line items totaled 1,260 came from three services I barely thought about when I set up the infrastructure.
A lot of my month-one leadership came through infrastructure choices that looked small from the outside. It also builds on what I learned earlier in “My first homelab rack: a mini PC, k3s, and the itch to self-host everything.” I was building the muscle memory that later fed the infrastructure and ctrlpane projects at home: reproducible defaults, cheap feedback loops, and enough observability that I did not need to guess under pressure.
The Three Hidden Cost Centers
I spent the rest of that Wednesday doing a forensic breakdown of every line item on the bill. The three culprits were hiding in plain sight.
- NAT Gateway: 0.045 per GB.
- Data Transfer: 0.01 per GB in each direction. Our application made 40,000 database queries per day, each returning an average of 2 KB. That adds up. Plus, every API response to clients outside the region incurred standard data transfer out charges.
- CloudWatch Logs: 0.50 per GB were the silent killer.
The NAT Gateway alone was eating $340 per month for the privilege of letting our servers talk to the internet. That is almost as much as all our compute costs combined.
The Architectural Changes
I spent the following week making four changes that brought the bill from 650. A 60% reduction without sacrificing any functionality or reliability.
- Moved to public subnets with security groups: Instead of private subnets behind a NAT Gateway, I moved our application instances to public subnets and used security groups to restrict inbound traffic to the load balancer only. Same security posture, zero NAT Gateway charges. This is not the right move for every workload, but for a startup with two instances it is a reasonable tradeoff.
- Colocated application and database in the same AZ: Cross-AZ redundancy is important for production resilience, but our RDS instance was single-AZ anyway. Putting the application in the same AZ eliminated cross-AZ data transfer charges. When we upgrade to Multi-AZ RDS, we will revisit this.
- Log level and retention policy: Changed application logging from INFO to WARN for request bodies. Kept INFO for payment processing paths where audit trails matter. Set log retention to 30 days instead of indefinite. Added log sampling for high-volume health check endpoints. CloudWatch costs dropped from 40.
- Switched to VPC endpoints for AWS services: S3 and SQS traffic now routes through VPC endpoints instead of the public internet. Zero data processing charges for internal AWS service calls.
The Spreadsheet Model
After the bill shock I built a spreadsheet that I now use before deploying anything new to AWS. It models four cost dimensions that are easy to overlook.
Cost Model Checklist:
1. COMPUTE: Instance type x hours x count → Straightforward. Everyone models this.
2. DATA TRANSFER: GB out to internet + GB cross-AZ + GB cross-region → Estimate daily API response volume x average payload size → Multiply by 30 for monthly projection
3. DATA PROCESSING: NAT Gateway + Load Balancer + VPC endpoints → Every GB through a NAT Gateway costs $0.045 → Every GB through an ALB costs $0.008
4. STORAGE & LOGGING: S3 + EBS + CloudWatch + RDS storage → Log volume grows every day. Model the 90-day projection. → RDS storage auto-scaling sounds free. It is not.The key insight is that compute costs are the minority of most AWS bills at startup scale. Data transfer and logging are the costs that sneak up on you because they scale with traffic, not with provisioned resources. You cannot see them coming by looking at your infrastructure diagram.
The Broader Lesson
Cloud bills are not a technical problem. They are an architecture problem. Every architectural decision has a cost implication that is invisible until the bill arrives. Private subnets are a security best practice, but they come with a $340 monthly NAT Gateway tax. Verbose logging is an operational best practice, but it comes with a storage cost that compounds every day. Cross-AZ deployment is a reliability best practice, but it doubles your data transfer charges.
None of these costs are visible in the signup flow, the quick start guide, or the architecture tutorial. They are visible in the billing console thirty days later when the damage is already done. The only defense is modeling costs before deploying and reviewing the bill line by line every month.
The builder phase was less glamorous than people imagine. It was mostly a series of stubborn, unfashionable choices that kept future-me out of 2 a.m. incident calls. I still make the same kind of choices inside portfolio, pipeline-sdk, and dotfiles.
Startups die from cloud bills more often than they admit. The first time you open the billing console and flinch, take it seriously. Model the costs, make the changes, and never deploy infrastructure without a cost projection again.