Building EduFly's Cloud Infrastructure
When I founded EduFly — an AI-powered School ERP serving multiple schools — I faced the classic startup infrastructure dilemma: build for scale from day one (and burn cash on unused capacity) or start minimal and rewrite later (and suffer downtime during growth). I chose a middle path: build with scalability in mind, but only pay for what you need today.
The result is an AWS infrastructure that serves 15,000+ users with 99.9% uptime while keeping monthly costs under control. These same architectural principles now inform how I build cloud infrastructure at Modelia.ai for our Shopify-integrated AI fashion platform.
The Architecture
EduFly's AWS setup, refined over 18 months of production operation:
- ›Compute: ECS Fargate — serverless containers that scale automatically. No EC2 instances to manage, no AMIs to maintain.
- ›Database: RDS PostgreSQL with automated backups, point-in-time recovery, and read replicas for analytics queries
- ›Cache: ElastiCache Redis for session management, API response caching, and real-time notifications
- ›Storage: S3 for student documents, report cards, school logos, and assessment attachments. Lifecycle policies automatically archive old files to Glacier.
- ›CDN: CloudFront for static assets (React bundles, images, fonts) with edge locations across India
- ›DNS: Route 53 with health checks and automatic failover
- ›Monitoring: CloudWatch with custom dashboards, alarms, and automated incident response
- ›Secrets: AWS Secrets Manager for database credentials, API keys, and encryption keys
Starting Small, Scaling Smart
Phase 1: MVP (0-1,000 users) — ~$50/month
The initial deployment was deliberately minimal:
# Initial ECS task definition
taskDefinition:
family: edufly-api
cpu: 256 # 0.25 vCPU
memory: 512 # 512 MB
containers:
- name: api
image: edufly-api:latest
portMappings:
- containerPort: 3000
# Single RDS instance
rdsInstance:
instanceClass: db.t3.micro
allocatedStorage: 20
multiAZ: false
backupRetentionPeriod: 7At this stage, a single ECS task and a single RDS instance handled everything. Total monthly cost: approximately $50 (RDS micro + Fargate minimal + S3 + Route 53).
Phase 2: Growth (1,000-10,000 users) — ~$200/month
As schools adopted EduFly, we needed autoscaling and caching:
# ECS Service with autoscaling
service:
desiredCount: 2
capacityProvider: FARGATE
autoScaling:
minCapacity: 2
maxCapacity: 6
targetCpuUtilization: 70
targetMemoryUtilization: 75
scaleInCooldown: 300
scaleOutCooldown: 60
# Add ElastiCache for caching
elasticache:
engine: redis
nodeType: cache.t3.micro
numCacheNodes: 1The key addition was Redis caching. Student profiles, attendance records, and timetable data are read-heavy and write-light — perfect for caching. Adding Redis reduced our RDS CPU usage by 40%.
Phase 3: Scale (10,000+ users) — ~$500/month
Multi-AZ deployment for high availability, read replicas for analytics, and CloudFront for global CDN:
# Multi-AZ RDS with read replica
rdsInstance:
instanceClass: db.t3.small
multiAZ: true # Automatic failover
readReplica:
instanceClass: db.t3.micro
region: ap-south-1 # Same region for low latency
# CloudFront distribution
cloudfront:
origins:
- s3Bucket: edufly-static-assets
- elbEndpoint: api.edufly.app
defaultCacheBehavior:
ttl: 86400 # 24 hours for static assets
customErrorResponse:
errorCode: 404
responseCode: 200
responsePage: /index.html # SPA fallbackCost Optimization Strategies
Running a startup means every dollar matters. Here's how I keep AWS costs under control:
1. Fargate Spot for Non-Critical Workloads (70% savings)
Background jobs like report generation, data exports, and AI model training run on Fargate Spot — AWS's unused capacity at a steep discount:
// ECS task for background jobs uses Fargate Spot
const backgroundTask = new ecs.FargateTaskDefinition(this, 'BackgroundTask', {
cpu: 512,
memoryLimitMiB: 1024,
});
new ecs.FargateService(this, 'BackgroundService', {
cluster,
taskDefinition: backgroundTask,
capacityProviderStrategies: [
{ capacityProvider: 'FARGATE_SPOT', weight: 4 }, // Prefer Spot
{ capacityProvider: 'FARGATE', weight: 1 }, // Fallback to on-demand
],
});2. RDS Reserved Instances (40% savings)
For the production database that runs 24/7, a 1-year reserved instance saves 40% compared to on-demand pricing.
3. S3 Lifecycle Policies
Student documents from previous academic years don't need instant access. Automatically transition to cheaper storage:
{
"Rules": [
{
"Status": "Enabled",
"Transitions": [
{ "Days": 90, "StorageClass": "STANDARD_IA" },
{ "Days": 365, "StorageClass": "GLACIER" }
]
}
]
}4. CloudWatch-Based Right-Sizing
We review resource utilization monthly. If an ECS task consistently uses less than 50% of its allocated CPU/memory, we downsize it:
# Check average CPU utilization for the last 7 days
aws cloudwatch get-metric-statistics --namespace AWS/ECS --metric-name CPUUtilization --dimensions Name=ServiceName,Value=edufly-api --start-time $(date -d '7 days ago' -Iseconds) --end-time $(date -Iseconds) --period 86400 --statistics AverageDocker + ECS Deployment
The deployment discipline I learned at Bharat Electronics Limited (BEL) — where every release for Airforce projects required rigorous review — applies perfectly to AWS ECS:
# ECS Task Definition with health check
containerDefinitions:
- name: api
image: 123456.dkr.ecr.ap-south-1.amazonaws.com/edufly-api:latest
memory: 512
cpu: 256
essential: true
portMappings:
- containerPort: 3000
healthCheck:
command: ["CMD-SHELL", "wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1"]
interval: 30
timeout: 5
retries: 3
startPeriod: 60
logConfiguration:
logDriver: awslogs
options:
awslogs-group: /ecs/edufly-api
awslogs-region: ap-south-1
awslogs-stream-prefix: ecs
environment:
- name: NODE_ENV
value: production
secrets:
- name: DATABASE_URL
valueFrom: arn:aws:secretsmanager:ap-south-1:123456:secret:edufly/database-urlCI/CD Pipeline
Our deployment pipeline (now also used at Modelia.ai):
- ›GitHub Actions — Run tests, lint, and type-check on every PR
- ›Docker build — Multi-stage build for minimal production image
- ›ECR push — Push tagged image to Amazon Elastic Container Registry
- ›ECS rolling update — New tasks start with the new image; old tasks drain connections gracefully
- ›Health verification — ECS waits for health checks to pass before routing traffic
- ›Rollback trigger — If error rate spikes above 5% in CloudWatch, automatically roll back to previous task definition
Monitoring and Alerting
The monitoring setup that keeps EduFly running at 99.9% uptime:
// CloudWatch alarms via CDK
new cloudwatch.Alarm(this, 'HighErrorRate', {
metric: apiService.metricCpuUtilization(),
threshold: 90,
evaluationPeriods: 3,
alarmDescription: 'API CPU above 90% for 3 consecutive periods',
actionsEnabled: true,
alarmActions: [snsTopic],
});
new cloudwatch.Alarm(this, 'HighLatency', {
metric: new cloudwatch.Metric({
namespace: 'EduFly',
metricName: 'ApiResponseTime',
statistic: 'p99',
period: cdk.Duration.minutes(5),
}),
threshold: 2000, // 2 seconds
evaluationPeriods: 2,
alarmDescription: 'P99 latency above 2s',
alarmActions: [snsTopic],
});We get Slack notifications within 60 seconds of any anomaly. During school exam periods (our highest traffic), I watch the CloudWatch dashboard in real-time.
Key Takeaways
- ›Start with the simplest architecture that works, then scale — EduFly started at $50/month and grew to $500/month serving 15K users
- ›Use Fargate to avoid managing EC2 instances — let AWS handle the underlying infrastructure
- ›Fargate Spot for background jobs saves 70% — accept occasional interruptions for massive savings
- ›CloudFront CDN is the easiest performance win — one configuration change improves load times globally
- ›Multi-AZ is non-negotiable for production — a single availability zone failure shouldn't take down your app
- ›Invest in monitoring from day one — CloudWatch alarms saved us from multiple incidents at EduFly
- ›S3 lifecycle policies prevent storage costs from growing linearly — archive old data automatically
- ›The deployment discipline from BEL applies perfectly — health checks, rolling updates, and automatic rollback
- ›Right-size monthly — review CloudWatch metrics and downsize over-provisioned resources
