Cloud

AWS Architecture for Startups: From Zero to Scale

Cost-effective cloud architecture that grows with your startup. How I architected EduFly to serve 15K+ users on AWS with 99.9% uptime.

Harsh RastogiHarsh Rastogi
Oct 20, 202416 min
AWSCloudArchitectureStartupDevOps

Building EduFly's Cloud Infrastructure

When I founded EduFly — an AI-powered School ERP serving multiple schools — I faced the classic startup infrastructure dilemma: build for scale from day one (and burn cash on unused capacity) or start minimal and rewrite later (and suffer downtime during growth). I chose a middle path: build with scalability in mind, but only pay for what you need today.

The result is an AWS infrastructure that serves 15,000+ users with 99.9% uptime while keeping monthly costs under control. These same architectural principles now inform how I build cloud infrastructure at Modelia.ai for our Shopify-integrated AI fashion platform.

The Architecture

EduFly's AWS setup, refined over 18 months of production operation:

  • Compute: ECS Fargate — serverless containers that scale automatically. No EC2 instances to manage, no AMIs to maintain.
  • Database: RDS PostgreSQL with automated backups, point-in-time recovery, and read replicas for analytics queries
  • Cache: ElastiCache Redis for session management, API response caching, and real-time notifications
  • Storage: S3 for student documents, report cards, school logos, and assessment attachments. Lifecycle policies automatically archive old files to Glacier.
  • CDN: CloudFront for static assets (React bundles, images, fonts) with edge locations across India
  • DNS: Route 53 with health checks and automatic failover
  • Monitoring: CloudWatch with custom dashboards, alarms, and automated incident response
  • Secrets: AWS Secrets Manager for database credentials, API keys, and encryption keys

Starting Small, Scaling Smart

Phase 1: MVP (0-1,000 users) — ~$50/month

The initial deployment was deliberately minimal:

yaml
# Initial ECS task definition
taskDefinition:
  family: edufly-api
  cpu: 256        # 0.25 vCPU
  memory: 512     # 512 MB
  containers:
    - name: api
      image: edufly-api:latest
      portMappings:
        - containerPort: 3000

# Single RDS instance
rdsInstance:
  instanceClass: db.t3.micro
  allocatedStorage: 20
  multiAZ: false
  backupRetentionPeriod: 7

At this stage, a single ECS task and a single RDS instance handled everything. Total monthly cost: approximately $50 (RDS micro + Fargate minimal + S3 + Route 53).

Phase 2: Growth (1,000-10,000 users) — ~$200/month

As schools adopted EduFly, we needed autoscaling and caching:

yaml
# ECS Service with autoscaling
service:
  desiredCount: 2
  capacityProvider: FARGATE
  autoScaling:
    minCapacity: 2
    maxCapacity: 6
    targetCpuUtilization: 70
    targetMemoryUtilization: 75
    scaleInCooldown: 300
    scaleOutCooldown: 60

# Add ElastiCache for caching
elasticache:
  engine: redis
  nodeType: cache.t3.micro
  numCacheNodes: 1

The key addition was Redis caching. Student profiles, attendance records, and timetable data are read-heavy and write-light — perfect for caching. Adding Redis reduced our RDS CPU usage by 40%.

Phase 3: Scale (10,000+ users) — ~$500/month

Multi-AZ deployment for high availability, read replicas for analytics, and CloudFront for global CDN:

yaml
# Multi-AZ RDS with read replica
rdsInstance:
  instanceClass: db.t3.small
  multiAZ: true              # Automatic failover
  readReplica:
    instanceClass: db.t3.micro
    region: ap-south-1       # Same region for low latency

# CloudFront distribution
cloudfront:
  origins:
    - s3Bucket: edufly-static-assets
    - elbEndpoint: api.edufly.app
  defaultCacheBehavior:
    ttl: 86400               # 24 hours for static assets
  customErrorResponse:
    errorCode: 404
    responseCode: 200
    responsePage: /index.html  # SPA fallback

Cost Optimization Strategies

Running a startup means every dollar matters. Here's how I keep AWS costs under control:

1. Fargate Spot for Non-Critical Workloads (70% savings)

Background jobs like report generation, data exports, and AI model training run on Fargate Spot — AWS's unused capacity at a steep discount:

typescript
// ECS task for background jobs uses Fargate Spot
const backgroundTask = new ecs.FargateTaskDefinition(this, 'BackgroundTask', {
  cpu: 512,
  memoryLimitMiB: 1024,
});

new ecs.FargateService(this, 'BackgroundService', {
  cluster,
  taskDefinition: backgroundTask,
  capacityProviderStrategies: [
    { capacityProvider: 'FARGATE_SPOT', weight: 4 },  // Prefer Spot
    { capacityProvider: 'FARGATE', weight: 1 },        // Fallback to on-demand
  ],
});

2. RDS Reserved Instances (40% savings)

For the production database that runs 24/7, a 1-year reserved instance saves 40% compared to on-demand pricing.

3. S3 Lifecycle Policies

Student documents from previous academic years don't need instant access. Automatically transition to cheaper storage:

json
{
  "Rules": [
    {
      "Status": "Enabled",
      "Transitions": [
        { "Days": 90, "StorageClass": "STANDARD_IA" },
        { "Days": 365, "StorageClass": "GLACIER" }
      ]
    }
  ]
}

4. CloudWatch-Based Right-Sizing

We review resource utilization monthly. If an ECS task consistently uses less than 50% of its allocated CPU/memory, we downsize it:

bash
# Check average CPU utilization for the last 7 days
aws cloudwatch get-metric-statistics   --namespace AWS/ECS   --metric-name CPUUtilization   --dimensions Name=ServiceName,Value=edufly-api   --start-time $(date -d '7 days ago' -Iseconds)   --end-time $(date -Iseconds)   --period 86400   --statistics Average

Docker + ECS Deployment

The deployment discipline I learned at Bharat Electronics Limited (BEL) — where every release for Airforce projects required rigorous review — applies perfectly to AWS ECS:

yaml
# ECS Task Definition with health check
containerDefinitions:
  - name: api
    image: 123456.dkr.ecr.ap-south-1.amazonaws.com/edufly-api:latest
    memory: 512
    cpu: 256
    essential: true
    portMappings:
      - containerPort: 3000
    healthCheck:
      command: ["CMD-SHELL", "wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1"]
      interval: 30
      timeout: 5
      retries: 3
      startPeriod: 60
    logConfiguration:
      logDriver: awslogs
      options:
        awslogs-group: /ecs/edufly-api
        awslogs-region: ap-south-1
        awslogs-stream-prefix: ecs
    environment:
      - name: NODE_ENV
        value: production
    secrets:
      - name: DATABASE_URL
        valueFrom: arn:aws:secretsmanager:ap-south-1:123456:secret:edufly/database-url

CI/CD Pipeline

Our deployment pipeline (now also used at Modelia.ai):

  • GitHub Actions — Run tests, lint, and type-check on every PR
  • Docker build — Multi-stage build for minimal production image
  • ECR push — Push tagged image to Amazon Elastic Container Registry
  • ECS rolling update — New tasks start with the new image; old tasks drain connections gracefully
  • Health verification — ECS waits for health checks to pass before routing traffic
  • Rollback trigger — If error rate spikes above 5% in CloudWatch, automatically roll back to previous task definition

Monitoring and Alerting

The monitoring setup that keeps EduFly running at 99.9% uptime:

typescript
// CloudWatch alarms via CDK
new cloudwatch.Alarm(this, 'HighErrorRate', {
  metric: apiService.metricCpuUtilization(),
  threshold: 90,
  evaluationPeriods: 3,
  alarmDescription: 'API CPU above 90% for 3 consecutive periods',
  actionsEnabled: true,
  alarmActions: [snsTopic],
});

new cloudwatch.Alarm(this, 'HighLatency', {
  metric: new cloudwatch.Metric({
    namespace: 'EduFly',
    metricName: 'ApiResponseTime',
    statistic: 'p99',
    period: cdk.Duration.minutes(5),
  }),
  threshold: 2000, // 2 seconds
  evaluationPeriods: 2,
  alarmDescription: 'P99 latency above 2s',
  alarmActions: [snsTopic],
});

We get Slack notifications within 60 seconds of any anomaly. During school exam periods (our highest traffic), I watch the CloudWatch dashboard in real-time.

Key Takeaways

  • Start with the simplest architecture that works, then scaleEduFly started at $50/month and grew to $500/month serving 15K users
  • Use Fargate to avoid managing EC2 instances — let AWS handle the underlying infrastructure
  • Fargate Spot for background jobs saves 70% — accept occasional interruptions for massive savings
  • CloudFront CDN is the easiest performance win — one configuration change improves load times globally
  • Multi-AZ is non-negotiable for production — a single availability zone failure shouldn't take down your app
  • Invest in monitoring from day one — CloudWatch alarms saved us from multiple incidents at EduFly
  • S3 lifecycle policies prevent storage costs from growing linearly — archive old data automatically
  • The deployment discipline from BEL applies perfectly — health checks, rolling updates, and automatic rollback
  • Right-size monthly — review CloudWatch metrics and downsize over-provisioned resources

Share this article

Harsh Rastogi - Full Stack Engineer

Harsh Rastogi

Full Stack Engineer

Full Stack Engineer building production AI systems at Modelia. Previously at Asynq and Bharat Electronics Limited. Published researcher.

Connect on LinkedIn

Follow me for more insights on software engineering, system design, and career growth.

View Profile