Harsh Rastogi

Building EduFly's Cloud Infrastructure

When I founded EduFly — an AI-powered School ERP serving multiple schools — I faced the classic startup infrastructure dilemma: build for scale from day one (and burn cash on unused capacity) or start minimal and rewrite later (and suffer downtime during growth). I chose a middle path: build with scalability in mind, but only pay for what you need today.

The result is an AWS infrastructure that serves 15,000+ users with 99.9% uptime while keeping monthly costs under control. These same architectural principles now inform how I build cloud infrastructure at Modelia.ai for our Shopify-integrated AI fashion platform.

The Architecture

EduFly's AWS setup, refined over 18 months of production operation:

›Compute: ECS Fargate — serverless containers that scale automatically. No EC2 instances to manage, no AMIs to maintain.
›Database: RDS PostgreSQL with automated backups, point-in-time recovery, and read replicas for analytics queries
›Cache: ElastiCache Redis for session management, API response caching, and real-time notifications
›Storage: S3 for student documents, report cards, school logos, and assessment attachments. Lifecycle policies automatically archive old files to Glacier.
›CDN: CloudFront for static assets (React bundles, images, fonts) with edge locations across India
›DNS: Route 53 with health checks and automatic failover
›Monitoring: CloudWatch with custom dashboards, alarms, and automated incident response
›Secrets: AWS Secrets Manager for database credentials, API keys, and encryption keys

Starting Small, Scaling Smart

Phase 1: MVP (0-1,000 users) — ~$50/month

The initial deployment was deliberately minimal:

yaml

# Initial ECS task definition
taskDefinition:
  family: edufly-api
  cpu: 256        # 0.25 vCPU
  memory: 512     # 512 MB
  containers:
    - name: api
      image: edufly-api:latest
      portMappings:
        - containerPort: 3000

# Single RDS instance
rdsInstance:
  instanceClass: db.t3.micro
  allocatedStorage: 20
  multiAZ: false
  backupRetentionPeriod: 7

At this stage, a single ECS task and a single RDS instance handled everything. Total monthly cost: approximately $50 (RDS micro + Fargate minimal + S3 + Route 53).

Phase 2: Growth (1,000-10,000 users) — ~$200/month

As schools adopted EduFly, we needed autoscaling and caching:

yaml

# ECS Service with autoscaling
service:
  desiredCount: 2
  capacityProvider: FARGATE
  autoScaling:
    minCapacity: 2
    maxCapacity: 6
    targetCpuUtilization: 70
    targetMemoryUtilization: 75
    scaleInCooldown: 300
    scaleOutCooldown: 60

# Add ElastiCache for caching
elasticache:
  engine: redis
  nodeType: cache.t3.micro
  numCacheNodes: 1

The key addition was Redis caching. Student profiles, attendance records, and timetable data are read-heavy and write-light — perfect for caching. Adding Redis reduced our RDS CPU usage by 40%.

Phase 3: Scale (10,000+ users) — ~$500/month

Multi-AZ deployment for high availability, read replicas for analytics, and CloudFront for global CDN:

yaml

# Multi-AZ RDS with read replica
rdsInstance:
  instanceClass: db.t3.small
  multiAZ: true              # Automatic failover
  readReplica:
    instanceClass: db.t3.micro
    region: ap-south-1       # Same region for low latency

# CloudFront distribution
cloudfront:
  origins:
    - s3Bucket: edufly-static-assets
    - elbEndpoint: api.edufly.app
  defaultCacheBehavior:
    ttl: 86400               # 24 hours for static assets
  customErrorResponse:
    errorCode: 404
    responseCode: 200
    responsePage: /index.html  # SPA fallback

Cost Optimization Strategies

Running a startup means every dollar matters. Here's how I keep AWS costs under control:

1. Fargate Spot for Non-Critical Workloads (70% savings)

Background jobs like report generation, data exports, and AI model training run on Fargate Spot — AWS's unused capacity at a steep discount:

typescript

// ECS task for background jobs uses Fargate Spot
const backgroundTask = new ecs.FargateTaskDefinition(this, 'BackgroundTask', {
  cpu: 512,
  memoryLimitMiB: 1024,
});

new ecs.FargateService(this, 'BackgroundService', {
  cluster,
  taskDefinition: backgroundTask,
  capacityProviderStrategies: [
    { capacityProvider: 'FARGATE_SPOT', weight: 4 },  // Prefer Spot
    { capacityProvider: 'FARGATE', weight: 1 },        // Fallback to on-demand
  ],
});

2. RDS Reserved Instances (40% savings)

For the production database that runs 24/7, a 1-year reserved instance saves 40% compared to on-demand pricing.

3. S3 Lifecycle Policies

Student documents from previous academic years don't need instant access. Automatically transition to cheaper storage:

json

{
  "Rules": [
    {
      "Status": "Enabled",
      "Transitions": [
        { "Days": 90, "StorageClass": "STANDARD_IA" },
        { "Days": 365, "StorageClass": "GLACIER" }
      ]
    }
  ]
}

4. CloudWatch-Based Right-Sizing

We review resource utilization monthly. If an ECS task consistently uses less than 50% of its allocated CPU/memory, we downsize it:

bash

# Check average CPU utilization for the last 7 days
aws cloudwatch get-metric-statistics   --namespace AWS/ECS   --metric-name CPUUtilization   --dimensions Name=ServiceName,Value=edufly-api   --start-time $(date -d '7 days ago' -Iseconds)   --end-time $(date -Iseconds)   --period 86400   --statistics Average

Docker + ECS Deployment

The deployment discipline I learned at Bharat Electronics Limited (BEL) — where every release for Airforce projects required rigorous review — applies perfectly to AWS ECS:

yaml

# ECS Task Definition with health check
containerDefinitions:
  - name: api
    image: 123456.dkr.ecr.ap-south-1.amazonaws.com/edufly-api:latest
    memory: 512
    cpu: 256
    essential: true
    portMappings:
      - containerPort: 3000
    healthCheck:
      command: ["CMD-SHELL", "wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1"]
      interval: 30
      timeout: 5
      retries: 3
      startPeriod: 60
    logConfiguration:
      logDriver: awslogs
      options:
        awslogs-group: /ecs/edufly-api
        awslogs-region: ap-south-1
        awslogs-stream-prefix: ecs
    environment:
      - name: NODE_ENV
        value: production
    secrets:
      - name: DATABASE_URL
        valueFrom: arn:aws:secretsmanager:ap-south-1:123456:secret:edufly/database-url

CI/CD Pipeline

Our deployment pipeline (now also used at Modelia.ai):

›GitHub Actions — Run tests, lint, and type-check on every PR
›Docker build — Multi-stage build for minimal production image
›ECR push — Push tagged image to Amazon Elastic Container Registry
›ECS rolling update — New tasks start with the new image; old tasks drain connections gracefully
›Health verification — ECS waits for health checks to pass before routing traffic
›Rollback trigger — If error rate spikes above 5% in CloudWatch, automatically roll back to previous task definition

Monitoring and Alerting

The monitoring setup that keeps EduFly running at 99.9% uptime:

typescript

// CloudWatch alarms via CDK
new cloudwatch.Alarm(this, 'HighErrorRate', {
  metric: apiService.metricCpuUtilization(),
  threshold: 90,
  evaluationPeriods: 3,
  alarmDescription: 'API CPU above 90% for 3 consecutive periods',
  actionsEnabled: true,
  alarmActions: [snsTopic],
});

new cloudwatch.Alarm(this, 'HighLatency', {
  metric: new cloudwatch.Metric({
    namespace: 'EduFly',
    metricName: 'ApiResponseTime',
    statistic: 'p99',
    period: cdk.Duration.minutes(5),
  }),
  threshold: 2000, // 2 seconds
  evaluationPeriods: 2,
  alarmDescription: 'P99 latency above 2s',
  alarmActions: [snsTopic],
});

We get Slack notifications within 60 seconds of any anomaly. During school exam periods (our highest traffic), I watch the CloudWatch dashboard in real-time.

Key Takeaways

›Start with the simplest architecture that works, then scale — EduFly started at $50/month and grew to $500/month serving 15K users
›Use Fargate to avoid managing EC2 instances — let AWS handle the underlying infrastructure
›Fargate Spot for background jobs saves 70% — accept occasional interruptions for massive savings
›CloudFront CDN is the easiest performance win — one configuration change improves load times globally
›Multi-AZ is non-negotiable for production — a single availability zone failure shouldn't take down your app
›Invest in monitoring from day one — CloudWatch alarms saved us from multiple incidents at EduFly
›S3 lifecycle policies prevent storage costs from growing linearly — archive old data automatically
›The deployment discipline from BEL applies perfectly — health checks, rolling updates, and automatic rollback
›Right-size monthly — review CloudWatch metrics and downsize over-provisioned resources

Written by Harsh Rastogi — AI Product Engineer leading AI product direction at Modelia. Connect with me on LinkedIn for more on Shopify, Generative AI, agentic systems, and production engineering.

AWS Architecture for Startups: From Zero to Scale

Building EduFly's Cloud Infrastructure

The Architecture

Starting Small, Scaling Smart

Phase 1: MVP (0-1,000 users) — ~$50/month

Phase 2: Growth (1,000-10,000 users) — ~$200/month

Phase 3: Scale (10,000+ users) — ~$500/month

Cost Optimization Strategies

1. Fargate Spot for Non-Critical Workloads (70% savings)

2. RDS Reserved Instances (40% savings)

3. S3 Lifecycle Policies

4. CloudWatch-Based Right-Sizing

Docker + ECS Deployment

CI/CD Pipeline

Monitoring and Alerting

Key Takeaways

Connect on LinkedIn

Related Articles

Docker Best Practices for Full-Stack Applications

Gemini CLI Is Dead: 15-Minute Migration to Antigravity CLI Before June 18

CI/CD with GitHub Actions: Complete Setup Guide