You don't need (Big Data Platform) for that

I’ve noticed that a lot of teams are eager to quickly reach for the big guns when it comes to data or file processing (Databricks, a managed Spark cluster and so on) to process files that arrive in their object storage. In many cases, the processing required is something which coreutils tools or a well-considered Python script with modest resources can process in almost no time.

Instead of spending bank on an overkill setup, I’d like to show you a pattern that I have been using for processing large files which:

Costs almost nothing
Is performant
Is scalable

and most importantly

Is dead simple

The architecture

graph TD
    A[S3 - file lands] --> B[EventBridge]
    B --> C[ECS Fargate Task]
    C --> D[S3 - output]
    C --> E[CloudWatch - observability]

When a file hits S3, a container will spin up and process it, writing the result back to S3 before dying. You pay for compute/storage/bandwidth used and nothing else.

How it works

A file lands in S3
This could be a CSV/TSV dump from a vendor, a Parquet export from another system or a media file. It doesn’t really matter, S3 will emit the event.

EventBridge picks it up
S3 sends notifications to EventBridge natively. An EventBridge rule matches on PutObject events for your bucket and prefix (e.g. uploads/).

A Fargate task is launched
The rule’s target is an ECS task. EventBridge passes the bucket and object key to the container as overrides.

The work is done
A Python script (or whatever, could be a shell script calling ffmpeg, could be a custom binary) reads the file from S3, processes it and then writes results back to the bucket.

The container exits
Your resources have served their purpose. Let AWS reclaim the compute and save your pennies.

Boilerplate example

Here’s a boilerplate example of this infrastructure using Terraform:

S3 bucket with EventBridge notifications

resource "aws_s3_bucket" "pipeline" {
  bucket = "acme-file-pipeline"
}

resource "aws_s3_bucket_notification" "pipeline" {
  bucket      = aws_s3_bucket.pipeline.id
  eventbridge = true
}

The eventbridge = true is all you need in order to trigger the event on upload.

EventBridge rule

resource "aws_cloudwatch_event_rule" "file_uploaded" {
  name = "file-pipeline-trigger"

  event_pattern = jsonencode({
    source      = ["aws.s3"]
    detail-type = ["Object Created"]
    detail = {
      bucket = { name = [aws_s3_bucket.pipeline.id] }
      object = { key = [{ prefix = "uploads/" }] }
    }
  })
}

resource "aws_cloudwatch_event_target" "run_task" {
  rule     = aws_cloudwatch_event_rule.file_uploaded.name
  arn      = aws_ecs_cluster.pipeline.arn
  role_arn = aws_iam_role.eventbridge_role.arn

  ecs_target {
    task_definition_arn = aws_ecs_task_definition.processor.arn
    task_count          = 1
    launch_type         = "FARGATE"

    network_configuration {
      subnets         = var.private_subnet_ids
      security_groups = [aws_security_group.task.id]
    }
  }

  input_transformer {
    input_paths = {
      bucket = "$.detail.bucket.name"
      key    = "$.detail.object.key"
    }
    input_template = <<-EOF
      {
        "containerOverrides": [{
          "name": "processor",
          "environment": [
            { "name": "S3_BUCKET", "value": <bucket> },
            { "name": "S3_KEY", "value": <key> }
          ]
        }]
      }
    EOF
  }
}

input_transformer is the key bit here. It extracts the bucket and key from the S3 event and injects them as environment variables into the container. Your processing script can infer the input from S3_BUCKET and S3_KEY. If you need more granularity, e.g. to process data differently from depending on the source, consider having inputs go to different prefixes or buckets altogether.

ECS cluster and task definition

resource "aws_ecs_cluster" "pipeline" {
  name = "file-pipeline"
}

resource "aws_ecs_task_definition" "processor" {
  family                   = "file-processor"
  requires_compatibilities = ["FARGATE"]
  network_mode             = "awsvpc"
  cpu                      = "4096"
  memory                   = "16384"
  execution_role_arn       = aws_iam_role.ecs_execution_role.arn
  task_role_arn            = aws_iam_role.task_role.arn

  container_definitions = jsonencode([{
    name      = "processor"
    image     = "${aws_ecr_repository.processor.repository_url}:latest"
    essential = true

    environment = [
      { name = "OUTPUT_BUCKET", value = aws_s3_bucket.pipeline.id },
      { name = "OUTPUT_PREFIX", value = "processed/" }
    ]

    logConfiguration = {
      logDriver = "awslogs"
      options = {
        awslogs-group         = aws_cloudwatch_log_group.processor.name
        awslogs-region        = var.region
        awslogs-stream-prefix = "task"
      }
    }
  }])
}

resource "aws_cloudwatch_log_group" "processor" {
  name              = "/ecs/file-processor"
  retention_in_days = 30
}

IAM

resource "aws_iam_role" "eventbridge_role" {
  name = "file-pipeline-eventbridge"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action = "sts:AssumeRole"
      Effect = "Allow"
      Principal = { Service = "events.amazonaws.com" }
    }]
  })
}

resource "aws_iam_role_policy" "eventbridge_ecs" {
  role = aws_iam_role.eventbridge_role.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect   = "Allow"
        Action   = ["ecs:RunTask"]
        Resource = [aws_ecs_task_definition.processor.arn]
      },
      {
        Effect   = "Allow"
        Action   = ["iam:PassRole"]
        Resource = [
          aws_iam_role.ecs_execution_role.arn,
          aws_iam_role.task_role.arn
        ]
      }
    ]
  })
}

resource "aws_iam_role" "task_role" {
  name = "file-pipeline-task"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action = "sts:AssumeRole"
      Effect = "Allow"
      Principal = { Service = "ecs-tasks.amazonaws.com" }
    }]
  })
}

resource "aws_iam_role_policy" "task_s3" {
  role = aws_iam_role.task_role.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Action = ["s3:GetObject", "s3:PutObject", "s3:ListBucket"]
      Resource = [
        aws_s3_bucket.pipeline.arn,
        "${aws_s3_bucket.pipeline.arn}/*"
      ]
    }]
  })
}

All you need to do is give the ECS task permission to read and write back to the S3 bucket, and let EventBridge start your ECS task.

Observability

CloudWatch natively captures stdout/stderr from every task. Beyond that, I would recommend setting up a few bits:

Failed Tasks
Have any failed tasks (be that due to OOM, whatever else) be sent to your Slack channel via SNS for manual inspection.

Resource alarms
Using Container Insights you can easily get per-task resource metrics (CPU, memory), and you can create alarms based on these metrics to fit your needs.

Write status updates back to the bucket
You can have the processing script writes status updates back to S3. Imagine a simple status.json output which updates throughout the process:

{
    "file": "uploads/2026-02-14/transactions-9bc7fd.parquet",
    "status": "completed",
    "output": "s3://acme-file-pipeline/processed/2026-02-14/transactions-9bc7fd_aggregated.parquet",
    "rows_processed": 47200000,
    "started_at": "2026-02-14T09:15:03Z",
    "completed_at": "2026-02-14T09:15:37Z"
  }

You can build a lightweight dashboard that just polls those status files. A static page alongside a simple API generating pre-signed S3 URLs, with a bit of JavaScript that fetches status.json from S3 will work absolutely fine for most use cases.

When you actually need the big guns

Use Spark/Databricks/EMR when:

You need to join or shuffle across multiple large datasets that don’t fit in memory on a single machine. (although, you could also consider using something like Polars in streaming mode)
Your processing requires iterative algorithms (e.g. ML training on huge datasets)
You actually have petabytes

For everything else, e.g. one-at-a-time ETL, media transcoding, document processing, report generation, data validation and so on - in my opinion a Fargate task is simpler, cheaper, and easier to debug.

Wrapping up

Start boring. Chances are you don’t need that much. You can always add more later if you need to. Fargate is an excellent choice. Lambda also works, but then you need to consider the hard timeout limit of 15 minutes per execution, and it will also work out to be a bit more expensive. Most large public clouds have their own version of these, just with different names.

The best infrastructure is the stuff you don’t have to think about. I’ve built the pipeline described in this post multiple times. These pipelines have processed billions of records, transcoded lots of large media files, and the only time I ever need to look at it is when I get a Slack alert, which practically never happens.