There’s a pattern I see constantly in architecture content: “here’s how Netflix does it.” Kafka clusters, Kubernetes operators, service meshes, dedicated worker fleets. It’s awesome, it’s correct, but it’s also potentially over the top if you’re one person building a product.
But the opposite advice - “just use a simple queue, you’re not Netflix” - also misses the mark. Because how you build the simple version matters. If your simple version is a mess of hacked-together polling loops with no resilience, you’re not buying time, you’re accumulating debt and dangerously close to the phenomenon of “proto-duction”.
This post is about the middle path. I built a production async processing pipeline - multi-step job processing, file generation, cloud storage, real-time progress, resume on failure - as a solo developer.
I’ll show you exactly how it worked initially, where the enterprise patterns appear in simplified form, and what the concrete upgrade path looks like when you need it.
First, a quick disclaimer. I put this together from notes I’ve accumulated building products over the years. There is no right or wrong answer to any of this, and I was a bit reluctant to post it as I wasn’t sure it would add value. If it does for you, fantastic! If not, I’m always open to discussion on this sort of thing.
I also added a quick price comparison for each pattern, and for these I got an AI to work it out, so it’s very rough.
Let’s go!
A dedicated message broker - Think Azure Service Bus, AWS SQS, RabbitMQ.
Messages are pushed onto a queue, consumers react when a message arrives. You get built-in dead lettering, message locks, retry policies, fan-out, the whole shebang!
A BackgroundJobs table in PostgreSQL:
CREATE TABLE "BackgroundJobs" (
"Id" UUID PRIMARY KEY,
"UserId" UUID NOT NULL,
"JobType" TEXT NOT NULL,
"Status" INT NOT NULL, -- 0=Queued, 1=Processing, 2=Completed, 3=Failed, 4=Cancelled
"Payload" JSONB NOT NULL,
"Result" JSONB,
"ErrorMessage" TEXT,
"ProgressPercent" INT,
"StatusMessage" TEXT,
"CreatedAt" TIMESTAMPTZ NOT NULL,
"StartedAt" TIMESTAMPTZ,
"CompletedAt" TIMESTAMPTZ
);When the domain service kicks off a job, it POSTs to the processing service’s internal endpoint. That endpoint writes a row with Status=Queued and returns immediately. The caller doesn’t wait.
The PollingJobProcessor - a .NET IHostedService - runs a loop inside the processing container. It queries for the oldest queued job, marks it as Processing, calls the appropriate handler, and updates the row to Completed or Failed.
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
while (!stoppingToken.IsCancellationRequested)
{
var processed = await ProcessNextJobAsync(stoppingToken);
await Task.Delay(processed ? _minDelay : _currentDelay, stoppingToken);
if (!processed) _currentDelay = Min(_currentDelay * 2, _maxDelay);
else _currentDelay = _minDelay;
}
}The exponential backoff is important. When there’s nothing to do, the polling interval grows, reducing pointless database round trips. When a job is found, the interval resets immediately. The database isn’t hammered for no reason.
When you hit contention, such as high write volume, jobs arriving faster than the poller can process them, replace the database write with a Service Bus / SQS message. The handler code doesn’t change. KEDA has native Service Bus and SQS scalers that swap in directly for the Postgres scaler.
Solo: £0/month extra - the queue is your existing database, no additional infrastructure. Enterprise: Azure Service Bus Standard ~£8–10/month base + ~£0.80 per million operations. AWS SQS is ~£0.40 per million requests (practically free at low volume, scales linearly).
KEDA running on Azure Kubernetes Service (AKS) or AWS Elastic Kubernetes Service (EKS), scaling dedicated worker deployments based on queue depth. Platform team owns the cluster, KEDA configuration, and scaling policies.
Azure Container Apps has KEDA built in and managed by Microsoft. You never see the Kubernetes cluster. The scaling rule lives in Terraform:
custom_scale_rule {
name = "active-jobs"
custom_rule_type = "postgresql"
metadata = {
query = "SELECT COUNT(*) FROM \"BackgroundJobs\" WHERE \"Status\" IN (0, 1)"
targetValue = "1"
}
authentication {
secret_name = "db-connection-string"
trigger_parameter = "connection"
}
}KEDA runs that query on an interval. When the count exceeds targetValue, it scales up a replica. When the queue empties, it scales back to zero. The container only exists when there’s work to do.
This means at zero load you pay nothing. The container cold-starts when a job arrives, processes it, and scales back down. For a solo dev, that’s the difference between a service costing £5/month and £50/month.
The scaling pattern is identical on AWS ECS with KEDA, or on AKS. Swap the Postgres scaler for an SQS or Service Bus scaler. The targetValue concept is the same - scale up when queue depth exceeds a threshold.
Solo: ~£5–15/month - Azure Container Apps consumption plan scales to zero, so you pay only for active compute. A lightly used processing service might run a few hours a day. Enterprise: ~£100–250/month minimum - AKS requires at least one always-on node pool. A basic 2-node cluster (Standard_D2s_v3) runs ~£120/month before you add monitoring, ingress, or node autoscaler overhead.
Idempotent message processing - each step of the pipeline publishes an event, downstream services consume it. If step 4 fails, you replay from step 4’s event without re-running steps 1–3. On Azure: Azure Event Grid or Azure Event Hubs. On AWS: Amazon EventBridge or SNS + SQS. Checkpoint state stored in Azure Cache for Redis or AWS ElastiCache for fast lookups.
Every completed unit gets uploaded to blob storage immediately as a checkpoint:
{userId}/jobs/{jobId}/units/unit_{unitId}.outputWhen a job starts (or restarts), it first checks for existing checkpoint blobs and downloads them. It then skips any unit that already has a checkpoint. You only process what hasn’t been done.
This matters because the per-unit processing is the expensive step - both in time and compute cost. A job with 200 units that fails at unit 180 restarts at unit 181, not unit 1.
The stale job cleanup service handles the crash case automatically. A background service runs on a schedule and finds any job stuck in Processing state beyond a timeout threshold - meaning the container died mid-job. It:
Failed with the checkpoint count in the error payloadOn retry, the domain service can resume from where it left off rather than starting over.
The checkpoint pattern doesn’t change at any scale. It’s purely a resilience strategy for long-running jobs. At enterprise scale you might use a distributed cache (Redis) for faster checkpoint lookups instead of blob storage queries, but the concept is identical.
Solo: ~£1–3/month - blob storage at standard tier is ~£0.02/GB. A few thousand checkpoint files barely registers. Enterprise: Add Redis Cache (~£12–45/month for Basic C1 to Standard C1 on Azure) if checkpoint lookups become a bottleneck. Blob storage cost stays the same.
An abstraction layer over multiple vendors with runtime routing, circuit breakers, and fallback chains. Circuit breaking handled by Polly (.NET), Resilience4j (JVM), or AWS App Mesh / Azure API Management policies at the infrastructure level.
A provider factory with a clean interface:
public interface IProcessingProvider
{
Task<Stream> ProcessAsync(ProcessingRequest request, CancellationToken ct);
Task<HealthStatus> GetHealthAsync();
}Multiple implementations behind that interface, plus a mock provider. The factory resolves the right one based on configuration:
return _settings.Provider switch {
"ProviderA" => _serviceProvider.GetRequiredService<ProviderAImplementation>(),
"ProviderB" => _serviceProvider.GetRequiredService<ProviderBImplementation>(),
"Mock" => _serviceProvider.GetRequiredService<MockProvider>(),
_ => throw new InvalidOperationException($"Unknown provider: {_settings.Provider}")
};Switching providers is a config change. The job pipeline doesn’t know which provider it’s using. The mock provider lets the whole pipeline run in tests without any real external service.
Add a circuit breaker (Polly) around each provider call, and a fallback chain - if the primary provider is unavailable, fall back to secondary. At enterprise scale you might route different workloads to different providers based on latency or cost. The interface stays the same.
Solo: £0 - this is pure application code. Polly is a free NuGet package. The providers themselves cost money; the pattern doesn’t. Enterprise: £0 - same. The abstraction layer adds no infrastructure cost at any scale.
A dedicated managed validation service. On Azure: Azure AI Content Safety. On AWS: Amazon Rekognition (image/video), Amazon Comprehend (text). Managed, scalable, compliance-certified.
A Python FastAPI microservice with zero external ML dependencies. Multiple analysis layers running locally, combined into a weighted validation score. No per-call API costs. No data leaving the platform. No vendor dependency for a core safety feature.
At enterprise scale you’d want the compliance certifications that come with managed services - especially in a regulated market. Managed validation services plug in as provider implementations behind the same interface. Any custom scoring logic that no managed service offers stays in-house regardless.
Solo: ~£5–15/month - a small always-on container running local ML models (Consumption plan, ~0.5 vCPU). No per-call API fees, no data egress. Enterprise: Managed validation APIs typically charge per transaction - roughly £1–1.50/1,000 requests depending on the modality. At moderate volume (50k requests/month) that’s ~£50–75/month.
A dedicated notification service with a message bus fan-out to push updates to connected clients. On Azure: Azure SignalR Service + Azure Service Bus for fan-out. On AWS: AWS API Gateway WebSockets + Amazon SNS for fan-out. WebSocket infrastructure managed separately from the application.
SignalR baked directly into the processing service. As the job handler updates progress - after each unit completes, at each major milestone - it calls:
await _notificationService.NotifyJobProgressAsync(userId, jobId, percent, message);That pushes directly to any connected WebSocket clients subscribed to that user’s job.
One non-obvious detail: WebSockets can’t send custom HTTP headers, so the standard Bearer token auth doesn’t work. The JWT is passed as a query string parameter and extracted in the SignalR pipeline configuration:
options.Events = new JwtBearerEvents {
OnMessageReceived = context => {
var token = context.Request.Query["access_token"];
if (!string.IsNullOrEmpty(token) && context.HttpContext.Request.Path.StartsWithSegments("/hubs"))
context.Token = token;
return Task.CompletedTask;
}
};At scale, replace self-hosted SignalR with Azure SignalR Service or AWS API Gateway WebSockets. The application code doesn’t change - just the backing transport.
Solo: £0 - SignalR runs inside your existing service container. No additional infrastructure. Enterprise: Azure SignalR Service Standard tier ~£40–50/month for 1 unit (1,000 concurrent connections). AWS API Gateway WebSockets ~£1/million messages + £0.25/million connection-minutes - near-zero at low volume, but adds up with always-on connections.
A service mesh handling mTLS between services automatically. On Azure: Istio on AKS (now GA as an AKS add-on) or Azure API Management with managed identities. On AWS: AWS App Mesh or Amazon ECS Service Connect. Zero application-layer auth between internal services.
A DelegatingHandler that automatically forwards the incoming request’s Bearer token to any outbound HTTP calls:
protected override async Task<HttpResponseMessage> SendAsync(
HttpRequestMessage request, CancellationToken ct)
{
var token = _httpContextAccessor.HttpContext?
.Request.Headers["Authorization"]
.ToString().Replace("Bearer ", "");
if (!string.IsNullOrEmpty(token))
request.Headers.Authorization = new AuthenticationHeaderValue("Bearer", token);
return await base.SendAsync(request, ct);
}Registered on typed HTTP clients at startup. When one service calls another on behalf of a user request, the user’s JWT flows through automatically. No manual token extraction, no service-specific credentials.
At enterprise scale, move to mutual TLS between services and a service account model - each service has its own identity rather than forwarding user tokens. But for internal services behind a gateway, token forwarding is secure and simple.
Solo: £0 - a DelegatingHandler is a few lines of code registered at startup.
Enterprise: £0 for the pattern itself, but a service mesh adds real cost. Istio on AKS or App Mesh on EKS adds ops complexity and some CPU overhead; budget ~£20–50/month in additional node capacity to absorb the sidecar proxies.
A Kubernetes operator that modifies Deployment replicas via the Kubernetes API, or a Horizontal Pod Autoscaler driven by custom metrics. On Azure: KEDA on AKS with Azure Monitor custom metrics. On AWS: ECS UpdateService API or KEDA on EKS with Amazon CloudWatch metrics.
One service can trigger scaling of another directly via the Azure Management API, authenticated with its managed identity:
PATCH /subscriptions/{id}/resourceGroups/{rg}/providers/Microsoft.App/containerApps/{app}This means one service can pre-warm another the moment it knows work is coming, rather than waiting for KEDA to react to queue depth. The managed identity means no credentials in config - the container app’s Azure identity is the auth mechanism.
On AWS this is the ECS UpdateService API, callable with an IAM role attached to the task. The pattern is identical.
Solo: £0 - Azure Management API calls are free. Managed identity is included with Azure Container Apps. Enterprise: £0 for the API calls themselves. If you move to a Kubernetes HPA with custom metrics you’ll need a Prometheus stack, which adds ~£10–20/month in storage and compute.
A centralised exception management platform with ProblemDetails RFC 7807 compliance. Correlation IDs tied to distributed tracing. On Azure: Azure Application Insights + Azure Monitor. On AWS: AWS X-Ray + Amazon CloudWatch. Cross-cloud or self-hosted: Datadog, Jaeger, or Zipkin.
A shared exception hierarchy and a single ErrorHandlerMiddleware registered in every service:
{
"traceId": "0HN4K2VG8T3CP:00000001",
"success": false,
"status": 404,
"errors": [
{ "code": "ResourceNotFound", "message": "The requested resource does not exist" }
]
}Every exception type maps to an HTTP status. Every response includes a traceId from the request context. Unexpected exceptions are logged with full context; expected exceptions (validation failures, not found) are returned cleanly without log noise. The client always gets a consistent shape regardless of which service produced the error.
Add OpenTelemetry and wire the traceId into a distributed tracing backend - Jaeger, Zipkin, or AWS X-Ray. The traceId is already there; you’re just making it span service boundaries.
Solo: £0 - Application Insights free tier covers 5GB/month of ingestion, enough for most solo projects. The traceId pattern itself is free.
Enterprise: Azure Monitor / Application Insights ~£2.30/GB over the free tier. At 10GB+/month (common for multi-service prod traffic) that’s ~£15–20/month. AWS X-Ray is ~£4.50/million traces recorded - negligible unless you’re tracing every request at high volume.
What the solo approach doesn’t give you:
What the solo approach gives you that enterprise often loses:
SELECT * FROM BackgroundJobs WHERE Status = 3 tells you everything. Dead letter queues require tooling to inspect.When you’re ready to move to enterprise infrastructure, here’s a rough order of operations:
traceId pattern. Add to the gateway so traces span service boundaries.None of these steps require rewriting business logic. The patterns were right the first time - the infrastructure underneath them just changes.
The question isn’t “should I build it the enterprise way or the simple way?” The question is “are these the right patterns, implemented simply?”
The database queue, the checkpoint resume, the provider factory, the structured errors, the auth forwarding - these aren’t compromises. They’re the same patterns enterprise teams use, built with less infrastructure overhead because the scale doesn’t justify it yet.
Build the patterns right. Choose the infrastructure for the scale you’re at. Know exactly what to swap when you outgrow it.
The prototypes you build this way can scale with you accordingly, and hopefully this will make your life easier!