Large-scale infrastructure migrations are rarely simple, but moving a mission-critical system with no downtime, no data loss, and no changes required from dependent services represents a particularly high level of engineering complexity. Reddit’s migration of its Apache Kafka infrastructure—from Amazon EC2 to Kubernetes—is a strong example of how careful planning, constraint-driven design, and incremental execution can enable such transitions at scale.
This article explores how Reddit successfully migrated a Kafka deployment consisting of hundreds of brokers and over a petabyte of live data, all while maintaining uninterrupted service.
Kafka’s Role in Reddit’s Architecture

Apache Kafka is a distributed event streaming platform that serves as the backbone for many real-time systems. It enables applications to publish and consume streams of data reliably and at scale.
Within Reddit’s infrastructure, Kafka plays a central role:
- It processes tens of millions of messages per second
- It supports hundreds of internal services
- It ensures reliable communication between producers and consumers
Given this level of dependency, any disruption to Kafka would have widespread consequences across the platform. This constraint heavily influenced every decision in the migration process.
Why Move from EC2 to Kubernetes?
Initially, Reddit operated its Kafka clusters on EC2 instances using a combination of infrastructure-as-code tools and manual operational workflows. While functional, this setup introduced several limitations as the system scaled:
- Operational overhead: Managing hundreds of brokers manually became increasingly inefficient
- Error-prone workflows: Human intervention in upgrades and configuration changes introduced risk
- Limited scalability: Scaling infrastructure required significant coordination
- Cost inefficiencies: Resource utilization was not always optimal
Kubernetes, combined with a Kafka operator framework, offered a more declarative and automated approach to managing distributed systems. It promised improved reliability, easier scaling, and reduced operational burden.
Defining the Constraints
Before initiating the migration, Reddit established a set of non-negotiable constraints that shaped the entire strategy:
1. Zero Downtime
The Kafka system had to remain fully operational throughout the migration. Scheduled outages or maintenance windows were not acceptable.
2. No Metadata Rebuild
Kafka’s internal metadata—tracking brokers, partitions, and replicas—could not be recreated from scratch. New infrastructure had to integrate with the existing cluster.
3. No Client Changes
Client applications were tightly coupled to specific broker endpoints. Any requirement to update connection configurations across hundreds of services was impractical.
4. Full Reversibility
Every step in the migration needed to be reversible. At no point could the system enter an unrecoverable state.
These constraints eliminated many conventional migration strategies and forced a more incremental, integration-based approach.
Phase 1: Introducing a DNS Abstraction Layer
The first step in the migration did not involve Kafka directly. Instead, Reddit introduced a DNS-based abstraction layer between clients and brokers.
This layer served as a controllable indirection mechanism:
- New DNS entries were created to represent broker endpoints
- These entries initially pointed to the existing EC2 infrastructure
- Client configurations were gradually updated to use the new DNS names
Once this transition was complete, Reddit gained the ability to redirect traffic without modifying client applications. This decoupling was critical for enabling future steps in the migration.
Phase 2: Reorganizing Broker Identifiers

Kafka brokers are identified by unique numeric IDs. The new Kubernetes-based system required access to a range of IDs that were already in use by existing EC2 brokers.
To resolve this, Reddit:
- Added new EC2 brokers with higher ID values
- Migrated data from older brokers to these new ones
- Decommissioned the original low-ID brokers
This process effectively freed up the lower ID range for the Kubernetes-based brokers, ensuring compatibility with the new deployment model.
Phase 3: Running a Hybrid Cluster
The most complex phase involved running both EC2-based and Kubernetes-based brokers within the same Kafka cluster.
To achieve this, Reddit:
- Modified their Kafka operator to support hybrid deployment
- Ensured network connectivity between both environments
- Connected all brokers to the same metadata system
This allowed new brokers running in Kubernetes to join the existing cluster seamlessly.
A key tool in this phase was Kafka’s rebalancing system, which enabled controlled movement of data between brokers. By leveraging this capability, Reddit could gradually shift load without disrupting operations.

Phase 4: Incremental Data Migration
With both environments operating within a single cluster, Reddit began migrating data and traffic incrementally.
This process involved:
- Reassigning partition leadership from EC2 brokers to Kubernetes brokers
- Replicating data across the new infrastructure
- Monitoring system health and performance continuously
The migration was performed gradually over several days, allowing engineers to:
- Pause the process if issues arose
- Roll back changes if necessary
- Validate system behavior at each stage
Traffic patterns naturally followed the data movement, shifting toward the Kubernetes-based brokers as they assumed leadership roles.

Phase 5: Migrating the Control Plane
Throughout the broker migration, Kafka’s metadata layer remained unchanged. Only after the data plane was fully stable on Kubernetes did Reddit migrate the control plane.
This step involved transitioning from an external metadata system to Kafka’s newer internal metadata management mechanism.

By deferring this change until the final phase, Reddit minimized risk and avoided introducing multiple variables simultaneously.
Phase 6: Finalizing the Migration
Once all data and traffic were fully operating on Kubernetes:
- The legacy EC2 infrastructure was decommissioned
- Temporary modifications to the Kafka operator were removed
- The system was transitioned to a standard, fully supported configuration
At this point, the migration was complete, and Kafka was entirely managed within Kubernetes.
Key Engineering Principles
Reddit’s migration highlights several important principles for large-scale system transformations:
Decoupling Clients from Infrastructure
Introducing an abstraction layer (such as DNS) allows infrastructure changes without impacting dependent systems.
Preserving Logical State
The most critical asset in distributed systems is not the infrastructure, but the data and metadata. Protecting this state is essential.
Incremental Execution
Breaking the migration into small, manageable steps reduces risk and improves observability.
Reversibility
Designing every step to be reversible enables faster progress and greater confidence.
Separation of Concerns
Handling the data plane and control plane independently prevents compounded failures.
Trade-offs and Challenges
While successful, the migration required addressing several challenges:
- Operational complexity: Running a hybrid cluster increased system complexity temporarily
- Custom tooling: Modifying existing tools introduced maintenance overhead
- Extended timeline: Incremental migration took longer than a direct cutover
These trade-offs were necessary to meet the strict constraints of zero downtime and full reliability.
Conclusion
Reddit’s migration of a petabyte-scale Kafka system demonstrates that even the most complex infrastructure transitions can be executed safely with the right strategy. By prioritizing stability, maintaining strict constraints, and adopting an incremental approach, the engineering team avoided disruption while modernizing their platform.
The key takeaway is not the specific technologies used, but the methodology applied. Large-scale migrations are less about bold, sweeping changes and more about careful sequencing, risk management, and system understanding.
In environments where reliability is critical, a migration that appears gradual and complex—but never interrupts production—is far more valuable than a faster approach that introduces uncertainty.

