The legacy orchestration platform: why it had to go
The existing system at bpost was an orchestration platform built on AWS. At its center sat a relational database, the source of truth for all item states across the entire logistics network. Central services handled business logic and coordinated between domains.
The problems with this setup were predictable:
- The database had a physical growth ceiling. More parcels, more items, more data, and the central DB would eventually become the bottleneck for the entire operation.
- Functional coupling was high. Business logic was embedded in central services, which meant domain teams couldn't evolve independently. Any change risked cascading effects elsewhere.
- There was no clean accountability. Producers didn't own their events. Consumers depended on a central orchestrator rather than reacting to domain-published state changes. Visibility into what was actually happening, in real time and domain by domain, was limited.
The goal was to move to a model where each domain publishes business events that describe what happened in their system, and consumers maintain their own read models based on those events, scaled and structured for their specific needs.
Technology choices: Kafka, Confluent Cloud, and Avro
Kafka via Confluent Cloud. bpost chose not to self-manage the Kafka cluster. Running Kafka at production scale is operationally demanding, and the team made a deliberate call to use Confluent's managed offering. Kafka Streams applications run on AWS (ECS/EKS), with the rest of the platform cloud-native.
Avro with schema registry. Schema governance was treated as non-negotiable from the start. Every event payload is defined in Avro, validated against the Confluent Schema Registry at runtime. This gives producers and consumers a clear contract: what the event looks like, and the guarantee that it won't silently change. We'll cover the full compatibility strategy and breaking-change process in the schema evolution section below.
Kafka Streams for stateful processing. Much of the interesting logic at bpost doesn't just consume events. It joins them. When a parcel moves through the network, multiple events from multiple domains need to be correlated to produce a meaningful state.
Kafka Streams applications handle this by maintaining state stores in memory, keeping latency low and throughput high. The architecture makes heavy use of KTables and stream joins to produce derived business events from raw domain events.
Event-driven architecture patterns: CQRS, event sourcing, and consumer-driven migration
CQRS. Read and write models are explicitly separated. When a consumer receives business events, it builds and maintains a read model tailored to its own needs, not a generic shared representation. The write path (event production) is the domain's responsibility; the read path is the consumer's.
Event sourcing. Events are retained as the system of record. If a consumer needs to rebuild state from scratch, after a failure, a redeployment, or a schema migration, the full event history is available for replay. This was a design decision made at the outset, not retrofitted later.
Consumer-driven migration. The migration from old to new doesn't happen domain by domain on the producer side. It happens consumer by consumer. A primary consumer is identified, for example the sorting domain, and then the producers and services required to feed that consumer are onboarded. Old and new systems run in parallel (dual run), with outputs compared before the legacy path is decommissioned.
This approach gives clear intermediate milestones and makes it easier to validate that the new system is producing equivalent or better output before cutting over.
First production use cases on the Kafka platform
The first applications on the platform were deliberately scoped to validate the approach before touching core systems. A greenfield license plate processing flow, spanning retail, distribution, and DIV domains, proved that multi-domain event choreography works end-to-end in production. A real-time alerting system (MAS) that fires when expected events don't arrive within a defined window showed that Kafka Streams could support operational monitoring at bpost's scale. Neither was trivial, but both were contained enough to build confidence quickly.
That validation phase mattered. It gave the broader organization proof that the pattern holds, and gave bpost's internal teams the hands-on experience they needed to start owning it themselves. A third application, "WIMI" (where is my item) is currently in development.
The rollout strategy reflects the growing confidence. Rather than a slow domain-by-domain migration, the consumer-driven approach identifies high-value consumers, onboards the producers they need, and retires legacy components as confidence grows. The goal now is speed, using the foundation that's been built to deliver new business value faster than was ever possible before.
Real-world EDA challenges: item identification, Kafka Streams portability, and team alignment
Item identity across domains. parcel and mail items don't always have stable, globally unique identifiers across the full lifecycle. Building a service that could track an item through the entire network, from announcement through delivery, and handle identity in a distributed, reliable way turned out to be one of the harder problems. Since this is the core of item identification, it required significant work to design and implement a scalable distributed mechanism.
Environment-specific behavior in Kafka Streams. Kafka Streams applications that rely on in-memory state stores behave differently depending on where they're running. The team found meaningful differences between EKS (Elastic Kubernetes Service) and ECS (Elastic Container Service) on AWS, specifically around state rebuild behavior when a pod goes down.
With ECS, there are constraints on how persistent storage is handled that can limit scalability or slow down state reconstruction after failure. This isn't obvious from Kafka's documentation and only became clear through production experience. The lesson: the platform hosting your Kafka Streams application is not neutral. Its characteristics affect your application's behavior in ways that need to be explicitly accounted for.
Mixed team dynamics. The team at bpost included people with deep domain knowledge but limited Kafka experience, alongside Cymo engineers with strong EDA expertise but limited knowledge of bpost's internal systems. Getting those two groups genuinely aligned, not just in kickoff sessions but in day-to-day decision-making, was the most consistent source of friction.
Lorenzo and Stefan both flagged this in retrospect: more frequent, more direct collaboration between the domain experts and the technical team would have accelerated decisions that ended up taking longer than necessary. Focus on specific problems, together, rather than working in parallel and aligning after the fact.
Kafka schema evolution: how bpost handles breaking changes
Full transitive compatibility means every schema change is valid for both older and newer consumers. This is stricter than backward-only compatibility, but it removes a class of problems: you don't have to reason about which version of a consumer is running or worry about whether a schema change will break a consumer that hasn't been redeployed.
When a breaking change is unavoidable, the process is: create a new schema version, deploy a migration application that reads from the new topic and publishes to the old topic with the updated payload, and give consumers a defined migration window before the old topic is retired. That migration application exists as a reusable reference implementation.
What the team would like to build next in this space: a proper event catalog, an API-management-style developer portal where teams can search events, inspect schemas, and initiate onboarding as a consumer without the current manual coordination overhead.
Building a center of excellence for event-driven architecture
The migration of the core orchestration platform is ongoing, working through the logistics domains one consumer at a time. The platform is stable in production. Teams across the organization now understand how to build producer and consumer applications. A center of excellence is being built to own the platform and drive EDA evangelization across bpost's other entities.
The goal bpost has been building toward from the start, an event-driven backbone that lets new business use cases click in without extensive integration work each time, is closer than it was a year ago. It's not finished. But the foundation is real.
Cymo specializes in event-driven architecture design and implementation using Apache Kafka. If you're working through a similar transition, get in touch.
