Problem:

Client requested a step‑by‑step procedure to add two new servers to an existing three‑node Kafka cluster running in KRaft mode, with the explicit requirement to avoid any data loss. The cluster’s existing configuration (server.properties) and current controller.quorum.voters format were provided for reference. The observable concern was that adding voters and brokers incorrectly could cause cluster metadata mismatch (e.g., “cluster ID not found” on new nodes), controller quorum instability, or uneven partition placement so that new brokers show little or no data until partitions are reassigned.

Process:

Step 1: Confirm request and review supplied configuration

Client supplied the running cluster’s server.properties and stated the cluster had three KRaft nodes and an existing cluster ID. The supplied files were inspected for controller.quorum.voters, process.roles, node.id, listeners/advertised.listeners and replication defaults. Discovery: controller.quorum.voters listed only three voters and node.ids were unique; why it mattered: extending the quorum requires synchronized configuration across all voters and consistent ports, which determined the exact field updates needed on every node and the risk surface for metadata mismatch.

Step 2: Validate cluster health and capture cluster ID

Existing cluster health was checked by querying topic under‑replicated partitions and by extracting cluster metadata from local log segments. Commands used to list under‑replicated partitions and to dump cluster metadata confirmed an empty under‑replicated set and produced the cluster ID. Discovery: cluster was healthy and the cluster ID was stable; why it mattered: the cluster ID must be used when formatting storage on new brokers to avoid “Cluster ID not found” errors.

Step 3: Prepare new servers’ base configuration

New hosts were provisioned with identical directory layout and Kafka user privileges. The provided server.properties template was copied and modified for each new host to set unique node.id values and to add each new host’s listener/advertised.listener entries. Discovery: port and listener schemes matched the existing nodes and no node.id overlap was present; why it mattered: correct node.id and listener configuration prevents bind conflicts and ensures the new brokers register with the expected endpoints.

Step 4: Format storage on new nodes using existing cluster ID

New nodes’ storage directories were formatted with the extracted cluster ID via the kafka-storage tool. The server.properties for each new node referenced the same process.roles (broker,controller) as the existing cluster. Discovery: successful format confirmed cluster ID acceptance by the new nodes; why it mattered: without formatting with the correct cluster ID, the brokers will refuse to join or report metadata errors.

Step 5: Start new brokers and verify registration

New broker processes were started and broker registration was verified by querying broker API versions and broker lists from an existing broker. Discovery: both new brokers appeared in broker listings but topic replicas remained on the original three nodes; why it mattered: arrival in the broker list confirmed KRaft registration, but data distribution remained skewed until partitions were reassigned.

Step 6: Update controller quorum on original nodes with rolling restarts

controller.quorum.voters in each existing node’s server.properties was updated to include the two new voter entries, then rolling restarts were performed one node at a time with verification after each restart using broker API and metadata quorum status checks. Discovery: incremental updates preserved quorum availability and avoided controller elections that would destabilize metadata; why it mattered: all voters must share the same controller.quorum.voters configuration for a healthy KRaft quorum, and rolling restarts minimized interruption to controller leadership.

Step 7: Rebalance partition replicas onto the new brokers with throttling

A topic list was generated and a reassignment plan was created including all broker IDs. Reassignment was executed with a replication throttle to limit replication bandwidth and progress was monitored until completion; throttle removal was performed once verification showed all partitions reassigned. Discovery: partitions redistributed to include the new brokers and under‑replicated partitions resolved; why it mattered: explicit reassignment avoided leaving new brokers empty and prevented saturating network or I/O resources during data movement. This final step implemented the changes described below.

Solution:

Two brokers were added to the KRaft cluster by formatting their storage with the existing cluster ID, configuring unique node.id and listener settings, starting them and then updating controller.quorum.voters across the original three nodes with rolling restarts. Partition replicas were reassigned to include the two new brokers using kafka‑reassign‑partitions with replication throttling until verification showed completion. Debezium connectors running against the cluster were left unchanged; because Kafka durability and replication were maintained during the expansion, Debezium continued to produce change events without data loss. Architecturally, the fix works because KRaft requires a consistent voter configuration and cluster‑ID matched storage for safe metadata growth, while controlled replica reassignment ensures replica placement changes do not induce data loss or prolonged under‑replication.

Conclusion:

Outcome: the cluster expanded from three to five voters with no data loss, broker registration verified, and topic partitions redistributed across all brokers. Operational impact: improved capacity and more balanced I/O, reduced single‑broker load risk, and preserved Debezium’s change‑feed stability through controlled quorum updates and throttled reassignment.