Kafka in Practice — Topic Design and Message Flow

Whether to use Kafka at all is covered in another note. This one is the next step — how to name topics, how to size partitions and choose keys, how to write Producer and Consumer code. Concepts settle once you learn them, but a wrong topic design follows you through the whole operational life.

1. Topic naming conventions

A topic name is effectively an API. Once publishing starts, consumers depend on it, so it is hard to change. Set the rules from the start.

<domain>.<entity>.<event> pattern — for example, orders.order.created, billing.invoice.paid. With the three axes of domain, entity, and event, search, permissions, and cleanup all get easier.
Consistent separator convention — pick one of dot (.) or kebab (-) and stay consistent. Mixing them (orders.order-created) breaks sorting and filtering. A compromise of dots for hierarchy and kebab for words (orders.order-line.created) is also common.
Environment separation — add a prefix like dev.orders.created, or separate clusters entirely per environment. Mixing environments in one cluster invites accidents where a dev consumer reads a prod topic. Separate clusters are safer.
Inject topic names via config — do not hardcode "orders.order.created" in code; take it as a config value. Prefixes differ per environment, and a name change should not require grepping the whole codebase.

# application.yml — topic names as config
app:
  topics:
    order-created: ${ORDER_CREATED_TOPIC:orders.order.created}

2. Partition and key design

Partitions are the core knob of Kafka performance design.

Partition count = parallelism ceiling — the number of consumers working concurrently in one consumer group cannot exceed the partition count. With 8 partitions, only up to 8 consumers are meaningful; a 9th sits idle.
Same key → same partition → ordering guaranteed — the partition is decided by the hash of the message key. Use one order id as the key and all events for that order land in one partition, preserving publish order. A null key scatters round-robin and the ordering guarantee disappears.
Increasing partitions changes the mapping — it is hash(key) % partitionCount, so changing the partition count sends the same key to a different partition. The ordering assumption between messages piled in the old partition and the new partition breaks. This is why the first design matters.

Starting with roughly 2–3x the expected throughput leaves a safe margin. But too many partitions raise metadata and file-handle pressure, so growing without limit is not the answer either.

// Same key → same partition → per-order ordering preserved
producer.send(new ProducerRecord<>(
    topic,
    order.getId(),          // key — the unit of ordering
    toJson(event)
));

3. Producer error handling and retries

Publishing is reliable only when confirmed as "the broker received it," not as "I sent it."

acks — 0 (no acknowledgement, loss possible), 1 (leader only), all (full ISR). For topics that must not lose data, use all.
enable.idempotence=true — the broker prevents a producer retry from writing the same message twice. Turning it on together with acks=all is effectively the default.
Retries — set retries and delivery.timeout.ms to handle transient network errors. With idempotence on, retries are safe.
Avoid synchronous blocking — waiting on every message with send().get() collapses throughput. Take the result via a callback and handle failures inside the callback.

producer.send(record, (metadata, ex) -> {
    if (ex != null) {
        log.error("publish failed topic={}", record.topic(), ex);
        // Store failed messages separately (outbox / retry queue)
    }
});

No publishing before the DB commit — if you send the message first inside a DB transaction, the message is already out even if the transaction rolls back (an event for an order that does not exist). Outbox pattern — within the same transaction, INSERT into an outbox table, and a separate publisher reads only committed rows and sends them to Kafka. Or publish in an after-commit hook right after the transaction commits.

4. Consumer groups and offset management

A consumer manages "how far it has read" (the offset) on its own.

group.id — consumers in the same group split the partitions. Separate groups read the same topic independently.
Offset commit — auto vs manual — auto-commit periodically commits automatically, which is convenient, but if it commits after receiving a message and before processing it, failures cause loss. Manual commit after processing completes pairs with at-least-once.
auto.offset.reset — the behavior when no stored offset exists (a new group, etc.). earliest starts from the beginning of the topic, latest from now. For a new consumer not to miss past data, earliest is recommended. latest silently skips messages between deployments.
Consumers ≤ partitions — same as §2. Consumers beyond the partition count receive no work.
Rebalancing — when a consumer joins or leaves, partitions are reassigned. Processing pauses briefly during that. Blocking too long inside a consumer makes the group treat it as dead and triggers repeated rebalancing (exceeding max.poll.interval.ms).

// Manual commit after processing completes — at-least-once
records.forEach(record -> {
    handle(record);                 // process
});
consumer.commitSync();              // commit after the batch

5. At-least-once and idempotent consumers

Kafka's default delivery guarantee is at-least-once — a message is delivered at least once, and the same message may arrive more than once. Because of retries, rebalancing, and commit timing, duplicates are normal. So idempotency is the consumer's responsibility.

A unique id on each message — at publish time, put a uuid (or domain key) per message in the header or payload.
A processed record — before processing, the consumer checks a table like processed_messages(message_id) and skips if already present. A UNIQUE constraint on the id column blocks duplicate INSERTs and makes it more robust.
DLQ (Dead Letter Queue) — for a message that keeps failing after a few retries (bad format, permanent error), do not retry forever; send it to a separate DLQ topic. The main flow is not blocked, and failed messages are investigated separately later.
Retry + backoff — for transient errors (external API 5xx, etc.), do not retry immediately; widen the interval progressively (backoff) and retry. Beyond a set count, send to the DLQ.

void handle(ConsumerRecord<String, String> record) {
    String messageId = header(record, "message-id");
    if (alreadyProcessed(messageId)) return;   // idempotent — skip duplicate
    try {
        applyBusinessLogic(record);
        markProcessed(messageId);              // record as processed
    } catch (TransientException e) {
        throw e;                               // eligible for retry
    } catch (PermanentException e) {
        sendToDlq(record, e);                  // permanent failure → DLQ
    }
}

6. Schema evolution

A topic's message format changes over time. Fields are added, renamed, or shift meaning. Because Producer and Consumer deploy separately, format changes break easily.

Schema Registry — version-manages Avro, Protobuf, or JSON Schema definitions centrally. The Producer publishes with a schema id, and the Consumer deserializes with that schema. The format agreement becomes explicit.
Compatibility rules — backward compatibility means a new schema can read old messages (deploy consumers first), forward compatibility means an old schema can read new messages (deploy producers first). Backward is usually the default.
Adding a field — adding an optional field with a default is usually safe. Old consumers ignore fields they do not know.
Removing, renaming, or retyping a field — these are breaking changes. Remove in stages (first make it optional, remove much later), and work around a rename by adding a new field while keeping the old one.

You can use plain JSON without a Schema Registry, but then the discipline of format changes must be held by people through code review and documentation.

Common pitfalls

Synchronous blocking send — calling send().get() per message stalls on every network round trip and drops throughput into the single digits. Use callbacks.

Under-sizing partition count — starting with 4 partitions and then trying to grow to 8 consumers hits a wall. Increasing is possible, but the key-to-partition mapping changes and ordering breaks.

Long blocking inside the consumer — a Thread.sleep or a slow external call inside the consumer loop exceeds max.poll.interval.ms, so the group treats the consumer as dead and rebalancing repeats.

auto.offset.reset=latest — for a new consumer group or an expired offset, latest silently skips the messages piled up in the meantime. Since this leads to data loss, earliest is the safe default.

Infinite retry from propagated exceptions — throwing an exception as-is on a malformed message leaves the offset uncommitted, so the same message is received forever. Send permanent failures to the DLQ and advance the offset.

Publishing before the DB commit — sending the message first inside a transaction means the message goes out even on rollback. The consumer receives an event for data that does not exist. Separate it with outbox or after-commit.

Closing thoughts

Topic names, partition counts, and key strategies follow you through the whole operational life once decided. Code can be fixed by refactoring, but topic design needs a migration. Agree only on the domain naming convention and the key strategy at the start, and the rest (idempotent consumers, DLQ, schemas) can be added gradually.

kafka-when
data-pipeline

References: Apache Kafka official docs, Producer configs, Consumer configs, Confluent Schema Registry, schema compatibility rules.

Kafka in Practice — Topic Design and Message Flow

1. Topic naming conventions

A topic name is effectively an API. Once publishing starts, consumers depend on it, so it is hard to change. Set the rules from the start.

<domain>.<entity>.<event> pattern — for example, orders.order.created, billing.invoice.paid. With the three axes of domain, entity, and event, search, permissions, and cleanup all get easier.
Consistent separator convention — pick one of dot (.) or kebab (-) and stay consistent. Mixing them (orders.order-created) breaks sorting and filtering. A compromise of dots for hierarchy and kebab for words (orders.order-line.created) is also common.
Environment separation — add a prefix like dev.orders.created, or separate clusters entirely per environment. Mixing environments in one cluster invites accidents where a dev consumer reads a prod topic. Separate clusters are safer.
Inject topic names via config — do not hardcode "orders.order.created" in code; take it as a config value. Prefixes differ per environment, and a name change should not require grepping the whole codebase.

# application.yml — topic names as config
app:
  topics:
    order-created: ${ORDER_CREATED_TOPIC:orders.order.created}

2. Partition and key design

Partitions are the core knob of Kafka performance design.

Partition count = parallelism ceiling — the number of consumers working concurrently in one consumer group cannot exceed the partition count. With 8 partitions, only up to 8 consumers are meaningful; a 9th sits idle.
Same key → same partition → ordering guaranteed — the partition is decided by the hash of the message key. Use one order id as the key and all events for that order land in one partition, preserving publish order. A null key scatters round-robin and the ordering guarantee disappears.
Increasing partitions changes the mapping — it is hash(key) % partitionCount, so changing the partition count sends the same key to a different partition. The ordering assumption between messages piled in the old partition and the new partition breaks. This is why the first design matters.

Starting with roughly 2–3x the expected throughput leaves a safe margin. But too many partitions raise metadata and file-handle pressure, so growing without limit is not the answer either.

// Same key → same partition → per-order ordering preserved
producer.send(new ProducerRecord<>(
    topic,
    order.getId(),          // key — the unit of ordering
    toJson(event)
));

3. Producer error handling and retries

Publishing is reliable only when confirmed as "the broker received it," not as "I sent it."

acks — 0 (no acknowledgement, loss possible), 1 (leader only), all (full ISR). For topics that must not lose data, use all.
enable.idempotence=true — the broker prevents a producer retry from writing the same message twice. Turning it on together with acks=all is effectively the default.
Retries — set retries and delivery.timeout.ms to handle transient network errors. With idempotence on, retries are safe.
Avoid synchronous blocking — waiting on every message with send().get() collapses throughput. Take the result via a callback and handle failures inside the callback.

producer.send(record, (metadata, ex) -> {
    if (ex != null) {
        log.error("publish failed topic={}", record.topic(), ex);
        // Store failed messages separately (outbox / retry queue)
    }
});

No publishing before the DB commit — if you send the message first inside a DB transaction, the message is already out even if the transaction rolls back (an event for an order that does not exist). Outbox pattern — within the same transaction, INSERT into an outbox table, and a separate publisher reads only committed rows and sends them to Kafka. Or publish in an after-commit hook right after the transaction commits.

4. Consumer groups and offset management

A consumer manages "how far it has read" (the offset) on its own.

group.id — consumers in the same group split the partitions. Separate groups read the same topic independently.
Offset commit — auto vs manual — auto-commit periodically commits automatically, which is convenient, but if it commits after receiving a message and before processing it, failures cause loss. Manual commit after processing completes pairs with at-least-once.
auto.offset.reset — the behavior when no stored offset exists (a new group, etc.). earliest starts from the beginning of the topic, latest from now. For a new consumer not to miss past data, earliest is recommended. latest silently skips messages between deployments.
Consumers ≤ partitions — same as §2. Consumers beyond the partition count receive no work.
Rebalancing — when a consumer joins or leaves, partitions are reassigned. Processing pauses briefly during that. Blocking too long inside a consumer makes the group treat it as dead and triggers repeated rebalancing (exceeding max.poll.interval.ms).

// Manual commit after processing completes — at-least-once
records.forEach(record -> {
    handle(record);                 // process
});
consumer.commitSync();              // commit after the batch

5. At-least-once and idempotent consumers

A unique id on each message — at publish time, put a uuid (or domain key) per message in the header or payload.
A processed record — before processing, the consumer checks a table like processed_messages(message_id) and skips if already present. A UNIQUE constraint on the id column blocks duplicate INSERTs and makes it more robust.
DLQ (Dead Letter Queue) — for a message that keeps failing after a few retries (bad format, permanent error), do not retry forever; send it to a separate DLQ topic. The main flow is not blocked, and failed messages are investigated separately later.
Retry + backoff — for transient errors (external API 5xx, etc.), do not retry immediately; widen the interval progressively (backoff) and retry. Beyond a set count, send to the DLQ.

void handle(ConsumerRecord<String, String> record) {
    String messageId = header(record, "message-id");
    if (alreadyProcessed(messageId)) return;   // idempotent — skip duplicate
    try {
        applyBusinessLogic(record);
        markProcessed(messageId);              // record as processed
    } catch (TransientException e) {
        throw e;                               // eligible for retry
    } catch (PermanentException e) {
        sendToDlq(record, e);                  // permanent failure → DLQ
    }
}

6. Schema evolution

A topic's message format changes over time. Fields are added, renamed, or shift meaning. Because Producer and Consumer deploy separately, format changes break easily.

Schema Registry — version-manages Avro, Protobuf, or JSON Schema definitions centrally. The Producer publishes with a schema id, and the Consumer deserializes with that schema. The format agreement becomes explicit.
Compatibility rules — backward compatibility means a new schema can read old messages (deploy consumers first), forward compatibility means an old schema can read new messages (deploy producers first). Backward is usually the default.
Adding a field — adding an optional field with a default is usually safe. Old consumers ignore fields they do not know.
Removing, renaming, or retyping a field — these are breaking changes. Remove in stages (first make it optional, remove much later), and work around a rename by adding a new field while keeping the old one.

You can use plain JSON without a Schema Registry, but then the discipline of format changes must be held by people through code review and documentation.

Common pitfalls

Synchronous blocking send — calling send().get() per message stalls on every network round trip and drops throughput into the single digits. Use callbacks.

Closing thoughts

kafka-when
data-pipeline

References: Apache Kafka official docs, Producer configs, Consumer configs, Confluent Schema Registry, schema compatibility rules.

Kafka in Practice — Topic Design and Message Flow

Kafka in Practice — Topic Design and Message Flow

1. Topic naming conventions

2. Partition and key design

3. Producer error handling and retries

4. Consumer groups and offset management

5. At-least-once and idempotent consumers

6. Schema evolution

Common pitfalls

Closing thoughts

Next

More in data

Kafka in Practice — Topic Design and Message Flow

Kafka in Practice — Topic Design and Message Flow

1. Topic naming conventions

2. Partition and key design

3. Producer error handling and retries

4. Consumer groups and offset management

5. At-least-once and idempotent consumers

6. Schema evolution

Common pitfalls

Closing thoughts

Next

More in data