1) Quick definitions (the mental model)
- Topic
A named stream of messages (events). Producers write to a topic; consumers read from it. - Partition
A topic is split into partitions—append-only logs with offsets 0,1,2,…
Ordering is guaranteed only within a partition. There is no global order across partitions. - Consumer group
A logical application (one name) that reads a topic. Kafka ensures each partition is read by at most one member of the group at a time, so the group processes the stream once. - Consumer process (a.k.a. worker / pod / container / JVM)
A running OS process that joins the consumer group. - KafkaConsumer instance
The client object inside your process (e.g.,new KafkaConsumer<>(props)
in Java) that talks to brokers. Kafka assigns partitions to these instances.
Important: KafkaConsumer is not thread-safe—only one thread should call its methods (poll
,commit
,seek
, …). - Thread
An execution lane inside a process. Common safe patterns are:- One KafkaConsumer per thread (multi-consumer, multi-thread).
- One polling thread + worker pool (poll in one thread, hand work to others; poller still commits).
2) How messages land in partitions
- With a key:
partition = hash(key) % numPartitions
.
Every event with the same key (e.g.,orderId
) goes to the same partition → preserves per-key order. - Without a key: modern clients use a sticky strategy (send batches to one partition, then switch) for throughput. Over time the distribution is roughly even, but there’s no cross-partition order.
Rule of thumb: choose a key that spreads load evenly but keeps events for the same entity together.
3) How partitions are assigned to consumers
When your group runs, the group coordinator maps partitions → KafkaConsumer instances. Triggers for rebalance: a member joins/leaves/crashes, subscriptions change, or a consumer stops heartbeating (e.g., blocked poll loop).
Assignor strategies:
- Range: contiguous ranges; can skew.
- Round robin: even spread; less stable across changes.
- Sticky / Cooperative sticky: keeps prior mapping to reduce churn; cooperative sticky does incremental hand-offs and is the recommended default.
Best practice: use cooperative sticky assignor and static membership (group.instance.id
) so quick restarts don’t cause full rebalances.
4) Concurrency inside a worker (safe blueprints)
- Per-partition serial lanes (simple & safe)
For each assigned partition, process serially on its own single-thread executor. Preserves partition order automatically. - Per-key serialization atop a worker pool (higher throughput)
Poll on one thread, dispatch records to a pool but keep per-key FIFO queues so each key is processed in sequence. Commit only when all records up to an offset are done.
Never have multiple threads call poll
/commit
on the same KafkaConsumer instance.
5) Sizing and throughput planning
Let:
R
= records/sec per consumer instanceT
= average processing time per record (seconds)
Required concurrency ≈ R × T
(Little’s Law).
- CPU-bound: thread pool ≈ number of vCPUs; batching can help.
- I/O-bound: pool ≈
R × T × (1.5–3)
with per-partition or per-key caps to preserve order. Use bounded queues andpause/resume
partitions for backpressure.
Partitions count (rough baseline):
- Estimate peak MB/s.
- Target ~1–3 MB/s per partition.
- Partitions ≈
(needed MB/s) / (target per-partition MB/s)
, then add 1.5–2× headroom.
6) Best practices (checklist)
enable.auto.commit=false
; commit after processing contiguous offsets.- Keep poll loop responsive; do I/O on worker threads; tune
max.poll.records
. - Use cooperative sticky assignor + static membership.
- Use keys for per-entity ordering; avoid hot keys.
- Bound everything: worker queues, per-key maps (LRU), in-flight per partition.
- Monitor lag per partition, processing latency, rebalances, and skew.
7) Limitations & trade-offs
- Max parallelism per group = number of partitions. Extra consumers beyond partitions sit idle.
- No global order across partitions.
- Hot partitions bottleneck throughput; fix keying or add partitions.
- Too many partitions increase broker metadata, open files, recovery time, and rebalance duration.
- Increasing partitions remaps
hash(key) % N
→ future events for a key may move; plan migrations carefully (often to a new topic).
8) Tiny example (end-to-end)
- Topic
orders
has 6 partitions. - Consumer group
orders-service
runs 2 processes.
Process A has 2 threads (2 KafkaConsumer instances). Process B has 1 thread (1 instance).
Kafka assigns partitions across these 3 instances; each instance may own multiple partitions.
One diagram to tie it all together
flowchart TB
%% Topic and partitions
subgraph Topic[Topic orders with 6 partitions]
direction LR
P0[P0]
P1[P1]
P2[P2]
P3[P3]
P4[P4]
P5[P5]
end
%% Consumer group and processes
subgraph Group[Consumer group orders-service]
direction LR
subgraph A[Consumer process A (worker)]
direction TB
A1[Thread 1\nKafkaConsumer #1]
A2[Thread 2\nKafkaConsumer #2]
end
subgraph B[Consumer process B (worker)]
direction TB
B1[Thread 1\nKafkaConsumer #3]
end
end
%% Example assignment (one moment in time)
P0 --> A1
P3 --> A1
P1 --> A2
P4 --> A2
P2 --> B1
P5 --> B1
%% Notes
Note[Rules:\n- One partition to at most one consumer instance in a group\n- A consumer instance may own multiple partitions\n- KafkaConsumer is not thread safe\n- Ordering is per partition only]
Group --- Note
TL;DR
- Topic = stream; Partition = ordered shard; Group = app;
- Process/Worker = running app instance; KafkaConsumer instance = the client object that owns partitions; Thread = how you parallelize work safely.
- Scale by partitions → instances → threads, preserve order per partition or per key, and keep the poll loop fast with manual commits.