Kafka: Consumer Group vs Worker vs Thread vs Consumer Instance vs Topic vs Partitions

Uncategorized

1) Quick definitions (the mental model)

  • Topic
    A named stream of messages (events). Producers write to a topic; consumers read from it.
  • Partition
    A topic is split into partitions—append-only logs with offsets 0,1,2,…
    Ordering is guaranteed only within a partition. There is no global order across partitions.
  • Consumer group
    A logical application (one name) that reads a topic. Kafka ensures each partition is read by at most one member of the group at a time, so the group processes the stream once.
  • Consumer process (a.k.a. worker / pod / container / JVM)
    A running OS process that joins the consumer group.
  • KafkaConsumer instance
    The client object inside your process (e.g., new KafkaConsumer<>(props) in Java) that talks to brokers. Kafka assigns partitions to these instances.
    Important: KafkaConsumer is not thread-safe—only one thread should call its methods (poll, commit, seek, …).
  • Thread
    An execution lane inside a process. Common safe patterns are:
    1. One KafkaConsumer per thread (multi-consumer, multi-thread).
    2. One polling thread + worker pool (poll in one thread, hand work to others; poller still commits).

2) How messages land in partitions

  • With a key: partition = hash(key) % numPartitions.
    Every event with the same key (e.g., orderId) goes to the same partition → preserves per-key order.
  • Without a key: modern clients use a sticky strategy (send batches to one partition, then switch) for throughput. Over time the distribution is roughly even, but there’s no cross-partition order.

Rule of thumb: choose a key that spreads load evenly but keeps events for the same entity together.


3) How partitions are assigned to consumers

When your group runs, the group coordinator maps partitions → KafkaConsumer instances. Triggers for rebalance: a member joins/leaves/crashes, subscriptions change, or a consumer stops heartbeating (e.g., blocked poll loop).

Assignor strategies:

  • Range: contiguous ranges; can skew.
  • Round robin: even spread; less stable across changes.
  • Sticky / Cooperative sticky: keeps prior mapping to reduce churn; cooperative sticky does incremental hand-offs and is the recommended default.

Best practice: use cooperative sticky assignor and static membership (group.instance.id) so quick restarts don’t cause full rebalances.


4) Concurrency inside a worker (safe blueprints)

  • Per-partition serial lanes (simple & safe)
    For each assigned partition, process serially on its own single-thread executor. Preserves partition order automatically.
  • Per-key serialization atop a worker pool (higher throughput)
    Poll on one thread, dispatch records to a pool but keep per-key FIFO queues so each key is processed in sequence. Commit only when all records up to an offset are done.

Never have multiple threads call poll/commit on the same KafkaConsumer instance.


5) Sizing and throughput planning

Let:

  • R = records/sec per consumer instance
  • T = average processing time per record (seconds)

Required concurrency ≈ R × T (Little’s Law).

  • CPU-bound: thread pool ≈ number of vCPUs; batching can help.
  • I/O-bound: pool ≈ R × T × (1.5–3) with per-partition or per-key caps to preserve order. Use bounded queues and pause/resume partitions for backpressure.

Partitions count (rough baseline):

  • Estimate peak MB/s.
  • Target ~1–3 MB/s per partition.
  • Partitions ≈ (needed MB/s) / (target per-partition MB/s), then add 1.5–2× headroom.

6) Best practices (checklist)

  • enable.auto.commit=false; commit after processing contiguous offsets.
  • Keep poll loop responsive; do I/O on worker threads; tune max.poll.records.
  • Use cooperative sticky assignor + static membership.
  • Use keys for per-entity ordering; avoid hot keys.
  • Bound everything: worker queues, per-key maps (LRU), in-flight per partition.
  • Monitor lag per partition, processing latency, rebalances, and skew.

7) Limitations & trade-offs

  • Max parallelism per group = number of partitions. Extra consumers beyond partitions sit idle.
  • No global order across partitions.
  • Hot partitions bottleneck throughput; fix keying or add partitions.
  • Too many partitions increase broker metadata, open files, recovery time, and rebalance duration.
  • Increasing partitions remaps hash(key) % N → future events for a key may move; plan migrations carefully (often to a new topic).

8) Tiny example (end-to-end)

  • Topic orders has 6 partitions.
  • Consumer group orders-service runs 2 processes.
    Process A has 2 threads (2 KafkaConsumer instances). Process B has 1 thread (1 instance).
    Kafka assigns partitions across these 3 instances; each instance may own multiple partitions.

One diagram to tie it all together

flowchart TB
  %% Topic and partitions
  subgraph Topic[Topic orders with 6 partitions]
    direction LR
    P0[P0]
    P1[P1]
    P2[P2]
    P3[P3]
    P4[P4]
    P5[P5]
  end

  %% Consumer group and processes
  subgraph Group[Consumer group orders-service]
    direction LR

    subgraph A[Consumer process A (worker)]
      direction TB
      A1[Thread 1\nKafkaConsumer #1]
      A2[Thread 2\nKafkaConsumer #2]
    end

    subgraph B[Consumer process B (worker)]
      direction TB
      B1[Thread 1\nKafkaConsumer #3]
    end
  end

  %% Example assignment (one moment in time)
  P0 --> A1
  P3 --> A1
  P1 --> A2
  P4 --> A2
  P2 --> B1
  P5 --> B1

  %% Notes
  Note[Rules:\n- One partition to at most one consumer instance in a group\n- A consumer instance may own multiple partitions\n- KafkaConsumer is not thread safe\n- Ordering is per partition only]
  Group --- Note

TL;DR

  • Topic = stream; Partition = ordered shard; Group = app;
  • Process/Worker = running app instance; KafkaConsumer instance = the client object that owns partitions; Thread = how you parallelize work safely.
  • Scale by partitions → instances → threads, preserve order per partition or per key, and keep the poll loop fast with manual commits.