kafka

2025-06-01 | ← Blog

A bottom-up walkthrough of Kafka: what it is, how it works internally, and how Spring Boot connects to it. Written as study notes refined from real production experience.

Apache Kafka is an Event Streaming Platform.

Event Streaming means:

Capturing data in real time from different sources as streams of events
Storing those streams durably for later retrieval
Processing streams in real time and retrospectively

#	Capability
1	Publish and subscribe to streams of events
2	Store streams durably for as long as needed
3	Process streams in real time or retrospectively

Kafka is distributed, fault-tolerant, and elastic. It runs on bare metal, VMs, or cloud.

An event is something that happened — a record or message. Every event has:

Key — identifies the entity (e.g., transactionId, userId)
Value — the payload (e.g., JSON)
Timestamp — when the event occurred
Optional metadata headers

Producers publish (write) events to Kafka.

Fully decoupled from consumers
Do not wait to know if or when an event is consumed
This decoupling is one of Kafka's biggest design strengths

Consumers read and process events from Kafka.

Events are not deleted after consumption — they can be re-read
Retention period is configurable per topic

A Topic is the fundamental way to organize data in Kafka.

Think of a topic like a folder in a filesystem, and events as the files inside it.

Property	Detail
Multi-producer	Many producers can write to the same topic
Multi-consumer	Many consumers can read from the same topic
No deletion on read	Consumers can re-read events at any time
Configurable retention	Keep events for 7 days, or indefinitely
Partitioned	Split for scalability and parallelism

Use descriptive, hyphen-separated names:

payments-processed
user-signup-events
order-created

A Partition is the physical subdivision of a topic. This is the most important concept for scalability.

A Topic is logical. A Partition is physical — an actual append-only log on disk.

Reason	Explanation
Scalability	Data spread across brokers; parallel producers/consumers
Throughput	Parallel reads and writes
Ordering	Guaranteed within a partition, not across partitions

Scenario	Behavior
Message has a Key	Kafka hashes the key → same key always hits same partition
No Key	Sticky partitioning — picks a partition per batch, then rotates
Manual override	Producer explicitly specifies partition number

Topic: orders
-----------------------------
P0 → [event1] [event2] [event3]   ← appended in order
P1 → [event4] [event5]
P2 → [event6] [event7]

Events within a partition are ordered and immutable. Only ever appended to the end.

A broker is a single Kafka server. It:

Stores partitions on disk
Handles read/write requests
Participates in replication

A Kafka Cluster is multiple brokers working together:

Kafka Cluster
┌───────────────────────────────────────┐
│  Broker 1    Broker 2    Broker 3     │
│  (Node 1)    (Node 2)    (Node 3)     │
└───────────────────────────────────────┘

Adding more brokers = horizontal scaling.

	ZooKeeper (old)	KRaft (new, Kafka 3+)
Role	External service managing cluster metadata	Built-in consensus, no external dependency
Status	Deprecated	Current standard
Setup	Requires separate ZooKeeper cluster	Self-contained

Use KRaft mode for all new setups.

Every partition has:

Exactly 1 Leader broker — handles ALL reads and writes for that partition

0 or more Follower brokers — maintain replicas, only used for failover

Partition P1:
┌──────────────────────────────────────┐
│  Broker-2 (Leader)  ← ALL traffic   │
│  Broker-1 (Follower) ← replica only │
│  Broker-3 (Follower) ← replica only │
└──────────────────────────────────────┘

One leader means simpler consistency — no stale reads. The leader is the single source of truth.

Kafka detects the leader is down
Elects a new leader from in-sync followers
Producers and consumers reconnect automatically
No manual intervention required

Replication Factor (RF) = how many copies of a partition exist across the cluster.

RF = 3, Topic: payments, Partition P0

  Broker 1 → [P0 Leader]   ← handles all reads/writes
  Broker 2 → [P0 Replica]  ← stays in sync
  Broker 3 → [P0 Replica]  ← stays in sync

RF	Meaning
1	No redundancy. Broker dies → data lost
2	One backup. Rarely used in prod
3	Standard for production

You need at least as many brokers as your replication factor.

Replicas that are caught up to the leader. If a replica falls behind, it's removed from ISR.

min.insync.replicas=2 with RF=3 means: at least 2 replicas must acknowledge a write before it's confirmed.

Producers write events to Kafka. Key config:

`acks`	Meaning	Risk
`0`	Fire and forget	Message can be lost
`1`	Leader ACKs only	Lost if leader fails before replication
`all`	All ISR replicas ACK	Safest — use for financial/critical data

enable.idempotence=true

Guarantees exactly-once delivery to a partition even if retries happen.

Producers batch messages before sending for throughput efficiency:

Config	Purpose
`linger.ms`	Wait up to N ms to fill a batch
`batch.size`	Max bytes per batch
`compression.type`	`snappy`, `lz4`, `gzip` — reduces network usage

This is how the connection actually works — not just "add bootstrap servers to config".

spring:
  kafka:
    bootstrap-servers: broker1:9092,broker2:9092,broker3:9092

Bootstrap servers are just an initial contact point. Spring Boot connects to any one of them to fetch cluster metadata (all brokers, all topics, all partition leaders).

From that initial connection, the client gets a full map of the cluster:

Topic: payments
  Partition 0 → Leader: Broker 2
  Partition 1 → Leader: Broker 1
  Partition 2 → Leader: Broker 3

After metadata fetch, the producer routes messages directly to the partition leader — not through the bootstrap server.

Spring Boot Producer
        │
        ├──→ Broker 2 (Leader for P0) ← payment with key "txn-001"
        ├──→ Broker 1 (Leader for P1) ← payment with key "txn-002"
        └──→ Broker 3 (Leader for P2) ← payment with key "txn-003"

Key insight: You don't need all brokers in bootstrap-servers — just enough that at least one is reachable at startup.

A Consumer Group is a set of consumers that together consume a topic.

Each partition is assigned to exactly one consumer in the group

Multiple groups can consume the same topic independently (fan-out)

Topic: payments (3 partitions)

Consumer Group A (payment-service):
  Consumer 1 → P0
  Consumer 2 → P1
  Consumer 3 → P2

Consumer Group B (audit-service):
  Consumer 1 → P0, P1, P2   ← reads all partitions independently

Kafka triggers a rebalance — partitions are redistributed across the group.

Offset = position of the last consumed message in a partition.

Mode	Behaviour
Auto commit	Kafka commits offset periodically (risk of re-processing on crash)
Manual commit	Consumer commits after processing (safer for financial systems)

Stored in internal topic: __consumer_offsets

For payment systems, always use manual commit:

@KafkaListener(topics = "payments", groupId = "payment-service")
public void consume(ConsumerRecord<String, String> record,
                    Acknowledgment ack) {
    process(record);
    ack.acknowledge(); // commit only after successful processing
}

More partitions = more parallelism, but also more overhead.

Factor	Guidance
Max consumer parallelism	Partitions = max consumers you'll ever want in a group
Throughput target	Measure per-partition throughput, divide target by that
Broker count	Partitions should be a multiple of broker count for even spread
Ordering	If strict per-entity ordering is needed, use key-based partitioning

Desired partitions ≈ max(
    target throughput / throughput per partition,
    max consumer instances you'll scale to
)

Example: 600 MB/s target, 100 MB/s per partition, max 12 consumers → 12 partitions

You can increase partitions later but cannot decrease them.
Increasing partitions can break key-based ordering for existing keys.
Over-partition slightly rather than under-partition.

Cluster Size	Recommendation
Small / Dev	3–6 partitions per topic
Medium	12–24 for high-throughput topics
Large / Enterprise	50–100+ based on SLA

# Generate Cluster ID
KAFKA_CLUSTER_ID="$(./bin/kafka-storage.sh random-uuid)"

# Format storage
./bin/kafka-storage.sh format --standalone -t $KAFKA_CLUSTER_ID -c config/server.properties

# Start the broker
./bin/kafka-server-start.sh config/server.properties

# Create
./bin/kafka-topics.sh --create --topic payments \
  --partitions 3 --replication-factor 1 \
  --bootstrap-server localhost:9092

# Describe
./bin/kafka-topics.sh --describe --topic payments \
  --bootstrap-server localhost:9092

# List
./bin/kafka-topics.sh --list --bootstrap-server localhost:9092

# Delete
./bin/kafka-topics.sh --delete --topic payments \
  --bootstrap-server localhost:9092

# Produce
./bin/kafka-console-producer.sh --topic payments \
  --bootstrap-server localhost:9092

# Consume from beginning
./bin/kafka-console-consumer.sh --topic payments \
  --from-beginning --bootstrap-server localhost:9092

# Consume in a group
./bin/kafka-console-consumer.sh --topic payments \
  --group payment-service --bootstrap-server localhost:9092

# Combined mode — good for local dev
process.roles=broker,controller

# Separate directories for data and metadata
log.dirs=/path/to/kafka-broker-logs
metadata.log.dir=/path/to/kafka-metadata-logs

In production, put these on separate disks — metadata writes need low latency and shouldn't compete with data writes.

This is based on the payment processing platform I work on:

External        Config      Kafka          Middleware     Targets
Sources         Manager    Cluster         Service        (3rd Party)
─────────      ─────────  ─────────       ─────────      ─────────
                          ┌─────────┐
 Merchant  →  CM Service →│merchant │→ MW consumes  →   Redis Cache
 Data                     │ topic   │                →   Payment Processor A
                          └─────────┘               →   Payment Processor B

                          ┌─────────┐
 Transaction →  App      →│txn      │→ TLM consumes → DB (save txn)
 Events                   │ topic   │
                          └─────────┘

Why Kafka here?

CM doesn't get hammered by MW polling — data is pushed via Kafka
Multiple consumers (MW, analytics, audit) independently read transaction events
If MW goes down, events are retained in Kafka and processed on recovery

One-Line Answers

Question	Answer
What is Kafka?	Distributed, fault-tolerant event streaming platform for publishing, storing, and processing real-time data streams
What is a topic?	Logical category for organizing events, split into partitions for scalability
What is a partition?	Physical, ordered, append-only log — the actual unit of storage and parallelism
What is a broker?	Single Kafka server that stores partitions and serves producer/consumer requests
What is a partition leader?	The one broker responsible for all reads and writes for a given partition
What is replication factor?	Number of copies of each partition — ensures fault tolerance
How does Spring Boot connect?	Connects to bootstrap servers for metadata, then routes directly to partition leaders
What is a consumer group?	Set of consumers sharing a topic, with each partition assigned to exactly one consumer

Setting	Value
Replication Factor (prod)	3
Min brokers for RF=3	3
Default retention	7 days
Max useful consumers per topic	= number of partitions

"Ordering is guaranteed within a partition, not across partitions."
"The bootstrap server is just an initial contact point — producers route directly to the partition leader."
"Kafka scales throughput not by reading from followers, but by increasing partition count."
"Consumer groups allow fan-out — the same event is consumed independently by multiple services."
"The maximum useful consumer parallelism equals the number of partitions."

Tags: kafka, distributed-systems, backend, spring-boot

← Blog | Home