Fixing Debezium Connectors when they break on production.

Published in

Level Up Coding

6 min readFeb 5, 2021

Debezium is a wonderful piece of technology. You can easily start a Debezium connector having followed the basic guide and lo and behold, you have events!

However, problems starts to appear when the environment around the initial happy path configuration changes which then breaks your connectors and the downstream consumers.

Here we’ll look at the following problems.

The database binlog rolls over and we were out of sync.
The entire database host changes along with its binlog,
You lose events for some reason.
You want to capture a new huge table in an existing connector
You introduce sudden schema changes but they are not synced.

Of course, creating a new connector altogether works, but the idea is to issue a minimally invasive update to the connector configuration in order to bring it back to life. A connector may be monitored by its name, in scripts, metric registries etc. You may not want to use a sword when a knife can do the job.

In this article, I try to debunk a few. We’ll also see the interaction of Debezium with the internal Kafka Connect topics.

As always, refer the official Debezium documentation for more information.

Problem I: Rolled Over Bin-logs

If the connector task is dead for a duration long enough such that the offset (binlog position) stored by Debezium in the internal Kafka Connect’s offset.storage.topic doesn’t exist anymore, the connector cannot resume.

You may try recreating the connector with a different name, restarting tasks, changing server ids but if the offset is gone, it cannot resume.

Fix:

You need to tell the connector to recreate a snapshot.
Use snapshot.mode=initial.
If you know how long your connector was down, you can issue snapshot.select.overrides on heavy tables so that you do not snapshot all rows. Do this only if your tables have a timestamp column (see problem IV)
Start tailing offset.storage.topic as a Kafka console consumer to view in on the action.
Issue the connector update.

Connector that tracks T1, T2, T3 will restore all of T2 and T3 and only a small override of T1

In the offset.storage.topic you’ll see that the connector tries to obtain an initial snapshot and then commits regular updates.

We lost track of offsets since 131 < last offset stored in database = 143. All is restored now.

Problem II: Database Host changes

If your database is migrated to a different host, you’d need to update the connector. However, the binlog file will naturally change.

As such, any existing connector, which was reading from the old bin-log will suddenly not find the file on the new server. It will die a horrible death.

Connector has died because the binlogs are no longer on the server

Fix:

Do what the error message says. Update the connector config by pointing to the new host. Set a new server-id as well.
Set snapshot.mode=when_needed. Ensure that the replication user exists on the new host and has the necessary permissions.
A new snapshot would be created. Use the snapshot.select.overrides to avoid reprocessing a lot of events.
Tail the status.storage.topic. From being UNASSIGNED, the connector should move to RUNNING.

Our connector has now moved to RUNNING state.

If you tail the offsets.storage.topic, you’d see that the binlog is switched to the new host against this connector.

Notice the host file change? Events should start streaming post snapshot.

Problem III: Lost Events

This is less of a Debezium problem and more of an application problem. Although Debezium is battle tested and rarely misses an event, events can sometimes be skipped for a few reasons (code bug in applications, edge cases in Debezium itself).

Depending on the criticality, a lost event could be catastrophic and unacceptable. However, if your consumers are idempotent, we can fix this.

A happy state for a Debezium connector. Metrics are a must.

Fix

If you lose events, check if the offset has rolled over in the bin log as well. In such cases, you usually need to know the time at which you started losing out events or you risk taking a bigger less granular snapshot.

Checking for offset for the oldest binlog.

If NO, then, congrats, we’re in a good state as we can replay events. We need to pause the relevant Debezium Connector, reset its offsets and restart it. If YES, then we are back to problem 1. And we solved it.
Stop the connector. Note it’s configuration so that we create the exact same connector again.
Manually reset Kafka Offset. Do this only if you know what you are doing. Below command writes a NULL message for the given key which is logically translated to removing stored offsets for the given connector.

We’re telling our connector to go back to a previous state

That’s it. Offsets have been reset. To avoid risking a full snapshot, recreate the connector with a snapshot.override with a time that is just before the time where you lost out on events.
Once connector is recreated, watch the offsets.storage.topic for the new offset written by Debezium.
All events including the lost events would be replayed. Your consumers should reprocess them now.

Problem IV: No Index on Override Table

All along we’ve been talking about taking snapshots using snapshot.override, but why?

If you do not, the default behaviour is to select *. This can be too costly for huge tables, which is when you’d like to restrict the number of rows in the initial snapshot with a where clause. Most likely, on a timestamp column.

However, it’s important to note how Debezium performs an initial snapshot in consideration with larger tables. It acquires a read lock and blocks writes.

A non optimized query would stall on a bigger dataset

The lock is unavoidable. After all Debezium has to ensure a consistent state. But what does it mean for writes?

It means that your APIs, scheduled jobs that write will fail for the duration of the snapshot. This, may again be unacceptable.

Fix

It is therefore quintessential to have a short snapshotting period or do it at favourable times during the day (lesser load)
If the where clause in the override doesn’t have an index, it is in fact worse as it will lead to full scans. Ensure an index is created on the snapshot override column. Profile the query and ensure it is fast enough.

If by mistake, an initial snapshot starts on a huge table (> 1M rows) and stalls writes, quickly perform the following.

Delete the connector.
Deleting connector won’t kill the running database query. Kill the query.

mysql $ show processlist → 
mysql $ kill <connection-doing-snapshotting> →

Problem V: Encountered Change Event Whose Schema Isn’t Known

This is a pretty unfortunate error to have. It happens when there’s a mismatch between the internal meta model and the database schema history. Most likely a DDL didn’t make it to the database history topic and now any new event will fail.

A failing connector

When Debezium tries to restart the task, it reads the latest offset from the connect.offsets.topics with a key of the connector name.

It then tries to compare the schema of later (not latest) events with its internal model (which isn’t updated).

This blows up depending upon the “inconsistent.schema.handling.mode” property (default behaviour is to fail, others are skip, warn)

Fix

On checking the relevant Debezium source code, it was clear that this was a dead end. A simple fix was to recreate the connector with a different name because the offset partition messages are keyed by the connector name. This’d take a new snapshot.
However, renaming the connector is not always feasible. In such cases, one can directly specify a tombstone message that basically tells Debezium that no offset has been recorded so far and a snapshot needs to be performed

We tell Debezium that we don’t have any offsets. Connector must be in stopped state.

The connector can now be recreated with the same name.

You might ask,
Why would a DDL not make it to the database.schema.history topic. After all Debezium is streaming all events, right?

It’s a valid question. To answer this, think of what happens when the retention of the history topic is set to be a fixed number of days or bytes, either of which exceeds?

Debezium needs to maintain all history under one topic.
Schema events cannot roll over.

That’s a wrap for this article. I hope this helps the readers to be able to tackle the issues with Debezium in their production configurations.

Fixing Debezium Connectors when they break on production.

Problem I: Rolled Over Bin-logs

Fix:

Problem II: Database Host changes

Fix:

Problem III: Lost Events

Fix

Problem IV: No Index on Override Table

Fix

Problem V: Encountered Change Event Whose Schema Isn’t Known

Fix

Written by Aman Garg