6 Key Kubernetes v1.36 Updates for Controller Health and Observability

If you've ever debugged a Kubernetes controller that made unexpected decisions, you've likely encountered the subtle but dangerous problem of staleness. Outdated cache data can cause controllers to take incorrect actions—or miss them entirely—often without any clear warning. With the release of Kubernetes v1.36, long-awaited mitigations have arrived. This article breaks down six critical things you need to know about staleness and the new features designed to make controllers more reliable and observable. Whether you're a platform engineer or a controller developer, these changes will help you build more robust systems.

1. What Is Staleness in Controllers?

Controllers in Kubernetes rely on a local cache—a snapshot of cluster state built from watch events sent by the API server. This cache allows controllers to make fast decisions without hammering the API. But the cache can become stale, meaning it no longer reflects the true state of the cluster. Staleness occurs when the cache is updated in a non-atomic way or when events arrive out of order. For example, if a controller restarts, it must rebuild its cache from scratch; during that window, its view is outdated. Similarly, if the API server is temporarily unreachable, the cache freezes. Understanding this fundamental problem is the first step in appreciating the v1.36 improvements.

6 Key Kubernetes v1.36 Updates for Controller Health and Observability

2. How Staleness Affects Controller Behavior

A stale cache can cause three major types of failure: incorrect actions, missed actions, and delayed actions. For instance, a controller might delete a resource that was already deleted, or fail to create a needed resource because it thinks it already exists. These bugs are notoriously hard to catch because they only appear under specific conditions—often in production. The root cause usually ties back to assumptions made by the controller author about event ordering. Prior to v1.36, the only defense was careful code, but now you get built-in tooling to detect and prevent these issues.

3. Common Causes of Stale Caches

Controller restarts: The entire cache must be rebuilt, leaving a window of outdated information.
API server downtime: No watch events arrive, so the cache becomes a frozen snapshot.
Event out-of-order delivery: Network delays or buffering can cause events to arrive in a sequence that doesn’t reflect reality.
Bulk operations: When an informer does an initial list operation, processing all objects at once can lead to inconsistent queue states.

These scenarios are surprisingly common and often silent. With v1.36, you gain atomic FIFO processing to tackle the bulk operation and out-of-order issues head-on.

4. Kubernetes v1.36: Overview of Improvements

The v1.36 release brings two layers of improvement: changes in client-go (the Go client library) and updated implementations in kube-controller-manager for highly contested controllers. The centerpiece is a new atomic FIFO mode (feature gate AtomicFIFO) that sits atop the existing FIFO queue. It ensures that batches of events—like the initial list from an informer—are processed together as one consistent unit. This prevents the queue from ever reflecting a partial, inconsistent state. The kube-controller-manager now uses this for controllers that are most sensitive to staleness, such as those managing endpoints or deployments.

5. Atomic FIFO Processing: How It Works

Previously, each event was added to the controller’s work queue in the order it was received. If the API server sent a list of 100 objects, those events were enqueued one by one. If a later event arrived out of order (e.g., an update before the initial creation), the queue could end up with a version of the object that doesn’t match the cache. Atomic FIFO solves this by grouping the events from a batch into a single atomic operation. The queue becomes a consistent snapshot: either it contains all events from the batch or none. This means the cache is always in a state that accurately reflects the cluster. Clients can also introspect the cache to determine the latest resource version, making debugging easier.

6. How to Adopt the New Features in Your Controllers

If you write controllers using client-go, enabling atomic FIFO is straightforward. Set the AtomicFIFO feature gate to true when constructing your informer factory. For controllers that are part of kube-controller-manager, the update is automatic in v1.36 for high-priority controllers. To verify, check the controller logs and metrics for staleness indicators. You can also monitor the queue’s consistency by exposing the resource version via the new introspection APIs. The result is a controller that is both safer (fewer incorrect actions) and more transparent—a big win for production reliability.

Kubernetes v1.36 marks a significant step forward in controller robustness. By addressing staleness at the core cache level with atomic FIFO, the project gives developers and operators a powerful tool to prevent subtle, hard-to-find bugs. As controllers become more observable and predictable, the entire ecosystem benefits. Whether you’re upgrading an existing cluster or building new controllers, these improvements are worth understanding—and adopting.

Darhost