Enhancing Search Reliability in GitHub Enterprise Server: A High Availability Overhaul

Introduction

Search is a fundamental component of the GitHub experience, powering not only the obvious search bars and filtering on Issues pages but also underlying features like release pages, project boards, and counters for issues and pull requests. Recognizing its critical role, GitHub’s engineering team has dedicated the past year to making search more resilient in GitHub Enterprise Server. The goal: reduce administrative overhead and allow teams to focus on what matters most—their customers.

Enhancing Search Reliability in GitHub Enterprise Server: A High Availability Overhaul — Source: github.blog

The Role of Search in GitHub Enterprise Server

Beyond the search bar, search indexes are the hidden engines that enable rapid retrieval of data across the platform. These specialized database tables are optimized for fast queries, but they also require careful management. In previous versions of GitHub Enterprise Server, administrators had to follow strict maintenance and upgrade sequences. Any misstep could lead to corrupted indexes requiring repairs, or locked indexes that caused upgrade failures. This fragility was especially problematic in High Availability (HA) environments.

How High Availability Works in GitHub Enterprise Server

High Availability setups are designed to keep the system running smoothly even when part of it fails. In a typical HA configuration, there is a primary node that handles all write operations and user traffic, and one or more replica nodes that stay synchronized and can take over if the primary goes down. This leader/follower pattern permeates every operation within GitHub Enterprise Server—including search.

Challenges with the Previous Elasticsearch Integration

GitHub uses Elasticsearch as its search database. Unfortunately, earlier versions of Elasticsearch did not natively support the leader/follower architecture required by HA. To work around this, GitHub’s engineers created an Elasticsearch cluster that spanned both primary and replica nodes. This approach made data replication straightforward and even offered performance benefits—each node could handle search requests locally. However, the downsides eventually outweighed these advantages.

The Problem of Moving Shards

Elasticsearch manages data in units called shards. A primary shard is responsible for receiving and validating write operations. In a clustered setup, Elasticsearch could automatically move a primary shard from the primary node to a replica node. If that replica was later taken down for maintenance, the entire system could enter a locked state. The replica would wait for Elasticsearch to become healthy before starting up, but Elasticsearch couldn’t become healthy until the replica rejoined. This circular dependency created a critical failure point.

Attempts to Stabilize the Clustered Mode

For several releases, GitHub’s engineers worked to stabilize this fragile configuration. They implemented health checks to ensure Elasticsearch was in a valid state before allowing operations to proceed. They also built processes to correct drifting states that occurred when nodes fell out of sync. Despite these efforts, the underlying architectural tension remained. The team even attempted to build a “search mirroring” system that would decouple the primary and replica Elasticsearch instances, but database replication is notoriously difficult, and consistency requirements made those early attempts unsustainable.

A New Approach: Search Mirroring

After years of incremental fixes, the engineering team decided to fundamentally rethink the search architecture. The solution they developed is a search mirroring system that eliminates the need for a cross-node Elasticsearch cluster. In this new design, the primary node runs its own Elasticsearch instance, and each replica node runs a separate, independent Elasticsearch instance. Data replication is handled at the application layer rather than the database layer, ensuring that search indexes remain consistent without the risks of shard movement or locking.

Key aspects of the new architecture include:

Independent Elasticsearch instances on each node, avoiding cross-node clustering.
Application-level replication that synchronizes search data from the primary to replicas safely.
Graceful degradation—if a replica fails, the primary continues to serve search queries, and the replica can be rebuilt without affecting the cluster.

This design also simplifies maintenance. Administrators can now perform upgrades or take replicas offline without risking a system-wide lock. The new architecture aligns with GitHub Enterprise Server’s HA pattern, making the entire system more predictable and reliable.

Conclusion

The overhaul of GitHub Enterprise Server’s search architecture represents a significant step forward in reliability. By moving away from a fragile clustered Elasticsearch setup and adopting a mirroring approach, GitHub has eliminated a major source of downtime and administrative complexity. The result is a platform where search continues to work seamlessly, even during maintenance events, allowing teams to focus on their code and collaboration.

Darhost