One of most important questions for any modern database administrator is, “How do different databases scale?” There are different approaches to database scaling, and not all of them are applicable in all contexts. An increasingly popular approach to database scaling is sharding, but what is database sharding, and how does it help databases scale?
What Is Sharding?
To understand database sharding, you must first understand the how and why of database scaling, especially in the cloud. Not all databases are equal. The brave new worlds of public cloud computing and containerization rely on your ability to grow your applications on demand. This means both databases and front-end processing applications like Apache must be able to scale up and down, which can be more than a bit complicated for databases.
Old-School Scaling
Traditionally, database scaling was accomplished through clustering. A typical cluster consists of multiple database servers, each with a complete copy of the database. Database requests are load balanced across the cluster, so no one server has to deal with the full impact of a workload’s database requirements.
Clustering, however, has its limitations. In the “every node in the cluster has a complete copy of the database” approach, only database reads can be effectively load-balanced. All nodes must (eventually) write all changes to disk, and the cluster itself ultimately cannot grow beyond the ability of an individual node to absorb writes.
Several workarounds exist. Some databases use an “eventually consistent” model: an incoming write is absorbed only by a subset of nodes, with the other nodes committing those writes when they have time.
There are several approaches to building eventually consistent database clusters.
In some cases, there are ingest nodes specifically designed to absorb high-volume performance peaks. These nodes don’t typically serve reads to workloads, instead only reading in from their databases when the read-only nodes are ready to catch up on the writes they’re missing.
Some databases offer the ability to absorb writes into RAM, typically with multiple nodes in the cluster receiving the writes simultaneously to protect against power failures. This is most common where nodes have Non-Volatile DIMMs (NVDIMMs), which can protect clusters using in-memory databases against data loss if there’s a power failure. This approach is most common when database writes are brief, but intense, as servers have limited RAM, and can only absorb so many writes before the entire database is reduced to the speed at which writes can be committed to SSD.
The traditional clustering approach of having a complete database copy for each node in the cluster is challenging even for bare metal servers with the latest, greatest technology. It’s prohibitively restrictive when talking about virtualized or public cloud instances and doesn’t fit well with the small footprint approach of containers.
Horizontal Scaling vs. Vertical Scaling
Sharding takes a different approach to spreading the load among database instances. Sharding literally breaks a database into little pieces, with each instance only responsible for part of the database. As with clustering, there are multiple approaches to sharding, not all of which are called sharding by database administrators.
There are two primary ways to break up a database: vertically and horizontally.
Vertical distribution of a database can be done by developers or database administrators without requiring any special support from the database application itself. Breaking up a database vertically usually means something along the lines of putting a table in a separate database or dedicating a node or cluster to a specific table.
Horizontal distribution—what almost everyone means when they talk about database sharding—requires the support of the underlying database application. Fortunately, this support is now common. Horizontal sharding is storing each row in each table independently, so they can be spread out evenly across nodes in a cluster.
There are also two primary approaches to database sharding: dedicated name nodes and distributed shard index. The shard index serves a purpose similar to the Master File Table (MFT) of a server’s file system, and how the shard index is handled plays a significant role in the performance and scalability of a sharded database.
The dedicated name node approach has one or more “name nodes,” which maintain the shard index. Workloads communicate with the shard index, and the shard index either redirects the request to the appropriate data node, or acts as a proxy for the data nodes, getting or putting data to the relevant nodes as required.
How Sharding Works
The distributed shard index approach usually requires each node to keep a copy of the node index. (Variations on this are possible, but I won’t explore those in this blog.) Here, workloads can interact with their nearest database shard directly; however, the shard containing the specific data they require may not be anywhere near the workload making the request.
Databases using a name-node approach can typically scale the number of name nodes as needed to meet performance or geographic distribution requirements. They may even separate the roles of “possessor of the shard index” from “data node proxy,” allowing both roles to be scaled independently.
Databases with distributed shard indexes tend to be better at broad geographic distribution. Every node has a copy of the shard index, so workloads can find the data they need as quickly as possible. The flip side of this is the larger the database, the larger the shard index, and thus the larger each individual database index must be.
Database sharding is an area of IT undergoing significant innovation. This is great for administrators because new capabilities in database management software are emerging all the time. On the other hand, with so many competitors in this space, it’s inevitable for vendors to adopt unique nomenclature as part of their differentiation strategies. This can make direct comparison of capabilities, technologies, and approaches somewhat difficult.
It’s important for database administrators to consider when thinking about using database sharding to meet their scaling requirements that not all approaches to database sharding are equal, but not all workload requirements are the same, either. Applications using sharding to cope with wide geographic distribution will have vastly different requirements than applications employing sharding to compensate for the fact that no individual server can meet the crushing performance demands of the application, even though the whole thing lives in a single data center.