Sunday, August 25, 2013

Database Sharding

Sharding is a type of database partitioning that separates very large databases the into smaller, faster, more easily managed parts called data shards. The word shard means a small part of a whole.

Horizontal partitioning is a database design principle whereby rows of a database table are held separately, rather than being split into columns (which is what normalization and vertical partitioning do, to differing extents). Each partition forms part of a shard, which may in turn be located on a separate database server or physical location.

There are numerous advantages to this partitioning approach. Since the tables are divided and distributed into multiple servers, the total number of rows in each table in each database is reduced. This reduces index size, which generally improves search performance. A database shard can be placed on separate hardware, and multiple shards can be placed on multiple machines. This enables a distribution of the database over a large number of machines, which means that the database performance can be spread out over multiple machines, greatly improving performance. In addition, if the database shard is based on some real-world segmentation of the data (e.g., European customers v. American customers) then it may be possible to infer the appropriate shard membership easily and automatically, and query only the relevant shard

Disadvantages

Sharding requires developers to think about things like rollbacks, constraints, and referential integrity across tables within their applications when these types of concerns are best handled by the database. It also makes other common operations such as joins, searches, and memory management very difficult. In short, the very solution implemented to overcome throughput issues becomes a bottleneck in and of itself.

Examples

MongoDB supports sharding from version 1.6

Solr enterprise search server provides sharding capabilities

Microsoft supports sharding in SQL Azure through "federations"