In the last blog post I described the differences between the terms “Parallel Database” and “Sharding”. In this post I’d like to illustrate some of the complexity that sharding introduces to the application.
The parallel database architecture has several benefits over a sharding approach. In a parallel database, application developers can focus on the application and the database will deal with the things that a database should do. With sharding, the application developer must devote a considerable amount of time and energy to developing, maintaining and constantly tweaking the sharding code. Even when you use a “sharding library” or some product that “automates” sharding, managing and maintaining that layer turns out to be a significant burden. Application developers still need to be cognizant of the “sharding” strategy and recognize the fact that queries may not produce the results they expect because of data locality issues.
Below is a simple illustration of running a query against a parallel database and a sharded system. The illustration at left is from a five node ParElastic system running on MySQL databases. The illustration at right is five MySQL databases and a sharding application layer.
Behind the scenes, the parallel database counts up the rows in the various “slices” of data in the node table and returns a number to the application. On the right hand side is the same query from the perspective of a sharding application which must now add the five numbers and determine that there are 15,692 rows in total.
While it is a pain to have to do this, it is easy to add up the numbers and come up with the right result. In the next blog post, I present a much more common problem that shows why sharding is a really bad idea.