System Design: Building Blocks - Database
I would say, Database is the most important component in a system, when you design a system, and you don’t consider a lot regarding the database, then the design wouldn’t be successful.
A proper database is essential for every business or organization.
It’s impossible to cover all the information of DB in this article.
What’s going to be covered
- Types of database
- Replication
- Partition/Sharding
What is Database
A database is an organized collection of data that can be managed and accessed easily.
Types of Database
Based on the intended use case, the type of information they hold, and the storage method, we can divide the database into following two types.
Relational Database(SQL)
The data stored in relational database has prior structure.
The instance are stored in rows, an the attributes of each instance stored as columns.
When we say relational database, first thing come to our mind has to be ACID: atomic, consistency, isolation and durability.
With the ACID, we can always assume the data stored in database is correct.
I would say, Database is the most important component in a system, when you design a system, and you don’t consider a lot regarding the database, then the design wouldn’t be successful.
Atomic
The transaction is treated as an atomic operation, means, the transaction can either success or fail, there is no space for the grey area.
If the statement within the transaction fails, it would be rolled back.
Consistent
If multiple user want to review a record from the database, it should return a similar result each time.
Isolation
When there are multiple transactions run at the same time, they shouldn’t affect each other, the final result should be the same as the transactions were executed sequentially.
Durability
The succussed transaction will be persisted forever.
Non-relational Database(NO-SQL)
These database are used in application that requires a large volume of semi-structured and unstructured data.
because we don’t need to maintain the schema for stored data, the advantage of no-sql database including:
simple design
it’s easier to write less code, debug and maintain.
scalability
NoSQL make it easier to scale out since the data related to specific employee is stored in one document
availability
NoSQL database support data replication to ensure high availability
Comparison
pros | cons | |
---|---|---|
Relational Database | concurrency, integration | impedance mismatch |
Non-relational Database | serialize/deserialize data data volume is big |
|
Data Replication
Replication refers to keep multiple copies of the same data, at various physical nodes, to achieve the availability, and performance.
The main problem of replication is when we have to maintain the change in the replicated data overtime.
Replication Model
usually we have adopt two types of replication model in practice
primary-secondary replication
in this mode, only one node accept the write operation, which is referred as primary node, then sync the change with other nodes, which is secondary node.
this model is appropriate when our application is read-heavy, and the secondary nodes shouldn’t too many.
The problem of this model is the consistency.
if we adopt the synchronous replication, then all read operations of system would suffer, if we adopt asynchronous replication, then the consistency of system could be a concern
another problem, which is more severe, is single point of failure, if the primary node fails, then system can only serve read request, cannot accept any write operations.
leaderless replication
leaderless replication resolves the problem of primary-secondary model.
in this model, all nodes can accept write operations. So inevitably, it may leads to concurrent write. An effective resolution is quorums
so what is quorums?
based on wiki definition, a quorums is the minimum number of members of a deliberative assembly necessary to conduct the business of that group.
let take an example: assume we have 3 nodes, for each write operation, if we ensure more than half nodes, in our case is 2, perform the operation. Then for each read request, if more than half nodes serve the request, the latest data will definitely be served
Data Partition / Sharding
When the data volume cannot fit on a single node, we need to partition data across multiple nodes, while we should still provide single-node like properties.
Usually there are two ways of partition: vertical partition or horizontal partition
vertical partition
in vertical partition, we can divide one table into multiple tables, each sub-table contains the subset of columns of original table.
a typical use case of vertical partition is we can divide some frequent accessed column with those not that frequent, like the Blob column or very wide text fields.
Also, the partition need to be careful, to make the columns which has stronger connection stay in one table, those columns with weaker connection on separate table.
horizontal partition
in horizontal partition, we divide the table in row wise.
so there are two ways to perform row wise partition: range split or hash based
there is an interesting hash method, called consistent hashing
in consistent hashing, we assign each server or item in a distinguished hash table a place on an abstract circle, called rings. this permits server and objects to scale without compromising the system’s overall performance.
request routing
how does client know which node to connect while making the request?
this problem is known as service discovery,
the main challenge of service discovery is, how to make client or routing service, know the updates in the partitioning of the nodes.
zookeeper can resolve this issue, each node connect to zookeeper to update the information, then zookeeper will notify the routing tier about the change
Evaluation
By partition or replication, what we can bring to overall system is
reliability
by replication the data across different side, we can increase the availability of system, because more than one site can serve the request
performance
by partition the data across different node, we also reduce the burden of each node so that the performance is better