HDFS HA Solution
What is HDFS
Copied from its official site
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware.
HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.
This article will introduce how HDFS could achieve fault-tolerant.
when we look at the architecture of HDFS, we can find the bottleneck is NameNode, if NameNode is down ,then all cluster is unavailable.
To make the overall HDFS fault tolerant, we need to make NameNode as “strong” as possible.
HA Solution
To make the NameNode more strong and fault tolerant, the intuitive way is to prepare a backup node, when main node get down, backup will be upgrade to main node and provide the service.
So the real problem is how to sync the data between main NameNode and backup NameNode so that the backup could provide exactly same service as the previous main node.
Intuitively speaking, there are two ways to sync the data between two nodes
- Synchronous Solution: When main node generate a new record and store it in EditLog, it will block the operation and send a same record to backup node, as long as the backup node return the confirmation, main node will notify client that the operation is completed
- Asynchronous Solution: When main node generated a new record, it will send the same record to backup node as well and just notify the user the operation is completed without waiting the response from backup node.
each solution has its pros and cons,
synchronous solution could promise the consistency between backup and main node, it’s important when main node get down and backup node start providing service, while this solution hurt the performance and availability of system.
asynchronous solution has better availability, while it cannot guarantee the consistency between main node and backup node. it’s possible user see different data for same query.
To combine the pros and cons of different solution, there is a third solution – building a middleware between main node and backup node.
We need the middleware can store the message quickly so that the main node could just send the record to it, and notify the client when this middleware return the confirmation of storing.
On the other hand, we need to connect backup node to this middleware as well. so that it can consume the message in the middleware which is stored from main node.
in HDFS, this middleware called JournalNode , usually it’s a cluster as well, more than 2, with one master node and several slave nodes.
When main node send message to Journal Node, master node will first save the message on its local disk, then send the same message to other slaves, as long as master received more than half slaves successful saving confirmation, it will return the notification back to main node and main node to consume its operation.
backup node will periodically check journal nodes and retrieve the new records back to local to consume.
Automatic HA Solution
After above solution, then we have a way to keep the data same between main node and backup node. when main node is down, we can manually upgrade backup node to main node so that overall cluster could still provide service.
while manual recovery is not scalable, automation is a more optimal solution.
To automate the recovering process, we need to introduce two more roles – ZKFC & ZK
ZKFC: ZKFalloverController
ZK: a service which enables highly reliable distributed coordination.
We will deploy ZKFC on both main NameNode machine and backup NameNode machine.
ZKFC is the core of automation recovery. I will use “hand” to introduce its functionality
- First hand held its own NameNode: ZKFC will be responsible to monitor its own NameNode, is it died or is it still alive?
- Second hand held ZK service: ZKFC will create a ephemeral node in ZK cluster and create a file in ZK, the one who get the lock of that file will be considered as the main NameNode, another one will be considered as backup NameNode
- Third hand held another NameNode. when current ZKFC get the lock of ZK file, usually means current NameNode is backup and the main node is down. This third hand is used for confirming if another namenode is really dead, if yes, then downgrade another NameNode to backup and upgrade its own NameNode to main node.
- This design is important, the designer of HDFS think “split brain” is a more severe problem than unavailability. because if third hand cannot downgrade another NameNode, then overall cluster will be unavailable just because two main node will cause the cluster under a dangerous situation.