2 minute read

What is HDFS

Copied from its official site

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware.

HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.

To understand why HDFS has the advantages as described above, it’s important to understand how it performs read process and write process.

And before we dig deep into this topic, there are 3 roles in HDFS need to be clarified.

Client: this is a role to connect real user and HDFS, user can send request to client and client will get the result back.

NameNode: the brain of HDFS cluster, generally speaking it’s responsible for two things

  1. communicate with client, when client receive the request from user, it will first send query to Namenode, Namenode will tell it what’s the next step.
  2. gather information from DataNodes: Datanodes periodically report the status of data stored on that node.

DataNode: where the data really is.

Reading Process

  1. client get the request from user, say user run the command hdfs dfs -get <file_name>
  2. client will first go to Namenode to query which Datanodes host the blocks of requested file
  3. then based on some built-in strategy, Namenode will return a list of closest Datanodes containing different blocks of the file
  4. client build connection with different DataNode to fetch different blocks of file and its metadata
  5. client verify if the file content is expected by calculating a new metadata and compare it with the one returned from DataNode.
  6. If all blocks of that file are available on client, then mark the operation as success.

Writing Process

  1. client get the request from user, say user input a command hdfs dfs -put <file_content> <file_name>
  2. client first query the Namenode to find the closest DataNode to host the file, sometimes DataNode and client might be on same machine
  3. client build the connection with the closest DataNode
  4. client will initiate a buffer and send data to that buffer, as long as the buffer get filled send the buffer to DataNode.
  5. if user specify to store the file in multiple replication, then the DataNode connected with client will find another DataNode based on HDFS strategy and send file to that new DataNode. Notice. client cannot perceive this process.
  6. when all file are written to HDFS cluster, client will notify user the writing operation is completed.