Node and disk balancer in hadoop is an important concept used by cluster admins to ensure that all nodes and the volumes (disks in those nodes) are in an equilibrium state. Node balancing is different from disk balancing, you can think of node balancing as ensuring equal storage utilization across the data nodes (inter-node concept) whereas disk balancing is ensuring the disks of a data node are used proportionately (intra-node). To understand this concept of node and disk balancer in hadoop refer figure A below which shows the current and ideal state of a cluster.
As depicted in the figure above you can see that all disks in a node are almost equally utilized and the same is the case when we compare the data nodes. But imagine a scenario when you do either of these two actions:
- Delete a huge file from hdfs
- Add a new volume/disk to the data node.
This would create an imbalance in the hdfs storage utilization within the node which now means that the disks are not utilized proportionately instead a few disks are heavily loaded and the other one is almost empty. Refer figure B below which shows what happens when a new volume is added to a node and the problem it generates:
Hadoop provides two options to store upcoming blocks:
- Fair or round-robin allocation wherein when namenode decides to store the next set of blocks on a datanode then all disks get equal share in terms of the number of blocks to be stored on them. Do you see any drawbacks here? Yes, the problem with this scheme is that the utilization gap between the disks of a datanode will always be there.
- Available space based allocation wherein datanode service will try to minimize the intra-node disk space utilization gap by storing all new blocks to the disk with the greatest available space. Do you see any problem here as well? Of course, this will create a hotspot problem where all the writes for that datanode are being handled by only one disk which means we have reduced the parallelism (be it for write requests or read requests).
Considering the pros and cons of the options available as well as the main benefit of Hadoop which is maintaining parallelism, the recommended option is to use a round-robin alternative and at the same time proactively engage in intra-node disk balancing. The solution is simple wherein we need to check and rebalance the disks with each datanode. We would use Hadoop provided functionality to rebalance the datanodes without taking them offline. The steps are:
- Enable disk rebalancing in your cluster by setting dfs.disk.balancer.enabled property to true on all datanodes (hdfs-site.xml).
- Check if it is enabled by issuing the hdfs command on the terminal. It must exist in the output and if it doesn’t then ensure that you redeploy the latest configuration to all the datanodes.
- Execute the below script from any host in your cluster to trigger the intra-node disk rebalancing:
#!/bin/bash set -o pipefail; ##Finding the current live datanodes of a cluster. We re-balance disk space utilization on the nodes where data resides i.e. datanodes dataNodes=`hdfs dfsadmin -report | grep Hostname | cut -d':' -f2 | sed 's/ //g'` echo "Datanodes of the current cluster are: $dataNodes" ##Creating an internal scratch file touch tempFile ##Processing each datanode for the disk rebalncing itratively. for dataNode in $dataNodes; do #Using hadoop diskbalancer functionalty to identify and generating the rebalance plan file for the datanode. ##A JSON file ending in plan.json would be created and stored in hdfs directory if disk rebalancing is required on the datanode. filePath=`hdfs diskbalancer -plan ${dataNode} &> tempFile` ##Identifying the directory on hdfs where plan file is generated planDirOnHDFS=`cat tempFile | grep 'Writing plan to' | rev | cut -d: -f1 | rev` ##Identifying the plan file name on hdfs planFileOnHDFS=`hdfs dfs -ls $planDirOnHDFS | grep 'plan.json' | rev | cut -d' ' -f1 | rev` ##If plan file is not generated this means that diskbalancing is not required. Else we submit the plan to rebalance the disks which runs in the background. if [ -z "$planFileOnHDFS" ]; then echo "Disks are already balanced for datanode: ${dataNode}" else echo "Submitting the disk balancing plan for server ${dataNode} generated at ${planDirOnHDFS} as file ${planFileOnHDFS}" hdfs diskbalancer -execute ${planFileOnHDFS} status=$? if [ "$status" == "0" ]; then echo "Plan submitted successfully for balancing the data disks in $dataNode" ${logFile} else echo "Error in submitted the disk balancing plan for $dataNode" ${logFile} fi fi done rm tempFile
Refer below for an explanation of how this script works
- Using hdfs dfsadmin -report command it extracts a list of all the live datanodes in the cluster.
- For each datanode in the cluster, it tries to check and create a plan if disk balancing is required. A JSON file is generated with a map of how many bytes have to be moved from an identified disk to another disk in the same datanode. The plan file looks like:
{"volumeSetPlans":[{"@class":"org.apache.hadoop.hdfs.server.diskbalancer.planner.MoveStep","sourceVolume": {"path":"/data/01/dfs/dn","capacity":200506269696,"storageType":"DISK","used":58507729497,"reserved":0, "uuid":"xyz","failed":false,"volumeDataDensity":9.999999999998899E-5,"skip":false,"transient":false,"readOnly":false}, "destinationVolume":{"path":"/data/04/dfs/dn","capacity":1071386906624,"storageType":"DISK","used":305094489511, "reserved":0,"uuid":"abc","failed":false,"volumeDataDensity":0.007099999999999995,"skip":false,"transient":false, "readOnly":false},"idealStorage":0.2918,"bytesToMove":35696407975,"volumeSetID":"54f53ded-f8ae-4c05-a2e2-9c9b16184bea", "maxDiskErrors":0,"tolerancePercent":0,"bandwidth":0},{"@class":"org.apache.hadoop.hdfs.server.diskbalancer.planner.MoveStep", "sourceVolume":{"path":"/data/05/dfs/dn","capacity":200506269696,"storageType":"DISK","used":58507729497,"reserved":0, "uuid":"xyz1","failed":false,"volumeDataDensity":9.999999999998899E-5,"skip":false,"transient":false,"readOnly":false}, "destinationVolume":{"path":"/data/07/dfs/dn","capacity":1071386906624,"storageType":"DISK","used":303247836010,"reserved":0, "uuid":"abc1","failed":false,"volumeDataDensity":0.00880000000000003,"skip":false,"transient":false,"readOnly":false}, "idealStorage":0.2918,"bytesToMove":31502404744,"volumeSetID":"54f53ded-f8ae-4c05-a2e2-9c9b16184bea","maxDiskErrors":0, "tolerancePercent":0,"bandwidth":0}],"nodeName":"datanodeA","nodeUUID":"abc2","port":50020,"timeStamp":1512904679030}
- And finally, it submits the plan using hdfs disk balancer functionality to initiate the balancing job which runs in the background.
I hope this article would have helped you in understanding the difference between inter and intra-node data balancing (usually referred to as Node and disk balancer in Hadoop) and how to implement it generically for all the clusters in place.