Providing Elasticity to the HDFS Cluster

Image credit: https://www.cbronline.com/wp-content/uploads/2017/02/hadoop-1.png

The Hadoop Distributed File System ( HDFS ) is a distributed file system designed to run on commodity hardware. It is used to scale a single Apache Hadoop cluster to hundreds (and even thousands) of nodes. It is designed to store very large data sets reliably. Now, the problem is that we will provide storage space to the HDFS cluster through the data node. But, the data node will have limited space. So, to increase/decrease the space of the data node, we have to provide the elasticity to the data node and hence to the HDFS cluster. This article will briefly explain and show you how to provide elasticity to the HDFS cluster. In the end, I will automate this process using a Python script.

Pre-requisites:

HDFS cluster should be configured.

In my case, the HDFS cluster is configured on the AWS cloud. But, you can configure it anywhere to perform this experiment.

Step 1: Create a volume/get a hard disk and attach it to the Datanode

I will create an EBS volume named DataNode of size 5 GiB and attach it to the Datanode named hadoop_data_node2 at /dev/sdg or /dev/xvdg.

Create a volume named DataNode
Attach volume named DataNode to the Datanode at /dev/sdg or /dev/xvdg

Similarly, I will create another volume and mount it at /dev/sdf or /dev/xvdf.

Step 2: Format the /dev/xvdg1 and /dev/xvdf1

Run the following command to format the volumes.

$ fdisk /dev/xvdg1

Now, format the drive to the ext4 format which is the format used by Linux systems.

$ mkfs.ext4 /dev/xvdg1

Now, we have both storage devices ready. So, in total, I have 5.5GiB(512 MiB + 5 GiB) storage available.

But this storage is present in two different devices and I can’t use 6 GiB at a time to store a file consisting of 6 GiB. To use the complete 6 GiB to store a single file of size 6 GiB, I have to combine the space of both storage devices logically. This will lead us to the concept of LVM that is Logical Volume Management. Using LVM, we can combine storage from multiple storage devices and use it as a single drive, logically. As we are combining storage devices logically, it will give us the power to increase or decrease the space of the logical volumes on the go and that too without losing the data.
So, I will use the concept of LVM to provide elasticity to the HDFS cluster.

So, let’s continue and create the Logical Volume using LVM.

Step 3: Create Physical Volume

First of all, we will create a Physical Volume from the storage devices we have.

$ pvcreate /dev/xvdf1$ pvcreate /dev/xvdg1

Confirm that the Physical Volumes are created using the pvdisplay command.

$ pvdisplay /dev/xvdf$ pvdisplay /dev/xvdg

Step 4: Create a Volume Group

To get the space from separate Physical Volumes we have to combine their space. To achieve this, we have to create a Volume Group that has space from both /dev/xvdf1 and /dev/xvdg1 physical volumes.

$ vgcreate contribdata /dev/xvdf1 /dev/xvdg1

This will create a volume group of size 5.5 GiB (512 MiB + 5 GiB).

View the created Volume Group using the following command

$ vgdisplay contribdata

Step 5: Create a Logical Volume

Now, we have Volume Group ready. We have to create a logical volume from this volume group, so that, we can mount the logical volume to use it.

$ lvcreate --size 3G --name ankush contribdata

The above command will create a logical volume of size 3 GiB and named ankush. This logical volume will take the storage space from the volume group named contribdata.

We can display the logical volume using the command lvdisplay.

$ lvdisplay

The command above will also show the exact path/name of the logical volume. For this logical volume, the exact name of the logical volume will be /dev/contribdata/ankush.

So, finally, we have a logical volume ready. And as this volume is the Logical Volume, we can increase or decrease its space on the go without losing any of the stored data.

Step 6: Mount the Logical Volume

As we already had configured HDFS cluster and our datanode will share the storage space to the HDFS cluster from /data directory. So, we have to mount the logical volume to the logical volume at /data directory. But, before mounting, we will format the logical volume to ext4 partition type.

$ mkfs.ext4 /dev/contribdata/ankush

Now, the logical volume is ready to use. We will mount the logical volume named /dev/contribdata/ankush at /data directory.

$ mount /dev/contribdata/ankush /data

Step 7: Start the Datanode

Now, we are all set to start the Datanode. This Datanode will share the storage to the HDFS cluster from the logical volume.

Before starting the Datanode, I will go to the Namenode and check how much storage it currently has.

$ hadoop dfsadmin -report

You can see in the above screenshot that the HDFS cluster does not have any Datanode connected and hence the current storage space is 0 GiB.

Now, Go back to the Datanode and start the Datanode.

$ hadoop-daemon.sh start datanode

After that, go to the Namenode again and check the storage space.

You can see that, after starting the Datanode, the HDFS cluster now has 2.89 GiB(approx. 3 GiB) of storage space. All this space is coming from Datanode and hence from the logical volume.

Now, Let’s see the power of LVM in the real scenario.

Let us assume that we are about to run out of the space of 3 GiB and we have to increase the space of the Datanode instead of adding a new Datanode. This problem may in the real world and copying all of the data to the new harddisk of more size is not the optimized solution. So, here the role of the LVM comes into play.
Now, we just have to extend the space of the logical volume and that's it! We can increase the storage without losing any past data.

Let’s see this in action.

Increasing the storage space of the Datanode

For increasing the space of the Datanode storage, we have to increase the size of the logical volume. This size can only be possible to extend if the volume group of the logical volume has unallocated storage space. As we had used 3 GiB for the logical volume out of 5.5 GiB total size of the volume group, we can extend our logical volume by up to 2.5 GiB.

Here, we will extend the logical volume by 1 GiB.

So, let’s begin.

Step 1: Stop the Datanode and unmount the logical volume

As the logical volume is mounted to /data directory, we will unmount it in order to increase its space.

$ hadoop-daemon.sh stop datanode
$ umount /data

Step 2: Extend the size of the logical volume

To extend the size of the logical volume, run the following command.

$ lvextend --size +1G /dev/contribdata/ankush

This will increase the size of the logical volume by 1 GiB. But this size is increased logically. If we check the size of the logical volume, it will show it 4GiB but after mounting the volume, the actual size is 3 GiB. This is because, we had extended the size of the logical volume but inorder to get the extended storage, we have to format only that part of the volume that we had added newly.

Step 3: Format the newly added storage to the logical volume

If we try to format the volume by using the regular method, it will recreate the whole partition table. And thus, we will lose the data stored in the volume. But, we have to format only that part of the volume that we had added newly. To do this, we will use the command resize2fs.

$ e2fsck -f /dev/contribdata/ankush
$ resize2fs /dev/contribdata/ankush

The command above will format the un-formated part of the logical volume.

Now, we have the size of 4 GiB ready to use.

Step 4: Mount the logical volume and start the datanode

Mount the logical volume to the /data directory.

$ mount /dev/contribdata/ankush /data

Start the datanode again.

$ hadoop-daemon.sh start datanode

Now, go back to the Namenode and check the HDFS storage.

In the above screenshot, you can see that the available storage space is 3.87 GiB(approx. 4 GiB).

So, this will prove that, we had successfully increased the storage space of the datanode on the go.

In the same way, we can reduce the size of the Datanode storage using lvreduce command.

Reducing the storage space of the datanode

For, reducing the Datanode storage space, run the following commands.

Step 1: Unmount logical volume

$ unmount /data

Step 2: Resize the LVM to 2 GiB

$ e2fsck -f /dev/contribdata/ankush
$ resize2fs /dev/contribdata/ankush 2G

Step 3: Reduce the size using lvreduce

$ lvreduce -L 2G /dev/contribdata/ankush
$ e2fsck -f /dev/contribdata/ankush

Step 4: Mount the LVM to the /data

$ mount /dev/contribdata/ankush /data

Creating a Python script to automate the elasticity of the HDFS cluster

The script above will automate the process of providing elasticity to the HDFS cluster.

That's it for this article!

For any help or suggestions connect with me on Twitter at @TheNameIsAnkush or find me on LinkedIn.

Tech blogger, researcher and integrator