Hadoop Distributed File System (HDFS)

Why HDFS?

In this video, we will learn about Hadoop Distributed File System (HDFS), which is one of the main components of Hadoop ecosystem.

Before going into depth of HDFS, let us discuss a problem statement.

If we have 100TB data, How will we design a system to store it? Let’s take 2 minutes to find out possible solutions and then we will discuss it.

One possible solution is to build network-attached storage or storage area network. We can buy hundred 1TB hard disks and mount them to hundred subfolders as shown in the image. What will be the challenges in this approach? Let us take 2 minutes to find out challenges and then we will discuss them.

Let us discuss the challenges.

How will we handle failover and backups?

Failover means switching to a redundant or standby hard disk upon the failure of any hard disk. For backup, we can put extra hard disks or build a RAID i.e. redundant array of independent disks for every hard disk in the system but still it will not solve the problem of failover which is really important for real-time applications.

How will we distribute the data uniformly?

Distributing the data uniformly across the hard disks is really important so that no single disk will be overloaded at any point in time.

Is it the best use of available resources?

There may be other small size hard disks available with us but we may not be able to add them to NAS or SAN because huge files can not be stored in these smaller hard disks. Therefore we will need to buy new bigger hard disks.

How will we handle frequent access to files? What if most of the users want to access the files stored in one of the hard disks. File access speed will be really slow in that case and apparently no user will be able to access the file due to congestion.

How will we scale out?

Scaling out means adding new hard disks when we need more storage. When we will add more hard disks, data will not be uniformly distributed as old hard disks will have more data and newly added hard disks will have less or no data.

To solve above problems Hadoop comes with a distributed filesystem called HDFS. We may sometimes see references to “DFS” informally or in older documentation or configurations.

HDFS - NameNode & DataNodes

An HDFS cluster has two types of nodes: one namenode also known as the master and multiple datanodes

An HDFS cluster consists of many machines. One of these machines is designated as namenode and other machines act as datanodes. Please note that we can also have datanode on the machine where namenode service is running. By default, namenode metadata service runs on port 8020

The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. This information is stored in RAM and persisted on the local disk in the form of two files: the namespace image and the edit log. The namenode also knows the datanodes on which all the blocks for a given file are located.

Datanodes are the workhorses of the filesystem. While namenode keeps the index of which block is stored in which datanode, datanodes store the actual data. In short, datanodes do not know the name of a file and namenode does not know what is inside a file.

As a rule of thumb, namenode should be installed on the machine having bigger RAM as all the metadata for files and directories is stored in RAM. We will not be able to store many files in HDFS if RAM is not big as there will not be enough space for metadata to fit in RAM. Since data nodes store actual data, datanodes should be run on the machines having the bigger disk.

HDFS - Replication

Let’s understand the concept of blocks in HDFS. When we store a file in HDFS, the file gets split into the chunks of 128MB block size. Except for the last block all other blocks will have 128 MB in size. The last block may be less than or equal to 128MB depending on file size. This default block size is configurable.

Let’s say we want to store a 560MB file in HDFS. This file will get split into 4 blocks of 128 MB and one block of 48 MB

What are the advantages of splitting the file into blocks? It helps fitting big file into smaller disks. It leaves less unused space on the every datanode as many 128MB blocks can be stored on the each datanode. It optimizes the file transfer. Also, it distributes the load to multiple machines. Let’s say a file is stored on 10 data nodes, whenever a user accesses the file, the load gets distributed to 10 machines instead of one machine.

It is same like when we download a movie using torrent. The movie file gets broken down into multiple pieces and these pieces get downloaded from multiple machines parallelly. It helps in downloading the file faster.

Let’s understand the HDFS replication. Each block has multiple copies in HDFS. A big file gets split into multiple blocks and each block gets stored to 3 different data nodes. The default replication factor is 3. Please note that no two copies will be on the same data node. Generally, first two copies will be on the same rack and the third copy will be off the rack (A rack is an almirah where we stack the machines in the same local area network). It is advised to set replication factor to at least 3 so that even if something happens to the rack, one copy is always safe.

We can set the default replication factor of the file system as well as of each file and directory individually. For files which are not important we can decrease the replication factor and for files which are very important should have high replication factor.

Whenever a datanode goes down or fails, the namenode instructs the datanodes which have copies of lost blocks to start replicating the blocks to the other data nodes so that each file and directory again reaches the replication factor assigned to it.

HDFS - Design & Limitations

HDFS is designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware. Let’s understand the design of HDFS

It is designed for very large files. “Very large” in this context means files that are hundreds of megabytes, gigabytes, or terabytes in size.

It is designed for streaming data access. It is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. A dataset is typically generated or copied from the source, and then various analyses are performed on that dataset over time. Each analysis will involve a large proportion, if not all, of the dataset, so the time to read the whole dataset is more important than the latency in reading the first record

It is designed for commodity hardware. Hadoop doesn’t require expensive, highly reliable hardware. It’s designed to run on the commonly available hardware that can be obtained from multiple vendors. HDFS is designed to carry on working without a noticeable interruption to the user in case of hardware failure.

It is also worth knowing the applications for which HDFS does not work so well.

HDFS does not work well for Low-latency data access. Applications that require low-latency access to data, in the tens of milliseconds range, will not work well with HDFS. HDFS is optimized for delivering high throughput and this may be at the expense of latency.

HDFS is not a good fit if we have a lot of small files. Because the namenode holds filesystem metadata in memory, the limit to the number of files in a filesystem is governed by the amount of memory on the namenode

If we have multiple writers and arbitrary file modifications, HDFS will not a good fit. Files in HDFS are modified by a single writer at any time.

Writes are always made at the end of the file, in the append-only fashion.

There is no support for modifications at arbitrary offsets in the file.

Questions & Answers

Q: Can u briefly explain about low latency data access with example?

A: The low latency here means the ability to access data instantaneously. In case of HDFS, since the request first goes to namenode and then goes to datanodes, there is a delay in getting the first byte of data. Therefore, there is high latency in accessing data from HDFS.

Latency Comparison Numbers
--------------------------
L1 cache reference                           0.5 ns
Branch mispredict                            5   ns
L2 cache reference                           7   ns                      14x L1 cache
Mutex lock/unlock                           25   ns
Main memory reference                      100   ns                      20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy             3,000   ns        3 us
Send 1K bytes over 1 Gbps network       10,000   ns       10 us
Read 4K randomly from SSD*             150,000   ns      150 us          ~1GB/sec SSD
Read 1 MB sequentially from memory     250,000   ns      250 us
Round trip within same datacenter      500,000   ns      500 us
Read 1 MB sequentially from SSD*     1,000,000   ns    1,000 us    1 ms  ~1GB/sec SSD, 4X memory
Disk seek                           10,000,000   ns   10,000 us   10 ms  20x datacenter roundtrip
Read 1 MB sequentially from disk    20,000,000   ns   20,000 us   20 ms  80x memory, 20X SSD
Send packet CA->Netherlands->CA    150,000,000   ns  150,000 us  150 ms

Source: https://gist.github.com/jboner/2841832

HDFS - File Reading - Writing

When a user wants to read a file, the client will talk to namenode and namenode will return the metadata of the file. The metadata has information about the blocks and their locations.

When the client receives metadata of the file, it communicates with the datanodes and accesses the data sequentially or parallelly. This way there is no bottleneck in namenode as client talks to namenode only once to get the metadata of the files.

HDFS by design makes sure that no two writers write the same file at the same time by having singular namenode.

If there are multiple namenodes, and clients make requests to these different namenodes, the entire filesystem can get corrupted. This is because these multiple requests can write to a file at the same time.

Let’s understand how files are written to HDFS. When a user uploads a file to HDFS, the client on behalf of the user tells the namenode that it wants to create the file. The namenode replies back with the locations of datanodes where the file can be written. Also, namenode creates a temporary entry in the metadata.

The client then opens the output stream and writes the file to the first datanode. The first datanode is the one which is closest to the client machine. If the client is on a machine which is also a datanode, the first copy will be written on this machine.

Once the file is stored on one datanode, the data gets copied to the other datanodes simultaneously. Also, once the first copy is completely written, the datanode informs the client that the file is created.

The client then confirms to the namenode that the file has been created. The namenode crosschecks this with the datanodes and updates the entry in the metadata successfully.

Now, lets try to understand what happens while reading a file from HDFS.

When a user wants to read a file, the HDFS client, on behalf of the user, talk to the namenode.

The Namenode provides the locations of various blocks of this file and their replicas instead of giving back the actual data.

Out of these locations, the client chooses the datanodes closer to it. The client talks to these datanodes directly and reads the data from these blocks.

The client can read blocks of the file either sequentially or simultaneously.

NOTE: Please see the discussion below to get your questions answered.

HDFS - Namenode Backup & Failover

The metadata is maintained in the memory as well as on the disk. On the disk, it is kept in two parts: namespace image and edit logs.

The namespace image is created on demand while edit logs are created whenever there is a change in the metadata. So, at any point, to get the current state of the metadata, edit logs need to be applied on the image.

Since the metadata is huge, writing it to the disk on every change may be time consuming. There fore, saving just the change makes it extremely fast.

Without the namenode, the HDFS cannot be used at all. This is because we do not know which files are stored in which datanodes. Therefore it is very important to make the namenode resilient to failures. Hadoop provides various approaches to safeguard the namenode.

The first approach is to maintain a copy of the metadata on NFS - Network File System. Hadoop can be configured to do this. These modifications to the metadata happen either both on NFS and Locally or nowhere.

In the second approach to making the namenode resilient, we run a secondary namenode on a different machine.

The main role of the secondary namenode is to periodically merge the namespace image with edit logs to prevent the edit logs from becoming too large.

When a namenode fails, we have to first prepare the latest namespace image and then bring up the secondary namenode.

This approach is not good for production applications as there will be a downtime until the secondary namenode is brought online. With this method, the namenode is not highly available.

To make the namenode resilient, Hadoop 2.0 added support for high availability.

This is done using multiple namenodes and zookeeper. Of these namenodes, one is active and the rest are standby namenodes. The standby namenodes are exact replicas of the active namenode.

The datanodes send block reports to both the active and the standby namenodes to keep all namenodes updated at any point-in-time.

If the active namenode fails, a standby can take over very quickly because it has the latest state of metadata.

zookeeper helps in switching between the active and the standby namenodes.

The namenode maintains the reference to every file and block in the memory. If we have too many files or folders in HDFS, the metadata could be huge. Therefore the memory of namenode may become insufficient.

To solve this problem, HDFS federation was introduced in Hadoop 2 dot 0.

In HDFS Federation, we have multiple namenodes containing parts of metadata.

The metadata - which is the information about files folders - gets distributed manually in different namenodes. This distribution is done by maintaining a mapping between folders and namenodes.

This mapping is known as mount tables.

In this diagram, the mount table is defining /mydata1 folder is in namenode1 and /mydata2 and /mydata3 are in namenode2

Mount table is not a service. It is a file kept along with and referred from the HDFS configuration file.

The client reads mount table to find out which folders belong to which namenode.

It routes the request to read or write a file to the namenode corresponding to the file's parent folder.

The same pool of datanodes is used for storing data from all namenodes.

Lets us discuss what is metadata. Following attributes get stored in metadata

·         List of files

·         List of Blocks for each file

·         List of DataNode for each block

·         File attributes, e.g. access time, replication factor, file size, file name, directory name

·         Transaction logs or edit logs store file creation and file deletion timestamps.

 

HDFS - Hands-On with Ambari

Let's do some hands-on on HDFS.

Login to CloudxLab. Go to “My Lab” and click on “Lab Credentials”. Lets login to Ambari.

We can see the HDFS summary in Ambari, like hostname of namenode, secondary namenode and HDFS configuration variables.

We can also see datanodes details like location, disk usage, CPU cores etc in Ambari.

 

  


HDFS - Hands-On with Hue

To upload files from our local machine to HDFS, we can use Hue. Let’s upload a file from our local machine to HDFS using Hue. Login to Hue, click on file browser. On the top left, you can see your home directory in HDFS. Please note that in HDFS you have permissions to create files and directories only in your home directory. Apart from your home directory, you have read permissions on /data and /dataset directories which contain dataset and code provided by CloudxLab.

Let’s upload a file. Click on “Upload”, select the type of file. In our case, we are going to upload a normal text file so we will select “Files”. Click on “Select Files” and let's select the file from the local machine. We’ve successfully uploaded the file from our local machine to HDFS.

We can see the user, owner, and permissions of the uploaded file in the Hue File browser. HDFS file permissions are the same as Unix file permissions.

Note: If you are facing problem with the updated version of Hue (Hue 4) you can easily switch to the older version, simply go to the top right corner in the Hue, you will see your Username there, click on it and you will see a drop-down. Click on Switch to Hue 3. Now you are in the older version matching the video. We will recommend you to get familiar with the newer version also as it will be beneficial to you.

HDFS - Hands-On with Console

We can also upload files from CloudxLab Linux console to HDFS.

Let’s say we want to copy the test.txt file from CloudxLab Linux console to HDFS. Login to CloudxLab Linux console. Create a file test.txt using nano or vi editor. Run command hadoop fs -copyFromLocal test.txt. This command will copy the file from CloudxLab Linux console to CloudxLab HDFS. To verify if file is copied to HDFS, please type hadoop fs -ls test.txt To see content of file in HDFS, please type hadoop fs -cat test.txt

HDFS - Hands-On - More Commands

To access the files in HDFS, we can type any of the commands displayed on the screen.

To see which datanodes have blocks of sample.txt, use the following command:

hdfs fsck -blocks -locations -racks -files sample.txt

Blocks are located on datanodes having private ips 172.31.53.48, 172.31.38.183 and 172.31.37.42

By default, Every file is having a replication factor of 3 on CloudxLab. To change the replication factor of sample.txt to 1, we can run

hadoop fs -setrep -w 1 /user/abhinav9884/sample.txt

Now if we check the blocks, we will see that the Average block replication is 1. If you want to increase your space quota on HDFS, please decrease the replication factor of your home directory in HDFS


Copy File to HDFS

The file created in the previous step is stored in your home folder in Linux console.

We can copy the files using following command. This command will put mylocalfile.txt from current directory from Linux file system into the HDFS home directory:

hadoop fs -copyFromLocal mylocalfile.txt

or

hadoop fs -put mylocalfile.txt

INSTRUCTIONS

Please copy myfirstfile.txt to your home folder in HDFS.

Set Replication Factor

For setting the replication factor, we use below command

hadoop fs -setrep 2 file_name

Where 2 is the new replication factor which we want to set and file_name is the file whose replication we want to change.

If you want to wait till the process gets over, use -w and if you want to set the replication for entire directory please use an option -R on command line. For example, the following command will set the replication factor of 2 of all the files in a directory 'mydir' and it will wait for the completion:

hadoop fs -mkdir mydir

hadoop fs -put wordcountscript.sh mydir

hadoop fs -setrep -R -w 2 mydir

Here is the output:

Replication 2 set: mydir/wordcountscript.sh
Waiting for mydir/wordcountscript.sh ... done

INSTRUCTIONS

In the previous step, we have copied myfirstfile.txt to HDFS. Now set its replication factor to 2.


Post a Comment

0 Comments

Close Menu