Hadoop is a distributed file system and it uses to store bulk amounts of data like terabytes or even petabytes. HDFS support high throughput mechanism for accessing this large amount information. This tutorial has HDFS pdfs at the end of this section.
In HDFS files are stored in s redundant manner over the multiple machines and this guaranteed the following ones.
NFS is the example of distributed file system. NFS stands for network file system. It gives remote access to a single logical volume stored on a single machine. A NFS server can visible a portion of its local files system to external clients and also the client can mount this remote file system directly into their own Linux file system, and interact with it as though it were part of the local drive. The one advantage of NFS is its transparency. That is clients do not need to be particularly aware that they are working on files stored remotely. For accomplishing the above task we are using the existing standard library methods like open (), close (), fread ().
HDFS is a block structured file system: – Each file is broken into blocks of a fixed size and these blocks are stored across a cluster of one or more machines with data storage capacity. Individual machines in the cluster are called DataNodes. A file can be made of several blocks and not necessarily stored on the same machine. The target machine chose each block randomly on a block-by-block basis. So access permission to a file may need the cooperation of multiple machines and it supports file size for larger than a single machine DFS. Individual files sometimes need large space than a single hard drive could hold. If several machines must be involved in the serving of a file, then a file could be rendered unavailable by the loss of any one of those machines. HDFS combats this problem by replicating each block across a number of machines (3, by default).
In the above figure the DataNodes represents multiple files with replication factor of 2 and the NameNode maps the filenames onto the block ids. In block structured file systems commonly use a block size on the order of 4 or 8 KB. The default block size in HDFS is 64 KB. This permits HDFS to decrease the amount of metadata storage required per file.
In a HDFS block structured file system, all the information’s are handled by single machine called NameNode. The NameNode stores all the metadata for the file system. All the information’s like tracks file names, permissions, and the locations of each block of each files etc can be stored in the main memory of the NameNode machine and it permits fast access to the metadata. The allow open a file the client first contacts the NameNode and access a list of locations for the block that comprise the files and these locations identify the DataNodes which hold each block. Clients then read file data directly from the DataNode servers in parallel. So the NameNode is not directly involved in the bulk data transfer keeping its overhead to a minimum. NameNode information should be preserved even if the NameNode machine fails. NameNode failure is more severe for the cluster than DataNode failure. While individual DataNodes may crash and the entire cluster will continue to operate, the loss of the NameNode will render the cluster inaccessible until it is manually restored.
In distributed file system, it is limited in its power. The files in an NFS volume all reside on a single machine. This will create some problems