We are listing here the advantages and disadvantages of Hadoop.Map-Reduce and HDFS are the two different parts of the Hadoop.
1) Distribute data and computation.The computation local to data prevents the network overload.
2) Tasks are independent The task are independent so,
3) Linear scaling in the ideal case.It used to design for cheap, commodity hardware.
4) Simple programming model.The end-user programmer only writes map-reduce tasks.
5) Flat scalability:-
This is the one advantages of using Hadoop in contrast to other distributed systems is its flat scalability curve. Executing Hadoop on a limited amount of data on a small number of nodes may not demonstrate particularly stellar performance as the overhead involved in starting Hadoop programs is relatively high. Other parallel/distributed programming paradigms such as MPI (Message Passing Interface) may perform much better on two, four, or perhaps a dozen machines. Though the effort of coordinating work among a small number of machines may be better-performed by such systems the price paid in performance and engineering effort (when adding more hardware as a result of increasing data volumes) increases non-linearly.
A program written in distributed frameworks other than Hadoop may require large amounts of refactoring when scaling from ten to one hundred or one thousand machines. This may involve having the program be rewritten several times; fundamental elements of its design may also put an upper bound on the scale to which the application can grow.
Hadoop, however, is specifically designed to have a very flat scalability curve. After a Hadoop program is written and functioning on ten nodes, very little–if any–work is required for that same program to run on a much larger amount of hardware. Orders of magnitude of growth can be managed with little re-work required for your applications. The underlying Hadoop platform will manage the data and hardware resources and provide dependable performance growth proportionate to the number of machines available.
6) HDFS store large amount of information
7) HDFS is simple and robust coherency model
8 ) That is it should store data reliably.
9) HDFS is scalable and fast access to this information and it also possible to serve s large number of clients by simply adding more machines to the cluster.
10) HDFS should integrate well with Hadoop MapReduce, allowing data to be read and computed upon locally when possible.
11) HDFS provide streaming read performance.
12) Data will be written to the HDFS once and then read several times.
13) The overhead of cashing is helps the data should simply be re-read from HDFS source.
14) Fault tolerance by detecting faults and applying quick, automatic recovery
15) Processing logic close to the data, rather than the data close to the processing logic
16) Portability across heterogeneous commodity hardware and operating systems
17) Economy by distributing data and processing across clusters of commodity personal computers
18) Efficiency by distributing data and logic to process it in parallel on nodes where data is located
19) Reliability by automatically maintaining multiple copies of data and automatically redeploying processing logic in the event of failures
20) HDFS is a block structured file system: – Each file is broken into blocks of a fixed size and these blocks are stored across a cluster of one or more machines with data storage capacity
21) Ability to write MapReduce programs in Java, a language which even many noncomputer scientists can learn with sufficient capability to meet powerful data-processing needs
22) Ability to rapidly process large amounts of data in parallel
23) Can be deployed on large clusters of cheap commodity hardware as opposed to expensive, specialized parallel-processing hardware
24) Can be offered as an on-demand service, for example as part of Amazon’s EC2 cluster computing service
1) Rough manner:- Hadoop Map-reduce and HDFS are rough in manner. Because the software under active development.
2) Programming model is very restrictive:- Lack of central data can be preventive.
3) Joins of multiple datasets are tricky and slow:- No indices! Often entire dataset gets copied in the process.
4) Cluster management is hard:- In the cluster, operations like debugging, distributing software, collection logs etc are too hard.
5) Still single master which requires care and may limit scaling
6) Managing job flow isn’t trivial when intermediate data should be kept
7) Optimal configuration of nodes not obvious. Eg: – #mappers, #reducers, mem.limits