The apache Hadoop project develops open source software’s. The properties of Apache Hadoops are,

  1. Reliable
  2. Scalable
  3. Distributed computing

The Apache Hadoop software is a type of library and this software library act as a framework. By using a simple programming model the above framework permits for the distributed processing of bulk data sets across clusters of computers. The simple programming model designed to scale up from the single server to thousands of machines and each machine permits local computation and storage. The software library is more reliable rather than on hardware to deliver the high availability. The software library itself is designed to detect and handle errors or failure conditions at the application layer.

To get the Hadoop environment and for installation we need the following three sub projects.

  1. Hadoop common: – The Hadoop common utilities support the other Hadoop sub projects.
  2. MapReduce: – It process large data sets on compute clusters.
  3. HDFS: – HDFS stands for Hadoop distributed file system. This file system provides high throughput access to application data.

Other Hadoop related projects are,

1)      Avro: – A data serialization system.

Avro allows the following properties.

  • It allows rich data structures.
  • It permits compact, fast, binary data format.
  • It allows a container file that helps to store persistent data.
  • It permits RPS mechanism. RPC stands for Remote Procedure Call.
  • Simple integration with dynamic languages.

2)      Cassandra: – A scalable multi-master database with no single points of failure. Apache Cassandra provides scalability and high availability without compromising performance. Linear scalability and fault tolerance properties help to make it as the perfect platform for critical data. Cassandra’s ColumnFamily data model offers the convenience of column indexes with the performance of log-structured updates, strong support for materialized views, and powerful built-in caching.

READ  Hadoop Hive

3)      Chukwa: – A data collection system for managing large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. It also includes a flexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.

4)      HBase: – A scalable, distributed database that supports structured data storage for large tables.

Properties of HBase:-

  • Linear and modular scalability.
  • Strictly consistent reads and writes.
  • Automatic and configurable sharding of tables
  • Automatic failover support between RegionServers.
  • Convenient base classes for backing Hadoop MapReduce jobs with HBase tables.
  • Easy to use Java API for client access.
  • Block cache and Bloom Filters for real-time queries.
  • Query predicate push down via server side Filters.
  • Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options.
  • Extensible jruby-based (JIRB) shell.
  • Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX

5)      Hive: – A data warehouse infrastructure that provides data summarization and ad hoc querying.

6)      Mahout: – A Scalable machine learning and data mining library.

Features of Mahout:-

    • Scalable to large data sets
    • Scalable to support your business case.
    • Scalable community

7)      Pig: – A high-level data-flow language and execution framework for parallel computation.

Features of Hadoop

1)      Ease of programming. It is trivial to achieve parallel execution of simple, “embarrassingly parallel” data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.


2)      Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.

3)      Extensibility. Users can create their own functions to do special-purpose processing.

4)      ZooKeeper: – A high performance coordination service for distributed applications. ZooKeeper provides centralized service. It helps for maintaining configuration information, naming, providing distributed synchronization and providing group services.

Applications of Hadoop

  1. Organizations or companies use Hadoop for research
  2. Organizations or companies use Hadoop for production.