The apache Hadoop project develops open source software’s. The properties of Apache Hadoops are,
The Apache Hadoop software is a type of library and this software library act as a framework. By using a simple programming model the above framework permits for the distributed processing of bulk data sets across clusters of computers. The simple programming model designed to scale up from the single server to thousands of machines and each machine permits local computation and storage. The software library is more reliable rather than on hardware to deliver the high availability. The software library itself is designed to detect and handle errors or failure conditions at the application layer.
To get the Hadoop environment and for installation we need the following three sub projects.
Other Hadoop related projects are,
1) Avro: – A data serialization system.
Avro allows the following properties.
2) Cassandra: – A scalable multi-master database with no single points of failure. Apache Cassandra provides scalability and high availability without compromising performance. Linear scalability and fault tolerance properties help to make it as the perfect platform for critical data. Cassandra’s ColumnFamily data model offers the convenience of column indexes with the performance of log-structured updates, strong support for materialized views, and powerful built-in caching.
3) Chukwa: – A data collection system for managing large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. It also includes a ﬂexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.
4) HBase: – A scalable, distributed database that supports structured data storage for large tables.
5) Hive: – A data warehouse infrastructure that provides data summarization and ad hoc querying.
6) Mahout: – A Scalable machine learning and data mining library.
7) Pig: – A high-level data-flow language and execution framework for parallel computation.
1) Ease of programming. It is trivial to achieve parallel execution of simple, “embarrassingly parallel” data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
2) Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.
3) Extensibility. Users can create their own functions to do special-purpose processing.
4) ZooKeeper: – A high performance coordination service for distributed applications. ZooKeeper provides centralized service. It helps for maintaining configuration information, naming, providing distributed synchronization and providing group services.