Hadoop Big Data quick summary

Hadoop Big Data quick summary

Hadoop – is a Java based programming framework that supports the processing of large data sets in a distributed computing environment
Hadoop – is based on Google File System (GFS)
Hadoop – uses thousands of nodes this is the key to improve performance.
Hadoop – is a Distributed File System or HDFS, which enables fast data transfer among the nodes.
Hadoop Configuration – has got the three modes of Hadoop configuration – Standalone, pseudo distributed, and fully distributed.
Hadoop MapReduce – Hadoop MapReduce is the core components of Hadoop and is a programming model and helps implementation for processing and generating large data sets, it uses parallel and distributed algorithms on a cluster. it can handle large scale data: petabytes, exabytes.
Mapreduce framework converts each record of input into a key/value pair.
Ubuntu Server – Ubuntu is a leading open-source platform. it helps in utilizing the infrastructure to users when they want to deploy a cloud, a web farm, or a Hadoop cluster.
HadoopDistributed File System (HDFS)- HadoopDistributed File System (HDFS) is a block-structured, distributed file system.
Distributed Cache – Distributed Cache is a Hadoop feature that helps cache files needed by applications.

Pig – is an Apache open-source project and one of the components of the Hadoop eco-system.
Pig – is a high-level data flow scripting language and runs on the Hadoopclusters.
Pig – uses HDFS for storing and retrieving data and Hadoop MapReduce for processing Big Data.

Hive – is a data warehouse system for Hadoop.
Hive – facilitates ad hoc queries and aids analysis of data sets stored in Hadoop.
Hive – provides an SQL like language called HiveQL(HQL)

Apache HBase – is a distributed, column oriented database.
Apache HBase – is built on top of HDFS.
Apache HBase – is an open-source, distributed, versioned, non relational database system.
Apache HBase – has two types of Nodes. 1. Master and 2. Region Server.

Cloudera – is a commercial vendor for deploying Hadoopin an enterprise.
Cloudera – offers ClouderaManager for system management, ClouderaNavigator for data management.

ZooKeeper – is an open source and high performance co ordination service for distributed applications.

Pivotal HD – is a commercially supported, enterprise capable distribution of Hadoop and it aims to accelerate data analytics projects.

Sqoop – Sqoop is an Apache Hadoop ecosystem project. Sqoop’s responsibility is to import or export operations across relational databases.

Apache Oozie – is a workflow scheduler system used to manage Apache Hadoop jobs/MapReduce jobs

Mahout – is library of machine learning algorithams, helps in clustering and Clustering allows the system to group various entities into separate clusters or groups based on certain characteristics or features.

Apache Cassandra – Apache Cassandra is an open source, freely distributed, high-performance, extremely scalable, and fault-tolerant post relational database.
Apache Spark – is a powerfull open source processing engine and general MapReduce like engine used for large-scale data processing.

Apache Ambari – Apache Ambari is a completely open operational tool or framework for provisioning, managing, and monitoring Apache Hadoop clusters.
Kerberos – is a third party authentication mechanism. It has a database of the users/services and their respective Kerberos passwords.

Java quick reference – Please click here