Components of Hadoop Ecosystem

At its initial release Hadoop, comprised mainly HDFS and Map Reduce, but soon Hadoop became a platform around which an entire ecosystem of capabilities could be built. Since then, dozens of self-standing software projects have sprung into being around Hadoop, each addressing a variety of problem spaces and meeting different needs in its own unique way. There is an ever increasing number of projects build around Hadoop, this post to give a view of various components in Hadoop Ecosystem. As the actual list is exhaustive and some components might be missing.

Hadoop Ecosystem


Components of Hadoop Ecosystem
SQLSQL (Structured Query Language) is a special-purpose programming language designed for managing data held in a relational database management system (RDBMS), or for stream processing in a relational data stream management system (RDSMS).
NoSQLNoSQL (often interpreted as Not only SQL) database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Motivations for this approach include simplicity of design, horizontal scaling, and finer control over availability.
Log DataIn computing, a logfile is a file that records either events that occur in an operating system or other software runs, or messages between different users of a communication software.
Streaming DataData that is transmitted and processed in a continuous flow, such as digital audio and video.Twitter ,Facebook ,Webclick data ,web form data
FlumeFlume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of  data . It has a simple and flexible architecture based on streaming data flows and it is used to collect streaming data in Hadoop cluster.
SqoopApache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
NFSNetwork File System or NFS, a distributed file system protocol that allows access to files on a remote computer in a manner similar to how a local file system is accessed. NFS interface support is one way for HDFS to have such easy integration. With NFS enabled for Hadoop, files can be browsed, downloaded, and written to and from HDFS as if it were a local file system. 
KafkaApache Kafka is a fast, scalable, durable, and fault-tolerant publish-subscribe messaging system. Kafka is often used in place of traditional message brokers like JMS and AMQP because of its higher throughput, reliability and replication.
Map ReduceMapReduce is a programming model for processing and generating large data sets with a parallel, distributed algorithm on a cluster.A MapReduce program is composed of a Map() procedure that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() procedure that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies).
SparkApache Spark is a fast and general engine for large-scale data processing built around speed, ease of use, and sophisticated analytics.
DrillApache  Drill is an open source, low latency SQL query engine for Hadoop and NoSQL.Apache Drill provides direct queries on self-describing and semi-structured data in files (such as JSON, Parquet) and HBase tables without needing to define and maintain schemas in a centralized store such as Hive metastore. This means that users can explore live data on their own as it arrives versus spending weeks or months on data preparation, modeling, ETL and subsequent schema management.
MahoutApache Mahout is a project  to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering and classification
OozieOozie is a workflow scheduler system to manage Apache Hadoop jobs.Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts)
HiveApache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
PigApache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
HBaseHBase is an open source, non-relational, distributed database modeled after Google's BigTable and written in Java. It  runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data (small amounts of information caught within a large collection of empty or unimportant data, such as finding the 50 largest items in a group of 2 billion records, or finding the non-zero items representing less than 0.1% of a huge collection).
Elasticsearch Elasticsearch is a search server based on Lucene. It provides a distributed, multitenant-capable full-text search engine with a RESTful web interface and schema-free JSON documents. 
SolrSolr  ,open source enterprise search platform, is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more.
SiLKSiLK is a feature rich UI that runs on-top of open source Lucene/Solr and commercially licensed Lucidworks Fusion. SiLK gives users the power to perform ad-hoc search and analysis of massive amounts of multi-structured and time series data. Users can swiftly transform their findings into visualizations and dashboards, which can easily shared across the organization.
Web TierWeb site delivery of big data analytics e.g. you may also like feature based the product browsing pattern
BananaBanana works with all kinds of time series (and non-time series) data stored in Apache Solr. It is used to create a rich and flexible UI, enabling users to rapidly develop end-to-end applications that leverage the power of Apache Solr.
KibanaArchitected to work with Elasticsearch, Kibana gives shape to any kind of data — structured and unstructured — indexed into Elasticsearch.
Data WarehouseDWs are central repositories of integrated data from one or more disparate sources.

Comments

  1. Good Composing Ravi, Keep it up.

    ReplyDelete
  2. Hello,
    Why you links between components of Ecosystem?. Does the links mean something ?
    Thank you.

    ReplyDelete
  3. The forward linking is kind of piping the various components to the next phase of processing.

    ReplyDelete
  4. Being new to the blogging world I feel like there is still so much to learn. Your tips helped to clarify a few things for me as well as giving..

    Dot Net Training in Chennai

    Software Testing Training in Chennai

    ReplyDelete

Post a Comment