Components of Hadoop Ecosystem

At its initial release Hadoop, comprised mainly HDFS and Map Reduce, but soon Hadoop became a platform around which an entire ecosystem of capabilities could be built. Since then, dozens of self-standing software projects have sprung into being around Hadoop, each addressing a variety of problem spaces and meeting different needs in its own unique way. There is an ever increasing number of projects build around Hadoop, this post to give a view of various components in Hadoop Ecosystem. As the actual list is exhaustive and some components might be missing.

Components of Hadoop Ecosystem
SQL	SQL (Structured Query Language) is a special-purpose programming language designed for managing data held in a relational database management system (RDBMS), or for stream processing in a relational data stream management system (RDSMS).
NoSQL	NoSQL (often interpreted as Not only SQL) database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Motivations for this approach include simplicity of design, horizontal scaling, and finer control over availability.
Log Data	In computing, a logfile is a file that records either events that occur in an operating system or other software runs, or messages between different users of a communication software.
Streaming Data	Data that is transmitted and processed in a continuous flow, such as digital audio and video.Twitter ,Facebook ,Webclick data ,web form data
Flume	Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of data . It has a simple and flexible architecture based on streaming data flows and it is used to collect streaming data in Hadoop cluster.
Sqoop	Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
NFS	Network File System or NFS, a distributed file system protocol that allows access to files on a remote computer in a manner similar to how a local file system is accessed. NFS interface support is one way for HDFS to have such easy integration. With NFS enabled for Hadoop, files can be browsed, downloaded, and written to and from HDFS as if it were a local file system.
Kafka	Apache Kafka is a fast, scalable, durable, and fault-tolerant publish-subscribe messaging system. Kafka is often used in place of traditional message brokers like JMS and AMQP because of its higher throughput, reliability and replication.
Map Reduce	MapReduce is a programming model for processing and generating large data sets with a parallel, distributed algorithm on a cluster.A MapReduce program is composed of a Map() procedure that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() procedure that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies).
Spark	Apache Spark is a fast and general engine for large-scale data processing built around speed, ease of use, and sophisticated analytics.
Drill	Apache Drill is an open source, low latency SQL query engine for Hadoop and NoSQL.Apache Drill provides direct queries on self-describing and semi-structured data in files (such as JSON, Parquet) and HBase tables without needing to define and maintain schemas in a centralized store such as Hive metastore. This means that users can explore live data on their own as it arrives versus spending weeks or months on data preparation, modeling, ETL and subsequent schema management.
Mahout	Apache Mahout is a project to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering and classification
Oozie	Oozie is a workflow scheduler system to manage Apache Hadoop jobs.Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts)
Hive	Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
Pig	Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
HBase	HBase is an open source, non-relational, distributed database modeled after Google's BigTable and written in Java. It runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data (small amounts of information caught within a large collection of empty or unimportant data, such as finding the 50 largest items in a group of 2 billion records, or finding the non-zero items representing less than 0.1% of a huge collection).
Elasticsearch	Elasticsearch is a search server based on Lucene. It provides a distributed, multitenant-capable full-text search engine with a RESTful web interface and schema-free JSON documents.
Solr	Solr ,open source enterprise search platform, is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more.
SiLK	SiLK is a feature rich UI that runs on-top of open source Lucene/Solr and commercially licensed Lucidworks Fusion. SiLK gives users the power to perform ad-hoc search and analysis of massive amounts of multi-structured and time series data. Users can swiftly transform their findings into visualizations and dashboards, which can easily shared across the organization.
Web Tier	Web site delivery of big data analytics e.g. you may also like feature based the product browsing pattern
Banana	Banana works with all kinds of time series (and non-time series) data stored in Apache Solr. It is used to create a rich and flexible UI, enabling users to rapidly develop end-to-end applications that leverage the power of Apache Solr.
Kibana	Architected to work with Elasticsearch, Kibana gives shape to any kind of data — structured and unstructured — indexed into Elasticsearch.
Data Warehouse	DWs are central repositories of integrated data from one or more disparate sources.

Comments

basApril 14, 2015 at 8:39 AM
Good Composing Ravi, Keep it up.
Anmar SalihSeptember 16, 2015 at 3:56 PM
Hello,
Why you links between components of Ecosystem?. Does the links mean something ?
Thank you.
Ravi KothiyalJanuary 11, 2016 at 9:11 AM
The forward linking is kind of piping the various components to the next phase of processing.
UnknownJune 10, 2017 at 6:37 AM
Being new to the blogging world I feel like there is still so much to learn. Your tips helped to clarify a few things for me as well as giving..

Dot Net Training in Chennai

Software Testing Training in Chennai

Ravi

Search This Blog

Components of Hadoop Ecosystem

Comments

Post a Comment