At its initial release Hadoop, comprised
mainly HDFS and Map Reduce, but soon Hadoop became a platform around
which an entire ecosystem of capabilities could be built. Since then,
dozens of self-standing software projects have sprung into being around
Hadoop, each addressing a variety of problem spaces and meeting
different needs in its own unique way. There is an ever increasing
number of projects build around Hadoop, this post to give a view of
various components in Hadoop Ecosystem. As the actual list is exhaustive
and some components might be missing.
Components of Hadoop Ecosystem
| |
SQL | SQL (Structured Query Language) is a special-purpose programming language designed for managing data held in a relational database management system (RDBMS), or for stream processing in a relational data stream management system (RDSMS). |
NoSQL | NoSQL (often interpreted as Not only SQL) database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Motivations for this approach include simplicity of design, horizontal scaling, and finer control over availability. |
Log Data | In computing, a logfile is a file that records either events that occur in an operating system or other software runs, or messages between different users of a communication software. |
Streaming Data | Data that is transmitted and processed in a continuous flow, such as digital audio and video.Twitter ,Facebook ,Webclick data ,web form data |
Flume | Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of data . It has a simple and flexible architecture based on streaming data flows and it is used to collect streaming data in Hadoop cluster. |
Sqoop | Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. |
NFS | Network File System or NFS, a distributed file system protocol that allows access to files on a remote computer in a manner similar to how a local file system is accessed. NFS interface support is one way for HDFS to have such easy integration. With NFS enabled for Hadoop, files can be browsed, downloaded, and written to and from HDFS as if it were a local file system. |
Kafka | Apache Kafka is a fast, scalable, durable, and fault-tolerant publish-subscribe messaging system. Kafka is often used in place of traditional message brokers like JMS and AMQP because of its higher throughput, reliability and replication. |
Map Reduce | MapReduce is a programming model for processing and generating large data sets with a parallel, distributed algorithm on a cluster.A MapReduce program is composed of a Map() procedure that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() procedure that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). |
Spark | Apache Spark is a fast and general engine for large-scale data processing built around speed, ease of use, and sophisticated analytics. |
Drill | Apache Drill is an open source, low latency SQL query engine for Hadoop and NoSQL.Apache Drill provides direct queries on self-describing and semi-structured data in files (such as JSON, Parquet) and HBase tables without needing to define and maintain schemas in a centralized store such as Hive metastore. This means that users can explore live data on their own as it arrives versus spending weeks or months on data preparation, modeling, ETL and subsequent schema management. |
Mahout | Apache Mahout is a project to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering and classification |
Oozie | Oozie is a workflow scheduler system to manage Apache Hadoop jobs.Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts) |
Hive | Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. |
Pig | Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. |
HBase | HBase is an open source, non-relational, distributed database modeled after Google's BigTable and written in Java. It runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data (small amounts of information caught within a large collection of empty or unimportant data, such as finding the 50 largest items in a group of 2 billion records, or finding the non-zero items representing less than 0.1% of a huge collection). |
Elasticsearch | Elasticsearch is a search server based on Lucene. It provides a distributed, multitenant-capable full-text search engine with a RESTful web interface and schema-free JSON documents. |
Solr | Solr ,open source enterprise search platform, is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. |
SiLK | SiLK is a feature rich UI that runs on-top of open source Lucene/Solr and commercially licensed Lucidworks Fusion. SiLK gives users the power to perform ad-hoc search and analysis of massive amounts of multi-structured and time series data. Users can swiftly transform their findings into visualizations and dashboards, which can easily shared across the organization. |
Web Tier | Web site delivery of big data analytics e.g. you may also like feature based the product browsing pattern |
Banana | Banana works with all kinds of time series (and non-time series) data stored in Apache Solr. It is used to create a rich and flexible UI, enabling users to rapidly develop end-to-end applications that leverage the power of Apache Solr. |
Kibana | Architected to work with Elasticsearch, Kibana gives shape to any kind of data — structured and unstructured — indexed into Elasticsearch. |
Data Warehouse | DWs are central repositories of integrated data from one or more disparate sources. |
Good Composing Ravi, Keep it up.
ReplyDeleteHello,
ReplyDeleteWhy you links between components of Ecosystem?. Does the links mean something ?
Thank you.
The forward linking is kind of piping the various components to the next phase of processing.
ReplyDeleteBeing new to the blogging world I feel like there is still so much to learn. Your tips helped to clarify a few things for me as well as giving..
ReplyDeleteDot Net Training in Chennai
Software Testing Training in Chennai