5
<meta charset='utf-8' />
6
<meta http-equiv="X-UA-Compatible" content="chrome=1" />
7
<meta name="description" content="Hadoopecosystemtable.github.io : This page is a summary to keep the track of Hadoop related project, and relevant projects around Big Data scene focused on the open source, free software enviroment." />
9
<link rel="stylesheet" type="text/css" media="screen" href="stylesheets/stylesheet.css">
11
<title>The Hadoop Ecosystem Table</title>
17
<div id="header_wrap" class="outer">
18
<header class="inner">
19
<a id="forkme_banner" href="https://github.com/hadoopecosystemtable/hadoopecosystemtable.github.io">Fork Me on GitHub</a>
20
<h1 id="project_title">The Hadoop Ecosystem Table</h1>
21
<h2 id="project_tagline">This page is a summary to keep the track of Hadoop related projects, focused on FLOSS environment.</h2>
27
<div id="main_content_wrap" class="outer">
29
<section id="main_content" class="inner">
32
<table class="example3">
35
<!-- Distributed Filesystem -->
38
<th colspan="3">Distributed Filesystem</th>
42
<td width="30%">Apache HDFS</td>
44
The Hadoop Distributed File System (HDFS) offers a way to store large files across
45
multiple machines. Hadoop and HDFS was derived from Google File System (GFS) paper.
46
Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster.
47
With Zookeeper the HDFS High Availability feature addresses this problem by providing
48
the option of running two redundant NameNodes in the same cluster in an Active/Passive
49
configuration with a hot standby.
51
<td width="20%"><a href="http://hadoop.apache.org/">1. hadoop.apache.org</a>
52
<br> <a href="http://research.google.com/archive/gfs.html">2. Google FileSystem - GFS Paper</a>
53
<br> <a href="http://blog.cloudera.com/blog/2012/07/why-we-build-our-platform-on-hdfs/">3. Cloudera Why HDFS</a>
54
<br> <a href="http://hortonworks.com/blog/thinking-about-the-hdfs-vs-other-storage-technologies/">4. Hortonworks Why HDFS</a>
59
<td width="20%">Red Hat GlusterFS</td>
61
GlusterFS is a scale-out network-attached storage file system. GlusterFS was
62
developed originally by Gluster, Inc., then by Red Hat, Inc., after their
63
purchase of Gluster in 2011. In June 2012, Red Hat Storage Server was
64
announced as a commercially-supported integration of GlusterFS with
65
Red Hat Enterprise Linux. Gluster File System, known now as Red Hat Storage Server.
67
<td width="20%"><a href="http://www.gluster.org/">1. www.gluster.org</a>
68
<br><a href="http://www.redhat.com/about/news/archive/2013/10/red-hat-contributes-apache-hadoop-plug-in-to-the-gluster-community">2. Red Hat Hadoop Plugin</a>
73
<td width="20%">Quantcast File System QFS</td>
75
QFS is an open-source distributed file system software package for
76
large-scale MapReduce or other batch-processing workloads. It was
77
designed as an alternative to Apache Hadoop’s HDFS, intended to deliver
78
better performance and cost-efficiency for large-scale processing clusters.
79
It is written in C++ and has fixed-footprint memory management. QFS uses
80
Reed-Solomon error correction as method for assuring reliable access to data.<br>
81
Reed–Solomon coding is very widely used in mass storage systems to correct the burst
82
errors associated with media defects. Rather than storing three full versions of
83
each file like HDFS, resulting in the need for three times more storage, QFS
84
only needs 1.5x the raw capacity because it stripes data across nine different disk drives.
86
<td width="20%"><a href="https://www.quantcast.com/engineering/qfs/">1. QFS site</a>
87
<br><a href="https://github.com/quantcast/qfs">2. GitHub QFS</a>
88
<br><a href="https://issues.apache.org/jira/browse/HADOOP-8885">3. HADOOP-8885</a>
93
<td width="30%">Ceph Filesystem</td>
95
Ceph is a free software storage platform designed to present object, block,
96
and file storage from a single distributed computer cluster. Ceph's main
97
goals are to be completely distributed without a single point of failure,
98
scalable to the exabyte level, and freely-available. The data is replicated,
99
making it fault tolerant.
101
<td width="20%"><a href="http://ceph.com/ceph-storage/file-system/">1. Ceph Filesystem site</a>
102
<br><a href="http://ceph.com/docs/next/cephfs/hadoop/">2. Ceph and Hadoop</a>
103
<br><a href="https://issues.apache.org/jira/browse/HADOOP-6253">3. HADOOP-6253</a>
108
<td width="30%">Lustre file system</td>
110
The Lustre filesystem is a high-performance distributed filesystem
111
intended for larger network and high-availability environments.
112
Traditionally, Lustre is configured to manage remote data storage
113
disk devices within a Storage Area Network (SAN), which is two or
114
more remotely attached disk devices communicating via a Small Computer
115
System Interface (SCSI) protocol. This includes Fibre Channel, Fibre
116
Channel over Ethernet (FCoE), Serial Attached SCSI (SAS) and even iSCSI.<br>
117
With Hadoop HDFS the software needs a dedicated cluster of computers
118
on which to run. But folks who run high performance computing clusters
119
for other purposes often don't run HDFS, which leaves them with a bunch
120
of computing power, tasks that could almost certainly benefit from a bit
121
of map reduce and no way to put that power to work running Hadoop. Intel's
122
noticed this and, in version 2.5 of its Hadoop distribution that it quietly
123
released last week, has added support for Lustre: the Intel® HPC Distribution
124
for Apache Hadoop* Software, a new product that combines Intel Distribution
125
for Apache Hadoop software with Intel® Enterprise Edition for Lustre software.
126
This is the only distribution of Apache Hadoop that is integrated with Lustre,
127
the parallel file system used by many of the world's fastest supercomputers
129
<td width="20%"><a href="http://wiki.lustre.org/">1. wiki.lustre.org/</a>
130
<br><a href="http://wiki.lustre.org/index.php/Running_Hadoop_with_Lustre">2. Hadoop with Lustre</a>
131
<br><a href="http://hadoop.intel.com/products/distribution">3. Intel HPC Hadoop</a>
136
<td width="30%">Alluxio</td>
138
Alluxio, the world’s first memory-centric virtual distributed storage system, unifies data access
139
and bridges computation frameworks and underlying storage systems. Applications only need to connect
140
with Alluxio to access data stored in any underlying storage systems. Additionally, Alluxio’s
141
memory-centric architecture enables data access orders of magnitude faster than existing solutions.
143
In big data ecosystem, Alluxio lies between computation frameworks or jobs, such as Apache Spark,
144
Apache MapReduce, or Apache Flink, and various kinds of storage systems, such as Amazon S3,
145
OpenStack Swift, GlusterFS, HDFS, Ceph, or OSS. Alluxio brings significant performance improvement
146
to the stack; for example, Baidu uses Alluxio to improve their data analytics performance by 30 times.
147
Beyond performance, Alluxio bridges new workloads with data stored in traditional storage systems.
148
Users can run Alluxio using its standalone cluster mode, for example on Amazon EC2, or launch Alluxio
149
with Apache Mesos or Apache Yarn.
151
Alluxio is Hadoop compatible. This means that existing Spark and MapReduce programs can run on top of
152
Alluxio without any code changes. The project is open source (Apache License 2.0) and is deployed at
153
multiple companies. It is one of the fastest growing open source projects. With less than three years
154
open source history, Alluxio has attracted more than 160 contributors from over 50 institutions,
155
including Alibaba, Alluxio, Baidu, CMU, IBM, Intel, NJU, Red Hat, UC Berkeley, and Yahoo.
156
The project is the storage layer of the Berkeley Data Analytics Stack (BDAS) and also part of the
159
<td width="20%"><a href="http://www.alluxio.org/">1. Alluxio site</a>
164
<td width="30%">GridGain</td>
166
GridGain is open source project licensed under Apache 2.0. One of the main pieces of this platform is the
167
In-Memory Apache Hadoop Accelerator which aims to accelerate HDFS and Map/Reduce by bringing both, data
168
and computations into memory. This work is done with the GGFS - Hadoop compliant in-memory file system.
169
For I/O intensive jobs GridGain GGFS offers performance close to 100x faster than standard HDFS.
170
Paraphrasing Dmitriy Setrakyan from GridGain Systems talking about GGFS regarding Tachyon:
172
<li>GGFS allows read-through and write-through to/from underlying HDFS or any
173
other Hadoop compliant file system with zero code change. Essentially GGFS
174
entirely removes ETL step from integration.</li>
175
<li>GGFS has ability to pick and choose what folders stay in memory, what
176
folders stay on disc, and what folders get synchronized with underlying
177
(HD)FS either synchronously or asynchronously.</li>
178
<li>GridGain is working on adding native MapReduce component which will
179
provide native complete Hadoop integration without changes in API, like
180
Spark currently forces you to do. Essentially GridGain MR+GGFS will allow
181
to bring Hadoop completely or partially in-memory in Plug-n-Play fashion
182
without any API changes.</li>
185
<td width="20%"><a href="http://www.gridgain.org/">1. GridGain site</a>
190
<td width="30%">XtreemFS</td>
192
XtreemFS is a general purpose storage system and covers most storage needs in a single deployment.
193
It is open-source, requires no special hardware or kernel modules, and can be mounted on Linux,
195
XtreemFS runs distributed and offers resilience through replication. XtreemFS Volumes can be accessed
196
through a FUSE component, that offers normal file interaction with POSIX like semantics. Furthermore an
197
implementation of Hadoops FileSystem interface is included which makes XtreemFS available for use with
198
Hadoop, Flink and Spark out of the box.
199
XtreemFS is licensed under the New BSD license. The XtreemFS project is developed by Zuse Institute Berlin.
200
The development of the project is funded by the European Commission since 2006 under
201
Grant Agreements No. FP6-033576, FP7-ICT-257438, and FP7-318521, as well as the German projects MoSGrid,
202
"First We Take Berlin", FFMK, GeoMultiSens, and BBDC.
204
<td width="20%"><a href="http://www.xtreemfs.org/">1. XtreemFS site</a>
205
<a href="https://github.com/xtreemfs/xtreemfs/wiki/Apache-Flink-with-XtreemFS">2. Flink on XtreemFS</a>
206
<a href="https://github.com/xtreemfs/xtreemfs/wiki/Apache-Spark-with-XtreemFS">. Spark XtreemFS</a>
211
<!-- Distributed Programming-->
214
<th colspan="3">Distributed Programming</th>
218
<td width="20%">Apache Ignite</td>
220
Apache Ignite In-Memory Data Fabric is a distributed in-memory platform
221
for computing and transacting on large-scale data sets in real-time.
222
It includes a distributed key-value in-memory store, SQL capabilities,
223
map-reduce and other computations, distributed data structures,
224
continuous queries, messaging and events subsystems, Hadoop and Spark integration.
225
Ignite is built in Java and provides .NET and C++ APIs.
227
<td width="20%"><a href="http://ignite.apache.org/">1. Apache Ignite</a>
228
<br> <a href="https://apacheignite.readme.io/">2. Apache Ignite documentation</a>
233
<td width="20%">Apache MapReduce</td>
235
MapReduce is a programming model for processing large data sets with a parallel,
236
distributed algorithm on a cluster. Apache MapReduce was derived from Google
237
MapReduce: Simplified Data Processing on Large Clusters paper. The current
238
Apache MapReduce version is built over Apache YARN Framework. YARN stands
239
for “Yet-Another-Resource-Negotiator”. It is a new framework that facilitates
240
writing arbitrary distributed processing frameworks and applications. YARN’s
241
execution model is more generic than the earlier MapReduce implementation.
242
YARN can run applications that do not follow the MapReduce model, unlike the
243
original Apache Hadoop MapReduce (also called MR1). Hadoop YARN is an attempt
244
to take Apache Hadoop beyond MapReduce for data-processing.
246
<td width="20%"><a href="http://wiki.apache.org/hadoop/MapReduce/">1. Apache MapReduce</a>
247
<br> <a href="http://research.google.com/archive/mapreduce.html">2. Google MapReduce paper</a>
248
<br> <a href="http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html">3. Writing YARN applications</a>
253
<td width="20%">Apache Pig</td>
255
Pig provides an engine for executing data flows in parallel on Hadoop. It includes a language,
256
Pig Latin, for expressing these data flows. Pig Latin includes operators for many of the
257
traditional data operations (join, sort, filter, etc.), as well as the ability for users
258
to develop their own functions for reading, processing, and writing data. Pig runs on Hadoop.
259
It makes use of both the Hadoop Distributed File System, HDFS, and Hadoop’s processing system, MapReduce.<br>
260
Pig uses MapReduce to execute all of its data processing. It compiles the Pig Latin scripts
261
that users write into a series of one or more MapReduce jobs that it then executes. Pig Latin looks
262
different from many of the programming languages you have seen. There are no if statements or for
263
loops in Pig Latin. This is because traditional procedural and object-oriented programming languages
264
describe control flow, and data flow is a side effect of the program. Pig Latin instead focuses on data flow.
266
<td width="20%"><a href="https://pig.apache.org/">1. pig.apache.org/</a>
267
<br> <a href="https://github.com/alanfgates/programmingpig">2.Pig examples by Alan Gates</a>
272
<td width="20%">JAQL</td>
274
JAQL is a functional, declarative programming language designed especially for working with large
275
volumes of structured, semi-structured and unstructured data. As its name implies, a primary
276
use of JAQL is to handle data stored as JSON documents, but JAQL can work on various types of data.
277
For example, it can support XML, comma-separated values (CSV) data and flat files. A "SQL within JAQL"
278
capability lets programmers work with structured SQL data while employing a JSON data model that's less
279
restrictive than its Structured Query Language counterparts.<br>
280
Specifically, Jaql allows you to select, join, group, and filter data that is stored in HDFS, much
281
like a blend of Pig and Hive. Jaql’s query language was inspired by many programming and query languages,
282
including Lisp, SQL, XQuery, and Pig. <br>
283
JAQL was created by workers at IBM Research Labs in 2008 and released to open source. While it continues
284
to be hosted as a project on Google Code, where a downloadable version is available under an Apache 2.0 license,
285
the major development activity around JAQL has remained centered at IBM. The company offers the query language
286
as part of the tools suite associated with InfoSphere BigInsights, its Hadoop platform. Working together with a
287
workflow orchestrator, JAQL is used in BigInsights to exchange data between storage, processing and analytics jobs.
288
It also provides links to external data and services, including relational databases and machine learning data.
290
<td width="20%"><a href="https://code.google.com/p/jaql/">1. JAQL in Google Code</a>
291
<br> <a href="http://www-01.ibm.com/software/data/infosphere/hadoop/jaql/">2. What is Jaql? by IBM</a>
296
<td width="20%">Apache Spark</td>
298
Data analytics cluster computing framework originally developed in the AMPLab at UC Berkeley.
299
Spark fits into the Hadoop open-source community, building on top of the Hadoop Distributed File System (HDFS).
300
However, Spark provides an easier to use alternative to Hadoop MapReduce and offers performance up to 10 times
301
faster than previous generation systems like Hadoop MapReduce for certain applications.<br>
302
Spark is a framework for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce
303
does but with a fast in-memory approach and a clean functional style API. With its ability to integrate with
304
Hadoop and inbuilt tools for interactive query analysis (Shark), large-scale graph processing and analysis (Bagel),
305
and real-time analysis (Spark Streaming), it can be interactively used to quickly process and query big data sets.<br>
306
To make programming faster, Spark provides clean, concise APIs in Scala, Java and Python. You can also use Spark
307
interactively from the Scala and Python shells to rapidly query big datasets. Spark is also the engine behind Shark,
308
a fully Apache Hive-compatible data warehousing system that can run 100x faster than Hive.
310
<td width="20%"><a href="http://spark.apache.org/">1. Apache Spark</a>
311
<br> <a href="https://github.com/apache/spark">2. Mirror of Spark on Github</a>
312
<br> <a href="http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf">3. RDDs - Paper</a>
313
<br> <a href="https://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf">4. Spark: Cluster Computing... - Paper</a>
314
<br> <a href="http://spark.apache.org/research.html">Spark Research</a>
319
<td width="20%">Apache Storm</td>
321
Storm is a complex event processor (CEP) and distributed computation
322
framework written predominantly in the Clojure programming language.
323
Is a distributed real-time computation system for processing fast,
324
large streams of data. Storm is an architecture based on master-workers
325
paradigma. So a Storm cluster mainly consists of a master and worker
326
nodes, with coordination done by Zookeeper. <br>
327
Storm makes use of zeromq (0mq, zeromq), an advanced, embeddable
328
networking library. It provides a message queue, but unlike
329
message-oriented middleware (MOM), a 0MQ system can run without
330
a dedicated message broker. The library is designed to have a
331
familiar socket-style API.<br>
332
Originally created by Nathan Marz and team at BackType, the
333
project was open sourced after being acquired by Twitter. Storm
334
was initially developed and deployed at BackType in 2011. After
335
7 months of development BackType was acquired by Twitter in July
336
2011. Storm was open sourced in September 2011. <br>
337
Hortonworks is developing a Storm-on-YARN version and plans
338
finish the base-level integration in 2013 Q4. This is the plan
339
from Hortonworks. Yahoo/Hortonworks also plans to move Storm-on-YARN
340
code from github.com/yahoo/storm-yarn to be a subproject of
341
Apache Storm project in the near future.<br>
342
Twitter has recently released a Hadoop-Storm Hybrid called
343
“Summingbird.” Summingbird fuses the two frameworks into one,
344
allowing for developers to use Storm for short-term processing
345
and Hadoop for deep data dives,. a system that aims to mitigate
346
the tradeoffs between batch processing and stream processing by
347
combining them into a hybrid system.
349
<td width="20%"><a href="http://storm-project.net/">1. Storm Project/</a>
350
<br> <a href="github.com/yahoo/storm-yarn">2. Storm-on-YARN</a>
355
<td width="20%">Apache Flink</td>
357
Apache Flink (formerly called Stratosphere) features powerful programming abstractions in Java and Scala,
358
a high-performance runtime, and automatic program optimization. It has native support for iterations,
359
incremental iterations, and programs consisting of large DAGs of operations.<br>
360
Flink is a data processing system and an alternative to Hadoop's MapReduce component. It comes with
361
its own runtime, rather than building on top of MapReduce. As such, it can work completely independently
362
of the Hadoop ecosystem. However, Flink can also access Hadoop's distributed file system (HDFS) to read
363
and write data, and Hadoop's next-generation resource manager (YARN) to provision cluster resources.
364
Since most Flink users are using Hadoop HDFS to store their data, it ships already the required libraries to access HDFS.
366
<td width="20%"><a href="http://flink.incubator.apache.org/">1. Apache Flink incubator page</a>
367
<br><a href="http://stratosphere.eu/">2. Stratosphere site</a>
372
<td width="20%">Apache Apex</td>
374
Apache Apex is an enterprise grade Apache YARN based big data-in-motion platform that
375
unifies stream processing as well as batch processing. It processes big data
376
in-motion in a highly scalable, highly performant, fault tolerant, stateful,
377
secure, distributed, and an easily operable way. It provides a simple API that
378
enables users to write or re-use generic Java code, thereby lowering the expertise
379
needed to write big data applications. <p>
380
The Apache Apex platform is supplemented by Apache Apex-Malhar,
381
which is a library of operators that implement common business logic
382
functions needed by customers who want to quickly develop applications.
383
These operators provide access to HDFS, S3, NFS, FTP, and other file systems;
384
Kafka, ActiveMQ, RabbitMQ, JMS, and other message systems; MySql, Cassandra,
385
MongoDB, Redis, HBase, CouchDB and other databases along with JDBC connectors.
386
The library also includes a host of other common business logic patterns that
387
help users to significantly reduce the time it takes to go into production.
388
Ease of integration with all other big data technologies is one of the primary
389
missions of Apache Apex-Malhar.<p>
390
Apex, available on GitHub, is the core technology upon which DataTorrent's
391
commercial offering, DataTorrent RTS 3, along with other technology such as
392
a data ingestion tool called dtIngest, are based.
394
<td width="20%"><a href="https://www.datatorrent.com/apex/">1. Apache Apex from DataTorrent</a>
395
<br><a href="http://apex.incubator.apache.org/">2. Apache Apex main page</a>
396
<br><a href="https://wiki.apache.org/incubator/ApexProposal">3. Apache Apex Proposal</a>
401
<td width="20%">Netflix PigPen</td>
403
PigPen is map-reduce for Clojure which compiles to Apache Pig. Clojure is dialect of the Lisp programming
404
language created by Rich Hickey, so is a functional general-purpose language, and runs on the Java Virtual Machine,
405
Common Language Runtime, and JavaScript engines. In PigPen there are no special user defined functions (UDFs).
406
Define Clojure functions, anonymously or named, and use them like you would in any Clojure program. This tool
407
is open sourced by Netflix, Inc. the American provider of on-demand Internet streaming media.
409
<td width="20%"><a href="https://github.com/Netflix/PigPen">1. PigPen on GitHub</a>
414
<td width="20%">AMPLab SIMR</td>
416
Apache Spark was developed thinking in Apache YARN. However, up to now, it has been relatively hard to run
417
Apache Spark on Hadoop MapReduce v1 clusters, i.e. clusters that do not have YARN installed. Typically,
418
users would have to get permission to install Spark/Scala on some subset of the machines, a process that
419
could be time consuming. SIMR allows anyone with access to a Hadoop MapReduce v1 cluster to run Spark out
420
of the box. A user can run Spark directly on top of Hadoop MapReduce v1 without any administrative rights,
421
and without having Spark or Scala installed on any of the nodes.
423
<td width="20%"><a href="http://databricks.github.io/simr/">1. SIMR on GitHub</a>
428
<td width="20%">Facebook Corona</td>
430
“The next version of Map-Reduce" from Facebook, based in own fork of Hadoop. The current Hadoop implementation
431
of the MapReduce technique uses a single job tracker, which causes scaling issues for very large data sets.
432
The Apache Hadoop developers have been creating their own next-generation MapReduce, called YARN, which Facebook
433
engineers looked at but discounted because of the highly-customised nature of the company's deployment of Hadoop and HDFS.
434
Corona, like YARN, spawns multiple job trackers (one for each job, in Corona's case).
436
<td width="20%"><a href="https://github.com/facebookarchive/hadoop-20/tree/master/src/contrib/corona">1. Corona on Github</a>
441
<td width="20%">Apache REEF</td>
443
Apache REEF™ (Retainable Evaluator Execution Framework) is a library for developing portable
444
applications for cluster resource managers such as Apache Hadoop™ YARN or Apache Mesos™.
445
Apache REEF drastically simplifies development of those resource managers through the following features:
449
Centralized Control Flow: Apache REEF turns the chaos of a distributed application into events in a
450
single machine, the Job Driver. Events include container allocation, Task launch, completion and
451
failure. For failures, Apache REEF makes every effort of making the actual `Exception` thrown by the
452
Task available to the Driver.
455
Task runtime: Apache REEF provides a Task runtime called Evaluator. Evaluators are instantiated in
456
every container of a REEF application. Evaluators can keep data in memory in between Tasks, which
457
enables efficient pipelines on REEF.
460
Support for multiple resource managers: Apache REEF applications are portable to any supported resource
461
manager with minimal effort. Further, new resource managers are easy to support in REEF.
464
.NET and Java API: Apache REEF is the only API to write YARN or Mesos applications in .NET. Further, a
465
single REEF application is free to mix and match Tasks written for .NET or Java.
468
Plugins: Apache REEF allows for plugins (called "Services") to augment its feature set without adding
469
bloat to the core. REEF includes many Services, such as a name-based communications between Tasks
470
MPI-inspired group communications (Broadcast, Reduce, Gather, ...) and data ingress.
474
<td width="20%"><a href="https://reef.apache.org">1. Apache REEF Website</a>
479
<td width="20%">Apache Twill</td>
481
Twill is an abstraction over Apache Hadoop® YARN that reduces the
482
complexity of developing distributed applications, allowing developers
483
to focus more on their business logic. Twill uses a simple thread-based model that Java
484
programmers will find familiar. YARN can be viewed as a compute
485
fabric of a cluster, which means YARN applications like Twill will
486
run on any Hadoop 2 cluster.<br>
487
YARN is an open source application that allows the Hadoop cluster
488
to turn into a collection of virtual machines. Weave, developed by
489
Continuuity and initially housed on Github, is a complementary open
490
source application that uses a programming model similar to Java
491
threads, making it easy to write distributed applications. In order to remove
492
a conflict with a similarly named project on Apache, called "Weaver,"
493
Weave's name changed to Twill when it moved to Apache incubation.<br>
494
Twill functions as a scaled-out proxy. Twill is a middleware layer
495
in between YARN and any application on YARN. When you develop a
496
Twill app, Twill handles APIs in YARN that resemble a multi-threaded application familiar to Java.
497
It is very easy to build multi-processed distributed applications in Twill.
499
<td width="20%"><a href="https://incubator.apache.org/projects/twill.html">1. Apache Twill Incubator</a>
504
<td width="20%">Damballa Parkour</td>
506
Library for develop MapReduce programs using the LISP like language Clojure. Parkour aims to provide deep Clojure
507
integration for Hadoop. Programs using Parkour are normal Clojure programs, using standard Clojure functions
508
instead of new framework abstractions. Programs using Parkour are also full Hadoop programs, with complete
509
access to absolutely everything possible in raw Java Hadoop MapReduce.
511
<td width="20%"><a href="https://github.com/damballa/parkour">1. Parkour GitHub Project</a>
516
<td width="20%">Apache Hama</td>
518
Apache Top-Level open source project, allowing you to do advanced analytics beyond MapReduce. Many data
519
analysis techniques such as machine learning and graph algorithms require iterative computations,
520
this is where Bulk Synchronous Parallel model can be more effective than "plain" MapReduce.
522
<td width="20%"><a href="http://hama.apache.org/">1. Hama site</a>
527
<td width="20%">Datasalt Pangool</td>
529
A new MapReduce paradigm. A new API for MR jobs, in higher level than Java.
531
<td width="20%"><a href="http://pangool.net">1.Pangool</a>
532
<br> <a href = "https://github.com/datasalt/pangool">2.GitHub Pangool</a>
537
<td width="20%">Apache Tez</td>
539
Tez is a proposal to develop a generic application which can be used to process complex data-processing
540
task DAGs and runs natively on Apache Hadoop YARN. Tez generalizes the MapReduce paradigm to a more
541
powerful framework based on expressing computations as a dataflow graph. Tez is not meant directly for
542
end-users – in fact it enables developers to build end-user applications with much better performance
543
and flexibility. Hadoop has traditionally been a batch-processing platform for large amounts of data.
544
However, there are a lot of use cases for near-real-time performance of query processing. There are also
545
several workloads, such as Machine Learning, which do not fit will into the MapReduce paradigm. Tez helps
546
Hadoop address these use cases. Tez framework constitutes part of Stinger initiative (a low latency
547
based SQL type query interface for Hadoop based on Hive).
549
<td width="20%"><a href="http://incubator.apache.org/projects/tez.html">1. Apache Tez Incubator</a>
550
<br> <a href="http://hortonworks.com/hadoop/tez/">2. Hortonworks Apache Tez page</a>
555
<td width="20%">Apache DataFu</td>
557
DataFu provides a collection of Hadoop MapReduce jobs and functions in higher level languages based
558
on it to perform data analysis. It provides functions for common statistics tasks (e.g. quantiles,
559
sampling), PageRank, stream sessionization, and set and bag operations. DataFu also provides Hadoop
560
jobs for incremental data processing in MapReduce. DataFu is a collection of Pig UDFs (including PageRank,
561
sessionization, set operations, sampling, and much more) that were originally developed at LinkedIn.
563
<td width="20%"><a href="http://incubator.apache.org/projects/datafu.html">1. DataFu Apache Incubator</a>
568
<td width="20%">Pydoop</td>
570
Pydoop is a Python MapReduce and HDFS API for Hadoop, built upon the C++
571
Pipes and the C libhdfs APIs, that allows to write full-fledged MapReduce
572
applications with HDFS access. Pydoop has several advantages over Hadoop’s built-in
573
solutions for Python programming, i.e., Hadoop Streaming and Jython: being a CPython
574
package, it allows you to access all standard library and third party modules,
575
some of which may not be available.
577
<td width="20%"><a href="http://pydoop.sourceforge.net/docs/">1. SF Pydoop site</a>
578
<br> <a href="https://github.com/crs4/pydoop">2. Pydoop GitHub Project</a>
583
<td width="20%">Kangaroo</td>
585
Open-source project from Conductor for writing MapReduce jobs consuming data from Kafka.
586
The introductory post explains Conductor’s use case—loading data from Kafka to HBase
587
by way of a MapReduce job using the HFileOutputFormat. Unlike other solutions
588
which are limited to a single InputSplit per Kafka partition, Kangaroo can launch
589
multiple consumers at different offsets in the stream of a single partition for
590
increased throughput and parallelism.
592
<td width="20%"><a href="http://www.conductor.com/nightlight/data-stream-processing-bulk-kafka-hadoop/">1. Kangaroo Introduction</a>
593
<br> <a href="https://github.com/Conductor/kangaroo">2. Kangaroo GitHub Project</a>
598
<td width="20%">TinkerPop</td>
600
Graph computing framework written in Java. Provides a core API that graph system vendors can implement.
601
There are various types of graph systems including in-memory graph libraries, OLTP graph databases,
602
and OLAP graph processors. Once the core interfaces are implemented, the underlying graph system
603
can be queried using the graph traversal language Gremlin and processed with TinkerPop-enabled
604
algorithms. For many, TinkerPop is seen as the JDBC of the graph computing community.
606
<td width="20%"><a href="https://wiki.apache.org/incubator/TinkerPopProposal">1. Apache Tinkerpop Proposal</a>
607
<br> <a href="http://www.tinkerpop.com/">2. TinkerPop site</a>
612
<td width="20%">Pachyderm MapReduce</td>
614
Pachyderm is a completely new MapReduce engine built on top Docker and CoreOS.
615
In Pachyderm MapReduce (PMR) a job is an HTTP server inside a Docker container
616
(a microservice). You give Pachyderm a Docker image and it will automatically
617
distribute it throughout the cluster next to your data. Data is POSTed to
618
the container over HTTP and the results are stored back in the file system.
619
You can implement the web server in any language you want and pull in any library.
620
Pachyderm also creates a DAG for all the jobs in the system and their dependencies
621
and it automatically schedules the pipeline such that each job isn’t run until it’s
622
dependencies have completed. Everything in Pachyderm “speaks in diffs” so it knows
623
exactly which data has changed and which subsets of the pipeline need to be rerun.
624
CoreOS is an open source lightweight operating system based on Chrome OS, actually
625
CoreOS is a fork of Chrome OS. CoreOS provides only the minimal functionality
626
required for deploying applications inside software containers, together with
627
built-in mechanisms for service discovery and configuration sharing
629
<td width="20%"><a href="http://www.pachyderm.io/">1. Pachyderm site</a>
630
<br> <a href="https://medium.com/pachyderm-data/lets-build-a-modern-hadoop-4fc160f8d74f">2. Pachyderm introduction article</a>
635
<td width="20%">Apache Beam</td>
637
Apache Beam is an open source, unified model for defining and executing
638
data-parallel processing pipelines, as well as a set of language-specific
639
SDKs for constructing pipelines and runtime-specific Runners for executing them.<p>
640
The model behind Beam evolved from a number of internal Google
641
data processing projects, including MapReduce, FlumeJava, and
642
Millwheel. This model was originally known as the “Dataflow Model”
643
and first implemented as Google Cloud Dataflow, including a Java SDK
644
on GitHub for writing pipelines and fully managed service for
645
executing them on Google Cloud Platform.<p>
646
In January 2016, Google and a number of partners submitted the Dataflow
647
Programming Model and SDKs portion as an Apache Incubator Proposal,
648
under the name Apache Beam (unified Batch + strEAM processing).
650
<td width="20%"><a href="https://wiki.apache.org/incubator/BeamProposal">1. Apache Beam Proposal</a>
651
<br><a href="https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison">2. DataFlow Beam and Spark Comparasion</a>
656
<!-- NoSQL ecosystem -->
659
<th colspan="3">NoSQL Databases</th>
663
<th colspan="3" style="background-color:#0099FF;">Column Data Model</th>
667
<td width="20%">Apache HBase</td>
669
Google BigTable Inspired. Non-relational distributed database.
670
Ramdom, real-time r/w operations in column-oriented very large
671
tables (BDDB: Big Data Data Base). It’s the backing system for
672
MR jobs outputs. It’s the Hadoop database. It’s for backing
673
Hadoop MapReduce jobs with Apache HBase tables
675
<td width="20%"><a href="https://hbase.apache.org/">1. Apache HBase Home</a>
676
<br> <a href="https://github.com/apache/hbase">2. Mirror of HBase on Github</a>
681
<td width="20%">Apache Cassandra</td>
683
Distributed Non-SQL DBMS, it’s a BDDB. MR can retrieve data from Cassandra.
684
This BDDB can run without HDFS, or on-top of HDFS (DataStax fork of Cassandra).
685
HBase and its required supporting systems are derived from what is known of
686
the original Google BigTable and Google File System designs (as known from the
687
Google File System paper Google published in 2003, and the BigTable paper published
688
in 2006). Cassandra on the other hand is a recent open source fork of a standalone
689
database system initially coded by Facebook, which while implementing the BigTable
690
data model, uses a system inspired by Amazon’s Dynamo for storing data (in fact
691
much of the initial development work on Cassandra was performed by two Dynamo
692
engineers recruited to Facebook from Amazon).
695
<a href="http://cassandra.apache.org" target="_blank">1. Apache HBase Home</a> <br>
696
<a href="https://github.com/apache/cassandra" target="_blank">2. Cassandra on GitHub</a> <br>
697
<a href="https://academy.datastax.com" target="_blank">3. Training Resources</a> <br>
698
<a href="https://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf" target="_blank">4. Cassandra - Paper</a>
703
<td width="20%">Hypertable</td>
705
Database system inspired by publications on the design of Google's
706
BigTable. The project is based on experience of engineers who were
707
solving large-scale data-intensive tasks for many years. Hypertable
708
runs on top of a distributed file system such as the Apache Hadoop DFS,
709
GlusterFS, or the Kosmos File System (KFS). It is written almost entirely
710
in C++. Sposored by Baidu the Chinese search engine.
712
<td width="20%">TODO</td>
716
<td width="20%">Apache Accumulo</td>
718
Distributed key/value store is a robust, scalable, high performance
719
data storage and retrieval system. Apache Accumulo is based on Google's
720
BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift.
721
Accumulo is software created by the NSA with security features.
723
<td width="20%"><a href="https://accumulo.apache.org/">1. Apache Accumulo Home</a>
728
<td width="20%">Apache Kudu</td>
730
Distributed, columnar, relational data store optimized for analytical use cases requiring
731
very fast reads with competitive write speeds.
733
<li>Relational data model (tables) with strongly-typed columns and a fast, online alter table operation.</li>
734
<li>Scale-out and sharded with support for partitioning based on key ranges and/or hashing.</li>
735
<li>Fault-tolerant and consistent due to its implementation of Raft consensus.</li>
736
<li>Supported by Apache Impala and Apache Drill, enabling fast SQL reads and writes through those systems.</li>
737
<li>Integrates with MapReduce and Spark.</li>
738
<li>Additionally provides "NoSQL" APIs in Java, Python, and C++.</li>
741
<td width="20%"><a href="http://getkudu.io/">1. Apache Kudu Home</a><br>
742
<a href="http://github.com/cloudera/kudu">2. Kudu on Github</a><br>
743
<a href="http://getkudu.io/kudu.pdf">3. Kudu technical whitepaper (pdf)</a>
748
<td width="20%">Apache Parquet</td>
750
Columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of
751
data processing framework, data model or programming language.
753
<td width="20%"><a href="https://parquet.apache.org">1. Apache Parquet Home</a><br>
754
<a href="https://github.com/apache/parquet-mr">2. Apache Parquet on Github</a>
759
<th colspan="3" style="background-color:#0099FF;">Document Data Model</th>
763
<td width="20%">MongoDB</td>
765
Document-oriented database system. It is part of the NoSQL family of
766
database systems. Instead of storing data in tables as is done in a "classical"
767
relational database, MongoDB stores structured data as JSON-like documents
769
<td width="20%"><a href="http://www.mongodb.org/">1. Mongodb site</a>
774
<td width="20%">RethinkDB</td>
776
RethinkDB is built to store JSON documents, and scale to multiple
777
machines with very little effort. It has a pleasant query language
778
that supports really useful queries like table joins and group by,
779
and is easy to setup and learn.
781
<td width="20%"><a href="http://www.rethinkdb.com/">1. RethinkDB site</a>
786
<td width="20%">ArangoDB</td>
788
An open-source database with a flexible data model for documents, graphs,
789
and key-values. Build high performance applications using a convenient
790
sql-like query language or JavaScript extensions.
792
<td width="20%"><a href="https://www.arangodb.org/">1. ArangoDB site</a>
797
<th colspan="3" style="background-color:#0099FF;">Stream Data Model</th>
801
<td width="20%">EventStore</td>
803
An open-source, functional database with support for Complex Event Processing.
804
It provides a persistence engine for applications using event-sourcing, or for
805
storing time-series data. Event Store is written in C#, C++ for the server which
806
runs on Mono or the .NET CLR, on Linux or Windows.
807
Applications using Event Store can be written in JavaScript. Event sourcing (ES)
808
is a way of persisting your application's state by storing the history that determines
809
the current state of your application.
811
<td width="20%"><a href="http://geteventstore.com/">1. EventStore site</a>
816
<th colspan="3" style="background-color:#0099FF;">Key-value Data Model</th>
820
<td width="20%">Redis DataBase</td>
822
Redis is an open-source, networked, in-memory, data structures
823
store with optional durability. It is written in ANSI C.
824
In its outer layer, the Redis data model is a dictionary which
825
maps keys to values. One of the main differences between Redis
826
and other structured storage systems is that Redis supports not
827
only strings, but also abstract data types. Sponsored by Redis Labs.
830
<td width="20%"><a href="http://redis.io/">1. Redis site</a>
831
<br> <a href="http://redislabs.com/">2. Redis Labs site</a>
836
<td width="20%">Linkedin Voldemort</td>
838
Distributed data store that is designed as a key-value store used
839
by LinkedIn for high-scalability storage.
841
<td width="20%"><a href="http://www.project-voldemort.com/voldemort/">1. Voldemort site</a>
846
<td width="20%">RocksDB</td>
848
RocksDB is an embeddable persistent key-value store for fast storage.
849
RocksDB can also be the foundation for a client-server database but our
850
current focus is on embedded workloads.
852
<td width="20%"><a href="http://rocksdb.org/">1. RocksDB site</a>
857
<td width="20%">OpenTSDB</td>
859
OpenTSDB is a distributed, scalable Time Series Database (TSDB)
860
written on top of HBase. OpenTSDB was written to address a common
861
need: store, index and serve metrics collected from computer systems
862
(network gear, operating systems, applications) at a large scale,
863
and make this data easily accessible and graphable.
865
<td width="20%"><a href="http://opentsdb.net/">1. OpenTSDB site</a>
870
<!-- NoSQL: Graph Data Model -->
873
<th colspan="3" style="background-color:#0099FF;">Graph Data Model</th>
877
<td width="20%">ArangoDB</td>
879
An open-source database with a flexible data model for documents,
880
graphs, and key-values. Build high performance applications using
881
a convenient sql-like query language or JavaScript extensions.
883
<td width="20%"><a href="https://www.arangodb.org/">1. ArangoDB site</a>
888
<td width="20%">Neo4j</td>
890
An open-source graph database writting entirely in Java. It is an
891
embedded, disk-based, fully transactional Java persistence engine
892
that stores data structured in graphs rather than in tables.
894
<td width="20%"><a href="http://www.neo4j.org/">1. Neo4j site</a>
899
<td width="20%">TitanDB</td>
901
TitanDB is a highly scalable graph database optimized for storing
902
and querying large graphs with billions of vertices and edges
903
distributed across a multi-machine cluster. Titan is a transactional
904
database that can support thousands of concurrent users.
906
<td width="20%"><a href="http://thinkaurelius.github.io/titan/">1. Titan site</a>
911
<!-- NewSQL ecosystem -->
914
<th colspan="3">NewSQL Databases</th>
918
<td width="20%">TokuDB</td>
920
TokuDB is a storage engine for MySQL and MariaDB that is specifically
921
designed for high performance on write-intensive workloads. It achieves
922
this via Fractal Tree indexing. TokuDB is a scalable, ACID and MVCC
923
compliant storage engine. TokuDB is one of the technologies that enable
926
<td width="20%">TODO</td>
930
<td width="20%">HandlerSocket</td>
932
HandlerSocket is a NoSQL plugin for MySQL/MariaDB (the storage engine
933
of MySQL). It works as a daemon inside the mysqld process, accepting TCP
934
connections, and executing requests from clients. HandlerSocket does not
935
support SQL queries. Instead, it supports simple CRUD operations on tables.
936
HandlerSocket can be much faster than mysqld/libmysql in some cases because
937
it has lower CPU, disk, and network overhead.
939
<td width="20%">TODO</td>
943
<td width="20%">Akiban Server</td>
945
Akiban Server is an open source database that brings document stores and
946
relational databases together. Developers get powerful document access
947
alongside surprisingly powerful SQL.
949
<td width="20%">TODO</td>
953
<td width="20%">Drizzle</td>
955
Drizzle is a re-designed version of the MySQL v6.0 codebase and
956
is designed around a central concept of having a microkernel
957
architecture. Features such as the query cache and authentication
958
system are now plugins to the database, which follow the general
959
theme of "pluggable storage engines" that were introduced in MySQL 5.1.
960
It supports PAM, LDAP, and HTTP AUTH for authentication via plugins
961
it ships. Via its plugin system it currently supports logging to files,
962
syslog, and remote services such as RabbitMQ and Gearman. Drizzle
963
is an ACID-compliant relational database that supports
964
transactions via an MVCC design
966
<td width="20%">TODO</td>
970
<td width="20%">Haeinsa</td>
972
Haeinsa is linearly scalable multi-row, multi-table transaction
973
library for HBase. Use Haeinsa if you need strong ACID semantics
974
on your HBase cluster. Is based on Google Perlocator concept.
976
<td width="20%">TODO</td>
980
<td width="20%">SenseiDB</td>
982
Open-source, distributed, realtime, semi-structured database.
983
Some Features: Full-text search, Fast realtime updates, Structured
984
and faceted search, BQL: SQL-like query language, Fast key-value
985
lookup, High performance under concurrent heavy update and query
986
volumes, Hadoop integration
988
<td width="20%"><a href="http://senseidb.com/">1. SenseiDB site</a>
993
<td width="20%">Sky</td>
995
Sky is an open source database used for flexible, high performance
996
analysis of behavioral data. For certain kinds of data such as
997
clickstream data and log data, it can be several orders of magnitude
998
faster than traditional approaches such as SQL databases or Hadoop.
1000
<td width="20%"><a href="http://skydb.io/">1. SkyDB site</a>
1005
<td width="20%">BayesDB</td>
1007
BayesDB, a Bayesian database table, lets users query the probable
1008
implications of their tabular data as easily as an SQL database
1009
lets them query the data itself. Using the built-in Bayesian Query
1010
Language (BQL), users with no statistics training can solve basic
1011
data science problems, such as detecting predictive relationships
1012
between variables, inferring missing values, simulating probable
1013
observations, and identifying statistically similar database entries.
1015
<td width="20%"><a href="http://probcomp.csail.mit.edu/bayesdb/index.html">1. BayesDB site</a>
1020
<td width="20%">InfluxDB</td>
1022
InfluxDB is an open source distributed time series database with
1023
no external dependencies. It's useful for recording metrics, events,
1024
and performing analytics. It has a built-in HTTP API so you don't
1025
have to write any server side code to get up and running. InfluxDB
1026
is designed to be scalable, simple to install and manage, and fast
1027
to get data in and out. It aims to answer queries in real-time.
1028
That means every data point is indexed as it comes in and is immediately
1029
available in queries that should return under 100ms.
1031
<td width="20%"><a href="http://influxdb.org/">1. InfluxDB site</a>
1036
<th colspan="3">SQL-on-Hadoop</th>
1040
<td width="20%">Apache Hive</td>
1042
Data Warehouse infrastructure developed by Facebook. Data
1043
summarization, query, and analysis. It’s provides SQL-like
1044
language (not SQL92 compliant): HiveQL.
1046
<td width="20%"><a href="http://hive.apache.org/">1. Apache HIVE site</a>
1047
<br> <a href="https://github.com/apache/hive">2. Apache HIVE GitHub Project</a>
1052
<td width="20%">Apache HCatalog</td>
1054
HCatalog’s table abstraction presents users with a relational view
1055
of data in the Hadoop Distributed File System (HDFS) and ensures
1056
that users need not worry about where or in what format their data
1057
is stored. Right now HCatalog is part of Hive. Only old versions are separated for download.
1059
<td width="20%">TODO</td>
1063
<td width="20%">Apache Trafodion</td>
1065
Apache Trafodion is a webscale SQL-on-Hadoop solution enabling
1066
enterprise-class transactional and operational workloads on
1067
HBase. Trafodion is a native MPP ANSI SQL database engine that
1068
builds on the scalability, elasticity and flexibility of HDFS and
1069
HBase, extending these to provide guaranteed transactional
1070
integrity for all workloads including multi-column, multi-row,
1071
multi-table, and multi-server updates.
1073
<td width="20%"><a href="http://trafodion.incubator.apache.org">1. Apache Trafodion website</a>
1074
<br> <a href="https://cwiki.apache.org/confluence/display/TRAFODION/Apache+Trafodion+Home">2. Apache Trafodion wiki</a>
1075
<br> <a href="https://github.com/apache/incubator-trafodion">3. Apache Trafodion GitHub Project</a>
1081
<td width="20%">Apache HAWQ</td>
1083
Apache HAWQ is a Hadoop native SQL query engine that combines
1084
key technological advantages of MPP database evolved from Greenplum Database,
1085
with the scalability and convenience of Hadoop.
1087
<td width="20%"><a href="http://hawq.incubator.apache.org/">1. Apache HAWQ site</a>
1088
<br> <a href="https://github.com/apache/incubator-hawq">2. HAWQ GitHub Project</a>
1093
<td width="20%">Apache Drill</td>
1095
Drill is the open source version of Google's Dremel system which
1096
is available as an infrastructure service called Google BigQuery.
1097
In recent years open source systems have emerged to address the
1098
need for scalable batch processing (Apache Hadoop) and stream
1099
processing (Storm, Apache S4). Apache Hadoop, originally inspired
1100
by Google's internal MapReduce system, is used by thousands of
1101
organizations processing large-scale datasets. Apache Hadoop is
1102
designed to achieve very high throughput, but is not designed to
1103
achieve the sub-second latency needed for interactive data analysis
1104
and exploration. Drill, inspired by Google's internal Dremel system,
1105
is intended to address this need
1107
<td width="20%"><a href="http://incubator.apache.org/drill/">1. Apache Incubator Drill</a>
1112
<td width="20%">Cloudera Impala</td>
1114
The Apache-licensed Impala project brings scalable parallel database
1115
technology to Hadoop, enabling users to issue low-latency SQL queries
1116
to data stored in HDFS and Apache HBase without requiring data movement
1117
or transformation. It's a Google Dremel clone (Big Query google).
1119
<td width="20%"><a href="http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html">1. Cloudera Impala site</a>
1120
<br> <a href="https://github.com/cloudera/impala">2. Impala GitHub Project</a>
1125
<td width="20%">Facebook Presto</td>
1127
Facebook has open sourced Presto, a SQL engine it says is on
1128
average 10 times faster than Hive for running queries across
1129
large data sets stored in Hadoop and elsewhere.
1131
<td width="20%"><a href="http://prestodb.io/">1. Presto site</a>
1136
<td width="20%">Datasalt Splout SQL</td>
1138
Splout allows serving an arbitrarily big dataset with high QPS
1139
rates and at the same time provides full SQL query syntax.
1141
<td width="20%">TODO</td>
1145
<td width="20%">Apache Tajo</td>
1147
Apache Tajo is a robust big data relational and distributed data
1148
warehouse system for Apache Hadoop. Tajo is designed for low-latency
1149
and scalable ad-hoc queries, online aggregation, and ETL
1150
(extract-transform-load process) on large-data sets stored on
1151
HDFS (Hadoop Distributed File System) and other data sources.
1152
By supporting SQL standards and leveraging advanced database
1153
techniques, Tajo allows direct control of distributed execution
1154
and data flow across a variety of query evaluation strategies
1155
and optimization opportunities. For reference, the Apache
1156
Software Foundation announced Tajo as a Top-Level Project in April 2014.
1158
<td width="20%"><a href="http://tajo.apache.org/">1. Apache Tajo site</a>
1163
<td width="20%">Apache Phoenix</td>
1165
Apache Phoenix is a SQL skin over HBase delivered as a
1166
client-embedded JDBC driver targeting low latency queries over
1167
HBase data. Apache Phoenix takes your SQL query, compiles it into
1168
a series of HBase scans, and orchestrates the running of those
1169
scans to produce regular JDBC result sets. The table metadata is
1170
stored in an HBase table and versioned, such that snapshot queries
1171
over prior versions will automatically use the correct schema.
1172
Direct use of the HBase API, along with coprocessors and custom
1173
filters, results in performance on the order of milliseconds for
1174
small queries, or seconds for tens of millions of rows.
1176
<td width="20%"><a href="http://phoenix.incubator.apache.org/index.html">1. Apache Phoenix site</a>
1181
<td width="20%">Apache MRQL</td>
1183
MRQL is a query processing and optimization system for large-scale,
1184
distributed data analysis, built on top of Apache Hadoop, Hama, and Spark.<br>
1185
MRQL (pronounced miracle) is a query processing and optimization
1186
system for large-scale, distributed data analysis. MRQL (the MapReduce
1187
Query Language) is an SQL-like query language for large-scale data analysis
1188
on a cluster of computers. The MRQL query processing system can evaluate MRQL
1189
queries in three modes:
1191
<li>in Map-Reduce mode using Apache Hadoop,</li>
1192
<li>in BSP mode (Bulk Synchronous Parallel mode) using Apache Hama, and</li>
1193
<li>in Spark mode using Apache Spark.</li>
1194
<li>in Flink mode using Apache Flink.</li>
1197
<td width="20%"><a href="http://mrql.incubator.apache.org/">1. Apache Incubator MRQL site</a>
1202
<td width="20%">Kylin</td>
1204
Kylin is an open source Distributed Analytics Engine from eBay
1205
Inc. that provides SQL interface and multi-dimensional analysis
1206
(OLAP) on Hadoop supporting extremely large datasets
1208
<td width="20%"><a href="http://www.kylin.io/">1. Kylin project site</a>
1213
<!-- Data Ingestion Tools -->
1216
<th colspan="3">Data Ingestion</th>
1220
<td width="20%">Apache Flume</td>
1222
Flume is a distributed, reliable, and available service for
1223
efficiently collecting, aggregating, and moving large amounts
1224
of log data. It has a simple and flexible architecture based on
1225
streaming data flows. It is robust and fault tolerant with tunable
1226
reliability mechanisms and many failover and recovery mechanisms.
1227
It uses a simple extensible data model that allows for online analytic application.
1229
<td width="20%"><a href="http://flume.apache.org/">1. Apache Flume project site</a>
1234
<td width="20%">Apache Sqoop</td>
1236
System for bulk data transfer between HDFS and structured
1237
datastores as RDBMS. Like Flume but from HDFS to RDBMS.
1239
<td width="20%"><a href="http://sqoop.apache.org/">1. Apache Sqoop project site</a>
1244
<td width="20%">Facebook Scribe</td>
1246
Log agregator in real-time. It’s a Apache Thrift Service.
1248
<td width="20%">TODO</td>
1252
<td width="20%">Apache Chukwa</td>
1254
Large scale log aggregator, and analytics.
1256
<td width="20%">TODO</td>
1260
<td width="20%">Apache Kafka</td>
1262
Distributed publish-subscribe system for processing large amounts
1263
of streaming data. Kafka is a Message Queue developed by LinkedIn
1264
that persists messages to disk in a very performant manner.
1265
Because messages are persisted, it has the interesting ability
1266
for clients to rewind a stream and consume the messages again.
1267
Another upside of the disk persistence is that bulk importing
1268
the data into HDFS for offline analysis can be done very quickly
1269
and efficiently. Storm, developed by BackType (which was acquired
1270
by Twitter a year ago), is more about transforming a stream of
1271
messages into new streams.
1273
<td width="20%"><a href="http://kafka.apache.org/">1. Apache Kafka</a>
1274
<br/><a href="https://github.com/apache/kafka/">2. GitHub source code</a>
1279
<td width="20%">Netflix Suro</td>
1281
Suro has its roots in Apache Chukwa, which was initially adopted
1282
by Netflix. Is a log agregattor like Storm, Samza.
1284
<td width="20%">TODO</td>
1288
<td width="20%">Apache Samza</td>
1290
Apache Samza is a distributed stream processing framework.
1291
It uses Apache Kafka for messaging, and Apache Hadoop YARN to
1292
provide fault tolerance, processor isolation, security, and
1293
resource management.
1294
Developed by http://www.linkedin.com/in/jaykreps Linkedin.
1296
<td width="20%">TODO</td>
1300
<td width="20%">Cloudera Morphline</td>
1302
Cloudera Morphlines is a new open source framework that reduces
1303
the time and skills necessary to integrate, build, and change
1304
Hadoop processing applications that extract, transform,
1305
and load data into Apache Solr, Apache HBase, HDFS, enterprise
1306
data warehouses, or analytic online dashboards.
1308
<td width="20%">TODO</td>
1312
<td width="20%">HIHO</td>
1314
This project is a framework for connecting disparate data sources
1315
with the Apache Hadoop system, making them interoperable. HIHO
1316
connects Hadoop with multiple RDBMS and file systems, so that
1317
data can be loaded to Hadoop and unloaded from Hadoop
1319
<td width="20%">TODO</td>
1323
<td width="20%">Apache NiFi</td>
1325
Apache NiFi is a dataflow system that is currently under
1326
incubation at the Apache Software Foundation. NiFi is based on
1327
the concepts of flow-based programming and is highly configurable.
1328
NiFi uses a component based extension model to rapidly add
1329
capabilities to complex dataflows. Out of the box NiFi has
1330
several extensions for dealing with file-based dataflows such
1331
as FTP, SFTP, and HTTP integration as well as integration with
1332
HDFS. One of NiFi’s unique features is a rich, web-based
1333
interface for designing, controlling, and monitoring a dataflow.
1335
<td width="20%"><a href="http://nifi.apache.org/index.html">1. Apache NiFi</a>
1340
<td width="20%">Apache ManifoldCF</td>
1342
Apache ManifoldCF provides a framework for connecting source content
1343
repositories like file systems, DB, CMIS, SharePoint, FileNet ...
1344
to target repositories or indexes, such as Apache Solr or ElasticSearch.
1345
It's a kind of crawler for multi-content repositories, supporting a lot
1346
of sources and multi-format conversion for indexing by means of Apache
1347
Tika Content Extractor transformation filter.
1349
<td width="20%"><a href="http://manifoldcf.apache.org/">1. Apache ManifoldCF</a>
1355
<th colspan="3">Service Programming</th>
1359
<td width="20%">Apache Thrift</td>
1361
A cross-language RPC framework for service creations. It’s the
1362
service base for Facebook technologies (the original Thrift
1363
contributor). Thrift provides a framework for developing and
1364
accessing remote services. It allows developers to create
1365
services that can be consumed by any application that is written
1366
in a language that there are Thrift bindings for. Thrift
1367
manages serialization of data to and from a service, as well as
1368
the protocol that describes a method invocation, response, etc.
1369
Instead of writing all the RPC code -- you can just get straight
1370
to your service logic. Thrift uses TCP and so a given service is
1371
bound to a particular port.
1373
<td width="20%"><a href="http://thrift.apache.org//">1. Apache Thrift</a>
1378
<td width="20%">Apache Zookeeper</td>
1380
It’s a coordination service that gives you the tools you need to
1381
write correct distributed applications. ZooKeeper was developed
1382
at Yahoo! Research. Several Hadoop projects are already using
1383
ZooKeeper to coordinate the cluster and provide highly-available
1384
distributed services. Perhaps most famous of those are Apache
1385
HBase, Storm, Kafka. ZooKeeper is an application library with
1386
two principal implementations of the APIs—Java and C—and a service
1387
component implemented in Java that runs on an ensemble of dedicated
1388
servers. Zookeeper is for building distributed systems, simplifies
1389
the development process, making it more agile and enabling more
1390
robust implementations. Back in 2006, Google published a paper
1391
on "Chubby", a distributed lock service which gained wide adoption
1392
within their data centers. Zookeeper, not surprisingly, is a close
1393
clone of Chubby designed to fulfill many of the same roles for
1394
HDFS and other Hadoop infrastructure.
1396
<td width="20%"><a href="http://zookeeper.apache.org/">1. Apache Zookeeper</a>
1397
<br><a href="http://research.google.com/archive/chubby.html">2. Google Chubby paper</a>
1402
<td width="20%">Apache Avro</td>
1404
Apache Avro is a framework for modeling, serializing and making
1405
Remote Procedure Calls (RPC). Avro data is described by a schema,
1406
and one interesting feature is that the schema is stored in the
1407
same file as the data it describes, so files are self-describing.
1408
Avro does not require code generation. This framework can compete
1409
with other similar tools like: Apache Thrift, Google Protocol Buffers, ZeroC ICE, and so on.
1411
<td width="20%"><a href="http://avro.apache.org/">1. Apache Avro</a>
1416
<td width="20%">Apache Curator</td>
1418
Curator is a set of Java libraries that make using Apache
1419
ZooKeeper much easier.
1421
<td width="20%">TODO</td>
1425
<td width="20%">Apache karaf</td>
1427
Apache Karaf is an OSGi runtime that runs on top of any OSGi
1428
framework and provides you a set of services, a powerful
1429
provisioning concept, an extensible shell and more.
1431
<td width="20%">TODO</td>
1435
<td width="20%">Twitter Elephant Bird</td>
1437
Elephant Bird is a project that provides utilities (libraries)
1438
for working with LZOP-compressed data. It also provides a
1439
container format that supports working with Protocol Buffers,
1440
Thrift in MapReduce, Writables, Pig LoadFuncs, Hive SerDe,
1441
HBase miscellanea. This open source library is massively
1444
<td width="20%"><a href="https://github.com/kevinweil/elephant-bird">1. Elephant Bird GitHub</a>
1449
<td width="20%">Linkedin Norbert</td>
1451
Norbert is a library that provides easy cluster management and
1452
workload distribution. With Norbert, you can quickly distribute
1453
a simple client/server architecture to create a highly scalable
1454
architecture capable of handling heavy traffic. Implemented in
1455
Scala, Norbert wraps ZooKeeper, Netty and uses Protocol Buffers
1456
for transport to make it easy to build a cluster aware application.
1457
A Java API is provided and pluggable load balancing strategies
1458
are supported with round robin and consistent hash strategies
1459
provided out of the box.
1461
<td width="20%"><a href="http://data.linkedin.com/opensource/norbert">1. Linedin Project</a>
1462
<br><a href="https://github.com/rhavyn/norbert">2. GitHub source code</a>
1467
<th colspan="3">Scheduling</th>
1471
<td width="20%">Apache Oozie</td>
1473
Workflow scheduler system for MR jobs using DAGs
1474
(Direct Acyclical Graphs). Oozie Coordinator can trigger jobs
1475
by time (frequency) and data availability
1477
<td width="20%"><a href="http://oozie.apache.org/">1. Apache Oozie</a>
1478
<br/><a href="https://github.com/apache/oozie">2. GitHub source code</a>
1483
<td width="20%">Linkedin Azkaban</td>
1485
Hadoop workflow management. A batch job scheduler can be seen as
1486
a combination of the cron and make Unix utilities combined with
1489
<td width="20%">TODO</td>
1493
<td width="20%">Apache Falcon</td>
1495
Apache Falcon is a data management framework for simplifying
1496
data lifecycle management and processing pipelines on Apache
1497
Hadoop. It enables users to configure, manage and orchestrate
1498
data motion, pipeline processing, disaster recovery, and data
1499
retention workflows. Instead of hard-coding complex data lifecycle
1500
capabilities, Hadoop applications can now rely on the well-tested
1501
Apache Falcon framework for these functions. Falcon’s simplification
1502
of data management is quite useful to anyone building apps on
1503
Hadoop. Data Management on Hadoop encompasses data motion, process
1504
orchestration, lifecycle management, data discovery, etc. among
1505
other concerns that are beyond ETL. Falcon is a new data processing
1506
and management platform for Hadoop that solves this problem and
1507
creates additional opportunities by building on existing components
1508
within the Hadoop ecosystem (ex. Apache Oozie, Apache Hadoop
1509
DistCp etc.) without reinventing the wheel.
1511
<td width="20%">TODO</td>
1515
<td width="20%">Schedoscope</td>
1517
Schedoscope is a new open-source project providing a scheduling
1518
framework for painfree agile development, testing, (re)loading,
1519
and monitoring of your datahub, lake, or whatever you choose to
1520
call your Hadoop data warehouse these days. Datasets (including
1521
dependencies) are defined using a scala DSL, which can embed
1522
MapReduce jobs, Pig scripts, Hive queries or Oozie workflows to
1523
build the dataset. The tool includes a test framework to verify
1524
logic and a command line utility to load and reload data.
1526
<td width="20%"><a href="https://github.com/ottogroup/schedoscope">GitHub source code</a>
1531
<!-- Machine Learning tools -->
1534
<th colspan="3">Machine Learning</th>
1538
<td width="20%">Apache Mahout</td>
1540
Machine learning library and math library, on top of MapReduce.
1542
<td width="20%">TODO</td>
1546
<td width="20%">WEKA</td>
1548
Weka (Waikato Environment for Knowledge Analysis) is a popular suite
1549
of machine learning software written in Java, developed at the
1550
University of Waikato, New Zealand. Weka is free software available
1551
under the GNU General Public License.
1553
<td width="20%">TODO</td>
1557
<td width="20%">Cloudera Oryx</td>
1559
The Oryx open source project provides simple, real-time large-scale
1560
machine learning / predictive analytics infrastructure. It implements
1561
a few classes of algorithm commonly used in business applications:
1562
collaborative filtering / recommendation, classification / regression,
1565
<td width="20%"><a href="https://github.com/cloudera/oryx">1. Oryx at GitHub</a>
1566
<br> <a href="https://community.cloudera.com/t5/Data-Science-and-Machine/bd-p/Mahout">2. Cloudera forum for Machine Learning</a>
1571
<td width="20%">Deeplearning4j</td>
1573
The Deeplearning4j open-source project is the most widely used deep-learning framework for the JVM. DL4J includes deep neural nets such as recurrent neural networks, Long Short Term Memory Networks (LSTMs), convolutional neural networks, various autoencoders and feedforward neural networks such as restricted Boltzmann machines and deep-belief networks. It also has natural language-processing algorithms such as word2vec, doc2vec, GloVe and TF-IDF. All Deeplearning4j networks run distributed on multiple CPUs and GPUs. They work as Hadoop jobs, and integrate with Spark on the slace level for host-thread orchestration. Deeplearning4j's neural networks are applied to use cases such as fraud and anomaly detection, recommender systems, and predictive maintenance.
1576
<td width="20%"><a href="http://deeplearning4j.org/">1. Deeplearning4j Website</a>
1577
<br> <a href="https://gitter.im/deeplearning4j/deeplearning4j">2. Gitter Community for Deeplearning4j</a>
1582
<td width="20%">MADlib</td>
1584
The MADlib project leverages the data-processing capabilities of an RDBMS to analyze data.
1585
The aim of this project is the integration of statistical data analysis into databases.
1586
The MADlib project is self-described as the Big Data Machine Learning in SQL for Data Scientists.
1587
The MADlib software project began the following year as a collaboration between researchers
1588
at UC Berkeley and engineers and data scientists at EMC/Greenplum (now Pivotal)
1590
<td width="20%"><a href="http://madlib.net/community/">1. MADlib Community</a>
1595
<td width="20%">H2O</td>
1597
<p>H2O is a statistical, machine learning and math runtime tool for bigdata analysis.
1598
Developed by the predictive analytics company H2O.ai, H2O has established a leadership
1599
in the ML scene together with R and Databricks’ Spark. According to the team,
1600
H2O is the world’s fastest in-memory platform for machine learning and predictive analytics
1601
on big data. It is designed to help users scale machine learning, math, and statistics over large datasets.</p>
1602
<p>In addition to H2O’s point and click Web-UI, its REST API allows easy integration into various
1603
clients. This means explorative analysis of data can be done in a typical fashion in R, Python, and Scala;
1604
and entire workflows can be written up as automated scripts.</p>
1606
<td width="20%"><a href="https://github.com/h2oai/h2o-dev">1. H2O at GitHub</a>
1607
<br/><a href="http://h2o.ai/blog">2. H2O Blog</a>
1612
<td width="20%">Sparkling Water</td>
1614
<p>Sparkling Water combines two open source technologies: Apache Spark and H2O - a machine learning engine.
1615
It makes H2O’s library of Advanced Algorithms including Deep Learning, GLM, GBM, KMeans, PCA, and Random Forest
1616
accessible from Spark workflows.
1617
Spark users are provided with the options to select the best features from either platforms to meet their Machine Learning needs.
1618
Users can combine Sparks’ RDD API and Spark MLLib with H2O’s machine learning algorithms,
1619
or use H2O independent of Spark in the model building process and post-process the results in Spark. </p>
1620
<p>Sparkling Water provides a transparent integration of H2O’s framework and data structures into Spark’s
1621
RDD-based environment by sharing the same execution space as well as providing a RDD-like API for H2O data structures. </p>
1623
<td width="20%"><a href="https://github.com/h2oai/sparkling-water">1. Sparkling Water at GitHub</a>
1624
<br/><a href="https://github.com/h2oai/sparkling-water/tree/master/examples">2. Sparkling Water Examples</a>
1629
<td width="20%">Apache SystemML</td>
1631
<p>Apache SystemML was open sourced by IBM and it's pretty
1632
related with Apache Spark. If you thinking in Apache Spark as
1633
the analytics operating system for any application that taps
1634
into huge volumes of streaming data. MLLib, the machine
1635
learning library for Spark, provides developers with a rich set
1636
of machine learning algorithms. And SystemML enables developers
1637
to translate those algorithms so they can easily digest different
1638
kinds of data and to run on different kinds of computers.</p>
1639
SystemML allows a developer to write a single machine learning
1640
algorithm and automatically scale it up using Spark or Hadoop.
1642
SystemML scales for big data analytics with high performance
1643
optimizer technology, and empowers users to write customized
1644
machine learning algorithms using simple, domain-specific
1645
language (DSL) without learning complicated distributed
1646
programming. It is an extensible complement framework of Spark
1649
<td width="20%"><a href="http://systemml.apache.org">1. Apache SystemML</a>
1650
<br/><a href="https://wiki.apache.org/incubator/SystemML">2. Apache Proposal</a>
1655
<!-- Benchmarking and QA tools -->
1658
<th colspan="3">Benchmarking and QA Tools</th>
1662
<td width="20%">Apache Hadoop Benchmarking</td>
1664
There are two main JAR files in Apache Hadoop for benchmarking.
1665
This JAR are micro-benchmarks for testing particular parts of the
1666
infrastructure, for instance TestDFSIO analyzes the disk system,
1667
TeraSort evaluates MapReduce tasks, WordCount measures cluster
1668
performance, etc. Micro-Benchmarks are packaged in the tests and
1669
exmaples JAR files, and you can get a list of them, with descriptions,
1670
by invoking the JAR file with no arguments. With regards Apache
1671
Hadoop 2.2.0 stable version we have available the following JAR
1672
files for test, examples and benchmarking. The Hadoop micro-benchmarks,
1673
are bundled in this JAR files: hadoop-mapreduce-examples-2.2.0.jar,
1674
hadoop-mapreduce-client-jobclient-2.2.0-tests.jar.
1676
<td width="20%"><a href="https://issues.apache.org/jira/browse/MAPREDUCE-3561">1. MAPREDUCE-3561 umbrella ticket to track all the issues related to performance</a>
1681
<td width="20%">Yahoo Gridmix3</td>
1683
Hadoop cluster benchmarking from Yahoo engineer team.
1685
<td width="20%">TODO</td>
1689
<td width="20%">PUMA Benchmarking</td>
1691
Benchmark suite which represents a broad range of MapReduce
1692
applications exhibiting application characteristics with
1693
high/low computation and high/low shuffle volumes. There are a
1694
total of 13 benchmarks, out of which Tera-Sort, Word-Count,
1695
and Grep are from Hadoop distribution. The rest of the benchmarks
1696
were developed in-house and are currently not part of the Hadoop
1697
distribution. The three benchmarks from Hadoop distribution are
1698
also slightly modified to take number of reduce tasks as input
1699
from the user and generate final time completion statistics of jobs.
1701
<td width="20%"><a href="https://issues.apache.org/jira/browse/MAPREDUCE-5116">1. MAPREDUCE-5116</a>
1702
<br> <a href="https://sites.google.com/site/farazahmad/">2. Faraz Ahmad researcher</a>
1703
<br> <a href="https://sites.google.com/site/farazahmad/pumabenchmarks">3. PUMA Docs</a>
1708
<td width="20%">Berkeley SWIM Benchmark</td>
1710
The SWIM benchmark (Statistical Workload Injector for MapReduce),
1711
is a benchmark representing a real-world big data workload developed
1712
by University of California at Berkley in close cooperation with
1713
Facebook. This test provides rigorous measurements of the performance
1714
of MapReduce systems comprised of real industry workloads..
1716
<td width="20%"><a href="https://github.com/SWIMProjectUCB/SWIM/wiki">1. GitHub SWIN</a>
1721
<td width="20%">Intel HiBench</td>
1723
HiBench is a Hadoop benchmark suite.
1725
<td width="20%">TODO</td>
1729
<td width="20%">Apache Yetus</td>
1731
To help maintain consistency over a large and disconnected set
1732
of committers, automated patch testing was added to Hadoop’s development process.
1733
This automated patch testing (now included as part of Apache Yetus)
1734
works as follows: when a patch is uploaded to the bug tracking
1735
system an automated process downloads the patch, performs some
1736
static analysis, and runs the unit tests. These results are posted
1737
back to the bug tracker and alerts notify interested parties about
1738
the state of the patch.<p>
1739
However The Apache Yetus project addresses much more than the traditional
1740
patch testing, it's a better approach including a massive rewrite of
1741
the patch testing facility used in Hadoop.
1743
<td width="20%"><a href="https://www.altiscale.com/blog/apache-yetus-faster-more-reliable-software-development/">1. Altiscale Blog Entry</a>
1744
<br> <a href="https://wiki.apache.org/incubator/YetusProposal">2. Apache Yetus Proposal</a>
1745
<br> <a href="https://yetus.apache.org/">3. Apache Yetus Project site</a>
1751
<th colspan="3">Security</th>
1755
<td width="20%">Apache Sentry</td>
1757
Sentry is the next step in enterprise-grade big data security
1758
and delivers fine-grained authorization to data stored in Apache
1759
Hadoop. An independent security module that integrates with open
1760
source SQL query engines Apache Hive and Cloudera Impala, Sentry
1761
delivers advanced authorization controls to enable multi-user
1762
applications and cross-functional processes for enterprise data
1763
sets. Sentry was a Cloudera development.
1765
<td width="20%">TODO</td>
1769
<td width="20%">Apache Knox Gateway</td>
1771
System that provides a single point of secure access for Apache
1772
Hadoop clusters. The goal is to simplify Hadoop security for both
1773
users (i.e. who access the cluster data and execute jobs) and
1774
operators (i.e. who control access and manage the cluster). The
1775
Gateway runs as a server (or cluster of servers) that serve one
1776
or more Hadoop clusters.
1778
<td width="20%"><a href="http://knox.apache.org/">1. Apache Knox</a>
1779
<br><a href="http://hortonworks.com/hadoop/knox-gateway/">2. Apache Knox Gateway Hortonworks web</a>
1784
<td width="20%">Apache Ranger</td>
1786
Apache Argus Ranger (formerly called Apache Argus or HDP Advanced
1787
Security) delivers comprehensive approach to central security policy
1788
administration across the core enterprise security requirements
1789
of authentication, authorization, accounting and data protection.
1790
It extends baseline features for coordinated enforcement across
1791
Hadoop workloads from batch, interactive SQL and real–time and
1792
leverages the extensible architecture to apply policies consistently
1793
against additional Hadoop ecosystem components (beyond HDFS, Hive,
1794
and HBase) including Storm, Solr, Spark, and more.
1796
<td width="20%"><a href="http://ranger.apache.org/">1. Apache Ranger</a>
1797
<br><a href="http://hortonworks.com/hadoop/ranger/">2. Apache Ranger Hortonworks web</a>
1802
<th colspan="3">Metadata Management</th>
1806
<td width="20%">Metascope</td>
1808
Metascope is a metadata management and data discovery tool which
1809
serves as an add-on to Schedoscope. Metascope is able to collect technical,
1810
operational and business metadata from your Hadoop Datahub and provides
1811
them easy to search and navigate via a portal.
1813
<td width="20%"><a href="https://github.com/ottogroup/metascope">GitHub source code</a>
1818
<th colspan="3">System Deployment</th>
1822
<td width="20%">Apache Ambari</td>
1824
Intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs.
1825
Apache Ambari was donated by Hortonworks team to the ASF. It's a powerful and
1826
nice interface for Hadoop and other typical applications from the Hadoop
1827
ecosystem. Apache Ambari is under a heavy development, and it will incorporate
1828
new features in a near future. For example Ambari is able to deploy a complete
1829
Hadoop system from scratch, however is not possible use this GUI in a Hadoop
1830
system that is already running. The ability to provisioning the operating
1831
system could be a good addition, however probably is not in the roadmap..
1833
<td width="20%"><a href="http://ambari.apache.org/">1. Apache Ambari</a>
1838
<td width="20%">Cloudera HUE</td>
1840
Web application for interacting with Apache Hadoop. It's not a deploment tool,
1841
is an open-source Web interface that supports Apache Hadoop and its ecosystem,
1842
licensed under the Apache v2 license. HUE is used for Hadoop and its ecosystem
1843
user operations. For example HUE offers editors for Hive, Impala, Oozie, Pig,
1844
notebooks for Spark, Solr Search dashboards, HDFS, YARN, HBase browsers..
1846
<td width="20%"><a href="http://gethue.com/">1. HUE home page</a>
1851
<td width="20%">Apache Mesos</td>
1853
Mesos is a cluster manager that provides resource sharing and isolation across
1854
cluster applications. Like HTCondor, SGE or Troque can do it. However Mesos
1855
is hadoop centred design
1857
<td width="20%">TODO</td>
1861
<td width="20%">Myriad</td>
1863
Myriad is a mesos framework designed for scaling YARN clusters on Mesos. Myriad
1864
can expand or shrink one or more YARN clusters in response to events as per
1865
configured rules and policies.
1867
<td width="20%"><a href="https://github.com/mesos/myriad">1. Myriad Github</a>
1872
<td width="20%">Marathon</td>
1874
Marathon is a Mesos framework for long-running services. Given that you have
1875
Mesos running as the kernel for your datacenter, Marathon is the init or upstart daemon.
1877
<td width="20%">TODO</td>
1881
<td width="20%">Brooklyn</td>
1883
Brooklyn is a library that simplifies application deployment and management.
1884
For deployment, it is designed to tie in with other tools, giving single-click
1885
deploy and adding the concepts of manageable clusters and fabrics:
1886
Many common software entities available out-of-the-box.
1887
Integrates with Apache Whirr -- and thereby Chef and Puppet -- to deploy well-known
1888
services such as Hadoop and elasticsearch (or use POBS, plain-old-bash-scripts)
1889
Use PaaS's such as OpenShift, alongside self-built clusters, for maximum flexibility
1891
<td width="20%">TODO</td>
1895
<td width="20%">Hortonworks HOYA</td>
1897
HOYA is defined as “running HBase On YARN”. The Hoya tool is a Java tool,
1898
and is currently CLI driven. It takes in a cluster specification – in terms
1899
of the number of regionservers, the location of HBASE_HOME, the ZooKeeper
1900
quorum hosts, the configuration that the new HBase cluster instance should
1902
So HOYA is for HBase deployment using a tool developed on top of YARN. Once the
1903
cluster has been started, the cluster can be made to grow or shrink using the
1904
Hoya commands. The cluster can also be stopped and later resumed. Hoya implements
1905
the functionality through YARN APIs and HBase’s shell scripts. The goal of
1906
the prototype was to have minimal code changes and as of this writing, it has
1907
required zero code changes in HBase.
1909
<td width="20%"><a href="http://hortonworks.com/blog/introducing-hoya-hbase-on-yarn/">1. Hortonworks Blog</a>
1914
<td width="20%">Apache Helix</td>
1916
Apache Helix is a generic cluster management framework used for the automatic
1917
management of partitioned, replicated and distributed resources hosted on a
1918
cluster of nodes. Originally developed by Linkedin, now is in an incubator
1919
project at Apache. Helix is developed on top of Zookeeper for coordination tasks.
1921
<td width="20%"><a href="http://helix.apache.org/">1. Apache Helix</a>
1926
<td width="20%">Apache Bigtop</td>
1928
Bigtop was originally developed and released as an open source packaging
1929
infrastructure by Cloudera. BigTop is used for some vendors to build their
1930
own distributions based on Apache Hadoop (CDH, Pivotal HD, Intel's distribution),
1931
however Apache Bigtop does many more tasks, like continuous integration testing
1932
(with Jenkins, maven, ...) and is useful for packaging (RPM and DEB), deployment
1933
with Puppet, and so on. BigTop also features vagrant recipes for spinning up "n-node"
1934
hadoop clusters, and the bigpetstore blueprint application which demonstrates
1935
construction of a full stack hadoop app with ETL, machine learning,
1936
and dataset generation. Apache Bigtop could be considered as a community effort
1937
with a main focus: put all bits of the Hadoop ecosystem as a whole, rather
1938
than individual projects.
1940
<td width="20%"><a href="http://bigtop.apache.org//">1. Apache Bigtop.</a>
1945
<td width="20%">Buildoop</td>
1947
Buildoop is an open source project licensed under Apache License 2.0, based on Apache BigTop idea.
1948
Buildoop is a collaboration project that provides templates and tools to help you create custom
1949
Linux-based systems based on Hadoop ecosystem. The project is built from scrach using Groovy language,
1950
and is not based on a mixture of tools like BigTop does (Makefile, Gradle, Groovy, Maven), probably
1951
is easier to programming than BigTop, and the desing is focused in the basic ideas behind the buildroot
1952
Yocto Project. The project is in early stages of development right now.
1954
<td width="20%"><a href="http://buildoop.github.io/">1. Hadoop Ecosystem Builder.</a>
1959
<td width="20%">Deploop</td>
1961
Deploop is a tool for provisioning, managing and monitoring Apache Hadoop
1962
clusters focused in the Lambda Architecture. LA is a generic design based on
1963
the concepts of Twitter engineer Nathan Marz. This generic architecture was
1964
designed addressing common requirements for big data. The Deploop system is
1965
in ongoing development, in alpha phases of maturity. The system is setup
1966
on top of highly scalable techologies like Puppet and MCollective.
1968
<td width="20%"><a href="http://deploop.github.io/">1. The Hadoop Deploy System.</a>
1973
<td width="20%">SequenceIQ Cloudbreak</td>
1975
Cloudbreak is an effective way to start and run multiple instances and
1976
versions of Hadoop clusters in the cloud, Docker containers or bare metal.
1977
It is a cloud and infrastructure agnostic and cost effictive Hadoop As-a-Service
1978
platform API. Provides automatic scaling, secure multi tenancy and full cloud lifecycle management.
1979
<p>Cloudbreak leverages the cloud infrastructure platforms to create host instances,
1980
uses Docker technology to deploy the requisite containers cloud-agnostically,
1981
and uses Apache Ambari (via Ambari Blueprints) to install and manage a Hortonworks cluster.
1982
This is a tool within the HDP ecosystem.
1984
<td width="20%"><a href="https://github.com/sequenceiq/cloudbreak">1. GitHub project.</a>
1985
<br><a href="http://sequenceiq.com/cloudbreak-docs/latest/#introduction">2. Cloudbreak introduction.</a>
1986
<br><a href="http://hortonworks.com/hadoop/cloudbreak/">3. Cloudbreak in Hortonworks.</a>
1990
<td width="20%">Apache Eagle</td>
1992
Apache Eagle is an open source analytics solution for identifying security and performance issues instantly on big data platforms, e.g. Hadoop, Spark etc. It analyzes data activities, yarn applications, jmx metrics, and daemon logs etc., provides state-of-the-art alert engine to identify security breach, performance issues and shows insights.
1993
Big data platform normally generates huge amount of operational logs and metrics in realtime. Apache Eagle is founded to solve hard problems in securing and tuning performance for big data platforms by ensuring metrics, logs always available and alerting immediately even under huge traffic.
1995
<td width="20%"><a href="https://github.com/apache/incubator-eagle">1. Apache Eagle Github Project.</a>
1996
<br><a href="http://eagle.incubator.apache.org/">2. Apache Eagle Web Site.</a>
2001
<th colspan="3">Applications</th>
2005
<td width="20%">Apache Nutch</td>
2007
Highly extensible and scalable open source web crawler software
2008
project. A search engine based on Lucene: A Web crawler is an
2009
Internet bot that systematically browses the World Wide Web,
2010
typically for the purpose of Web indexing. Web crawlers can copy
2011
all the pages they visit for later processing by a search engine
2012
that indexes the downloaded pages so that users can search them
2015
<td width="20%">TODO</td>
2019
<td width="20%">Sphnix Search Server</td>
2021
Sphinx lets you either batch index and search data stored in an
2022
SQL database, NoSQL storage, or just files quickly and easily —
2023
or index and search data on the fly, working with Sphinx pretty
2024
much as with a database server.
2026
<td width="20%">TODO</td>
2030
<td width="20%">Apache OODT</td>
2032
OODT was originally developed at NASA Jet Propulsion Laboratory
2033
to support capturing, processing and sharing of data for NASA's
2036
<td width="20%">TODO</td>
2040
<td width="20%">HIPI Library</td>
2042
HIPI is a library for Hadoop's MapReduce framework that provides
2043
an API for performing image processing tasks in a distributed
2044
computing environment.
2046
<td width="20%">TODO</td>
2050
<td width="20%">PivotalR</td>
2052
PivotalR is a package that enables users of R, the most popular open source statistical
2053
programming language and environment to interact with the Pivotal (Greenplum) Database
2054
as well as Pivotal HD / HAWQ and the open-source database PostgreSQL for Big Data analytics.
2055
R is a programming language and data analysis software: you do data analysis in R by writing
2056
scripts and functions in the R programming language. R is a complete, interactive,
2057
object-oriented language: designed by statisticians, for statisticians. The language
2058
provides objects, operators and functions that make the process of exploring, modeling,
2059
and visualizing data a natural one.
2061
<td width="20%"><a href="https://github.com/gopivotal/PivotalR">1. PivotalR on GitHub</a>
2066
<!-- Development Framework -->
2069
<th colspan="3">Development Frameworks</th>
2073
<td width="20%">Jumbune</td>
2075
Jumbune is an open source product that sits on top of any Hadoop
2076
distribution and assists in development and administration of
2077
MapReduce solutions. The objective of the product is to assist
2078
analytical solution providers to port fault free applications on
2079
production Hadoop environments.<br> Jumbune supports all active
2080
major branches of Apache Hadoop namely 1.x, 2.x, 0.23.x and commercial
2081
MapR, HDP 2.x and CDH 5.x distributions of Hadoop. It has the
2082
ability to work well with both Yarn and non-Yarn versions of Hadoop.<br>
2083
It has four major modules MapReduce Debugger, HDFS Data Validator,
2084
On-demand cluster monitor and MapReduce job profiler. Jumbune can
2085
be deployed on any remote user machine and uses a lightweight
2086
agent on the NameNode of the cluster to relay relevant information to and fro.<br>
2088
<td width="20%"><a href="https://jumbune.org">1. Jumbune</a>
2089
<br><a href="https://github.com/impetus-opensource/jumbune">2. Jumbune GitHub Project</a>
2090
<br><a href="http://jumbune.org/jira/secure/Dashboard.jspa">3. Jumbune JIRA page</a>
2095
<td width="20%">Spring XD</td>
2097
Spring XD (Xtreme Data) is a evolution of Spring Java application
2098
development framework to help Big Data Applications by Pivotal.
2099
SpringSource was the company created by the founders of the
2100
Spring Framework. SpringSource was purchased by VMware where it was
2101
maintained for some time as a separate division within VMware.
2102
Later VMware, and its parent company EMC Corporation, formally created
2103
a joint venture called Pivotal. Spring XD is more than development
2104
framework library, is a distributed, and extensible system for
2105
data ingestion, real time analytics, batch processing, and data
2106
export. It could be considered as alternative to Apache
2107
Flume/Sqoop/Oozie in some scenarios. Spring XD is part of Pivotal
2108
Spring for Apache Hadoop (SHDP). SHDP, integrated with Spring,
2109
Spring Batch and Spring Data are part of the Spring IO Platform
2110
as foundational libraries. Building on top of, and extending this
2111
foundation, the Spring IO platform provides Spring XD as big data
2112
runtime. Spring for Apache Hadoop (SHDP) aims to help simplify the
2113
development of Hadoop based applications by providing a consistent
2114
configuration and API across a wide range of Hadoop ecosystem
2115
projects such as Pig, Hive, and Cascading in addition to providing
2116
extensions to Spring Batch for orchestrating Hadoop based workflows.
2118
<td width="20%"><a href="https://github.com/spring-projects/spring-xd">1. Spring XD on GitHub</a>
2123
<td width="20%">Cask Data Application Platform</td>
2125
Cask Data Application Platform is an open source application
2126
development platform for the Hadoop ecosystem that provides
2127
developers with data and application virtualization to accelerate
2128
application development, address a range of real-time and batch
2129
use cases, and deploy applications into production. The deployment
2130
is made by Cask Coopr, an open source template-based cluster
2131
management solution that provisions, manages, and scales clusters
2132
for multi-tiered application stacks on public and private clouds.
2133
Another component is Tigon, a distributed framework built on Apache
2134
Hadoop and Apache HBase for real-time, high-throughput, low-latency
2135
data processing and analytics applications.
2137
<td width="20%"><a href="http://cask.co/">1. Cask Site</a>
2142
<th colspan="3">Categorize Pending ... </th>
2146
<td width="20%">Twitter Summingbird</td>
2148
A system that aims to mitigate the tradeoffs between batch
2149
processing and stream processing by combining them into a
2150
hybrid system. In the case of Twitter, Hadoop handles batch
2151
processing, Storm handles stream processing, and the hybrid
2152
system is called Summingbird.
2154
<td width="20%">TODO</td>
2158
<td width="20%">Apache Kiji</td>
2160
Build Real-time Big Data Applications on Apache HBase.
2162
<td width="20%">TODO</td>
2166
<td width="20%">S4 Yahoo</td>
2168
S4 is a general-purpose, distributed, scalable, fault-tolerant,
2169
pluggable platform that allows programmers to easily develop
2170
applications for processing continuous unbounded streams of data.
2172
<td width="20%">TODO</td>
2176
<td width="20%">Metamarkers Druid</td>
2178
Realtime analytical data store.
2180
<td width="20%">TODO</td>
2184
<td width="20%">Concurrent Cascading</td>
2186
Application framework for Java developers to simply develop
2187
robust Data Analytics and Data Management applications on Apache Hadoop.
2189
<td width="20%">TODO</td>
2193
<td width="20%">Concurrent Lingual</td>
2195
Open source project enabling fast and simple Big Data application
2196
development on Apache Hadoop. project that delivers ANSI-standard
2197
SQL technology to easily build new and integrate existing
2198
applications onto Hadoop
2200
<td width="20%">TODO</td>
2204
<td width="20%">Concurrent Pattern</td>
2206
Machine Learning for Cascading on Apache Hadoop through an API,
2207
and standards based PMML
2209
<td width="20%">TODO</td>
2213
<td width="20%">Apache Giraph</td>
2215
Apache Giraph is an iterative graph processing system built for
2216
high scalability. For example, it is currently used at Facebook
2217
to analyze the social graph formed by users and their connections.
2218
Giraph originated as the open-source counterpart to Pregel, the
2219
graph processing architecture developed at Google
2221
<td width="20%">TODO</td>
2225
<td width="20%">Talend</td>
2227
Talend is an open source software vendor that provides data
2228
integration, data management, enterprise application integration
2229
and big data software and solutions.
2231
<td width="20%">TODO</td>
2235
<td width="20%">Akka Toolkit</td>
2237
Akka is an open-source toolkit and runtime simplifying the
2238
construction of concurrent applications on the Java platform.
2240
<td width="20%">TODO</td>
2244
<td width="20%">Eclipse BIRT</td>
2246
BIRT is an open source Eclipse-based reporting system that
2247
integrates with your Java/Java EE application to produce
2250
<td width="20%">TODO</td>
2254
<td width="20%">Spango BI</td>
2256
SpagoBI is an Open Source Business Intelligence suite,
2257
belonging to the free/open source SpagoWorld initiative,
2258
founded and supported by Engineering Group. It offers a large
2259
range of analytical functions, a highly functional semantic layer
2260
often absent in other open source platforms and projects, and a
2261
respectable set of advanced data visualization features including
2262
geospatial analytics
2264
<td width="20%">TODO</td>
2268
<td width="20%">Jedox Palo</td>
2270
Palo Suite combines all core applications — OLAP Server, Palo
2271
Web, Palo ETL Server and Palo for Excel — into one comprehensive
2272
and customisable Business Intelligence platform. The platform is
2273
completely based on Open Source products representing a high-end
2274
Business Intelligence solution which is available entirely free
2275
of any license fees.
2277
<td width="20%">TODO</td>
2281
<td width="20%">Twitter Finagle</td>
2283
Finagle is an asynchronous network stack for the JVM that you
2284
can use to build asynchronous Remote Procedure Call (RPC)
2285
clients and servers in Java, Scala, or any JVM-hosted language.
2287
<td width="20%">TODO</td>
2291
<td width="20%">Intel GraphBuilder</td>
2293
Library which provides tools to construct large-scale graphs on
2294
top of Apache Hadoop
2296
<td width="20%">TODO</td>
2300
<td width="20%">Apache Tika</td>
2302
Toolkit detects and extracts metadata and structured text content
2303
from various documents using existing parser libraries.
2305
<td width="20%">TODO</td>
2309
<td width="20%">Apache Zeppelin</td>
2311
Zeppelin is a modern web-based tool for the data scientists to
2312
collaborate over large-scale data exploration and visualization
2313
projects. It is a notebook style interpreter that enable
2314
collaborative analysis sessions sharing between users. Zeppelin
2315
is independent of the execution framework itself. Current version
2316
runs on top of Apache Spark but it has pluggable interpreter APIs
2317
to support other data processing systems. More execution frameworks
2318
could be added at a later date i.e Apache Flink, Crunch as well
2319
as SQL-like backends such as Hive, Tajo, MRQL.
2321
<td width="20%"><a href="https://zeppelin.incubator.apache.org/">1. Apache Zeppelin site</a>
2331
<div id="footer_wrap" class="outer">
2332
<footer class="inner">
2333
<p>Published with <a href="http://pages.github.com">GitHub Pages</a>
2334
by <a href="http://es.linkedin.com/in/javiroman/">Javi Roman</a>, and
2335
<a href="https://github.com/hadoopecosystemtable/hadoopecosystemtable.github.io/graphs/contributors">contributors</a>