/docs/MyDocs : revision 28

To get this branch, use:

bzr branch
http://darksoft.org/webbzr/docs/MyDocs

« back to all changes in this revision

Viewing changes to Administration/Server/Orchestration/apache/hadoopecosystemtable.github.io.html

Committer: Suren A. Chilingaryan
Date: 2017-04-03 02:45:17 UTC
Revision ID: csa@suren.me-20170403024517-dwzj0z0k1cmhxm7u

Restructuring, OpenShift, Ansible, Git

files added:
Administration/Platforms/gentoo/ebuild.txt

Administration/Server/Containers

Administration/Server/Containers/docker

Administration/Server/Containers/docker/best_practices

Administration/Server/Containers/docker/best_practices/git_strategies_for_docker.html

Administration/Server/Containers/docker/compose.txt

Administration/Server/Containers/docker/docker.txt

Administration/Server/Containers/docker/kiwi.txt

Administration/Server/Containers/docker/security.txt

Administration/Server/Containers/docker/setup.txt

Administration/Server/Containers/docker/terms.txt

Administration/Server/Containers/list.txt

Administration/Server/Containers/openvz

Administration/Server/Containers/openvz/openvz.txt

Administration/Server/Network

Administration/Server/Network/dhcp.txt

Administration/Server/Network/firewall

Administration/Server/Network/firewall/firewalld.txt

Administration/Server/Network/firewall/iptables.txt

Administration/Server/Network/hints.1

Administration/Server/Network/infiniband

Administration/Server/Network/infiniband/bonding.txt

Administration/Server/Network/infiniband/infiniband.txt

Administration/Server/Network/infiniband/openmpi.txt

Administration/Server/Network/interfaces.txt

Administration/Server/Network/routing

Administration/Server/Network/routing/FW-IDS-iptables-Flowchart-2014-09-25.png

Administration/Server/Network/routing/Preventing brute force attacks using iptables recent matching.html

Administration/Server/Network/ssh.txt

Administration/Server/Network/sshtunnel.txt

Administration/Server/Orchestration

Administration/Server/Orchestration/apache

Administration/Server/Orchestration/apache/apache.txt

Administration/Server/Orchestration/apache/hadoopecosystemtable.github.io.html

Administration/Server/Orchestration/ha.txt

Administration/Server/Orchestration/openshift

Administration/Server/Orchestration/openshift/concepts.txt

Administration/Server/Orchestration/openshift/discussion.txt

Administration/Server/Orchestration/openshift/ha.txt

Administration/Server/Orchestration/openshift/heketi.txt

Administration/Server/Orchestration/openshift/kubernetes.txt

Administration/Server/Orchestration/openshift/links.txt

Administration/Server/Orchestration/openshift/network.txt

Administration/Server/Orchestration/openshift/openshift.txt

Administration/Server/Orchestration/openshift/recipes.txt

Administration/Server/Orchestration/openshift/resources.txt

Administration/Server/Orchestration/openshift/sample.txt

Administration/Server/Orchestration/openshift/security.txt

Administration/Server/Orchestration/openshift/setup.txt

Administration/Server/Orchestration/orchestration.txt

Administration/Server/Orchestration/pacemaker

Administration/Server/Orchestration/pacemaker/docs

Administration/Server/Orchestration/pacemaker/docs/BuildingHACluster.mht

Administration/Server/Orchestration/pacemaker/docs/Clusters from Scratch.mht

Administration/Server/Orchestration/pacemaker/pacemaker.txt

Administration/Server/Provisioning

Administration/Server/Provisioning/ansible

Administration/Server/Provisioning/ansible/ansible.txt

Administration/Server/Provisioning/ansible/format.txt

Administration/Server/Provisioning/ansible/inventory.txt

Administration/Server/Provisioning/ansible/modules.txt

Administration/Server/Provisioning/ansible/roles.txt

Administration/Server/Provisioning/ansible/secrets.txt

Administration/Server/Provisioning/ansible/structures.txt

Administration/Server/Provisioning/ansible/templates.txt

Administration/Server/Provisioning/ansible/variables.txt

Administration/Server/Provisioning/tools.txt

Administration/Server/Security/pam.txt

Administration/Server/Security/selinux.txt

Administration/Server/Security/sudo.txt

Administration/Server/Storage

Administration/Server/Storage/clusterfs

Administration/Server/Storage/clusterfs/ceph.txt

Administration/Server/Storage/clusterfs/fhgfs.txt

Administration/Server/Storage/clusterfs/gluster

Administration/Server/Storage/clusterfs/gluster/docs

Administration/Server/Storage/clusterfs/gluster/docs/Performance_in_a_Gluster_Systemv6F.pdf

Administration/Server/Storage/clusterfs/gluster/gluster-nfs.txt

Administration/Server/Storage/clusterfs/gluster/gluster-samba.txt

Administration/Server/Storage/clusterfs/gluster/gluster.txt

Administration/Server/Storage/clusterfs/gluster/heketi.txt

Administration/Server/Storage/clusterfs/ocfs2.txt

Administration/Server/Storage/iscsi.txt

Administration/Server/Storage/storage.txt

Administration/Server/Storage/swap

Administration/Server/Storage/swap/ramster

Administration/Server/Storage/swap/ramster.txt

Administration/Server/Storage/swap/ramster/HOWTO-v5-120214

Administration/Server/Virtualization/kvm

Administration/Server/Virtualization/kvm/kvm.txt

Administration/Server/Virtualization/kvm/libvirt.txt

Administration/Server/Virtualization/kvm/qemu.txt

Administration/Server/Virtualization/parallels

Administration/Server/Virtualization/parallels/parallels.txt

Administration/Server/Virtualization/vagrant

Administration/Server/Virtualization/vagrant/vagrant.txt

Administration/Server/Virtualization/virtualbox

Administration/Server/Virtualization/virtualbox/virtualbox.txt

SCM/compilers

SCM/compilers/gcc.txt

SCM/compilers/llvm

SCM/compilers/llvm/os-createcompilerllvm1-pdf.pdf

SCM/compilers/llvm/os-createcompilerllvm2-pdf.pdf

SCM/compilers/windows

SCM/compilers/windows/vs2005.txt

SCM/makefile

SCM/makefile/functions

SCM/makefile/macros

SCM/makefile/math

SCM/makefile/rules

SCM/makefile/variables

SCM/packaging

SCM/packaging/deb

SCM/packaging/deb/ubuntu.txt

SCM/packaging/general

SCM/packaging/general/gcc

SCM/packaging/general/gcc/gcc23compat.txt

SCM/packaging/general/kernel.txt

SCM/packaging/general/linking

SCM/packaging/general/linking/ld_assume_kernel.html

SCM/packaging/general/linking/libtool-missing_so.html

SCM/packaging/general/linking/linking.txt

SCM/packaging/osc

SCM/packaging/osc/info.txt

SCM/packaging/osc/python.txt

SCM/packaging/osc/stripping.txt

SCM/packaging/portage

SCM/packaging/rpm

SCM/packaging/rpm/distributions

SCM/packaging/rpm/distributions/mandrake.txt

SCM/packaging/rpm/rpm

SCM/packaging/rpm/rpm.addon

SCM/packaging/rpm/rpm.commands

SCM/packaging/rpm/rpm.pgp_signatures

SCM/packaging/rpm/rpm.problems

SCM/packaging/rpm/rpm.scrypt

SCM/packaging/rpm/rpm.signature

SCM/packaging/rpm/rpm.spec

SCM/packaging/rpm/rpm.unpackaged

SCM/packaging/rpm/rpm.virables

SCM/rcs

SCM/rcs/bzr

SCM/rcs/bzr/bzr.txt

SCM/rcs/bzr/bzr_plugins.txt

SCM/rcs/diff.txt

SCM/rcs/git

SCM/rcs/git/git.txt

SCM/rcs/git/git_arch.txt

SCM/rcs/git/git_config.txt

SCM/rcs/git/git_modules.txt

SCM/rcs/git/git_public.txt

SCM/rcs/git/git_recipes.txt

SCM/rcs/git/git_remote.txt

SCM/rcs/git/git_security.txt

SCM/rcs/mergers.txt

SCM/rcs/other

SCM/rcs/other/svn.txt

SCM/rcs/other/tla.txt

SCM/trac

SCM/trac/edit.txt

SCM/trac/fixes

SCM/trac/fixes/0.10

SCM/trac/fixes/0.10/recaptcharegisterplugin_trac10.tar.bz2

SCM/trac/fixes/0.10/tags

SCM/trac/fixes/0.10/tags/tags-0.4.1-upgrade.patch

SCM/trac/fixes/0.10/tags/tags-0.6-upgrade.patch

SCM/trac/fixes/0.10/trac-bazaar-ds-curtimestamp.patch

SCM/trac/fixes/0.10/trac-bazaar-ds-escape.patch

SCM/trac/fixes/0.10/tracnav-ds-noedit.patch

SCM/trac/fixes/0.11

SCM/trac/fixes/0.11/TracIncludeMacro-ds-FineGrainedPermissions.patch

SCM/trac/fixes/0.11/jsgnatt

SCM/trac/fixes/0.11/jsgnatt/jsgnatt-ds.patch

SCM/trac/fixes/0.11/jsgnatt/jsgnatt.ini

SCM/trac/fixes/0.11/jsgnatt/tracjsgantt.py

SCM/trac/fixes/0.11/tracdownloader-ds-web_ui.diff

SCM/trac/gmaps.txt

SCM/trac/problems.txt

SCM/trac/setup.txt

SCM/web

SCM/web/sourceforge-cvs

files removed:
Administration/Linux/system/network

Administration/Linux/system/network/dhcp.txt

Administration/Linux/system/network/hints.1

Administration/Linux/system/network/interfaces.txt

Administration/Linux/system/network/routing

Administration/Linux/system/network/routing/FW-IDS-iptables-Flowchart-2014-09-25.png

Administration/Linux/system/network/routing/Preventing brute force attacks using iptables recent matching.html

Administration/Linux/system/network/routing/iptables.txt

Administration/Linux/system/network/ssh.txt

Administration/Linux/system/network/sshtunnel.txt

Administration/Linux/system/security

Administration/Linux/system/security/pam.txt

Administration/Linux/system/security/sudo.txt

Administration/Server/Cluster

Administration/Server/Cluster/apache

Administration/Server/Cluster/apache.txt

Administration/Server/Cluster/apache/hadoopecosystemtable.github.io.html

Administration/Server/Cluster/bonding.txt

Administration/Server/Cluster/deployment

Administration/Server/Cluster/deployment/ansible.txt

Administration/Server/Cluster/deployment/tools.txt

Administration/Server/Cluster/ha

Administration/Server/Cluster/ha/docs

Administration/Server/Cluster/ha/docs/BuildingHACluster.mht

Administration/Server/Cluster/ha/docs/Clusters from Scratch.mht

Administration/Server/Cluster/ha/list.txt

Administration/Server/Cluster/ha/rh-ha.txt

Administration/Server/Cluster/infiniband.txt

Administration/Server/Cluster/mpi

Administration/Server/Cluster/mpi/openmpi.txt

Administration/Server/Cluster/orchestration

Administration/Server/Cluster/orchestration.txt

Administration/Server/Cluster/orchestration/kubernetes.txt

Administration/Server/Cluster/storage

Administration/Server/Cluster/storage/clusterfs

Administration/Server/Cluster/storage/clusterfs/ceph.txt

Administration/Server/Cluster/storage/clusterfs/fhgfs.txt

Administration/Server/Cluster/storage/clusterfs/gluster

Administration/Server/Cluster/storage/clusterfs/gluster/docs

Administration/Server/Cluster/storage/clusterfs/gluster/docs/Performance_in_a_Gluster_Systemv6F.pdf

Administration/Server/Cluster/storage/clusterfs/gluster/gluster-nfs.txt

Administration/Server/Cluster/storage/clusterfs/gluster/gluster-samba.txt

Administration/Server/Cluster/storage/clusterfs/gluster/gluster.txt

Administration/Server/Cluster/storage/clusterfs/ocfs2.txt

Administration/Server/Cluster/storage/iscsi.txt

Administration/Server/Cluster/storage/storage.txt

Administration/Server/Cluster/swap

Administration/Server/Cluster/swap/ramster

Administration/Server/Cluster/swap/ramster.txt

Administration/Server/Cluster/swap/ramster/HOWTO-v5-120214

Administration/Server/Virtualization/backup

Administration/Server/Virtualization/backup/openvz.txt

Administration/Server/Virtualization/containers

Administration/Server/Virtualization/containers/docker

Administration/Server/Virtualization/containers/docker/compose.txt

Administration/Server/Virtualization/containers/docker/docker.txt

Administration/Server/Virtualization/containers/docker/kiwi.txt

Administration/Server/Virtualization/containers/docker/security.txt

Administration/Server/Virtualization/containers/docker/setup.txt

Administration/Server/Virtualization/containers/docker/terms.txt

Administration/Server/Virtualization/containers/list.txt

Administration/Server/Virtualization/kvm.txt

Administration/Server/Virtualization/libvirt.txt

Administration/Server/Virtualization/parallels.txt

Administration/Server/Virtualization/qemu.txt

Administration/Server/Virtualization/virtualbox.txt

Development/autotools

Development/autotools/cvs

Development/autotools/cvs/bzr.txt

Development/autotools/cvs/bzr_plugins.txt

Development/autotools/cvs/git.txt

Development/autotools/cvs/git_config.txt

Development/autotools/cvs/mergers.txt

Development/autotools/cvs/svn.txt

Development/autotools/cvs/tla.txt

Development/autotools/diff.txt

Development/autotools/makefile

Development/autotools/makefile/functions

Development/autotools/makefile/macros

Development/autotools/makefile/math

Development/autotools/makefile/rules

Development/autotools/makefile/variables

Development/autotools/sourceforge

Development/autotools/sourceforge/sourceforge-cvs

Development/autotools/trac

Development/autotools/trac/edit.txt

Development/autotools/trac/fixes

Development/autotools/trac/fixes/0.10

Development/autotools/trac/fixes/0.10/recaptcharegisterplugin_trac10.tar.bz2

Development/autotools/trac/fixes/0.10/tags

Development/autotools/trac/fixes/0.10/tags/tags-0.4.1-upgrade.patch

Development/autotools/trac/fixes/0.10/tags/tags-0.6-upgrade.patch

Development/autotools/trac/fixes/0.10/trac-bazaar-ds-curtimestamp.patch

Development/autotools/trac/fixes/0.10/trac-bazaar-ds-escape.patch

Development/autotools/trac/fixes/0.10/tracnav-ds-noedit.patch

Development/autotools/trac/fixes/0.11

Development/autotools/trac/fixes/0.11/TracIncludeMacro-ds-FineGrainedPermissions.patch

Development/autotools/trac/fixes/0.11/jsgnatt

Development/autotools/trac/fixes/0.11/jsgnatt/jsgnatt-ds.patch

Development/autotools/trac/fixes/0.11/jsgnatt/jsgnatt.ini

Development/autotools/trac/fixes/0.11/jsgnatt/tracjsgantt.py

Development/autotools/trac/fixes/0.11/tracdownloader-ds-web_ui.diff

Development/autotools/trac/gmaps.txt

Development/autotools/trac/problems.txt

Development/autotools/trac/setup.txt

Development/compilers

Development/compilers/gcc.txt

Development/compilers/llvm

Development/compilers/llvm/os-createcompilerllvm1-pdf.pdf

Development/compilers/llvm/os-createcompilerllvm2-pdf.pdf

Development/compilers/windows

Development/compilers/windows/vs2005.txt

Development/packaging

Development/packaging/deb

Development/packaging/deb/ubuntu.txt

Development/packaging/general

Development/packaging/general/gcc

Development/packaging/general/gcc/gcc23compat.txt

Development/packaging/general/kernel.txt

Development/packaging/general/linking

Development/packaging/general/linking/ld_assume_kernel.html

Development/packaging/general/linking/libtool-missing_so.html

Development/packaging/general/linking/linking.txt

Development/packaging/osc

Development/packaging/osc/info.txt

Development/packaging/osc/python.txt

Development/packaging/osc/stripping.txt

Development/packaging/portage

Development/packaging/rpm

Development/packaging/rpm/distributions

Development/packaging/rpm/distributions/mandrake.txt

Development/packaging/rpm/rpm

Development/packaging/rpm/rpm.addon

Development/packaging/rpm/rpm.commands

Development/packaging/rpm/rpm.pgp_signatures

Development/packaging/rpm/rpm.problems

Development/packaging/rpm/rpm.scrypt

Development/packaging/rpm/rpm.signature

Development/packaging/rpm/rpm.spec

Development/packaging/rpm/rpm.unpackaged

Development/packaging/rpm/rpm.virables

files renamed:
Administration/Platforms/portage/ => Administration/Platforms/gentoo/

files modified:
Administration/Platforms/gentoo/gentoo.txt

Administration/Server/Access/desktop/vnc.txt

Administration/Software/web/mozilla.txt

Administration/problems.txt

Show diffs side-by-side

added added

removed removed

Administration/Server/Orchestration/apache/hadoopecosystemtable.github.io.html

<!DOCTYPE html>

<html>

<head>

<title>The Hadoop Ecosystem Table</title>

</head>

<body>

<a id="forkme_banner" href="https://github.com/hadoopecosystemtable/hadoopecosystemtable.github.io">Fork Me on GitHub</a>

<h1 id="project_title">The Hadoop Ecosystem Table</h1>

<h2 id="project_tagline">This page is a summary to keep the track of Hadoop related projects, focused on FLOSS environment.</h2>

</header>

</div>

<tr>

<th colspan="3">Distributed Filesystem</th>

</tr>

<tr>

<td width="30%">Apache HDFS</td>

<td>

The Hadoop Distributed File System (HDFS) offers a way to store large files across

multiple machines. Hadoop and HDFS was derived from Google File System (GFS) paper.

Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster.

With Zookeeper the HDFS High Availability feature addresses this problem by providing

the option of running two redundant NameNodes in the same cluster in an Active/Passive

configuration with a hot standby.

</td>

<td width="20%"><a href="http://hadoop.apache.org/">1. hadoop.apache.org</a>

<a href="http://research.google.com/archive/gfs.html">2. Google FileSystem - GFS Paper</a>

<a href="http://blog.cloudera.com/blog/2012/07/why-we-build-our-platform-on-hdfs/">3. Cloudera Why HDFS</a>

<a href="http://hortonworks.com/blog/thinking-about-the-hdfs-vs-other-storage-technologies/">4. Hortonworks Why HDFS</a>

</td>

</tr>

<tr>

<td width="20%">Red Hat GlusterFS</td>

<td>

GlusterFS is a scale-out network-attached storage file system. GlusterFS was

developed originally by Gluster, Inc., then by Red Hat, Inc., after their

purchase of Gluster in 2011. In June 2012, Red Hat Storage Server was

announced as a commercially-supported integration of GlusterFS with

Red Hat Enterprise Linux. Gluster File System, known now as Red Hat Storage Server.

</td>

<td width="20%"><a href="http://www.gluster.org/">1. www.gluster.org</a>

<a href="http://www.redhat.com/about/news/archive/2013/10/red-hat-contributes-apache-hadoop-plug-in-to-the-gluster-community">2. Red Hat Hadoop Plugin</a>

</td>

</tr>

<tr>

<td width="20%">Quantcast File System QFS</td>

<td>

QFS is an open-source distributed file system software package for

large-scale MapReduce or other batch-processing workloads. It was

designed as an alternative to Apache Hadoop’s HDFS, intended to deliver

better performance and cost-efficiency for large-scale processing clusters.

It is written in C++ and has fixed-footprint memory management. QFS uses

Reed-Solomon error correction as method for assuring reliable access to data.

Reed–Solomon coding is very widely used in mass storage systems to correct the burst

errors associated with media defects. Rather than storing three full versions of

each file like HDFS, resulting in the need for three times more storage, QFS

only needs 1.5x the raw capacity because it stripes data across nine different disk drives.

</td>

<a href="https://github.com/quantcast/qfs">2. GitHub QFS</a>

<a href="https://issues.apache.org/jira/browse/HADOOP-8885">3. HADOOP-8885</a>

</td>

</tr>

<tr>

<td width="30%">Ceph Filesystem</td>

<td>

Ceph is a free software storage platform designed to present object, block,

and file storage from a single distributed computer cluster. Ceph's main

goals are to be completely distributed without a single point of failure,

scalable to the exabyte level, and freely-available. The data is replicated,

making it fault tolerant.

100

</td>

101

<td width="20%"><a href="http://ceph.com/ceph-storage/file-system/">1. Ceph Filesystem site</a>

102

<a href="http://ceph.com/docs/next/cephfs/hadoop/">2. Ceph and Hadoop</a>

103

<a href="https://issues.apache.org/jira/browse/HADOOP-6253">3. HADOOP-6253</a>

104

</td>

105

</tr>

106

107

<tr>

108

<td width="30%">Lustre file system</td>

109

<td>

110

The Lustre filesystem is a high-performance distributed filesystem

111

intended for larger network and high-availability environments.

112

Traditionally, Lustre is configured to manage remote data storage

113

disk devices within a Storage Area Network (SAN), which is two or

114

more remotely attached disk devices communicating via a Small Computer

115

System Interface (SCSI) protocol. This includes Fibre Channel, Fibre

116

Channel over Ethernet (FCoE), Serial Attached SCSI (SAS) and even iSCSI.

117

With Hadoop HDFS the software needs a dedicated cluster of computers

118

on which to run. But folks who run high performance computing clusters

119

for other purposes often don't run HDFS, which leaves them with a bunch

120

of computing power, tasks that could almost certainly benefit from a bit

121

of map reduce and no way to put that power to work running Hadoop. Intel's

122

noticed this and, in version 2.5 of its Hadoop distribution that it quietly

123

released last week, has added support for Lustre: the Intel® HPC Distribution

124

for Apache Hadoop* Software, a new product that combines Intel Distribution

125

for Apache Hadoop software with Intel® Enterprise Edition for Lustre software.

126

This is the only distribution of Apache Hadoop that is integrated with Lustre,

127

the parallel file system used by many of the world's fastest supercomputers

128

</td>

129

<td width="20%"><a href="http://wiki.lustre.org/">1. wiki.lustre.org/</a>

130

<a href="http://wiki.lustre.org/index.php/Running_Hadoop_with_Lustre">2. Hadoop with Lustre</a>

131

<a href="http://hadoop.intel.com/products/distribution">3. Intel HPC Hadoop</a>

132

</td>

133

</tr>

134

135

<tr>

136

<td width="30%">Alluxio</td>

137

<td>

138

Alluxio, the world’s first memory-centric virtual distributed storage system, unifies data access

139

and bridges computation frameworks and underlying storage systems. Applications only need to connect

140

with Alluxio to access data stored in any underlying storage systems. Additionally, Alluxio’s

141

memory-centric architecture enables data access orders of magnitude faster than existing solutions.

142

143

In big data ecosystem, Alluxio lies between computation frameworks or jobs, such as Apache Spark,

144

Apache MapReduce, or Apache Flink, and various kinds of storage systems, such as Amazon S3,

145

OpenStack Swift, GlusterFS, HDFS, Ceph, or OSS. Alluxio brings significant performance improvement

146

to the stack; for example, Baidu uses Alluxio to improve their data analytics performance by 30 times.

147

Beyond performance, Alluxio bridges new workloads with data stored in traditional storage systems.

148

Users can run Alluxio using its standalone cluster mode, for example on Amazon EC2, or launch Alluxio

149

with Apache Mesos or Apache Yarn.

150

151

Alluxio is Hadoop compatible. This means that existing Spark and MapReduce programs can run on top of

152

Alluxio without any code changes. The project is open source (Apache License 2.0) and is deployed at

153

multiple companies. It is one of the fastest growing open source projects. With less than three years

154

open source history, Alluxio has attracted more than 160 contributors from over 50 institutions,

155

including Alibaba, Alluxio, Baidu, CMU, IBM, Intel, NJU, Red Hat, UC Berkeley, and Yahoo.

156

The project is the storage layer of the Berkeley Data Analytics Stack (BDAS) and also part of the

157

Fedora distribution.

158

</td>

159

<td width="20%"><a href="http://www.alluxio.org/">1. Alluxio site</a>

160

</td>

161

</tr>

162

163

<tr>

164

<td width="30%">GridGain</td>

165

<td>

166

GridGain is open source project licensed under Apache 2.0. One of the main pieces of this platform is the

167

In-Memory Apache Hadoop Accelerator which aims to accelerate HDFS and Map/Reduce by bringing both, data

168

and computations into memory. This work is done with the GGFS - Hadoop compliant in-memory file system.

169

For I/O intensive jobs GridGain GGFS offers performance close to 100x faster than standard HDFS.

170

Paraphrasing Dmitriy Setrakyan from GridGain Systems talking about GGFS regarding Tachyon:

171

<ul>

172

<li>GGFS allows read-through and write-through to/from underlying HDFS or any

173

other Hadoop compliant file system with zero code change. Essentially GGFS

174

entirely removes ETL step from integration.</li>

175

<li>GGFS has ability to pick and choose what folders stay in memory, what

176

folders stay on disc, and what folders get synchronized with underlying

177

(HD)FS either synchronously or asynchronously.</li>

178

<li>GridGain is working on adding native MapReduce component which will

179

provide native complete Hadoop integration without changes in API, like

180

Spark currently forces you to do. Essentially GridGain MR+GGFS will allow

181

to bring Hadoop completely or partially in-memory in Plug-n-Play fashion

182

without any API changes.</li>

183

</ul>

184

</td>

185

<td width="20%"><a href="http://www.gridgain.org/">1. GridGain site</a>

186

</td>

187

</tr>

188

189

<tr>

190

<td width="30%">XtreemFS</td>

191

<td>

192

XtreemFS is a general purpose storage system and covers most storage needs in a single deployment.

193

It is open-source, requires no special hardware or kernel modules, and can be mounted on Linux,

194

Windows and OS X.

195

XtreemFS runs distributed and offers resilience through replication. XtreemFS Volumes can be accessed

196

through a FUSE component, that offers normal file interaction with POSIX like semantics. Furthermore an

197

implementation of Hadoops FileSystem interface is included which makes XtreemFS available for use with

198

Hadoop, Flink and Spark out of the box.

199

XtreemFS is licensed under the New BSD license. The XtreemFS project is developed by Zuse Institute Berlin.

200

The development of the project is funded by the European Commission since 2006 under

201

Grant Agreements No. FP6-033576, FP7-ICT-257438, and FP7-318521, as well as the German projects MoSGrid,

202

"First We Take Berlin", FFMK, GeoMultiSens, and BBDC.

203

</td>

204

<td width="20%"><a href="http://www.xtreemfs.org/">1. XtreemFS site</a>

205

<a href="https://github.com/xtreemfs/xtreemfs/wiki/Apache-Flink-with-XtreemFS">2. Flink on XtreemFS</a>

206

<a href="https://github.com/xtreemfs/xtreemfs/wiki/Apache-Spark-with-XtreemFS">. Spark XtreemFS</a>

207

</td>

208

</tr>

209

210

211

212

213

<tr>

214

<th colspan="3">Distributed Programming</th>

215

</tr>

216

217

<tr>

218

<td width="20%">Apache Ignite</td>

219

<td>

220

Apache Ignite In-Memory Data Fabric is a distributed in-memory platform

221

for computing and transacting on large-scale data sets in real-time.

222

It includes a distributed key-value in-memory store, SQL capabilities,

223

map-reduce and other computations, distributed data structures,

224

continuous queries, messaging and events subsystems, Hadoop and Spark integration.

225

Ignite is built in Java and provides .NET and C++ APIs.

226

</td>

227

<td width="20%"><a href="http://ignite.apache.org/">1. Apache Ignite</a>

228

<a href="https://apacheignite.readme.io/">2. Apache Ignite documentation</a>

229

</td>

230

</tr>

231

232

<tr>

233

<td width="20%">Apache MapReduce</td>

234

<td>

235

MapReduce is a programming model for processing large data sets with a parallel,

236

distributed algorithm on a cluster. Apache MapReduce was derived from Google

237

MapReduce: Simplified Data Processing on Large Clusters paper. The current

238

Apache MapReduce version is built over Apache YARN Framework. YARN stands

239

for “Yet-Another-Resource-Negotiator”. It is a new framework that facilitates

240

writing arbitrary distributed processing frameworks and applications. YARN’s

241

execution model is more generic than the earlier MapReduce implementation.

242

YARN can run applications that do not follow the MapReduce model, unlike the

243

original Apache Hadoop MapReduce (also called MR1). Hadoop YARN is an attempt

244

to take Apache Hadoop beyond MapReduce for data-processing.

245

</td>

246

<td width="20%"><a href="http://wiki.apache.org/hadoop/MapReduce/">1. Apache MapReduce</a>

247

<a href="http://research.google.com/archive/mapreduce.html">2. Google MapReduce paper</a>

248

<a href="http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html">3. Writing YARN applications</a>

249

</td>

250

</tr>

251

252

<tr>

253

<td width="20%">Apache Pig</td>

254

<td>

255

Pig provides an engine for executing data flows in parallel on Hadoop. It includes a language,

256

Pig Latin, for expressing these data flows. Pig Latin includes operators for many of the

257

traditional data operations (join, sort, filter, etc.), as well as the ability for users

258

to develop their own functions for reading, processing, and writing data. Pig runs on Hadoop.

259

It makes use of both the Hadoop Distributed File System, HDFS, and Hadoop’s processing system, MapReduce.

260

Pig uses MapReduce to execute all of its data processing. It compiles the Pig Latin scripts

261

that users write into a series of one or more MapReduce jobs that it then executes. Pig Latin looks

262

different from many of the programming languages you have seen. There are no if statements or for

263

loops in Pig Latin. This is because traditional procedural and object-oriented programming languages

264

describe control flow, and data flow is a side effect of the program. Pig Latin instead focuses on data flow.

265

</td>

266

<td width="20%"><a href="https://pig.apache.org/">1. pig.apache.org/</a>

267

<a href="https://github.com/alanfgates/programmingpig">2.Pig examples by Alan Gates</a>

268

</td>

269

</tr>

270

271

<tr>

272

273

<td>

274

JAQL is a functional, declarative programming language designed especially for working with large

275

volumes of structured, semi-structured and unstructured data. As its name implies, a primary

276

use of JAQL is to handle data stored as JSON documents, but JAQL can work on various types of data.

277

For example, it can support XML, comma-separated values (CSV) data and flat files. A "SQL within JAQL"

278

capability lets programmers work with structured SQL data while employing a JSON data model that's less

279

restrictive than its Structured Query Language counterparts.

280

Specifically, Jaql allows you to select, join, group, and filter data that is stored in HDFS, much

281

like a blend of Pig and Hive. Jaql’s query language was inspired by many programming and query languages,

282

including Lisp, SQL, XQuery, and Pig.

283

JAQL was created by workers at IBM Research Labs in 2008 and released to open source. While it continues

284

to be hosted as a project on Google Code, where a downloadable version is available under an Apache 2.0 license,

285

the major development activity around JAQL has remained centered at IBM. The company offers the query language

286

as part of the tools suite associated with InfoSphere BigInsights, its Hadoop platform. Working together with a

287

workflow orchestrator, JAQL is used in BigInsights to exchange data between storage, processing and analytics jobs.

288

It also provides links to external data and services, including relational databases and machine learning data.

289

</td>

290

<td width="20%"><a href="https://code.google.com/p/jaql/">1. JAQL in Google Code</a>

291

292

</td>

293

</tr>

294

295

<tr>

296

<td width="20%">Apache Spark</td>

297

<td>

298

Data analytics cluster computing framework originally developed in the AMPLab at UC Berkeley.

299

Spark fits into the Hadoop open-source community, building on top of the Hadoop Distributed File System (HDFS).

300

However, Spark provides an easier to use alternative to Hadoop MapReduce and offers performance up to 10 times

301

faster than previous generation systems like Hadoop MapReduce for certain applications.

302

Spark is a framework for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce

303

does but with a fast in-memory approach and a clean functional style API. With its ability to integrate with

304

Hadoop and inbuilt tools for interactive query analysis (Shark), large-scale graph processing and analysis (Bagel),

305

and real-time analysis (Spark Streaming), it can be interactively used to quickly process and query big data sets.

306

To make programming faster, Spark provides clean, concise APIs in Scala, Java and Python. You can also use Spark

307

interactively from the Scala and Python shells to rapidly query big datasets. Spark is also the engine behind Shark,

308

a fully Apache Hive-compatible data warehousing system that can run 100x faster than Hive.

309

</td>

310

<td width="20%"><a href="http://spark.apache.org/">1. Apache Spark</a>

311

<a href="https://github.com/apache/spark">2. Mirror of Spark on Github</a>

312

<a href="http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf">3. RDDs - Paper</a>

313

<a href="https://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf">4. Spark: Cluster Computing... - Paper</a>

314

<a href="http://spark.apache.org/research.html">Spark Research</a>

315

</td>

316

</tr>

317

318

<tr>

319

<td width="20%">Apache Storm</td>

320

<td>

321

Storm is a complex event processor (CEP) and distributed computation

322

framework written predominantly in the Clojure programming language.

323

Is a distributed real-time computation system for processing fast,

324

large streams of data. Storm is an architecture based on master-workers

325

paradigma. So a Storm cluster mainly consists of a master and worker

326

nodes, with coordination done by Zookeeper.

327

Storm makes use of zeromq (0mq, zeromq), an advanced, embeddable

328

networking library. It provides a message queue, but unlike

329

message-oriented middleware (MOM), a 0MQ system can run without

330

a dedicated message broker. The library is designed to have a

331

familiar socket-style API.

332

Originally created by Nathan Marz and team at BackType, the

333

project was open sourced after being acquired by Twitter. Storm

334

was initially developed and deployed at BackType in 2011. After

335

7 months of development BackType was acquired by Twitter in July

336

2011. Storm was open sourced in September 2011.

337

Hortonworks is developing a Storm-on-YARN version and plans

338

finish the base-level integration in 2013 Q4. This is the plan

339

from Hortonworks. Yahoo/Hortonworks also plans to move Storm-on-YARN

340

code from github.com/yahoo/storm-yarn to be a subproject of

341

Apache Storm project in the near future.

342

Twitter has recently released a Hadoop-Storm Hybrid called

343

“Summingbird.” Summingbird fuses the two frameworks into one,

344

allowing for developers to use Storm for short-term processing

345

and Hadoop for deep data dives,. a system that aims to mitigate

346

the tradeoffs between batch processing and stream processing by

347

combining them into a hybrid system.

348

</td>

349

<td width="20%"><a href="http://storm-project.net/">1. Storm Project/</a>

350

<a href="github.com/yahoo/storm-yarn">2. Storm-on-YARN</a>

351

</td>

352

</tr>

353

354

<tr>

355

<td width="20%">Apache Flink</td>

356

<td>

357

Apache Flink (formerly called Stratosphere) features powerful programming abstractions in Java and Scala,

358

a high-performance runtime, and automatic program optimization. It has native support for iterations,

359

incremental iterations, and programs consisting of large DAGs of operations.

360

Flink is a data processing system and an alternative to Hadoop's MapReduce component. It comes with

361

its own runtime, rather than building on top of MapReduce. As such, it can work completely independently

362

of the Hadoop ecosystem. However, Flink can also access Hadoop's distributed file system (HDFS) to read

363

and write data, and Hadoop's next-generation resource manager (YARN) to provision cluster resources.

364

Since most Flink users are using Hadoop HDFS to store their data, it ships already the required libraries to access HDFS.

365

</td>

366

<td width="20%"><a href="http://flink.incubator.apache.org/">1. Apache Flink incubator page</a>

367

<a href="http://stratosphere.eu/">2. Stratosphere site</a>

368

</td>

369

</tr>

370

371

<tr>

372

<td width="20%">Apache Apex</td>

373

<td>

374

Apache Apex is an enterprise grade Apache YARN based big data-in-motion platform that

375

unifies stream processing as well as batch processing. It processes big data

376

in-motion in a highly scalable, highly performant, fault tolerant, stateful,

377

secure, distributed, and an easily operable way. It provides a simple API that

378

enables users to write or re-use generic Java code, thereby lowering the expertise

379

needed to write big data applications.

380

The Apache Apex platform is supplemented by Apache Apex-Malhar,

381

which is a library of operators that implement common business logic

382

functions needed by customers who want to quickly develop applications.

383

These operators provide access to HDFS, S3, NFS, FTP, and other file systems;

384

Kafka, ActiveMQ, RabbitMQ, JMS, and other message systems; MySql, Cassandra,

385

MongoDB, Redis, HBase, CouchDB and other databases along with JDBC connectors.

386

The library also includes a host of other common business logic patterns that

387

help users to significantly reduce the time it takes to go into production.

388

Ease of integration with all other big data technologies is one of the primary

389

missions of Apache Apex-Malhar.

390

Apex, available on GitHub, is the core technology upon which DataTorrent's

391

commercial offering, DataTorrent RTS 3, along with other technology such as

392

a data ingestion tool called dtIngest, are based.

393

</td>

394

<td width="20%"><a href="https://www.datatorrent.com/apex/">1. Apache Apex from DataTorrent</a>

395

<a href="http://apex.incubator.apache.org/">2. Apache Apex main page</a>

396

<a href="https://wiki.apache.org/incubator/ApexProposal">3. Apache Apex Proposal</a>

397

</td>

398

</tr>

399

400

<tr>

401

<td width="20%">Netflix PigPen</td>

402

<td>

403

PigPen is map-reduce for Clojure which compiles to Apache Pig. Clojure is dialect of the Lisp programming

404

language created by Rich Hickey, so is a functional general-purpose language, and runs on the Java Virtual Machine,

405

Common Language Runtime, and JavaScript engines. In PigPen there are no special user defined functions (UDFs).

406

Define Clojure functions, anonymously or named, and use them like you would in any Clojure program. This tool

407

is open sourced by Netflix, Inc. the American provider of on-demand Internet streaming media.

408

</td>

409

<td width="20%"><a href="https://github.com/Netflix/PigPen">1. PigPen on GitHub</a>

410

</td>

411

</tr>

412

413

<tr>

414

<td width="20%">AMPLab SIMR</td>

415

<td>

416

Apache Spark was developed thinking in Apache YARN. However, up to now, it has been relatively hard to run

417

Apache Spark on Hadoop MapReduce v1 clusters, i.e. clusters that do not have YARN installed. Typically,

418

users would have to get permission to install Spark/Scala on some subset of the machines, a process that

419

could be time consuming. SIMR allows anyone with access to a Hadoop MapReduce v1 cluster to run Spark out

420

of the box. A user can run Spark directly on top of Hadoop MapReduce v1 without any administrative rights,

421

and without having Spark or Scala installed on any of the nodes.

422

</td>

423

<td width="20%"><a href="http://databricks.github.io/simr/">1. SIMR on GitHub</a>

424

</td>

425

</tr>

426

427

<tr>

428

<td width="20%">Facebook Corona</td>

429

<td>

430

“The next version of Map-Reduce" from Facebook, based in own fork of Hadoop. The current Hadoop implementation

431

of the MapReduce technique uses a single job tracker, which causes scaling issues for very large data sets.

432

The Apache Hadoop developers have been creating their own next-generation MapReduce, called YARN, which Facebook

433

engineers looked at but discounted because of the highly-customised nature of the company's deployment of Hadoop and HDFS.

434

Corona, like YARN, spawns multiple job trackers (one for each job, in Corona's case).

435

</td>

436

<td width="20%"><a href="https://github.com/facebookarchive/hadoop-20/tree/master/src/contrib/corona">1. Corona on Github</a>

437

</td>

438

</tr>

439

440

<tr>

441

<td width="20%">Apache REEF</td>

442

<td>

443

Apache REEF™ (Retainable Evaluator Execution Framework) is a library for developing portable

444

applications for cluster resource managers such as Apache Hadoop™ YARN or Apache Mesos™.

445

Apache REEF drastically simplifies development of those resource managers through the following features:

446

447

<ul>

448

<li>

449

Centralized Control Flow: Apache REEF turns the chaos of a distributed application into events in a

450

single machine, the Job Driver. Events include container allocation, Task launch, completion and

451

failure. For failures, Apache REEF makes every effort of making the actual `Exception` thrown by the

452

Task available to the Driver.

453

</li>

454

<li>

455

Task runtime: Apache REEF provides a Task runtime called Evaluator. Evaluators are instantiated in

456

every container of a REEF application. Evaluators can keep data in memory in between Tasks, which

457

enables efficient pipelines on REEF.

458

</li>

459

<li>

460

Support for multiple resource managers: Apache REEF applications are portable to any supported resource

461

manager with minimal effort. Further, new resource managers are easy to support in REEF.

462

</li>

463

<li>

464

.NET and Java API: Apache REEF is the only API to write YARN or Mesos applications in .NET. Further, a

465

single REEF application is free to mix and match Tasks written for .NET or Java.

466

</li>

467

<li>

468

Plugins: Apache REEF allows for plugins (called "Services") to augment its feature set without adding

469

bloat to the core. REEF includes many Services, such as a name-based communications between Tasks

470

MPI-inspired group communications (Broadcast, Reduce, Gather, ...) and data ingress.

471

</li>

472

</ul>

473

</td>

474

<td width="20%"><a href="https://reef.apache.org">1. Apache REEF Website</a>

475

</td>

476

</tr>

477

478

<tr>

479

<td width="20%">Apache Twill</td>

480

<td>

481

Twill is an abstraction over Apache Hadoop® YARN that reduces the

482

complexity of developing distributed applications, allowing developers

483

to focus more on their business logic. Twill uses a simple thread-based model that Java

484

programmers will find familiar. YARN can be viewed as a compute

485

fabric of a cluster, which means YARN applications like Twill will

486

run on any Hadoop 2 cluster.

487

YARN is an open source application that allows the Hadoop cluster

488

to turn into a collection of virtual machines. Weave, developed by

489

Continuuity and initially housed on Github, is a complementary open

490

source application that uses a programming model similar to Java

491

threads, making it easy to write distributed applications. In order to remove

492

a conflict with a similarly named project on Apache, called "Weaver,"

493

Weave's name changed to Twill when it moved to Apache incubation.

494

Twill functions as a scaled-out proxy. Twill is a middleware layer

495

in between YARN and any application on YARN. When you develop a

496

Twill app, Twill handles APIs in YARN that resemble a multi-threaded application familiar to Java.

497

It is very easy to build multi-processed distributed applications in Twill.

498

</td>

499

<td width="20%"><a href="https://incubator.apache.org/projects/twill.html">1. Apache Twill Incubator</a>

500

</td>

501

</tr>

502

503

<tr>

504

<td width="20%">Damballa Parkour</td>

505

<td>

506

Library for develop MapReduce programs using the LISP like language Clojure. Parkour aims to provide deep Clojure

507

integration for Hadoop. Programs using Parkour are normal Clojure programs, using standard Clojure functions

508

instead of new framework abstractions. Programs using Parkour are also full Hadoop programs, with complete

509

access to absolutely everything possible in raw Java Hadoop MapReduce.

510

</td>

511

<td width="20%"><a href="https://github.com/damballa/parkour">1. Parkour GitHub Project</a>

512

</td>

513

</tr>

514

515

<tr>

516

<td width="20%">Apache Hama</td>

517

<td>

518

Apache Top-Level open source project, allowing you to do advanced analytics beyond MapReduce. Many data

519

analysis techniques such as machine learning and graph algorithms require iterative computations,

520

this is where Bulk Synchronous Parallel model can be more effective than "plain" MapReduce.

521

</td>

522

523

</td>

524

</tr>

525

526

<tr>

527

<td width="20%">Datasalt Pangool</td>

528

<td>

529

A new MapReduce paradigm. A new API for MR jobs, in higher level than Java.

530

</td>

531

<td width="20%"><a href="http://pangool.net">1.Pangool</a>

532

<a href = "https://github.com/datasalt/pangool">2.GitHub Pangool</a>

533

</td>

534

</tr>

535

536

<tr>

537

<td width="20%">Apache Tez</td>

538

<td>

539

Tez is a proposal to develop a generic application which can be used to process complex data-processing

540

task DAGs and runs natively on Apache Hadoop YARN. Tez generalizes the MapReduce paradigm to a more

541

powerful framework based on expressing computations as a dataflow graph. Tez is not meant directly for

542

end-users – in fact it enables developers to build end-user applications with much better performance

543

and flexibility. Hadoop has traditionally been a batch-processing platform for large amounts of data.

544

However, there are a lot of use cases for near-real-time performance of query processing. There are also

545

several workloads, such as Machine Learning, which do not fit will into the MapReduce paradigm. Tez helps

546

Hadoop address these use cases. Tez framework constitutes part of Stinger initiative (a low latency

547

based SQL type query interface for Hadoop based on Hive).

548

</td>

549

<td width="20%"><a href="http://incubator.apache.org/projects/tez.html">1. Apache Tez Incubator</a>

550

<a href="http://hortonworks.com/hadoop/tez/">2. Hortonworks Apache Tez page</a>

551

</td>

552

</tr>

553

554

<tr>

555

<td width="20%">Apache DataFu</td>

556

<td>

557

DataFu provides a collection of Hadoop MapReduce jobs and functions in higher level languages based

558

on it to perform data analysis. It provides functions for common statistics tasks (e.g. quantiles,

559

sampling), PageRank, stream sessionization, and set and bag operations. DataFu also provides Hadoop

560

jobs for incremental data processing in MapReduce. DataFu is a collection of Pig UDFs (including PageRank,

561

sessionization, set operations, sampling, and much more) that were originally developed at LinkedIn.

562

</td>

563

<td width="20%"><a href="http://incubator.apache.org/projects/datafu.html">1. DataFu Apache Incubator</a>

564

</td>

565

</tr>

566

567

<tr>

568

<td width="20%">Pydoop</td>

569

<td>

570

Pydoop is a Python MapReduce and HDFS API for Hadoop, built upon the C++

571

Pipes and the C libhdfs APIs, that allows to write full-fledged MapReduce

572

applications with HDFS access. Pydoop has several advantages over Hadoop’s built-in

573

solutions for Python programming, i.e., Hadoop Streaming and Jython: being a CPython

574

package, it allows you to access all standard library and third party modules,

575

some of which may not be available.

576

</td>

577

<td width="20%"><a href="http://pydoop.sourceforge.net/docs/">1. SF Pydoop site</a>

578

<a href="https://github.com/crs4/pydoop">2. Pydoop GitHub Project</a>

579

</td>

580

</tr>

581

582

<tr>

583

<td width="20%">Kangaroo</td>

584

<td>

585

Open-source project from Conductor for writing MapReduce jobs consuming data from Kafka.

586

The introductory post explains Conductor’s use case—loading data from Kafka to HBase

587

by way of a MapReduce job using the HFileOutputFormat. Unlike other solutions

588

which are limited to a single InputSplit per Kafka partition, Kangaroo can launch

589

multiple consumers at different offsets in the stream of a single partition for

590

increased throughput and parallelism.

591

</td>

592

<td width="20%"><a href="http://www.conductor.com/nightlight/data-stream-processing-bulk-kafka-hadoop/">1. Kangaroo Introduction</a>

593

<a href="https://github.com/Conductor/kangaroo">2. Kangaroo GitHub Project</a>

594

</td>

595

</tr>

596

597

<tr>

598

<td width="20%">TinkerPop</td>

599

<td>

600

Graph computing framework written in Java. Provides a core API that graph system vendors can implement.

601

There are various types of graph systems including in-memory graph libraries, OLTP graph databases,

602

and OLAP graph processors. Once the core interfaces are implemented, the underlying graph system

603

can be queried using the graph traversal language Gremlin and processed with TinkerPop-enabled

604

algorithms. For many, TinkerPop is seen as the JDBC of the graph computing community.

605

</td>

606

<td width="20%"><a href="https://wiki.apache.org/incubator/TinkerPopProposal">1. Apache Tinkerpop Proposal</a>

607

<a href="http://www.tinkerpop.com/">2. TinkerPop site</a>

608

</td>

609

</tr>

610

611

<tr>

612

<td width="20%">Pachyderm MapReduce</td>

613

<td>

614

Pachyderm is a completely new MapReduce engine built on top Docker and CoreOS.

615

In Pachyderm MapReduce (PMR) a job is an HTTP server inside a Docker container

616

(a microservice). You give Pachyderm a Docker image and it will automatically

617

distribute it throughout the cluster next to your data. Data is POSTed to

618

the container over HTTP and the results are stored back in the file system.

619

You can implement the web server in any language you want and pull in any library.

620

Pachyderm also creates a DAG for all the jobs in the system and their dependencies

621

and it automatically schedules the pipeline such that each job isn’t run until it’s

622

dependencies have completed. Everything in Pachyderm “speaks in diffs” so it knows

623

exactly which data has changed and which subsets of the pipeline need to be rerun.

624

CoreOS is an open source lightweight operating system based on Chrome OS, actually

625

CoreOS is a fork of Chrome OS. CoreOS provides only the minimal functionality

626

required for deploying applications inside software containers, together with

627

built-in mechanisms for service discovery and configuration sharing

628

</td>

629

<td width="20%"><a href="http://www.pachyderm.io/">1. Pachyderm site</a>

630

<a href="https://medium.com/pachyderm-data/lets-build-a-modern-hadoop-4fc160f8d74f">2. Pachyderm introduction article</a>

631

</td>

632

</tr>

633

634

<tr>

635

<td width="20%">Apache Beam</td>

636

<td>

637

Apache Beam is an open source, unified model for defining and executing

638

data-parallel processing pipelines, as well as a set of language-specific

639

SDKs for constructing pipelines and runtime-specific Runners for executing them.

640

The model behind Beam evolved from a number of internal Google

641

data processing projects, including MapReduce, FlumeJava, and

642

Millwheel. This model was originally known as the “Dataflow Model”

643

and first implemented as Google Cloud Dataflow, including a Java SDK

644

on GitHub for writing pipelines and fully managed service for

645

executing them on Google Cloud Platform.

646

In January 2016, Google and a number of partners submitted the Dataflow

647

Programming Model and SDKs portion as an Apache Incubator Proposal,

648

under the name Apache Beam (unified Batch + strEAM processing).

649

</td>

650

<td width="20%"><a href="https://wiki.apache.org/incubator/BeamProposal">1. Apache Beam Proposal</a>

651

<a href="https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison">2. DataFlow Beam and Spark Comparasion</a>

652

</td>

653

</tr>

654

655

656

657

658

<tr>

659

<th colspan="3">NoSQL Databases</th>

660

</tr>

661

662

<tr>

663

<th colspan="3" style="background-color:#0099FF;">Column Data Model</th>

664

</tr>

665

666

<tr>

667

<td width="20%">Apache HBase</td>

668

<td>

669

Google BigTable Inspired. Non-relational distributed database.

670

Ramdom, real-time r/w operations in column-oriented very large

671

tables (BDDB: Big Data Data Base). It’s the backing system for

672

MR jobs outputs. It’s the Hadoop database. It’s for backing

673

Hadoop MapReduce jobs with Apache HBase tables

674

</td>

675

<td width="20%"><a href="https://hbase.apache.org/">1. Apache HBase Home</a>

676

<a href="https://github.com/apache/hbase">2. Mirror of HBase on Github</a>

677

</td>

678

</tr>

679

680

<tr>

681

<td width="20%">Apache Cassandra</td>

682

<td>

683

Distributed Non-SQL DBMS, it’s a BDDB. MR can retrieve data from Cassandra.

684

This BDDB can run without HDFS, or on-top of HDFS (DataStax fork of Cassandra).

685

HBase and its required supporting systems are derived from what is known of

686

the original Google BigTable and Google File System designs (as known from the

687

Google File System paper Google published in 2003, and the BigTable paper published

688

in 2006). Cassandra on the other hand is a recent open source fork of a standalone

689

database system initially coded by Facebook, which while implementing the BigTable

690

data model, uses a system inspired by Amazon’s Dynamo for storing data (in fact

691

much of the initial development work on Cassandra was performed by two Dynamo

692

engineers recruited to Facebook from Amazon).

693

</td>

694

695

<a href="http://cassandra.apache.org" target="_blank">1. Apache HBase Home</a>

696

<a href="https://github.com/apache/cassandra" target="_blank">2. Cassandra on GitHub</a>

697

<a href="https://academy.datastax.com" target="_blank">3. Training Resources</a>

698

<a href="https://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf" target="_blank">4. Cassandra - Paper</a>

699

</td>

700

</tr>

701

702

<tr>

703

<td width="20%">Hypertable</td>

704

<td>

705

Database system inspired by publications on the design of Google's

706

BigTable. The project is based on experience of engineers who were

707

solving large-scale data-intensive tasks for many years. Hypertable

708

runs on top of a distributed file system such as the Apache Hadoop DFS,

709

GlusterFS, or the Kosmos File System (KFS). It is written almost entirely

710

in C++. Sposored by Baidu the Chinese search engine.

711

</td>

712

713

</tr>

714

715

<tr>

716

<td width="20%">Apache Accumulo</td>

717

<td>

718

Distributed key/value store is a robust, scalable, high performance

719

data storage and retrieval system. Apache Accumulo is based on Google's

720

BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift.

721

Accumulo is software created by the NSA with security features.

722

</td>

723

<td width="20%"><a href="https://accumulo.apache.org/">1. Apache Accumulo Home</a>

724

</td>

725

</tr>

726

727

<tr>

728

<td width="20%">Apache Kudu</td>

729

<td>

730

Distributed, columnar, relational data store optimized for analytical use cases requiring

731

very fast reads with competitive write speeds.

732

<ul>

733

<li>Relational data model (tables) with strongly-typed columns and a fast, online alter table operation.</li>

734

<li>Scale-out and sharded with support for partitioning based on key ranges and/or hashing.</li>

735

<li>Fault-tolerant and consistent due to its implementation of Raft consensus.</li>

736

<li>Supported by Apache Impala and Apache Drill, enabling fast SQL reads and writes through those systems.</li>

737

<li>Integrates with MapReduce and Spark.</li>

738

<li>Additionally provides "NoSQL" APIs in Java, Python, and C++.</li>

739

</ul>

740

</td>

741

<td width="20%"><a href="http://getkudu.io/">1. Apache Kudu Home</a>

742

<a href="http://github.com/cloudera/kudu">2. Kudu on Github</a>

743

<a href="http://getkudu.io/kudu.pdf">3. Kudu technical whitepaper (pdf)</a>

744

</td>

745

</tr>

746

747

<tr>

748

<td width="20%">Apache Parquet</td>

749

<td>

750

Columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of

751

data processing framework, data model or programming language.

752

</td>

753

<td width="20%"><a href="https://parquet.apache.org">1. Apache Parquet Home</a>

754

<a href="https://github.com/apache/parquet-mr">2. Apache Parquet on Github</a>

755

</td>

756

</tr>

757

758

<tr>

759

<th colspan="3" style="background-color:#0099FF;">Document Data Model</th>

760

</tr>

761

762

<tr>

763

<td width="20%">MongoDB</td>

764

<td>

765

Document-oriented database system. It is part of the NoSQL family of

766

database systems. Instead of storing data in tables as is done in a "classical"

767

relational database, MongoDB stores structured data as JSON-like documents

768

</td>

769

<td width="20%"><a href="http://www.mongodb.org/">1. Mongodb site</a>

770

</td>

771

</tr>

772

773

<tr>

774

<td width="20%">RethinkDB</td>

775

<td>

776

RethinkDB is built to store JSON documents, and scale to multiple

777

machines with very little effort. It has a pleasant query language

778

that supports really useful queries like table joins and group by,

779

and is easy to setup and learn.

780

</td>

781

<td width="20%"><a href="http://www.rethinkdb.com/">1. RethinkDB site</a>

782

</td>

783

</tr>

784

785

<tr>

786

<td width="20%">ArangoDB</td>

787

<td>

788

An open-source database with a flexible data model for documents, graphs,

789

and key-values. Build high performance applications using a convenient

790

sql-like query language or JavaScript extensions.

791

</td>

792

<td width="20%"><a href="https://www.arangodb.org/">1. ArangoDB site</a>

793

</td>

794

</tr>

795

796

<tr>

797

<th colspan="3" style="background-color:#0099FF;">Stream Data Model</th>

798

</tr>

799

800

<tr>

801

<td width="20%">EventStore</td>

802

<td>

803

An open-source, functional database with support for Complex Event Processing.

804

It provides a persistence engine for applications using event-sourcing, or for

805

storing time-series data. Event Store is written in C#, C++ for the server which

806

runs on Mono or the .NET CLR, on Linux or Windows.

807

Applications using Event Store can be written in JavaScript. Event sourcing (ES)

808

is a way of persisting your application's state by storing the history that determines

809

the current state of your application.

810

</td>

811

<td width="20%"><a href="http://geteventstore.com/">1. EventStore site</a>

812

</td>

813

</tr>

814

815

<tr>

816

<th colspan="3" style="background-color:#0099FF;">Key-value Data Model</th>

817

</tr>

818

819

<tr>

820

<td width="20%">Redis DataBase</td>

821

<td>

822

Redis is an open-source, networked, in-memory, data structures

823

store with optional durability. It is written in ANSI C.

824

In its outer layer, the Redis data model is a dictionary which

825

maps keys to values. One of the main differences between Redis

826

and other structured storage systems is that Redis supports not

827

only strings, but also abstract data types. Sponsored by Redis Labs.

828

It’s BSD licensed.

829

</td>

830

<td width="20%"><a href="http://redis.io/">1. Redis site</a>

831

<a href="http://redislabs.com/">2. Redis Labs site</a>

832

</td>

833

</tr>

834

835

<tr>

836

<td width="20%">Linkedin Voldemort</td>

837

<td>

838

Distributed data store that is designed as a key-value store used

839

by LinkedIn for high-scalability storage.

840

</td>

841

<td width="20%"><a href="http://www.project-voldemort.com/voldemort/">1. Voldemort site</a>

842

</td>

843

</tr>

844

845

<tr>

846

<td width="20%">RocksDB</td>

847

<td>

848

RocksDB is an embeddable persistent key-value store for fast storage.

849

RocksDB can also be the foundation for a client-server database but our

850

current focus is on embedded workloads.

851

</td>

852

<td width="20%"><a href="http://rocksdb.org/">1. RocksDB site</a>

853

</td>

854

</tr>

855

856

<tr>

857

<td width="20%">OpenTSDB</td>

858

<td>

859

OpenTSDB is a distributed, scalable Time Series Database (TSDB)

860

written on top of HBase. OpenTSDB was written to address a common

861

need: store, index and serve metrics collected from computer systems

862

(network gear, operating systems, applications) at a large scale,

863

and make this data easily accessible and graphable.

864

</td>

865

<td width="20%"><a href="http://opentsdb.net/">1. OpenTSDB site</a>

866

</td>

867

</tr>

868

869

870

871

872

<tr>

873

<th colspan="3" style="background-color:#0099FF;">Graph Data Model</th>

874

</tr>

875

876

<tr>

877

<td width="20%">ArangoDB</td>

878

<td>

879

An open-source database with a flexible data model for documents,

880

graphs, and key-values. Build high performance applications using

881

a convenient sql-like query language or JavaScript extensions.

882

</td>

883

<td width="20%"><a href="https://www.arangodb.org/">1. ArangoDB site</a>

884

</td>

885

</tr>

886

887

<tr>

888

889

<td>

890

An open-source graph database writting entirely in Java. It is an

891

embedded, disk-based, fully transactional Java persistence engine

892

that stores data structured in graphs rather than in tables.

893

</td>

894

895

</td>

896

</tr>

897

898

<tr>

899

<td width="20%">TitanDB</td>

900

<td>

901

TitanDB is a highly scalable graph database optimized for storing

902

and querying large graphs with billions of vertices and edges

903

distributed across a multi-machine cluster. Titan is a transactional

904

database that can support thousands of concurrent users.

905

</td>

906

<td width="20%"><a href="http://thinkaurelius.github.io/titan/">1. Titan site</a>

907

</td>

908

</tr>

909

910

911

912

913

<tr>

914

<th colspan="3">NewSQL Databases</th>

915

</tr>

916

917

<tr>

918

<td width="20%">TokuDB</td>

919

<td>

920

TokuDB is a storage engine for MySQL and MariaDB that is specifically

921

designed for high performance on write-intensive workloads. It achieves

922

this via Fractal Tree indexing. TokuDB is a scalable, ACID and MVCC

923

compliant storage engine. TokuDB is one of the technologies that enable

924

Big Data in MySQL.

925

</td>

926

927

</tr>

928

929

<tr>

930

<td width="20%">HandlerSocket</td>

931

<td>

932

HandlerSocket is a NoSQL plugin for MySQL/MariaDB (the storage engine

933

of MySQL). It works as a daemon inside the mysqld process, accepting TCP

934

connections, and executing requests from clients. HandlerSocket does not

935

support SQL queries. Instead, it supports simple CRUD operations on tables.

936

HandlerSocket can be much faster than mysqld/libmysql in some cases because

937

it has lower CPU, disk, and network overhead.

938

</td>

939

940

</tr>

941

942

<tr>

943

<td width="20%">Akiban Server</td>

944

<td>

945

Akiban Server is an open source database that brings document stores and

946

relational databases together. Developers get powerful document access

947

alongside surprisingly powerful SQL.

948

</td>

949

950

</tr>

951

952

<tr>

953

<td width="20%">Drizzle</td>

954

<td>

955

Drizzle is a re-designed version of the MySQL v6.0 codebase and

956

is designed around a central concept of having a microkernel

957

architecture. Features such as the query cache and authentication

958

system are now plugins to the database, which follow the general

959

theme of "pluggable storage engines" that were introduced in MySQL 5.1.

960

It supports PAM, LDAP, and HTTP AUTH for authentication via plugins

961

it ships. Via its plugin system it currently supports logging to files,

962

syslog, and remote services such as RabbitMQ and Gearman. Drizzle

963

is an ACID-compliant relational database that supports

964

transactions via an MVCC design

965

</td>

966

967

</tr>

968

969

<tr>

970

<td width="20%">Haeinsa</td>

971

<td>

972

Haeinsa is linearly scalable multi-row, multi-table transaction

973

library for HBase. Use Haeinsa if you need strong ACID semantics

974

on your HBase cluster. Is based on Google Perlocator concept.

975

</td>

976

977

</tr>

978

979

<tr>

980

<td width="20%">SenseiDB</td>

981

<td>

982

Open-source, distributed, realtime, semi-structured database.

983

Some Features: Full-text search, Fast realtime updates, Structured

984

and faceted search, BQL: SQL-like query language, Fast key-value

985

lookup, High performance under concurrent heavy update and query

986

volumes, Hadoop integration

987

</td>

988

<td width="20%"><a href="http://senseidb.com/">1. SenseiDB site</a>

989

</td>

990

</tr>

991

992

<tr>

993

994

<td>

995

Sky is an open source database used for flexible, high performance

996

analysis of behavioral data. For certain kinds of data such as

997

clickstream data and log data, it can be several orders of magnitude

998

faster than traditional approaches such as SQL databases or Hadoop.

999

</td>

1000

<td width="20%"><a href="http://skydb.io/">1. SkyDB site</a>

1001

</td>

1002

</tr>

1003

1004

<tr>

1005

<td width="20%">BayesDB</td>

1006

<td>

1007

BayesDB, a Bayesian database table, lets users query the probable

1008

implications of their tabular data as easily as an SQL database

1009

lets them query the data itself. Using the built-in Bayesian Query

1010

Language (BQL), users with no statistics training can solve basic

1011

data science problems, such as detecting predictive relationships

1012

between variables, inferring missing values, simulating probable

1013

observations, and identifying statistically similar database entries.

1014

</td>

1015

<td width="20%"><a href="http://probcomp.csail.mit.edu/bayesdb/index.html">1. BayesDB site</a>

1016

</td>

1017

</tr>

1018

1019

<tr>

1020

<td width="20%">InfluxDB</td>

1021

<td>

1022

InfluxDB is an open source distributed time series database with

1023

no external dependencies. It's useful for recording metrics, events,

1024

and performing analytics. It has a built-in HTTP API so you don't

1025

have to write any server side code to get up and running. InfluxDB

1026

is designed to be scalable, simple to install and manage, and fast

1027

to get data in and out. It aims to answer queries in real-time.

1028

That means every data point is indexed as it comes in and is immediately

1029

available in queries that should return under 100ms.

1030

</td>

1031

<td width="20%"><a href="http://influxdb.org/">1. InfluxDB site</a>

1032

</td>

1033

</tr>

1034

1035

<tr>

1036

<th colspan="3">SQL-on-Hadoop</th>

1037

</tr>

1038

1039

<tr>

1040

<td width="20%">Apache Hive</td>

1041

<td>

1042

Data Warehouse infrastructure developed by Facebook. Data

1043

summarization, query, and analysis. It’s provides SQL-like

1044

language (not SQL92 compliant): HiveQL.

1045

</td>

1046

<td width="20%"><a href="http://hive.apache.org/">1. Apache HIVE site</a>

1047

<a href="https://github.com/apache/hive">2. Apache HIVE GitHub Project</a>

1048

</td>

1049

</tr>

1050

1051

<tr>

1052

<td width="20%">Apache HCatalog</td>

1053

<td>

1054

HCatalog’s table abstraction presents users with a relational view

1055

of data in the Hadoop Distributed File System (HDFS) and ensures

1056

that users need not worry about where or in what format their data

1057

is stored. Right now HCatalog is part of Hive. Only old versions are separated for download.

1058

</td>

1059

1060

</tr>

1061

1062

<tr>

1063

<td width="20%">Apache Trafodion</td>

1064

<td>

1065

Apache Trafodion is a webscale SQL-on-Hadoop solution enabling

1066

enterprise-class transactional and operational workloads on

1067

HBase. Trafodion is a native MPP ANSI SQL database engine that

1068

builds on the scalability, elasticity and flexibility of HDFS and

1069

HBase, extending these to provide guaranteed transactional

1070

integrity for all workloads including multi-column, multi-row,

1071

multi-table, and multi-server updates.

1072

</td>

1073

<td width="20%"><a href="http://trafodion.incubator.apache.org">1. Apache Trafodion website</a>

1074

<a href="https://cwiki.apache.org/confluence/display/TRAFODION/Apache+Trafodion+Home">2. Apache Trafodion wiki</a>

1075

<a href="https://github.com/apache/incubator-trafodion">3. Apache Trafodion GitHub Project</a>

1076

1077

</td>

1078

</tr>

1079

1080

<tr>

1081

<td width="20%">Apache HAWQ</td>

1082

<td>

1083

Apache HAWQ is a Hadoop native SQL query engine that combines

1084

key technological advantages of MPP database evolved from Greenplum Database,

1085

with the scalability and convenience of Hadoop.

1086

</td>

1087

<td width="20%"><a href="http://hawq.incubator.apache.org/">1. Apache HAWQ site</a>

1088

<a href="https://github.com/apache/incubator-hawq">2. HAWQ GitHub Project</a>

1089

</td>

1090

</tr>

1091

1092

<tr>

1093

<td width="20%">Apache Drill</td>

1094

<td>

1095

Drill is the open source version of Google's Dremel system which

1096

is available as an infrastructure service called Google BigQuery.

1097

In recent years open source systems have emerged to address the

1098

need for scalable batch processing (Apache Hadoop) and stream

1099

processing (Storm, Apache S4). Apache Hadoop, originally inspired

1100

by Google's internal MapReduce system, is used by thousands of

1101

organizations processing large-scale datasets. Apache Hadoop is

1102

designed to achieve very high throughput, but is not designed to

1103

achieve the sub-second latency needed for interactive data analysis

1104

and exploration. Drill, inspired by Google's internal Dremel system,

1105

is intended to address this need

1106

</td>

1107

<td width="20%"><a href="http://incubator.apache.org/drill/">1. Apache Incubator Drill</a>

1108

</td>

1109

</tr>

1110

1111

<tr>

1112

<td width="20%">Cloudera Impala</td>

1113

<td>

1114

The Apache-licensed Impala project brings scalable parallel database

1115

technology to Hadoop, enabling users to issue low-latency SQL queries

1116

to data stored in HDFS and Apache HBase without requiring data movement

1117

or transformation. It's a Google Dremel clone (Big Query google).

1118

</td>

1119

<td width="20%"><a href="http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html">1. Cloudera Impala site</a>

1120

<a href="https://github.com/cloudera/impala">2. Impala GitHub Project</a>

1121

</td>

1122

</tr>

1123

1124

<tr>

1125

<td width="20%">Facebook Presto</td>

1126

<td>

1127

Facebook has open sourced Presto, a SQL engine it says is on

1128

average 10 times faster than Hive for running queries across

1129

large data sets stored in Hadoop and elsewhere.

1130

</td>

1131

<td width="20%"><a href="http://prestodb.io/">1. Presto site</a>

1132

</td>

1133

</tr>

1134

1135

<tr>

1136

<td width="20%">Datasalt Splout SQL</td>

1137

<td>

1138

Splout allows serving an arbitrarily big dataset with high QPS

1139

rates and at the same time provides full SQL query syntax.

1140

</td>

1141

1142

</tr>

1143

1144

<tr>

1145

<td width="20%">Apache Tajo</td>

1146

<td>

1147

Apache Tajo is a robust big data relational and distributed data

1148

warehouse system for Apache Hadoop. Tajo is designed for low-latency

1149

and scalable ad-hoc queries, online aggregation, and ETL

1150

(extract-transform-load process) on large-data sets stored on

1151

HDFS (Hadoop Distributed File System) and other data sources.

1152

By supporting SQL standards and leveraging advanced database

1153

techniques, Tajo allows direct control of distributed execution

1154

and data flow across a variety of query evaluation strategies

1155

and optimization opportunities. For reference, the Apache

1156

Software Foundation announced Tajo as a Top-Level Project in April 2014.

1157

</td>

1158

<td width="20%"><a href="http://tajo.apache.org/">1. Apache Tajo site</a>

1159

</td>

1160

</tr>

1161

1162

<tr>

1163

<td width="20%">Apache Phoenix</td>

1164

<td>

1165

Apache Phoenix is a SQL skin over HBase delivered as a

1166

client-embedded JDBC driver targeting low latency queries over

1167

HBase data. Apache Phoenix takes your SQL query, compiles it into

1168

a series of HBase scans, and orchestrates the running of those

1169

scans to produce regular JDBC result sets. The table metadata is

1170

stored in an HBase table and versioned, such that snapshot queries

1171

over prior versions will automatically use the correct schema.

1172

Direct use of the HBase API, along with coprocessors and custom

1173

filters, results in performance on the order of milliseconds for

1174

small queries, or seconds for tens of millions of rows.

1175

</td>

1176

<td width="20%"><a href="http://phoenix.incubator.apache.org/index.html">1. Apache Phoenix site</a>

1177

</td>

1178

</tr>

1179

1180

<tr>

1181

<td width="20%">Apache MRQL</td>

1182

<td>

1183

MRQL is a query processing and optimization system for large-scale,

1184

distributed data analysis, built on top of Apache Hadoop, Hama, and Spark.

1185

MRQL (pronounced miracle) is a query processing and optimization

1186

system for large-scale, distributed data analysis. MRQL (the MapReduce

1187

Query Language) is an SQL-like query language for large-scale data analysis

1188

on a cluster of computers. The MRQL query processing system can evaluate MRQL

1189

queries in three modes:

1190

<ul>

1191

<li>in Map-Reduce mode using Apache Hadoop,</li>

1192

<li>in BSP mode (Bulk Synchronous Parallel mode) using Apache Hama, and</li>

1193

<li>in Spark mode using Apache Spark.</li>

1194

<li>in Flink mode using Apache Flink.</li>

1195

</ul>

1196

</td>

1197

<td width="20%"><a href="http://mrql.incubator.apache.org/">1. Apache Incubator MRQL site</a>

1198

</td>

1199

</tr>

1200

1201

<tr>

1202

<td width="20%">Kylin</td>

1203

<td>

1204

Kylin is an open source Distributed Analytics Engine from eBay

1205

Inc. that provides SQL interface and multi-dimensional analysis

1206

(OLAP) on Hadoop supporting extremely large datasets

1207

</td>

1208

<td width="20%"><a href="http://www.kylin.io/">1. Kylin project site</a>

1209

</td>

1210

</tr>

1211

1212

1213

1214

1215

<tr>

1216

<th colspan="3">Data Ingestion</th>

1217

</tr>

1218

1219

<tr>

1220

<td width="20%">Apache Flume</td>

1221

<td>

1222

Flume is a distributed, reliable, and available service for

1223

efficiently collecting, aggregating, and moving large amounts

1224

of log data. It has a simple and flexible architecture based on

1225

streaming data flows. It is robust and fault tolerant with tunable

1226

reliability mechanisms and many failover and recovery mechanisms.

1227

It uses a simple extensible data model that allows for online analytic application.

1228

</td>

1229

<td width="20%"><a href="http://flume.apache.org/">1. Apache Flume project site</a>

1230

</td>

1231

</tr>

1232

1233

<tr>

1234

<td width="20%">Apache Sqoop</td>

1235

<td>

1236

System for bulk data transfer between HDFS and structured

1237

datastores as RDBMS. Like Flume but from HDFS to RDBMS.

1238

</td>

1239

<td width="20%"><a href="http://sqoop.apache.org/">1. Apache Sqoop project site</a>

1240

</td>

1241

</tr>

1242

1243

<tr>

1244

<td width="20%">Facebook Scribe</td>

1245

<td>

1246

Log agregator in real-time. It’s a Apache Thrift Service.

1247

</td>

1248

1249

</tr>

1250

1251

<tr>

1252

<td width="20%">Apache Chukwa</td>

1253

<td>

1254

Large scale log aggregator, and analytics.

1255

</td>

1256

1257

</tr>

1258

1259

<tr>

1260

<td width="20%">Apache Kafka</td>

1261

<td>

1262

Distributed publish-subscribe system for processing large amounts

1263

of streaming data. Kafka is a Message Queue developed by LinkedIn

1264

that persists messages to disk in a very performant manner.

1265

Because messages are persisted, it has the interesting ability

1266

for clients to rewind a stream and consume the messages again.

1267

Another upside of the disk persistence is that bulk importing

1268

the data into HDFS for offline analysis can be done very quickly

1269

and efficiently. Storm, developed by BackType (which was acquired

1270

by Twitter a year ago), is more about transforming a stream of

1271

messages into new streams.

1272

</td>

1273

<td width="20%"><a href="http://kafka.apache.org/">1. Apache Kafka</a>

1274

<a href="https://github.com/apache/kafka/">2. GitHub source code</a>

1275

</td>

1276

</tr>

1277

1278

<tr>

1279

<td width="20%">Netflix Suro</td>

1280

<td>

1281

Suro has its roots in Apache Chukwa, which was initially adopted

1282

by Netflix. Is a log agregattor like Storm, Samza.

1283

</td>

1284

1285

</tr>

1286

1287

<tr>

1288

<td width="20%">Apache Samza</td>

1289

<td>

1290

Apache Samza is a distributed stream processing framework.

1291

It uses Apache Kafka for messaging, and Apache Hadoop YARN to

1292

provide fault tolerance, processor isolation, security, and

1293

resource management.

1294

Developed by http://www.linkedin.com/in/jaykreps Linkedin.

1295

</td>

1296

1297

</tr>

1298

1299

<tr>

1300

<td width="20%">Cloudera Morphline</td>

1301

<td>

1302

Cloudera Morphlines is a new open source framework that reduces

1303

the time and skills necessary to integrate, build, and change

1304

Hadoop processing applications that extract, transform,

1305

and load data into Apache Solr, Apache HBase, HDFS, enterprise

1306

data warehouses, or analytic online dashboards.

1307

</td>

1308

1309

</tr>

1310

1311

<tr>

1312

1313

<td>

1314

This project is a framework for connecting disparate data sources

1315

with the Apache Hadoop system, making them interoperable. HIHO

1316

connects Hadoop with multiple RDBMS and file systems, so that

1317

data can be loaded to Hadoop and unloaded from Hadoop

1318

</td>

1319

1320

</tr>

1321

1322

<tr>

1323

<td width="20%">Apache NiFi</td>

1324

<td>

1325

Apache NiFi is a dataflow system that is currently under

1326

incubation at the Apache Software Foundation. NiFi is based on

1327

the concepts of flow-based programming and is highly configurable.

1328

NiFi uses a component based extension model to rapidly add

1329

capabilities to complex dataflows. Out of the box NiFi has

1330

several extensions for dealing with file-based dataflows such

1331

as FTP, SFTP, and HTTP integration as well as integration with

1332

HDFS. One of NiFi’s unique features is a rich, web-based

1333

interface for designing, controlling, and monitoring a dataflow.

1334

</td>

1335

<td width="20%"><a href="http://nifi.apache.org/index.html">1. Apache NiFi</a>

1336

</td>

1337

</tr>

1338

1339

<tr>

1340

<td width="20%">Apache ManifoldCF</td>

1341

<td>

1342

Apache ManifoldCF provides a framework for connecting source content

1343

repositories like file systems, DB, CMIS, SharePoint, FileNet ...

1344

to target repositories or indexes, such as Apache Solr or ElasticSearch.

1345

It's a kind of crawler for multi-content repositories, supporting a lot

1346

of sources and multi-format conversion for indexing by means of Apache

1347

Tika Content Extractor transformation filter.

1348

</td>

1349

<td width="20%"><a href="http://manifoldcf.apache.org/">1. Apache ManifoldCF</a>

1350

</td>

1351

</tr>

1352

1353

1354

<tr>

1355

<th colspan="3">Service Programming</th>

1356

</tr>

1357

1358

<tr>

1359

<td width="20%">Apache Thrift</td>

1360

<td>

1361

A cross-language RPC framework for service creations. It’s the

1362

service base for Facebook technologies (the original Thrift

1363

contributor). Thrift provides a framework for developing and

1364

accessing remote services. It allows developers to create

1365

services that can be consumed by any application that is written

1366

in a language that there are Thrift bindings for. Thrift

1367

manages serialization of data to and from a service, as well as

1368

the protocol that describes a method invocation, response, etc.

1369

Instead of writing all the RPC code -- you can just get straight

1370

to your service logic. Thrift uses TCP and so a given service is

1371

bound to a particular port.

1372

</td>

1373

<td width="20%"><a href="http://thrift.apache.org//">1. Apache Thrift</a>

1374

</td>

1375

</tr>

1376

1377

<tr>

1378

<td width="20%">Apache Zookeeper</td>

1379

<td>

1380

It’s a coordination service that gives you the tools you need to

1381

write correct distributed applications. ZooKeeper was developed

1382

at Yahoo! Research. Several Hadoop projects are already using

1383

ZooKeeper to coordinate the cluster and provide highly-available

1384

distributed services. Perhaps most famous of those are Apache

1385

HBase, Storm, Kafka. ZooKeeper is an application library with

1386

two principal implementations of the APIs—Java and C—and a service

1387

component implemented in Java that runs on an ensemble of dedicated

1388

servers. Zookeeper is for building distributed systems, simplifies

1389

the development process, making it more agile and enabling more

1390

robust implementations. Back in 2006, Google published a paper

1391

on "Chubby", a distributed lock service which gained wide adoption

1392

within their data centers. Zookeeper, not surprisingly, is a close

1393

clone of Chubby designed to fulfill many of the same roles for

1394

HDFS and other Hadoop infrastructure.

1395

</td>

1396

<td width="20%"><a href="http://zookeeper.apache.org/">1. Apache Zookeeper</a>

1397

<a href="http://research.google.com/archive/chubby.html">2. Google Chubby paper</a>

1398

</td>

1399

</tr>

1400

1401

<tr>

1402

<td width="20%">Apache Avro</td>

1403

<td>

1404

Apache Avro is a framework for modeling, serializing and making

1405

Remote Procedure Calls (RPC). Avro data is described by a schema,

1406

and one interesting feature is that the schema is stored in the

1407

same file as the data it describes, so files are self-describing.

1408

Avro does not require code generation. This framework can compete

1409

with other similar tools like: Apache Thrift, Google Protocol Buffers, ZeroC ICE, and so on.

1410

</td>

1411

<td width="20%"><a href="http://avro.apache.org/">1. Apache Avro</a>

1412

</td>

1413

</tr>

1414

1415

<tr>

1416

<td width="20%">Apache Curator</td>

1417

<td>

1418

Curator is a set of Java libraries that make using Apache

1419

ZooKeeper much easier.

1420

</td>

1421

1422

</tr>

1423

1424

<tr>

1425

<td width="20%">Apache karaf</td>

1426

<td>

1427

Apache Karaf is an OSGi runtime that runs on top of any OSGi

1428

framework and provides you a set of services, a powerful

1429

provisioning concept, an extensible shell and more.

1430

</td>

1431

1432

</tr>

1433

1434

<tr>

1435

<td width="20%">Twitter Elephant Bird</td>

1436

<td>

1437

Elephant Bird is a project that provides utilities (libraries)

1438

for working with LZOP-compressed data. It also provides a

1439

container format that supports working with Protocol Buffers,

1440

Thrift in MapReduce, Writables, Pig LoadFuncs, Hive SerDe,

1441

HBase miscellanea. This open source library is massively

1442

used in Twitter.

1443

</td>

1444

<td width="20%"><a href="https://github.com/kevinweil/elephant-bird">1. Elephant Bird GitHub</a>

1445

</td>

1446

</tr>

1447

1448

<tr>

1449

<td width="20%">Linkedin Norbert</td>

1450

<td>

1451

Norbert is a library that provides easy cluster management and

1452

workload distribution. With Norbert, you can quickly distribute

1453

a simple client/server architecture to create a highly scalable

1454

architecture capable of handling heavy traffic. Implemented in

1455

Scala, Norbert wraps ZooKeeper, Netty and uses Protocol Buffers

1456

for transport to make it easy to build a cluster aware application.

1457

A Java API is provided and pluggable load balancing strategies

1458

are supported with round robin and consistent hash strategies

1459

provided out of the box.

1460

</td>

1461

<td width="20%"><a href="http://data.linkedin.com/opensource/norbert">1. Linedin Project</a>

1462

<a href="https://github.com/rhavyn/norbert">2. GitHub source code</a>

1463

</td>

1464

</tr>

1465

1466

<tr>

1467

<th colspan="3">Scheduling</th>

1468

</tr>

1469

1470

<tr>

1471

<td width="20%">Apache Oozie</td>

1472

<td>

1473

Workflow scheduler system for MR jobs using DAGs

1474

(Direct Acyclical Graphs). Oozie Coordinator can trigger jobs

1475

by time (frequency) and data availability

1476

</td>

1477

<td width="20%"><a href="http://oozie.apache.org/">1. Apache Oozie</a>

1478

<a href="https://github.com/apache/oozie">2. GitHub source code</a>

1479

</td>

1480

</tr>

1481

1482

<tr>

1483

<td width="20%">Linkedin Azkaban</td>

1484

<td>

1485

Hadoop workflow management. A batch job scheduler can be seen as

1486

a combination of the cron and make Unix utilities combined with

1487

a friendly UI.

1488

</td>

1489

1490

</tr>

1491

1492

<tr>

1493

<td width="20%">Apache Falcon</td>

1494

<td>

1495

Apache Falcon is a data management framework for simplifying

1496

data lifecycle management and processing pipelines on Apache

1497

Hadoop. It enables users to configure, manage and orchestrate

1498

data motion, pipeline processing, disaster recovery, and data

1499

retention workflows. Instead of hard-coding complex data lifecycle

1500

capabilities, Hadoop applications can now rely on the well-tested

1501

Apache Falcon framework for these functions. Falcon’s simplification

1502

of data management is quite useful to anyone building apps on

1503

Hadoop. Data Management on Hadoop encompasses data motion, process

1504

orchestration, lifecycle management, data discovery, etc. among

1505

other concerns that are beyond ETL. Falcon is a new data processing

1506

and management platform for Hadoop that solves this problem and

1507

creates additional opportunities by building on existing components

1508

within the Hadoop ecosystem (ex. Apache Oozie, Apache Hadoop

1509

DistCp etc.) without reinventing the wheel.

1510

</td>

1511

1512

</tr>

1513

1514

<tr>

1515

<td width="20%">Schedoscope</td>

1516

<td>

1517

Schedoscope is a new open-source project providing a scheduling

1518

framework for painfree agile development, testing, (re)loading,

1519

and monitoring of your datahub, lake, or whatever you choose to

1520

call your Hadoop data warehouse these days. Datasets (including

1521

dependencies) are defined using a scala DSL, which can embed

1522

MapReduce jobs, Pig scripts, Hive queries or Oozie workflows to

1523

build the dataset. The tool includes a test framework to verify

1524

logic and a command line utility to load and reload data.

1525

</td>

1526

<td width="20%"><a href="https://github.com/ottogroup/schedoscope">GitHub source code</a>

1527

</td>

1528

</tr>

1529

1530

1531

1532

1533

<tr>

1534

<th colspan="3">Machine Learning</th>

1535

</tr>

1536

1537

<tr>

1538

<td width="20%">Apache Mahout</td>

1539

<td>

1540

Machine learning library and math library, on top of MapReduce.

1541

</td>

1542

1543

</tr>

1544

1545

<tr>

1546

1547

<td>

1548

Weka (Waikato Environment for Knowledge Analysis) is a popular suite

1549

of machine learning software written in Java, developed at the

1550

University of Waikato, New Zealand. Weka is free software available

1551

under the GNU General Public License.

1552

</td>

1553

1554

</tr>

1555

1556

<tr>

1557

<td width="20%">Cloudera Oryx</td>

1558

<td>

1559

The Oryx open source project provides simple, real-time large-scale

1560

machine learning / predictive analytics infrastructure. It implements

1561

a few classes of algorithm commonly used in business applications:

1562

collaborative filtering / recommendation, classification / regression,

1563

and clustering.

1564

</td>

1565

<td width="20%"><a href="https://github.com/cloudera/oryx">1. Oryx at GitHub</a>

1566

<a href="https://community.cloudera.com/t5/Data-Science-and-Machine/bd-p/Mahout">2. Cloudera forum for Machine Learning</a>

1567

</td>

1568

</tr>

1569

1570

<tr>

1571

<td width="20%">Deeplearning4j</td>

1572

<td>

1573

The Deeplearning4j open-source project is the most widely used deep-learning framework for the JVM. DL4J includes deep neural nets such as recurrent neural networks, Long Short Term Memory Networks (LSTMs), convolutional neural networks, various autoencoders and feedforward neural networks such as restricted Boltzmann machines and deep-belief networks. It also has natural language-processing algorithms such as word2vec, doc2vec, GloVe and TF-IDF. All Deeplearning4j networks run distributed on multiple CPUs and GPUs. They work as Hadoop jobs, and integrate with Spark on the slace level for host-thread orchestration. Deeplearning4j's neural networks are applied to use cases such as fraud and anomaly detection, recommender systems, and predictive maintenance.

1574

1575

</td>

1576

<td width="20%"><a href="http://deeplearning4j.org/">1. Deeplearning4j Website</a>

1577

<a href="https://gitter.im/deeplearning4j/deeplearning4j">2. Gitter Community for Deeplearning4j</a>

1578

</td>

1579

</tr>

1580

1581

<tr>

1582

<td width="20%">MADlib</td>

1583

<td>

1584

The MADlib project leverages the data-processing capabilities of an RDBMS to analyze data.

1585

The aim of this project is the integration of statistical data analysis into databases.

1586

The MADlib project is self-described as the Big Data Machine Learning in SQL for Data Scientists.

1587

The MADlib software project began the following year as a collaboration between researchers

1588

at UC Berkeley and engineers and data scientists at EMC/Greenplum (now Pivotal)

1589

</td>

1590

<td width="20%"><a href="http://madlib.net/community/">1. MADlib Community</a>

1591

</td>

1592

</tr>

1593

1594

<tr>

1595

1596

<td>

1597

H2O is a statistical, machine learning and math runtime tool for bigdata analysis.

1598

Developed by the predictive analytics company H2O.ai, H2O has established a leadership

1599

in the ML scene together with R and Databricks’ Spark. According to the team,

1600

H2O is the world’s fastest in-memory platform for machine learning and predictive analytics

1601

on big data. It is designed to help users scale machine learning, math, and statistics over large datasets.

1602

In addition to H2O’s point and click Web-UI, its REST API allows easy integration into various

1603

clients. This means explorative analysis of data can be done in a typical fashion in R, Python, and Scala;

1604

and entire workflows can be written up as automated scripts.

1605

</td>

1606

<td width="20%"><a href="https://github.com/h2oai/h2o-dev">1. H2O at GitHub</a>

1607

1608

</td>

1609

</tr>

1610

1611

<tr>

1612

<td width="20%">Sparkling Water</td>

1613

<td>

1614

Sparkling Water combines two open source technologies: Apache Spark and H2O - a machine learning engine.

1615

It makes H2O’s library of Advanced Algorithms including Deep Learning, GLM, GBM, KMeans, PCA, and Random Forest

1616

accessible from Spark workflows.

1617

Spark users are provided with the options to select the best features from either platforms to meet their Machine Learning needs.

1618

Users can combine Sparks’ RDD API and Spark MLLib with H2O’s machine learning algorithms,

1619

or use H2O independent of Spark in the model building process and post-process the results in Spark.

1620

Sparkling Water provides a transparent integration of H2O’s framework and data structures into Spark’s

1621

RDD-based environment by sharing the same execution space as well as providing a RDD-like API for H2O data structures.

1622

</td>

1623

<td width="20%"><a href="https://github.com/h2oai/sparkling-water">1. Sparkling Water at GitHub</a>

1624

<a href="https://github.com/h2oai/sparkling-water/tree/master/examples">2. Sparkling Water Examples</a>

1625

</td>

1626

</tr>

1627

1628

<tr>

1629

<td width="20%">Apache SystemML</td>

1630

<td>

1631

Apache SystemML was open sourced by IBM and it's pretty

1632

related with Apache Spark. If you thinking in Apache Spark as

1633

the analytics operating system for any application that taps

1634

into huge volumes of streaming data. MLLib, the machine

1635

learning library for Spark, provides developers with a rich set

1636

of machine learning algorithms. And SystemML enables developers

1637

to translate those algorithms so they can easily digest different

1638

kinds of data and to run on different kinds of computers.

1639

SystemML allows a developer to write a single machine learning

1640

algorithm and automatically scale it up using Spark or Hadoop.

1641

1642

SystemML scales for big data analytics with high performance

1643

optimizer technology, and empowers users to write customized

1644

machine learning algorithms using simple, domain-specific

1645

language (DSL) without learning complicated distributed

1646

programming. It is an extensible complement framework of Spark

1647

MLlib.

1648

</td>

1649

<td width="20%"><a href="http://systemml.apache.org">1. Apache SystemML</a>

1650

<a href="https://wiki.apache.org/incubator/SystemML">2. Apache Proposal</a>

1651

</td>

1652

</tr>

1653

1654

1655

1656

1657

<tr>

1658

<th colspan="3">Benchmarking and QA Tools</th>

1659

</tr>

1660

1661

<tr>

1662

<td width="20%">Apache Hadoop Benchmarking</td>

1663

<td>

1664

There are two main JAR files in Apache Hadoop for benchmarking.

1665

This JAR are micro-benchmarks for testing particular parts of the

1666

infrastructure, for instance TestDFSIO analyzes the disk system,

1667

TeraSort evaluates MapReduce tasks, WordCount measures cluster

1668

performance, etc. Micro-Benchmarks are packaged in the tests and

1669

exmaples JAR files, and you can get a list of them, with descriptions,

1670

by invoking the JAR file with no arguments. With regards Apache

1671

Hadoop 2.2.0 stable version we have available the following JAR

1672

files for test, examples and benchmarking. The Hadoop micro-benchmarks,

1673

are bundled in this JAR files: hadoop-mapreduce-examples-2.2.0.jar,

1674

hadoop-mapreduce-client-jobclient-2.2.0-tests.jar.

1675

</td>

1676

<td width="20%"><a href="https://issues.apache.org/jira/browse/MAPREDUCE-3561">1. MAPREDUCE-3561 umbrella ticket to track all the issues related to performance</a>

1677

</td>

1678

</tr>

1679

1680

<tr>

1681

<td width="20%">Yahoo Gridmix3</td>

1682

<td>

1683

Hadoop cluster benchmarking from Yahoo engineer team.

1684

</td>

1685

1686

</tr>

1687

1688

<tr>

1689

<td width="20%">PUMA Benchmarking</td>

1690

<td>

1691

Benchmark suite which represents a broad range of MapReduce

1692

applications exhibiting application characteristics with

1693

high/low computation and high/low shuffle volumes. There are a

1694

total of 13 benchmarks, out of which Tera-Sort, Word-Count,

1695

and Grep are from Hadoop distribution. The rest of the benchmarks

1696

were developed in-house and are currently not part of the Hadoop

1697

distribution. The three benchmarks from Hadoop distribution are

1698

also slightly modified to take number of reduce tasks as input

1699

from the user and generate final time completion statistics of jobs.

1700

</td>

1701

<td width="20%"><a href="https://issues.apache.org/jira/browse/MAPREDUCE-5116">1. MAPREDUCE-5116</a>

1702

<a href="https://sites.google.com/site/farazahmad/">2. Faraz Ahmad researcher</a>

1703

1704

</td>

1705

</tr>

1706

1707

<tr>

1708

<td width="20%">Berkeley SWIM Benchmark</td>

1709

<td>

1710

The SWIM benchmark (Statistical Workload Injector for MapReduce),

1711

is a benchmark representing a real-world big data workload developed

1712

by University of California at Berkley in close cooperation with

1713

Facebook. This test provides rigorous measurements of the performance

1714

of MapReduce systems comprised of real industry workloads..

1715

</td>

1716

<td width="20%"><a href="https://github.com/SWIMProjectUCB/SWIM/wiki">1. GitHub SWIN</a>

1717

</td>

1718

</tr>

1719

1720

<tr>

1721

<td width="20%">Intel HiBench</td>

1722

<td>

1723

HiBench is a Hadoop benchmark suite.

1724

</td>

1725

1726

</tr>

1727

1728

<tr>

1729

<td width="20%">Apache Yetus</td>

1730

<td>

1731

To help maintain consistency over a large and disconnected set

1732

of committers, automated patch testing was added to Hadoop’s development process.

1733

This automated patch testing (now included as part of Apache Yetus)

1734

works as follows: when a patch is uploaded to the bug tracking

1735

system an automated process downloads the patch, performs some

1736

static analysis, and runs the unit tests. These results are posted

1737

back to the bug tracker and alerts notify interested parties about

1738

the state of the patch.

1739

However The Apache Yetus project addresses much more than the traditional

1740

patch testing, it's a better approach including a massive rewrite of

1741

the patch testing facility used in Hadoop.

1742

</td>

1743

<td width="20%"><a href="https://www.altiscale.com/blog/apache-yetus-faster-more-reliable-software-development/">1. Altiscale Blog Entry</a>

1744

<a href="https://wiki.apache.org/incubator/YetusProposal">2. Apache Yetus Proposal</a>

1745

<a href="https://yetus.apache.org/">3. Apache Yetus Project site</a>

1746

</td>

1747

</tr>

1748

1749

1750

<tr>

1751

<th colspan="3">Security</th>

1752

</tr>

1753

1754

<tr>

1755

<td width="20%">Apache Sentry</td>

1756

<td>

1757

Sentry is the next step in enterprise-grade big data security

1758

and delivers fine-grained authorization to data stored in Apache

1759

Hadoop. An independent security module that integrates with open

1760

source SQL query engines Apache Hive and Cloudera Impala, Sentry

1761

delivers advanced authorization controls to enable multi-user

1762

applications and cross-functional processes for enterprise data

1763

sets. Sentry was a Cloudera development.

1764

</td>

1765

1766

</tr>

1767

1768

<tr>

1769

<td width="20%">Apache Knox Gateway</td>

1770

<td>

1771

System that provides a single point of secure access for Apache

1772

Hadoop clusters. The goal is to simplify Hadoop security for both

1773

users (i.e. who access the cluster data and execute jobs) and

1774

operators (i.e. who control access and manage the cluster). The

1775

Gateway runs as a server (or cluster of servers) that serve one

1776

or more Hadoop clusters.

1777

</td>

1778

<td width="20%"><a href="http://knox.apache.org/">1. Apache Knox</a>

1779

<a href="http://hortonworks.com/hadoop/knox-gateway/">2. Apache Knox Gateway Hortonworks web</a>

1780

</td>

1781

</tr>

1782

1783

<tr>

1784

<td width="20%">Apache Ranger</td>

1785

<td>

1786

Apache Argus Ranger (formerly called Apache Argus or HDP Advanced

1787

Security) delivers comprehensive approach to central security policy

1788

administration across the core enterprise security requirements

1789

of authentication, authorization, accounting and data protection.

1790

It extends baseline features for coordinated enforcement across

1791

Hadoop workloads from batch, interactive SQL and real–time and

1792

leverages the extensible architecture to apply policies consistently

1793

against additional Hadoop ecosystem components (beyond HDFS, Hive,

1794

and HBase) including Storm, Solr, Spark, and more.

1795

</td>

1796

<td width="20%"><a href="http://ranger.apache.org/">1. Apache Ranger</a>

1797

<a href="http://hortonworks.com/hadoop/ranger/">2. Apache Ranger Hortonworks web</a>

1798

</td>

1799

</tr>

1800

1801

<tr>

1802

<th colspan="3">Metadata Management</th>

1803

</tr>

1804

1805

<tr>

1806

<td width="20%">Metascope</td>

1807

<td>

1808

Metascope is a metadata management and data discovery tool which

1809

serves as an add-on to Schedoscope. Metascope is able to collect technical,

1810

operational and business metadata from your Hadoop Datahub and provides

1811

them easy to search and navigate via a portal.

1812

</td>

1813

<td width="20%"><a href="https://github.com/ottogroup/metascope">GitHub source code</a>

1814

</td>

1815

</tr>

1816

1817

<tr>

1818

<th colspan="3">System Deployment</th>

1819

</tr>

1820

1821

<tr>

1822

<td width="20%">Apache Ambari</td>

1823

<td>

1824

Intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs.

1825

Apache Ambari was donated by Hortonworks team to the ASF. It's a powerful and

1826

nice interface for Hadoop and other typical applications from the Hadoop

1827

ecosystem. Apache Ambari is under a heavy development, and it will incorporate

1828

new features in a near future. For example Ambari is able to deploy a complete

1829

Hadoop system from scratch, however is not possible use this GUI in a Hadoop

1830

system that is already running. The ability to provisioning the operating

1831

system could be a good addition, however probably is not in the roadmap..

1832

</td>

1833

<td width="20%"><a href="http://ambari.apache.org/">1. Apache Ambari</a>

1834

</td>

1835

</tr>

1836

1837

<tr>

1838

<td width="20%">Cloudera HUE</td>

1839

<td>

1840

Web application for interacting with Apache Hadoop. It's not a deploment tool,

1841

is an open-source Web interface that supports Apache Hadoop and its ecosystem,

1842

licensed under the Apache v2 license. HUE is used for Hadoop and its ecosystem

1843

user operations. For example HUE offers editors for Hive, Impala, Oozie, Pig,

1844

notebooks for Spark, Solr Search dashboards, HDFS, YARN, HBase browsers..

1845

</td>

1846

1847

</td>

1848

</tr>

1849

1850

<tr>

1851

<td width="20%">Apache Mesos</td>

1852

<td>

1853

Mesos is a cluster manager that provides resource sharing and isolation across

1854

cluster applications. Like HTCondor, SGE or Troque can do it. However Mesos

1855

is hadoop centred design

1856

</td>

1857

1858

</tr>

1859

1860

<tr>

1861

<td width="20%">Myriad</td>

1862

<td>

1863

Myriad is a mesos framework designed for scaling YARN clusters on Mesos. Myriad

1864

can expand or shrink one or more YARN clusters in response to events as per

1865

configured rules and policies.

1866

</td>

1867

<td width="20%"><a href="https://github.com/mesos/myriad">1. Myriad Github</a>

1868

</td>

1869

</tr>

1870

1871

<tr>

1872

<td width="20%">Marathon</td>

1873

<td>

1874

Marathon is a Mesos framework for long-running services. Given that you have

1875

Mesos running as the kernel for your datacenter, Marathon is the init or upstart daemon.

1876

</td>

1877

1878

</tr>

1879

1880

<tr>

1881

<td width="20%">Brooklyn</td>

1882

<td>

1883

Brooklyn is a library that simplifies application deployment and management.

1884

For deployment, it is designed to tie in with other tools, giving single-click

1885

deploy and adding the concepts of manageable clusters and fabrics:

1886

Many common software entities available out-of-the-box.

1887

Integrates with Apache Whirr -- and thereby Chef and Puppet -- to deploy well-known

1888

services such as Hadoop and elasticsearch (or use POBS, plain-old-bash-scripts)

1889

Use PaaS's such as OpenShift, alongside self-built clusters, for maximum flexibility

1890

</td>

1891

1892

</tr>

1893

1894

<tr>

1895

<td width="20%">Hortonworks HOYA</td>

1896

<td>

1897

HOYA is defined as “running HBase On YARN”. The Hoya tool is a Java tool,

1898

and is currently CLI driven. It takes in a cluster specification – in terms

1899

of the number of regionservers, the location of HBASE_HOME, the ZooKeeper

1900

quorum hosts, the configuration that the new HBase cluster instance should

1901

use and so on.

1902

So HOYA is for HBase deployment using a tool developed on top of YARN. Once the

1903

cluster has been started, the cluster can be made to grow or shrink using the

1904

Hoya commands. The cluster can also be stopped and later resumed. Hoya implements

1905

the functionality through YARN APIs and HBase’s shell scripts. The goal of

1906

the prototype was to have minimal code changes and as of this writing, it has

1907

required zero code changes in HBase.

1908

</td>

1909

<td width="20%"><a href="http://hortonworks.com/blog/introducing-hoya-hbase-on-yarn/">1. Hortonworks Blog</a>

1910

</td>

1911

</tr>

1912

1913

<tr>

1914

<td width="20%">Apache Helix</td>

1915

<td>

1916

Apache Helix is a generic cluster management framework used for the automatic

1917

management of partitioned, replicated and distributed resources hosted on a

1918

cluster of nodes. Originally developed by Linkedin, now is in an incubator

1919

project at Apache. Helix is developed on top of Zookeeper for coordination tasks.

1920

</td>

1921

<td width="20%"><a href="http://helix.apache.org/">1. Apache Helix</a>

1922

</td>

1923

</tr>

1924

1925

<tr>

1926

<td width="20%">Apache Bigtop</td>

1927

<td>

1928

Bigtop was originally developed and released as an open source packaging

1929

infrastructure by Cloudera. BigTop is used for some vendors to build their

1930

own distributions based on Apache Hadoop (CDH, Pivotal HD, Intel's distribution),

1931

however Apache Bigtop does many more tasks, like continuous integration testing

1932

(with Jenkins, maven, ...) and is useful for packaging (RPM and DEB), deployment

1933

with Puppet, and so on. BigTop also features vagrant recipes for spinning up "n-node"

1934

hadoop clusters, and the bigpetstore blueprint application which demonstrates

1935

construction of a full stack hadoop app with ETL, machine learning,

1936

and dataset generation. Apache Bigtop could be considered as a community effort

1937

with a main focus: put all bits of the Hadoop ecosystem as a whole, rather

1938

than individual projects.

1939

</td>

1940

<td width="20%"><a href="http://bigtop.apache.org//">1. Apache Bigtop.</a>

1941

</td>

1942

</tr>

1943

1944

<tr>

1945

<td width="20%">Buildoop</td>

1946

<td>

1947

Buildoop is an open source project licensed under Apache License 2.0, based on Apache BigTop idea.

1948

Buildoop is a collaboration project that provides templates and tools to help you create custom

1949

Linux-based systems based on Hadoop ecosystem. The project is built from scrach using Groovy language,

1950

and is not based on a mixture of tools like BigTop does (Makefile, Gradle, Groovy, Maven), probably

1951

is easier to programming than BigTop, and the desing is focused in the basic ideas behind the buildroot

1952

Yocto Project. The project is in early stages of development right now.

1953

</td>

1954

<td width="20%"><a href="http://buildoop.github.io/">1. Hadoop Ecosystem Builder.</a>

1955

</td>

1956

</tr>

1957

1958

<tr>

1959

<td width="20%">Deploop</td>

1960

<td>

1961

Deploop is a tool for provisioning, managing and monitoring Apache Hadoop

1962

clusters focused in the Lambda Architecture. LA is a generic design based on

1963

the concepts of Twitter engineer Nathan Marz. This generic architecture was

1964

designed addressing common requirements for big data. The Deploop system is

1965

in ongoing development, in alpha phases of maturity. The system is setup

1966

on top of highly scalable techologies like Puppet and MCollective.

1967

</td>

1968

<td width="20%"><a href="http://deploop.github.io/">1. The Hadoop Deploy System.</a>

1969

</td>

1970

</tr>

1971

1972

<tr>

1973

<td width="20%">SequenceIQ Cloudbreak</td>

1974

<td>

1975

Cloudbreak is an effective way to start and run multiple instances and

1976

versions of Hadoop clusters in the cloud, Docker containers or bare metal.

1977

It is a cloud and infrastructure agnostic and cost effictive Hadoop As-a-Service

1978

platform API. Provides automatic scaling, secure multi tenancy and full cloud lifecycle management.

1979

Cloudbreak leverages the cloud infrastructure platforms to create host instances,

1980

uses Docker technology to deploy the requisite containers cloud-agnostically,

1981

and uses Apache Ambari (via Ambari Blueprints) to install and manage a Hortonworks cluster.

1982

This is a tool within the HDP ecosystem.

1983

</td>

1984

<td width="20%"><a href="https://github.com/sequenceiq/cloudbreak">1. GitHub project.</a>

1985

<a href="http://sequenceiq.com/cloudbreak-docs/latest/#introduction">2. Cloudbreak introduction.</a>

1986

<a href="http://hortonworks.com/hadoop/cloudbreak/">3. Cloudbreak in Hortonworks.</a>

1987

</td>

1988

</tr>

1989

<tr>

1990

<td width="20%">Apache Eagle</td>

1991

<td>

1992

Apache Eagle is an open source analytics solution for identifying security and performance issues instantly on big data platforms, e.g. Hadoop, Spark etc. It analyzes data activities, yarn applications, jmx metrics, and daemon logs etc., provides state-of-the-art alert engine to identify security breach, performance issues and shows insights.

1993

Big data platform normally generates huge amount of operational logs and metrics in realtime. Apache Eagle is founded to solve hard problems in securing and tuning performance for big data platforms by ensuring metrics, logs always available and alerting immediately even under huge traffic.

1994

</td>

1995

<td width="20%"><a href="https://github.com/apache/incubator-eagle">1. Apache Eagle Github Project.</a>

1996

<a href="http://eagle.incubator.apache.org/">2. Apache Eagle Web Site.</a>

1997

</td>

1998

</tr>

1999

2000

<tr>

2001

<th colspan="3">Applications</th>

2002

</tr>

2003

<tr>

2004

2005

<td width="20%">Apache Nutch</td>

2006

<td>

2007

Highly extensible and scalable open source web crawler software

2008

project. A search engine based on Lucene: A Web crawler is an

2009

Internet bot that systematically browses the World Wide Web,

2010

typically for the purpose of Web indexing. Web crawlers can copy

2011

all the pages they visit for later processing by a search engine

2012

that indexes the downloaded pages so that users can search them

2013

much more quickly.

2014

</td>

2015

2016

</tr>

2017

2018

<tr>

2019

<td width="20%">Sphnix Search Server</td>

2020

<td>

2021

Sphinx lets you either batch index and search data stored in an

2022

SQL database, NoSQL storage, or just files quickly and easily —

2023

or index and search data on the fly, working with Sphinx pretty

2024

much as with a database server.

2025

</td>

2026

2027

</tr>

2028

2029

<tr>

2030

<td width="20%">Apache OODT</td>

2031

<td>

2032

OODT was originally developed at NASA Jet Propulsion Laboratory

2033

to support capturing, processing and sharing of data for NASA's

2034

scientific archives

2035

</td>

2036

2037

</tr>

2038

2039

<tr>

2040

<td width="20%">HIPI Library</td>

2041

<td>

2042

HIPI is a library for Hadoop's MapReduce framework that provides

2043

an API for performing image processing tasks in a distributed

2044

computing environment.

2045

</td>

2046

2047

</tr>

2048

2049

<tr>

2050

<td width="20%">PivotalR</td>

2051

<td>

2052

PivotalR is a package that enables users of R, the most popular open source statistical

2053

programming language and environment to interact with the Pivotal (Greenplum) Database

2054

as well as Pivotal HD / HAWQ and the open-source database PostgreSQL for Big Data analytics.

2055

R is a programming language and data analysis software: you do data analysis in R by writing

2056

scripts and functions in the R programming language. R is a complete, interactive,

2057

object-oriented language: designed by statisticians, for statisticians. The language

2058

provides objects, operators and functions that make the process of exploring, modeling,

2059

and visualizing data a natural one.

2060

</td>

2061

<td width="20%"><a href="https://github.com/gopivotal/PivotalR">1. PivotalR on GitHub</a>

2062

</td>

2063

</tr>

2064

2065

2066

2067

2068

<tr>

2069

<th colspan="3">Development Frameworks</th>

2070

</tr>

2071

2072

<tr>

2073

<td width="20%">Jumbune</td>

2074

<td>

2075

Jumbune is an open source product that sits on top of any Hadoop

2076

distribution and assists in development and administration of

2077

MapReduce solutions. The objective of the product is to assist

2078

analytical solution providers to port fault free applications on

2079

production Hadoop environments. Jumbune supports all active

2080

major branches of Apache Hadoop namely 1.x, 2.x, 0.23.x and commercial

2081

MapR, HDP 2.x and CDH 5.x distributions of Hadoop. It has the

2082

ability to work well with both Yarn and non-Yarn versions of Hadoop.

2083

It has four major modules MapReduce Debugger, HDFS Data Validator,

2084

On-demand cluster monitor and MapReduce job profiler. Jumbune can

2085

be deployed on any remote user machine and uses a lightweight

2086

agent on the NameNode of the cluster to relay relevant information to and fro.

2087

</td>

2088

<td width="20%"><a href="https://jumbune.org">1. Jumbune</a>

2089

<a href="https://github.com/impetus-opensource/jumbune">2. Jumbune GitHub Project</a>

2090

<a href="http://jumbune.org/jira/secure/Dashboard.jspa">3. Jumbune JIRA page</a>

2091

</td>

2092

</tr>

2093

2094

<tr>

2095

<td width="20%">Spring XD</td>

2096

<td>

2097

Spring XD (Xtreme Data) is a evolution of Spring Java application

2098

development framework to help Big Data Applications by Pivotal.

2099

SpringSource was the company created by the founders of the

2100

Spring Framework. SpringSource was purchased by VMware where it was

2101

maintained for some time as a separate division within VMware.

2102

Later VMware, and its parent company EMC Corporation, formally created

2103

a joint venture called Pivotal. Spring XD is more than development

2104

framework library, is a distributed, and extensible system for

2105

data ingestion, real time analytics, batch processing, and data

2106

export. It could be considered as alternative to Apache

2107

Flume/Sqoop/Oozie in some scenarios. Spring XD is part of Pivotal

2108

Spring for Apache Hadoop (SHDP). SHDP, integrated with Spring,

2109

Spring Batch and Spring Data are part of the Spring IO Platform

2110

as foundational libraries. Building on top of, and extending this

2111

foundation, the Spring IO platform provides Spring XD as big data

2112

runtime. Spring for Apache Hadoop (SHDP) aims to help simplify the

2113

development of Hadoop based applications by providing a consistent

2114

configuration and API across a wide range of Hadoop ecosystem

2115

projects such as Pig, Hive, and Cascading in addition to providing

2116

extensions to Spring Batch for orchestrating Hadoop based workflows.

2117

</td>

2118

<td width="20%"><a href="https://github.com/spring-projects/spring-xd">1. Spring XD on GitHub</a>

2119

</td>

2120

</tr>

2121

2122

<tr>

2123

<td width="20%">Cask Data Application Platform</td>

2124

<td>

2125

Cask Data Application Platform is an open source application

2126

development platform for the Hadoop ecosystem that provides

2127

developers with data and application virtualization to accelerate

2128

application development, address a range of real-time and batch

2129

use cases, and deploy applications into production. The deployment

2130

is made by Cask Coopr, an open source template-based cluster

2131

management solution that provisions, manages, and scales clusters

2132

for multi-tiered application stacks on public and private clouds.

2133

Another component is Tigon, a distributed framework built on Apache

2134

Hadoop and Apache HBase for real-time, high-throughput, low-latency

2135

data processing and analytics applications.

2136

</td>

2137

2138

</td>

2139

</tr>

2140

2141

<tr>

2142

<th colspan="3">Categorize Pending ... </th>

2143

</tr>

2144

2145

<tr>

2146

<td width="20%">Twitter Summingbird</td>

2147

<td>

2148

A system that aims to mitigate the tradeoffs between batch

2149

processing and stream processing by combining them into a

2150

hybrid system. In the case of Twitter, Hadoop handles batch

2151

processing, Storm handles stream processing, and the hybrid

2152

system is called Summingbird.

2153

</td>

2154

2155

</tr>

2156

2157

<tr>

2158

<td width="20%">Apache Kiji</td>

2159

<td>

2160

Build Real-time Big Data Applications on Apache HBase.

2161

</td>

2162

2163

</tr>

2164

2165

<tr>

2166

<td width="20%">S4 Yahoo</td>

2167

<td>

2168

S4 is a general-purpose, distributed, scalable, fault-tolerant,

2169

pluggable platform that allows programmers to easily develop

2170

applications for processing continuous unbounded streams of data.

2171

</td>

2172

2173

</tr>

2174

2175

<tr>

2176

<td width="20%">Metamarkers Druid</td>

2177

<td>

2178

Realtime analytical data store.

2179

</td>

2180

2181

</tr>

2182

2183

<tr>

2184

<td width="20%">Concurrent Cascading</td>

2185

<td>

2186

Application framework for Java developers to simply develop

2187

robust Data Analytics and Data Management applications on Apache Hadoop.

2188

</td>

2189

2190

</tr>

2191

2192

<tr>

2193

<td width="20%">Concurrent Lingual</td>

2194

<td>

2195

Open source project enabling fast and simple Big Data application

2196

development on Apache Hadoop. project that delivers ANSI-standard

2197

SQL technology to easily build new and integrate existing

2198

applications onto Hadoop

2199

</td>

2200

2201

</tr>

2202

2203

<tr>

2204

<td width="20%">Concurrent Pattern</td>

2205

<td>

2206

Machine Learning for Cascading on Apache Hadoop through an API,

2207

and standards based PMML

2208

</td>

2209

2210

</tr>

2211

2212

<tr>

2213

<td width="20%">Apache Giraph</td>

2214

<td>

2215

Apache Giraph is an iterative graph processing system built for

2216

high scalability. For example, it is currently used at Facebook

2217

to analyze the social graph formed by users and their connections.

2218

Giraph originated as the open-source counterpart to Pregel, the

2219

graph processing architecture developed at Google

2220

</td>

2221

2222

</tr>

2223

2224

<tr>

2225

<td width="20%">Talend</td>

2226

<td>

2227

Talend is an open source software vendor that provides data

2228

integration, data management, enterprise application integration

2229

and big data software and solutions.

2230

</td>

2231

2232

</tr>

2233

2234

<tr>

2235

<td width="20%">Akka Toolkit</td>

2236

<td>

2237

Akka is an open-source toolkit and runtime simplifying the

2238

construction of concurrent applications on the Java platform.

2239

</td>

2240

2241

</tr>

2242

2243

<tr>

2244

<td width="20%">Eclipse BIRT</td>

2245

<td>

2246

BIRT is an open source Eclipse-based reporting system that

2247

integrates with your Java/Java EE application to produce

2248

compelling reports.

2249

</td>

2250

2251

</tr>

2252

2253

<tr>

2254

<td width="20%">Spango BI</td>

2255

<td>

2256

SpagoBI is an Open Source Business Intelligence suite,

2257

belonging to the free/open source SpagoWorld initiative,

2258

founded and supported by Engineering Group. It offers a large

2259

range of analytical functions, a highly functional semantic layer

2260

often absent in other open source platforms and projects, and a

2261

respectable set of advanced data visualization features including

2262

geospatial analytics

2263

</td>

2264

2265

</tr>

2266

2267

<tr>

2268

<td width="20%">Jedox Palo</td>

2269

<td>

2270

Palo Suite combines all core applications — OLAP Server, Palo

2271

Web, Palo ETL Server and Palo for Excel — into one comprehensive

2272

and customisable Business Intelligence platform. The platform is

2273

completely based on Open Source products representing a high-end

2274

Business Intelligence solution which is available entirely free

2275

of any license fees.

2276

</td>

2277

2278

</tr>

2279

2280

<tr>

2281

<td width="20%">Twitter Finagle</td>

2282

<td>

2283

Finagle is an asynchronous network stack for the JVM that you

2284

can use to build asynchronous Remote Procedure Call (RPC)

2285

clients and servers in Java, Scala, or any JVM-hosted language.

2286

</td>

2287

2288

</tr>

2289

2290

<tr>

2291

<td width="20%">Intel GraphBuilder</td>

2292

<td>

2293

Library which provides tools to construct large-scale graphs on

2294

top of Apache Hadoop

2295

</td>

2296

2297

</tr>

2298

2299

<tr>

2300

<td width="20%">Apache Tika</td>

2301

<td>

2302

Toolkit detects and extracts metadata and structured text content

2303

from various documents using existing parser libraries.

2304

</td>

2305

2306

</tr>

2307

2308

<tr>

2309

<td width="20%">Apache Zeppelin</td>

2310

<td>

2311

Zeppelin is a modern web-based tool for the data scientists to

2312

collaborate over large-scale data exploration and visualization

2313

projects. It is a notebook style interpreter that enable

2314

collaborative analysis sessions sharing between users. Zeppelin

2315

is independent of the execution framework itself. Current version

2316

runs on top of Apache Spark but it has pluggable interpreter APIs

2317

to support other data processing systems. More execution frameworks

2318

could be added at a later date i.e Apache Flink, Crunch as well

2319

as SQL-like backends such as Hive, Tajo, MRQL.

2320

</td>

2321

<td width="20%"><a href="https://zeppelin.incubator.apache.org/">1. Apache Zeppelin site</a>

2322

</td>

2323

</tr>

2324

2325

</table>

2326

2327

</section>

2328

2329

</div>

2330

2331

2332

2333

Published with <a href="http://pages.github.com">GitHub Pages</a>

2334

by <a href="http://es.linkedin.com/in/javiroman/">Javi Roman</a>, and

2335

<a href="https://github.com/hadoopecosystemtable/hadoopecosystemtable.github.io/graphs/contributors">contributors</a>

2336

2337

</footer>

2338

</div>

2339

2340

</body>

2341

</html>

Older »