-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.html
2331 lines (2145 loc) · 97.9 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html>
<head>
<meta charset='utf-8' />
<meta http-equiv="X-UA-Compatible" content="chrome=1" />
<meta name="description" content="Hadoopecosystemtable.github.io : This page is a summary to keep the track of Hadoop related project, and relevant projects around Big Data scene focused on the open source, free software enviroment." />
<link rel="stylesheet" type="text/css" media="screen" href="stylesheets/stylesheet.css">
<title>The Hadoop Ecosystem Table</title>
</head>
<body>
<!-- HEADER -->
<div id="header_wrap" class="outer">
<header class="inner">
<a id="forkme_banner" href="https://github.com/hadoopecosystemtable/hadoopecosystemtable.github.io">Fork Me on GitHub</a>
<h1 id="project_title">The Hadoop Ecosystem Table</h1>
<h2 id="project_tagline">This page is a summary to keep the track of Hadoop related projects, focused on FLOSS environment.</h2>
</header>
</div>
<!-- MAIN CONTENT -->
<div id="main_content_wrap" class="outer">
<section id="main_content" class="inner">
<!-- THE TABLE -->
<table class="example3">
<!-- -->
<!-- Distributed Filesystem -->
<!-- -->
<tr>
<th colspan="3">Distributed Filesystem</th>
</tr>
<tr>
<td width="30%">Apache HDFS</td>
<td>
The Hadoop Distributed File System (HDFS) offers a way to store large files across
multiple machines. Hadoop and HDFS was derived from Google File System (GFS) paper.
Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster.
With Zookeeper the HDFS High Availability feature addresses this problem by providing
the option of running two redundant NameNodes in the same cluster in an Active/Passive
configuration with a hot standby.
</td>
<td width="20%"><a href="http://hadoop.apache.org/">1. hadoop.apache.org</a>
<br> <a href="http://research.google.com/archive/gfs.html">2. Google FileSystem - GFS Paper</a>
<br> <a href="http://blog.cloudera.com/blog/2012/07/why-we-build-our-platform-on-hdfs/">3. Cloudera Why HDFS</a>
<br> <a href="http://hortonworks.com/blog/thinking-about-the-hdfs-vs-other-storage-technologies/">4. Hortonworks Why HDFS</a>
</td>
</tr>
<tr>
<td width="20%">Red Hat GlusterFS</td>
<td>
GlusterFS is a scale-out network-attached storage file system. GlusterFS was
developed originally by Gluster, Inc., then by Red Hat, Inc., after their
purchase of Gluster in 2011. In June 2012, Red Hat Storage Server was
announced as a commercially-supported integration of GlusterFS with
Red Hat Enterprise Linux. Gluster File System, known now as Red Hat Storage Server.
</td>
<td width="20%"><a href="http://www.gluster.org/">1. www.gluster.org</a>
<br><a href="http://www.redhat.com/about/news/archive/2013/10/red-hat-contributes-apache-hadoop-plug-in-to-the-gluster-community">2. Red Hat Hadoop Plugin</a>
</td>
</tr>
<tr>
<td width="20%">Quantcast File System QFS</td>
<td>
QFS is an open-source distributed file system software package for
large-scale MapReduce or other batch-processing workloads. It was
designed as an alternative to Apache Hadoop’s HDFS, intended to deliver
better performance and cost-efficiency for large-scale processing clusters.
It is written in C++ and has fixed-footprint memory management. QFS uses
Reed-Solomon error correction as method for assuring reliable access to data.<br>
Reed–Solomon coding is very widely used in mass storage systems to correct the burst
errors associated with media defects. Rather than storing three full versions of
each file like HDFS, resulting in the need for three times more storage, QFS
only needs 1.5x the raw capacity because it stripes data across nine different disk drives.
</td>
<td width="20%"><a href="https://www.quantcast.com/engineering/qfs/">1. QFS site</a>
<br><a href="https://github.com/quantcast/qfs">2. GitHub QFS</a>
<br><a href="https://issues.apache.org/jira/browse/HADOOP-8885">3. HADOOP-8885</a>
</td>
</tr>
<tr>
<td width="30%">Ceph Filesystem</td>
<td>
Ceph is a free software storage platform designed to present object, block,
and file storage from a single distributed computer cluster. Ceph's main
goals are to be completely distributed without a single point of failure,
scalable to the exabyte level, and freely-available. The data is replicated,
making it fault tolerant.
</td>
<td width="20%"><a href="http://ceph.com/ceph-storage/file-system/">1. Ceph Filesystem site</a>
<br><a href="http://ceph.com/docs/next/cephfs/hadoop/">2. Ceph and Hadoop</a>
<br><a href="https://issues.apache.org/jira/browse/HADOOP-6253">3. HADOOP-6253</a>
</td>
</tr>
<tr>
<td width="30%">Lustre file system</td>
<td>
The Lustre filesystem is a high-performance distributed filesystem
intended for larger network and high-availability environments.
Traditionally, Lustre is configured to manage remote data storage
disk devices within a Storage Area Network (SAN), which is two or
more remotely attached disk devices communicating via a Small Computer
System Interface (SCSI) protocol. This includes Fibre Channel, Fibre
Channel over Ethernet (FCoE), Serial Attached SCSI (SAS) and even iSCSI.<br>
With Hadoop HDFS the software needs a dedicated cluster of computers
on which to run. But folks who run high performance computing clusters
for other purposes often don't run HDFS, which leaves them with a bunch
of computing power, tasks that could almost certainly benefit from a bit
of map reduce and no way to put that power to work running Hadoop. Intel's
noticed this and, in version 2.5 of its Hadoop distribution that it quietly
released last week, has added support for Lustre: the Intel® HPC Distribution
for Apache Hadoop* Software, a new product that combines Intel Distribution
for Apache Hadoop software with Intel® Enterprise Edition for Lustre software.
This is the only distribution of Apache Hadoop that is integrated with Lustre,
the parallel file system used by many of the world's fastest supercomputers
</td>
<td width="20%"><a href="http://wiki.lustre.org/">1. wiki.lustre.org/</a>
<br><a href="http://wiki.lustre.org/index.php/Running_Hadoop_with_Lustre">2. Hadoop with Lustre</a>
<br><a href="http://hadoop.intel.com/products/distribution">3. Intel HPC Hadoop</a>
</td>
</tr>
<tr>
<td width="30%">Alluxio</td>
<td>
Alluxio, the world’s first memory-centric virtual distributed storage system, unifies data access
and bridges computation frameworks and underlying storage systems. Applications only need to connect
with Alluxio to access data stored in any underlying storage systems. Additionally, Alluxio’s
memory-centric architecture enables data access orders of magnitude faster than existing solutions.
<br>
In big data ecosystem, Alluxio lies between computation frameworks or jobs, such as Apache Spark,
Apache MapReduce, or Apache Flink, and various kinds of storage systems, such as Amazon S3,
OpenStack Swift, GlusterFS, HDFS, Ceph, or OSS. Alluxio brings significant performance improvement
to the stack; for example, Baidu uses Alluxio to improve their data analytics performance by 30 times.
Beyond performance, Alluxio bridges new workloads with data stored in traditional storage systems.
Users can run Alluxio using its standalone cluster mode, for example on Amazon EC2, or launch Alluxio
with Apache Mesos or Apache Yarn.
<br>
Alluxio is Hadoop compatible. This means that existing Spark and MapReduce programs can run on top of
Alluxio without any code changes. The project is open source (Apache License 2.0) and is deployed at
multiple companies. It is one of the fastest growing open source projects. With less than three years
open source history, Alluxio has attracted more than 160 contributors from over 50 institutions,
including Alibaba, Alluxio, Baidu, CMU, IBM, Intel, NJU, Red Hat, UC Berkeley, and Yahoo.
The project is the storage layer of the Berkeley Data Analytics Stack (BDAS) and also part of the
Fedora distribution.
</td>
<td width="20%"><a href="http://www.alluxio.org/">1. Alluxio site</a>
</td>
</tr>
<tr>
<td width="30%">GridGain</td>
<td>
GridGain is open source project licensed under Apache 2.0. One of the main pieces of this platform is the
In-Memory Apache Hadoop Accelerator which aims to accelerate HDFS and Map/Reduce by bringing both, data
and computations into memory. This work is done with the GGFS - Hadoop compliant in-memory file system.
For I/O intensive jobs GridGain GGFS offers performance close to 100x faster than standard HDFS.
Paraphrasing Dmitriy Setrakyan from GridGain Systems talking about GGFS regarding Tachyon:
<ul>
<li>GGFS allows read-through and write-through to/from underlying HDFS or any
other Hadoop compliant file system with zero code change. Essentially GGFS
entirely removes ETL step from integration.</li>
<li>GGFS has ability to pick and choose what folders stay in memory, what
folders stay on disc, and what folders get synchronized with underlying
(HD)FS either synchronously or asynchronously.</li>
<li>GridGain is working on adding native MapReduce component which will
provide native complete Hadoop integration without changes in API, like
Spark currently forces you to do. Essentially GridGain MR+GGFS will allow
to bring Hadoop completely or partially in-memory in Plug-n-Play fashion
without any API changes.</li>
</ul>
</td>
<td width="20%"><a href="http://www.gridgain.org/">1. GridGain site</a>
</td>
</tr>
<tr>
<td width="30%">XtreemFS</td>
<td>
XtreemFS is a general purpose storage system and covers most storage needs in a single deployment.
It is open-source, requires no special hardware or kernel modules, and can be mounted on Linux,
Windows and OS X.
XtreemFS runs distributed and offers resilience through replication. XtreemFS Volumes can be accessed
through a FUSE component, that offers normal file interaction with POSIX like semantics. Furthermore an
implementation of Hadoops FileSystem interface is included which makes XtreemFS available for use with
Hadoop, Flink and Spark out of the box.
XtreemFS is licensed under the New BSD license. The XtreemFS project is developed by Zuse Institute Berlin.
The development of the project is funded by the European Commission since 2006 under
Grant Agreements No. FP6-033576, FP7-ICT-257438, and FP7-318521, as well as the German projects MoSGrid,
"First We Take Berlin", FFMK, GeoMultiSens, and BBDC.
</td>
<td width="20%"><a href="http://www.xtreemfs.org/">1. XtreemFS site</a>
<a href="https://github.com/xtreemfs/xtreemfs/wiki/Apache-Flink-with-XtreemFS">2. Flink on XtreemFS</a>
<a href="https://github.com/xtreemfs/xtreemfs/wiki/Apache-Spark-with-XtreemFS">. Spark XtreemFS</a>
</td>
</tr>
<!-- -->
<!-- Distributed Programming-->
<!-- -->
<tr>
<th colspan="3">Distributed Programming</th>
</tr>
<tr>
<td width="20%">Apache Ignite</td>
<td>
Apache Ignite In-Memory Data Fabric is a distributed in-memory platform
for computing and transacting on large-scale data sets in real-time.
It includes a distributed key-value in-memory store, SQL capabilities,
map-reduce and other computations, distributed data structures,
continuous queries, messaging and events subsystems, Hadoop and Spark integration.
Ignite is built in Java and provides .NET and C++ APIs.
</td>
<td width="20%"><a href="http://ignite.apache.org/">1. Apache Ignite</a>
<br> <a href="https://apacheignite.readme.io/">2. Apache Ignite documentation</a>
</td>
</tr>
<tr>
<td width="20%">Apache MapReduce</td>
<td>
MapReduce is a programming model for processing large data sets with a parallel,
distributed algorithm on a cluster. Apache MapReduce was derived from Google
MapReduce: Simplified Data Processing on Large Clusters paper. The current
Apache MapReduce version is built over Apache YARN Framework. YARN stands
for “Yet-Another-Resource-Negotiator”. It is a new framework that facilitates
writing arbitrary distributed processing frameworks and applications. YARN’s
execution model is more generic than the earlier MapReduce implementation.
YARN can run applications that do not follow the MapReduce model, unlike the
original Apache Hadoop MapReduce (also called MR1). Hadoop YARN is an attempt
to take Apache Hadoop beyond MapReduce for data-processing.
</td>
<td width="20%"><a href="http://wiki.apache.org/hadoop/MapReduce/">1. Apache MapReduce</a>
<br> <a href="http://research.google.com/archive/mapreduce.html">2. Google MapReduce paper</a>
<br> <a href="http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html">3. Writing YARN applications</a>
</td>
</tr>
<tr>
<td width="20%">Apache Pig</td>
<td>
Pig provides an engine for executing data flows in parallel on Hadoop. It includes a language,
Pig Latin, for expressing these data flows. Pig Latin includes operators for many of the
traditional data operations (join, sort, filter, etc.), as well as the ability for users
to develop their own functions for reading, processing, and writing data. Pig runs on Hadoop.
It makes use of both the Hadoop Distributed File System, HDFS, and Hadoop’s processing system, MapReduce.<br>
Pig uses MapReduce to execute all of its data processing. It compiles the Pig Latin scripts
that users write into a series of one or more MapReduce jobs that it then executes. Pig Latin looks
different from many of the programming languages you have seen. There are no if statements or for
loops in Pig Latin. This is because traditional procedural and object-oriented programming languages
describe control flow, and data flow is a side effect of the program. Pig Latin instead focuses on data flow.
</td>
<td width="20%"><a href="https://pig.apache.org/">1. pig.apache.org/</a>
<br> <a href="https://github.com/alanfgates/programmingpig">2.Pig examples by Alan Gates</a>
</td>
</tr>
<tr>
<td width="20%">JAQL</td>
<td>
JAQL is a functional, declarative programming language designed especially for working with large
volumes of structured, semi-structured and unstructured data. As its name implies, a primary
use of JAQL is to handle data stored as JSON documents, but JAQL can work on various types of data.
For example, it can support XML, comma-separated values (CSV) data and flat files. A "SQL within JAQL"
capability lets programmers work with structured SQL data while employing a JSON data model that's less
restrictive than its Structured Query Language counterparts.<br>
Specifically, Jaql allows you to select, join, group, and filter data that is stored in HDFS, much
like a blend of Pig and Hive. Jaql’s query language was inspired by many programming and query languages,
including Lisp, SQL, XQuery, and Pig. <br>
JAQL was created by workers at IBM Research Labs in 2008 and released to open source. While it continues
to be hosted as a project on Google Code, where a downloadable version is available under an Apache 2.0 license,
the major development activity around JAQL has remained centered at IBM. The company offers the query language
as part of the tools suite associated with InfoSphere BigInsights, its Hadoop platform. Working together with a
workflow orchestrator, JAQL is used in BigInsights to exchange data between storage, processing and analytics jobs.
It also provides links to external data and services, including relational databases and machine learning data.
</td>
<td width="20%"><a href="https://code.google.com/p/jaql/">1. JAQL in Google Code</a>
<br> <a href="http://www-01.ibm.com/software/data/infosphere/hadoop/jaql/">2. What is Jaql? by IBM</a>
</td>
</tr>
<tr>
<td width="20%">Apache Spark</td>
<td>
Data analytics cluster computing framework originally developed in the AMPLab at UC Berkeley.
Spark fits into the Hadoop open-source community, building on top of the Hadoop Distributed File System (HDFS).
However, Spark provides an easier to use alternative to Hadoop MapReduce and offers performance up to 10 times
faster than previous generation systems like Hadoop MapReduce for certain applications.<br>
Spark is a framework for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce
does but with a fast in-memory approach and a clean functional style API. With its ability to integrate with
Hadoop and inbuilt tools for interactive query analysis (Shark), large-scale graph processing and analysis (Bagel),
and real-time analysis (Spark Streaming), it can be interactively used to quickly process and query big data sets.<br>
To make programming faster, Spark provides clean, concise APIs in Scala, Java and Python. You can also use Spark
interactively from the Scala and Python shells to rapidly query big datasets. Spark is also the engine behind Shark,
a fully Apache Hive-compatible data warehousing system that can run 100x faster than Hive.
</td>
<td width="20%"><a href="http://spark.apache.org/">1. Apache Spark</a>
<br> <a href="https://github.com/apache/spark">2. Mirror of Spark on Github</a>
<br> <a href="http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf">3. RDDs - Paper</a>
<br> <a href="https://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf">4. Spark: Cluster Computing... - Paper</a>
<br> <a href="http://spark.apache.org/research.html">Spark Research</a>
</td>
</tr>
<tr>
<td width="20%">Apache Storm</td>
<td>
Storm is a complex event processor (CEP) and distributed computation
framework written predominantly in the Clojure programming language.
Is a distributed real-time computation system for processing fast,
large streams of data. Storm is an architecture based on master-workers
paradigma. So a Storm cluster mainly consists of a master and worker
nodes, with coordination done by Zookeeper. <br>
Storm makes use of zeromq (0mq, zeromq), an advanced, embeddable
networking library. It provides a message queue, but unlike
message-oriented middleware (MOM), a 0MQ system can run without
a dedicated message broker. The library is designed to have a
familiar socket-style API.<br>
Originally created by Nathan Marz and team at BackType, the
project was open sourced after being acquired by Twitter. Storm
was initially developed and deployed at BackType in 2011. After
7 months of development BackType was acquired by Twitter in July
2011. Storm was open sourced in September 2011. <br>
Hortonworks is developing a Storm-on-YARN version and plans
finish the base-level integration in 2013 Q4. This is the plan
from Hortonworks. Yahoo/Hortonworks also plans to move Storm-on-YARN
code from github.com/yahoo/storm-yarn to be a subproject of
Apache Storm project in the near future.<br>
Twitter has recently released a Hadoop-Storm Hybrid called
“Summingbird.” Summingbird fuses the two frameworks into one,
allowing for developers to use Storm for short-term processing
and Hadoop for deep data dives,. a system that aims to mitigate
the tradeoffs between batch processing and stream processing by
combining them into a hybrid system.
</td>
<td width="20%"><a href="http://storm-project.net/">1. Storm Project/</a>
<br> <a href="github.com/yahoo/storm-yarn">2. Storm-on-YARN</a>
</td>
</tr>
<tr>
<td width="20%">Apache Flink</td>
<td>
Apache Flink (formerly called Stratosphere) features powerful programming abstractions in Java and Scala,
a high-performance runtime, and automatic program optimization. It has native support for iterations,
incremental iterations, and programs consisting of large DAGs of operations.<br>
Flink is a data processing system and an alternative to Hadoop's MapReduce component. It comes with
its own runtime, rather than building on top of MapReduce. As such, it can work completely independently
of the Hadoop ecosystem. However, Flink can also access Hadoop's distributed file system (HDFS) to read
and write data, and Hadoop's next-generation resource manager (YARN) to provision cluster resources.
Since most Flink users are using Hadoop HDFS to store their data, it ships already the required libraries to access HDFS.
</td>
<td width="20%"><a href="http://flink.incubator.apache.org/">1. Apache Flink incubator page</a>
<br><a href="http://stratosphere.eu/">2. Stratosphere site</a>
</td>
</tr>
<tr>
<td width="20%">Apache Apex</td>
<td>
Apache Apex is an enterprise grade Apache YARN based big data-in-motion platform that
unifies stream processing as well as batch processing. It processes big data
in-motion in a highly scalable, highly performant, fault tolerant, stateful,
secure, distributed, and an easily operable way. It provides a simple API that
enables users to write or re-use generic Java code, thereby lowering the expertise
needed to write big data applications. <p>
The Apache Apex platform is supplemented by Apache Apex-Malhar,
which is a library of operators that implement common business logic
functions needed by customers who want to quickly develop applications.
These operators provide access to HDFS, S3, NFS, FTP, and other file systems;
Kafka, ActiveMQ, RabbitMQ, JMS, and other message systems; MySql, Cassandra,
MongoDB, Redis, HBase, CouchDB and other databases along with JDBC connectors.
The library also includes a host of other common business logic patterns that
help users to significantly reduce the time it takes to go into production.
Ease of integration with all other big data technologies is one of the primary
missions of Apache Apex-Malhar.<p>
Apex, available on GitHub, is the core technology upon which DataTorrent's
commercial offering, DataTorrent RTS 3, along with other technology such as
a data ingestion tool called dtIngest, are based.
</td>
<td width="20%"><a href="https://www.datatorrent.com/apex/">1. Apache Apex from DataTorrent</a>
<br><a href="http://apex.incubator.apache.org/">2. Apache Apex main page</a>
<br><a href="https://wiki.apache.org/incubator/ApexProposal">3. Apache Apex Proposal</a>
</td>
</tr>
<tr>
<td width="20%">Netflix PigPen</td>
<td>
PigPen is map-reduce for Clojure which compiles to Apache Pig. Clojure is dialect of the Lisp programming
language created by Rich Hickey, so is a functional general-purpose language, and runs on the Java Virtual Machine,
Common Language Runtime, and JavaScript engines. In PigPen there are no special user defined functions (UDFs).
Define Clojure functions, anonymously or named, and use them like you would in any Clojure program. This tool
is open sourced by Netflix, Inc. the American provider of on-demand Internet streaming media.
</td>
<td width="20%"><a href="https://github.com/Netflix/PigPen">1. PigPen on GitHub</a>
</td>
</tr>
<tr>
<td width="20%">AMPLab SIMR</td>
<td>
Apache Spark was developed thinking in Apache YARN. However, up to now, it has been relatively hard to run
Apache Spark on Hadoop MapReduce v1 clusters, i.e. clusters that do not have YARN installed. Typically,
users would have to get permission to install Spark/Scala on some subset of the machines, a process that
could be time consuming. SIMR allows anyone with access to a Hadoop MapReduce v1 cluster to run Spark out
of the box. A user can run Spark directly on top of Hadoop MapReduce v1 without any administrative rights,
and without having Spark or Scala installed on any of the nodes.
</td>
<td width="20%"><a href="http://databricks.github.io/simr/">1. SIMR on GitHub</a>
</td>
</tr>
<tr>
<td width="20%">Facebook Corona</td>
<td>
“The next version of Map-Reduce" from Facebook, based in own fork of Hadoop. The current Hadoop implementation
of the MapReduce technique uses a single job tracker, which causes scaling issues for very large data sets.
The Apache Hadoop developers have been creating their own next-generation MapReduce, called YARN, which Facebook
engineers looked at but discounted because of the highly-customised nature of the company's deployment of Hadoop and HDFS.
Corona, like YARN, spawns multiple job trackers (one for each job, in Corona's case).
</td>
<td width="20%"><a href="https://github.com/facebookarchive/hadoop-20/tree/master/src/contrib/corona">1. Corona on Github</a>
</td>
</tr>
<tr>
<td width="20%">Apache REEF</td>
<td>
Apache REEF™ (Retainable Evaluator Execution Framework) is a library for developing portable
applications for cluster resource managers such as Apache Hadoop™ YARN or Apache Mesos™.
Apache REEF drastically simplifies development of those resource managers through the following features:
<ul>
<li>
Centralized Control Flow: Apache REEF turns the chaos of a distributed application into events in a
single machine, the Job Driver. Events include container allocation, Task launch, completion and
failure. For failures, Apache REEF makes every effort of making the actual `Exception` thrown by the
Task available to the Driver.
</li>
<li>
Task runtime: Apache REEF provides a Task runtime called Evaluator. Evaluators are instantiated in
every container of a REEF application. Evaluators can keep data in memory in between Tasks, which
enables efficient pipelines on REEF.
</li>
<li>
Support for multiple resource managers: Apache REEF applications are portable to any supported resource
manager with minimal effort. Further, new resource managers are easy to support in REEF.
</li>
<li>
.NET and Java API: Apache REEF is the only API to write YARN or Mesos applications in .NET. Further, a
single REEF application is free to mix and match Tasks written for .NET or Java.
</li>
<li>
Plugins: Apache REEF allows for plugins (called "Services") to augment its feature set without adding
bloat to the core. REEF includes many Services, such as a name-based communications between Tasks
MPI-inspired group communications (Broadcast, Reduce, Gather, ...) and data ingress.
</li>
</ul>
</td>
<td width="20%"><a href="https://reef.apache.org">1. Apache REEF Website</a>
</td>
</tr>
<tr>
<td width="20%">Apache Twill</td>
<td>
Twill is an abstraction over Apache Hadoop® YARN that reduces the
complexity of developing distributed applications, allowing developers
to focus more on their business logic. Twill uses a simple thread-based model that Java
programmers will find familiar. YARN can be viewed as a compute
fabric of a cluster, which means YARN applications like Twill will
run on any Hadoop 2 cluster.<br>
YARN is an open source application that allows the Hadoop cluster
to turn into a collection of virtual machines. Weave, developed by
Continuuity and initially housed on Github, is a complementary open
source application that uses a programming model similar to Java
threads, making it easy to write distributed applications. In order to remove
a conflict with a similarly named project on Apache, called "Weaver,"
Weave's name changed to Twill when it moved to Apache incubation.<br>
Twill functions as a scaled-out proxy. Twill is a middleware layer
in between YARN and any application on YARN. When you develop a
Twill app, Twill handles APIs in YARN that resemble a multi-threaded application familiar to Java.
It is very easy to build multi-processed distributed applications in Twill.
</td>
<td width="20%"><a href="https://incubator.apache.org/projects/twill.html">1. Apache Twill Incubator</a>
</td>
</tr>
<tr>
<td width="20%">Damballa Parkour</td>
<td>
Library for develop MapReduce programs using the LISP like language Clojure. Parkour aims to provide deep Clojure
integration for Hadoop. Programs using Parkour are normal Clojure programs, using standard Clojure functions
instead of new framework abstractions. Programs using Parkour are also full Hadoop programs, with complete
access to absolutely everything possible in raw Java Hadoop MapReduce.
</td>
<td width="20%"><a href="https://github.com/damballa/parkour">1. Parkour GitHub Project</a>
</td>
</tr>
<tr>
<td width="20%">Apache Hama</td>
<td>
Apache Top-Level open source project, allowing you to do advanced analytics beyond MapReduce. Many data
analysis techniques such as machine learning and graph algorithms require iterative computations,
this is where Bulk Synchronous Parallel model can be more effective than "plain" MapReduce.
</td>
<td width="20%"><a href="http://hama.apache.org/">1. Hama site</a>
</td>
</tr>
<tr>
<td width="20%">Datasalt Pangool</td>
<td>
A new MapReduce paradigm. A new API for MR jobs, in higher level than Java.
</td>
<td width="20%"><a href="http://pangool.net">1.Pangool</a>
<br> <a href = "https://github.com/datasalt/pangool">2.GitHub Pangool</a>
</td>
</tr>
<tr>
<td width="20%">Apache Tez</td>
<td>
Tez is a proposal to develop a generic application which can be used to process complex data-processing
task DAGs and runs natively on Apache Hadoop YARN. Tez generalizes the MapReduce paradigm to a more
powerful framework based on expressing computations as a dataflow graph. Tez is not meant directly for
end-users – in fact it enables developers to build end-user applications with much better performance
and flexibility. Hadoop has traditionally been a batch-processing platform for large amounts of data.
However, there are a lot of use cases for near-real-time performance of query processing. There are also
several workloads, such as Machine Learning, which do not fit will into the MapReduce paradigm. Tez helps
Hadoop address these use cases. Tez framework constitutes part of Stinger initiative (a low latency
based SQL type query interface for Hadoop based on Hive).
</td>
<td width="20%"><a href="http://incubator.apache.org/projects/tez.html">1. Apache Tez Incubator</a>
<br> <a href="http://hortonworks.com/hadoop/tez/">2. Hortonworks Apache Tez page</a>
</td>
</tr>
<tr>
<td width="20%">Apache DataFu</td>
<td>
DataFu provides a collection of Hadoop MapReduce jobs and functions in higher level languages based
on it to perform data analysis. It provides functions for common statistics tasks (e.g. quantiles,
sampling), PageRank, stream sessionization, and set and bag operations. DataFu also provides Hadoop
jobs for incremental data processing in MapReduce. DataFu is a collection of Pig UDFs (including PageRank,
sessionization, set operations, sampling, and much more) that were originally developed at LinkedIn.
</td>
<td width="20%"><a href="http://incubator.apache.org/projects/datafu.html">1. DataFu Apache Incubator</a>
</td>
</tr>
<tr>
<td width="20%">Pydoop</td>
<td>
Pydoop is a Python MapReduce and HDFS API for Hadoop, built upon the C++
Pipes and the C libhdfs APIs, that allows to write full-fledged MapReduce
applications with HDFS access. Pydoop has several advantages over Hadoop’s built-in
solutions for Python programming, i.e., Hadoop Streaming and Jython: being a CPython
package, it allows you to access all standard library and third party modules,
some of which may not be available.
</td>
<td width="20%"><a href="http://pydoop.sourceforge.net/docs/">1. SF Pydoop site</a>
<br> <a href="https://github.com/crs4/pydoop">2. Pydoop GitHub Project</a>
</td>
</tr>
<tr>
<td width="20%">Kangaroo</td>
<td>
Open-source project from Conductor for writing MapReduce jobs consuming data from Kafka.
The introductory post explains Conductor’s use case—loading data from Kafka to HBase
by way of a MapReduce job using the HFileOutputFormat. Unlike other solutions
which are limited to a single InputSplit per Kafka partition, Kangaroo can launch
multiple consumers at different offsets in the stream of a single partition for
increased throughput and parallelism.
</td>
<td width="20%"><a href="http://www.conductor.com/nightlight/data-stream-processing-bulk-kafka-hadoop/">1. Kangaroo Introduction</a>
<br> <a href="https://github.com/Conductor/kangaroo">2. Kangaroo GitHub Project</a>
</td>
</tr>
<tr>
<td width="20%">TinkerPop</td>
<td>
Graph computing framework written in Java. Provides a core API that graph system vendors can implement.
There are various types of graph systems including in-memory graph libraries, OLTP graph databases,
and OLAP graph processors. Once the core interfaces are implemented, the underlying graph system
can be queried using the graph traversal language Gremlin and processed with TinkerPop-enabled
algorithms. For many, TinkerPop is seen as the JDBC of the graph computing community.
</td>
<td width="20%"><a href="https://wiki.apache.org/incubator/TinkerPopProposal">1. Apache Tinkerpop Proposal</a>
<br> <a href="http://www.tinkerpop.com/">2. TinkerPop site</a>
</td>
</tr>
<tr>
<td width="20%">Pachyderm MapReduce</td>
<td>
Pachyderm is a completely new MapReduce engine built on top Docker and CoreOS.
In Pachyderm MapReduce (PMR) a job is an HTTP server inside a Docker container
(a microservice). You give Pachyderm a Docker image and it will automatically
distribute it throughout the cluster next to your data. Data is POSTed to
the container over HTTP and the results are stored back in the file system.
You can implement the web server in any language you want and pull in any library.
Pachyderm also creates a DAG for all the jobs in the system and their dependencies
and it automatically schedules the pipeline such that each job isn’t run until it’s
dependencies have completed. Everything in Pachyderm “speaks in diffs” so it knows
exactly which data has changed and which subsets of the pipeline need to be rerun.
CoreOS is an open source lightweight operating system based on Chrome OS, actually
CoreOS is a fork of Chrome OS. CoreOS provides only the minimal functionality
required for deploying applications inside software containers, together with
built-in mechanisms for service discovery and configuration sharing
</td>
<td width="20%"><a href="http://www.pachyderm.io/">1. Pachyderm site</a>
<br> <a href="https://medium.com/pachyderm-data/lets-build-a-modern-hadoop-4fc160f8d74f">2. Pachyderm introduction article</a>
</td>
</tr>
<tr>
<td width="20%">Apache Beam</td>
<td>
Apache Beam is an open source, unified model for defining and executing
data-parallel processing pipelines, as well as a set of language-specific
SDKs for constructing pipelines and runtime-specific Runners for executing them.<p>
The model behind Beam evolved from a number of internal Google
data processing projects, including MapReduce, FlumeJava, and
Millwheel. This model was originally known as the “Dataflow Model”
and first implemented as Google Cloud Dataflow, including a Java SDK
on GitHub for writing pipelines and fully managed service for
executing them on Google Cloud Platform.<p>
In January 2016, Google and a number of partners submitted the Dataflow
Programming Model and SDKs portion as an Apache Incubator Proposal,
under the name Apache Beam (unified Batch + strEAM processing).
</td>
<td width="20%"><a href="https://wiki.apache.org/incubator/BeamProposal">1. Apache Beam Proposal</a>
<br><a href="https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison">2. DataFlow Beam and Spark Comparasion</a>
</td>
</tr>
<!-- -->
<!-- NoSQL ecosystem -->
<!-- -->
<tr>
<th colspan="3">NoSQL Databases</th>
</tr>
<tr>
<th colspan="3" style="background-color:#0099FF;">Column Data Model</th>
</tr>
<tr>
<td width="20%">Apache HBase</td>
<td>
Google BigTable Inspired. Non-relational distributed database.
Ramdom, real-time r/w operations in column-oriented very large
tables (BDDB: Big Data Data Base). It’s the backing system for
MR jobs outputs. It’s the Hadoop database. It’s for backing
Hadoop MapReduce jobs with Apache HBase tables
</td>
<td width="20%"><a href="https://hbase.apache.org/">1. Apache HBase Home</a>
<br> <a href="https://github.com/apache/hbase">2. Mirror of HBase on Github</a>
</td>
</tr>
<tr>
<td width="20%">Apache Cassandra</td>
<td>
Distributed Non-SQL DBMS, it’s a BDDB. MR can retrieve data from Cassandra.
This BDDB can run without HDFS, or on-top of HDFS (DataStax fork of Cassandra).
HBase and its required supporting systems are derived from what is known of
the original Google BigTable and Google File System designs (as known from the
Google File System paper Google published in 2003, and the BigTable paper published
in 2006). Cassandra on the other hand is a recent open source fork of a standalone
database system initially coded by Facebook, which while implementing the BigTable
data model, uses a system inspired by Amazon’s Dynamo for storing data (in fact
much of the initial development work on Cassandra was performed by two Dynamo
engineers recruited to Facebook from Amazon).
</td>
<td width="20%">
<a href="http://cassandra.apache.org" target="_blank">1. Apache HBase Home</a> <br>
<a href="https://github.com/apache/cassandra" target="_blank">2. Cassandra on GitHub</a> <br>
<a href="https://academy.datastax.com" target="_blank">3. Training Resources</a> <br>
<a href="https://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf" target="_blank">4. Cassandra - Paper</a>
</td>
</tr>
<tr>
<td width="20%">Hypertable</td>
<td>
Database system inspired by publications on the design of Google's
BigTable. The project is based on experience of engineers who were
solving large-scale data-intensive tasks for many years. Hypertable
runs on top of a distributed file system such as the Apache Hadoop DFS,
GlusterFS, or the Kosmos File System (KFS). It is written almost entirely
in C++. Sposored by Baidu the Chinese search engine.
</td>
<td width="20%">TODO</td>
</tr>
<tr>
<td width="20%">Apache Accumulo</td>
<td>
Distributed key/value store is a robust, scalable, high performance
data storage and retrieval system. Apache Accumulo is based on Google's
BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift.
Accumulo is software created by the NSA with security features.
</td>
<td width="20%"><a href="https://accumulo.apache.org/">1. Apache Accumulo Home</a>
</td>
</tr>
<tr>
<td width="20%">Apache Kudu</td>
<td>
Distributed, columnar, relational data store optimized for analytical use cases requiring
very fast reads with competitive write speeds.
<ul>
<li>Relational data model (tables) with strongly-typed columns and a fast, online alter table operation.</li>
<li>Scale-out and sharded with support for partitioning based on key ranges and/or hashing.</li>
<li>Fault-tolerant and consistent due to its implementation of Raft consensus.</li>
<li>Supported by Apache Impala and Apache Drill, enabling fast SQL reads and writes through those systems.</li>
<li>Integrates with MapReduce and Spark.</li>
<li>Additionally provides "NoSQL" APIs in Java, Python, and C++.</li>
</ul>
</td>
<td width="20%"><a href="http://getkudu.io/">1. Apache Kudu Home</a><br>
<a href="http://github.com/cloudera/kudu">2. Kudu on Github</a><br>
<a href="http://getkudu.io/kudu.pdf">3. Kudu technical whitepaper (pdf)</a>
</td>
</tr>
<tr>
<th colspan="3" style="background-color:#0099FF;">Document Data Model</th>
</tr>
<tr>
<td width="20%">MongoDB</td>
<td>
Document-oriented database system. It is part of the NoSQL family of
database systems. Instead of storing data in tables as is done in a "classical"
relational database, MongoDB stores structured data as JSON-like documents
</td>
<td width="20%"><a href="http://www.mongodb.org/">1. Mongodb site</a>
</td>
</tr>
<tr>
<td width="20%">RethinkDB</td>
<td>
RethinkDB is built to store JSON documents, and scale to multiple
machines with very little effort. It has a pleasant query language
that supports really useful queries like table joins and group by,
and is easy to setup and learn.
</td>
<td width="20%"><a href="http://www.rethinkdb.com/">1. RethinkDB site</a>
</td>
</tr>
<tr>
<td width="20%">ArangoDB</td>
<td>
An open-source database with a flexible data model for documents, graphs,
and key-values. Build high performance applications using a convenient
sql-like query language or JavaScript extensions.
</td>
<td width="20%"><a href="https://www.arangodb.org/">1. ArangoDB site</a>
</td>
</tr>
<tr>
<th colspan="3" style="background-color:#0099FF;">Stream Data Model</th>
</tr>
<tr>
<td width="20%">EventStore</td>
<td>
An open-source, functional database with support for Complex Event Processing.
It provides a persistence engine for applications using event-sourcing, or for
storing time-series data. Event Store is written in C#, C++ for the server which
runs on Mono or the .NET CLR, on Linux or Windows.
Applications using Event Store can be written in JavaScript. Event sourcing (ES)
is a way of persisting your application's state by storing the history that determines
the current state of your application.
</td>
<td width="20%"><a href="http://geteventstore.com/">1. EventStore site</a>
</td>
</tr>
<tr>
<th colspan="3" style="background-color:#0099FF;">Key-value Data Model</th>
</tr>
<tr>
<td width="20%">Redis DataBase</td>
<td>
Redis is an open-source, networked, in-memory, data structures
store with optional durability. It is written in ANSI C.
In its outer layer, the Redis data model is a dictionary which
maps keys to values. One of the main differences between Redis
and other structured storage systems is that Redis supports not
only strings, but also abstract data types. Sponsored by Redis Labs.
It’s BSD licensed.
</td>
<td width="20%"><a href="http://redis.io/">1. Redis site</a>
<br> <a href="http://redislabs.com/">2. Redis Labs site</a>
</td>
</tr>
<tr>
<td width="20%">Linkedin Voldemort</td>
<td>
Distributed data store that is designed as a key-value store used
by LinkedIn for high-scalability storage.
</td>
<td width="20%"><a href="http://www.project-voldemort.com/voldemort/">1. Voldemort site</a>
</td>
</tr>
<tr>
<td width="20%">RocksDB</td>
<td>
RocksDB is an embeddable persistent key-value store for fast storage.
RocksDB can also be the foundation for a client-server database but our
current focus is on embedded workloads.
</td>
<td width="20%"><a href="http://rocksdb.org/">1. RocksDB site</a>
</td>
</tr>
<tr>
<td width="20%">OpenTSDB</td>
<td>
OpenTSDB is a distributed, scalable Time Series Database (TSDB)
written on top of HBase. OpenTSDB was written to address a common
need: store, index and serve metrics collected from computer systems
(network gear, operating systems, applications) at a large scale,
and make this data easily accessible and graphable.
</td>
<td width="20%"><a href="http://opentsdb.net/">1. OpenTSDB site</a>
</td>
</tr>
<!-- -->
<!-- NoSQL: Graph Data Model -->
<!-- -->
<tr>
<th colspan="3" style="background-color:#0099FF;">Graph Data Model</th>
</tr>
<tr>
<td width="20%">ArangoDB</td>
<td>
An open-source database with a flexible data model for documents,
graphs, and key-values. Build high performance applications using
a convenient sql-like query language or JavaScript extensions.
</td>
<td width="20%"><a href="https://www.arangodb.org/">1. ArangoDB site</a>
</td>
</tr>
<tr>
<td width="20%">Neo4j</td>
<td>
An open-source graph database writting entirely in Java. It is an
embedded, disk-based, fully transactional Java persistence engine
that stores data structured in graphs rather than in tables.
</td>
<td width="20%"><a href="http://www.neo4j.org/">1. Neo4j site</a>
</td>
</tr>
<tr>
<td width="20%">TitanDB</td>
<td>
TitanDB is a highly scalable graph database optimized for storing
and querying large graphs with billions of vertices and edges
distributed across a multi-machine cluster. Titan is a transactional
database that can support thousands of concurrent users.
</td>
<td width="20%"><a href="http://thinkaurelius.github.io/titan/">1. Titan site</a>
</td>
</tr>
<!-- -->
<!-- NewSQL ecosystem -->
<!-- -->
<tr>
<th colspan="3">NewSQL Databases</th>
</tr>
<tr>
<td width="20%">TokuDB</td>
<td>
TokuDB is a storage engine for MySQL and MariaDB that is specifically
designed for high performance on write-intensive workloads. It achieves
this via Fractal Tree indexing. TokuDB is a scalable, ACID and MVCC
compliant storage engine. TokuDB is one of the technologies that enable
Big Data in MySQL.
</td>
<td width="20%">TODO</td>
</tr>
<tr>
<td width="20%">HandlerSocket</td>
<td>
HandlerSocket is a NoSQL plugin for MySQL/MariaDB (the storage engine
of MySQL). It works as a daemon inside the mysqld process, accepting TCP
connections, and executing requests from clients. HandlerSocket does not
support SQL queries. Instead, it supports simple CRUD operations on tables.
HandlerSocket can be much faster than mysqld/libmysql in some cases because
it has lower CPU, disk, and network overhead.
</td>
<td width="20%">TODO</td>
</tr>
<tr>
<td width="20%">Akiban Server</td>
<td>
Akiban Server is an open source database that brings document stores and
relational databases together. Developers get powerful document access
alongside surprisingly powerful SQL.
</td>
<td width="20%">TODO</td>
</tr>
<tr>
<td width="20%">Drizzle</td>
<td>
Drizzle is a re-designed version of the MySQL v6.0 codebase and
is designed around a central concept of having a microkernel
architecture. Features such as the query cache and authentication
system are now plugins to the database, which follow the general
theme of "pluggable storage engines" that were introduced in MySQL 5.1.
It supports PAM, LDAP, and HTTP AUTH for authentication via plugins
it ships. Via its plugin system it currently supports logging to files,
syslog, and remote services such as RabbitMQ and Gearman. Drizzle
is an ACID-compliant relational database that supports
transactions via an MVCC design
</td>
<td width="20%">TODO</td>
</tr>
<tr>
<td width="20%">Haeinsa</td>
<td>
Haeinsa is linearly scalable multi-row, multi-table transaction
library for HBase. Use Haeinsa if you need strong ACID semantics
on your HBase cluster. Is based on Google Perlocator concept.
</td>
<td width="20%">TODO</td>
</tr>
<tr>
<td width="20%">SenseiDB</td>
<td>
Open-source, distributed, realtime, semi-structured database.
Some Features: Full-text search, Fast realtime updates, Structured
and faceted search, BQL: SQL-like query language, Fast key-value
lookup, High performance under concurrent heavy update and query
volumes, Hadoop integration
</td>
<td width="20%"><a href="http://senseidb.com/">1. SenseiDB site</a>
</td>
</tr>
<tr>
<td width="20%">Sky</td>
<td>
Sky is an open source database used for flexible, high performance
analysis of behavioral data. For certain kinds of data such as
clickstream data and log data, it can be several orders of magnitude
faster than traditional approaches such as SQL databases or Hadoop.
</td>
<td width="20%"><a href="http://skydb.io/">1. SkyDB site</a>
</td>
</tr>
<tr>
<td width="20%">BayesDB</td>
<td>
BayesDB, a Bayesian database table, lets users query the probable
implications of their tabular data as easily as an SQL database
lets them query the data itself. Using the built-in Bayesian Query
Language (BQL), users with no statistics training can solve basic
data science problems, such as detecting predictive relationships