Databases

8 structures that power modern databases
Source

Skiplist: probabilistic data structure used to implement a sorted set, ex. Redis
B-tree: balanced tree, where each node can have multiple children, data is stored in leaf nodes, used in MySQL, PostgreSQL, and Oracle
Hash index: hash table, key-value pairs, allow for fast lookup, used in redis, MySQL, PostreSQL
Inverted index: used to efficiently store and search documents, maps words to documents, used in elasticsearch
SSTable: sorted string table, file-based data structure, highly compressed and efficient format
Suffix tree: finds all instances of a search term in a large collection of documents
LSM tree: secret sauce behind nosql, used in Cassandra, RocksDB, and levelDB
R-tree: for spatial data, examples: PostGIS, mongoDB, elasticsearch

Big Data Machine Learning Tools

See also ML Pipelines & Stacks page.

Postgres

Postgress default settings for Mac OSX

Everything a Data Scientist Should Know About Data Management

Apache Spark

Apache Spark Home A fast and general engine for large-scale data processing
Spark SQL Apache Spark's module for working with structured data
Spark Streaming Build scalable fault-tolerant streaming applications
MLib Apache Spark's scalable machine learning library
GraphX Apache Spark's API for graphs and graph-parallel computation
Spark RDD A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.
ClouderaSpark2016

Spark Tutorials Spark Python Tutorial Notebooks This is a collection of IPython notebook/Jupyter notebooks intended to train the reader on different Apache Spark concepts, from basic to advanced, by using the Python language.
Spark & Python: SQL & DataFrames Tutorial
Code Project Apache Spark/Cassandra Tutorial Series:
Part 1 : Introduction to Apache Spark (general purpose grid computation engine)
Part 2 : Installing Apache Cassandra
Part 3 : Saving/Retrieving data to/from Cassandra into/from Apache Spark
Docker Spark Notebook Demo Spark notebook demo by Andy Petrella

Spark Related Apps & Tools Spark-config-gen Small web app for generating spark configuration parameters.
Salt library A library for creating interactive visualizations of massive datasets. Useful for geographic data visualization. Generate with Salt and visualize with Leaflet.
sparkpipe A modular, non-linear data pipeline framework for Apache Spark.

Elasticsearch

Elasticsearch is an open source project built in Java that acts as a database and search engine for JSON documents. It is an alternative to traditional document stores. Along with Logstash and Kibana, it forms the ELK stack. Logstash is a document ingestion and transformation pipeline and Kibana is a visual front end service.
Elasticsearch at Yelp We decided to use Elasticsearch for indexing our logs for fast retrieval, powerful search tools and great visualizations. With ELK, we are able to parse and ingest logs, store them, create dashboards for them, and perform full text search on them.
Scaling Elasticsearch to Hundreds of Developers Yelp uses Elasticsearch to rapidly prototype and launch new search applications.
Beginner's Guide to ElasticSearch JavaScript Tutorial

SQL

Databases and SQL Tutorial by Software Carpentry. (~4 hour lesson plan)
Programming with SQL Databases Tutorial on using Python with SQlite3 (~45 minutes)
Data Management with SQL Data Carpentry Tutorial w/4 lessons
Accessing SQLite Databases with Python & Pandas Tutorial by Data Carpentry. Storing your data in an SQLite database can provide substantial performance improvements when reading/writing compared to CSV.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Databases

Big Data Machine Learning Tools

Postgres

Apache Spark

Elasticsearch

SQL

Clone this wiki locally