Skip to content

Latest commit

 

History

History
72 lines (53 loc) · 4.2 KB

README.md

File metadata and controls

72 lines (53 loc) · 4.2 KB

NetworkXum

NetworkXum is NetworkX-like interface for large persistent graphs stored inside DBMS. This lets you upscale from Megabyte-Gigabyte graphs to Terabyte-Petabyte graphs (that won't fit into RAM), without changing your code. We provide wrappers for following DBs:

  • MongoDB - modern (yet mature) distributed document DB with good performance in most workloads
  • Neo4J - disturbingly slow and unstable DBMS positioned as the go-to Graph database,
  • SQLite3 - ubiquitous compact relational DBMS with basic SQL support,
  • PostgreSQL - most feature-rich open-source relational DB,
  • MySQL - the most commonly-used relational DB.

Project Structure

  • networkxum - Python wrappers for Graph (Network) datastructures backed by persistent DBs.
  • benchmarks - benchmarking tools and performance results.
  • assets - tiny datasets for testing purposes.
  • regexum - Python wrappers for search-able containers backed by persistent DBs.
  • benchmarks - benchmarking tools and performance results.
  • assets - tiny datasets for testing purposes.

Implementation Details & Included DBs

Some common databases have licences that prohibit sharing of benchmark results, so they were excluded from comparisons.

Name Purpose Implementation Language Lines of Code (in /src/)
MongoDB Documents C++ 3'900'000
Postgre Tables C 1'300'000
Neo4J Graphs Java 800'000
ElasticSearch Text Java 730'000
Unum Graphs, Table, Text C++ 80'000

MongoDB

  • A distributed ACID document store.
  • Internally uses the BSON binary format.
  • Very popular open-source project backed by the $MDB publicly traded company.
  • Provides bindings for most programming languages (including PyMongo for Python).

ElasticSearch

  • Java-based document store built on top of Lucene text index.
  • Widely considered high-performance solutions due to the lack of competition.
  • Lucene was ported to multiple languages including projects like: CLucene and LucenePlusPlus.
  • Very popular open-source project backed by the $ESTC publicly traded company.

SQLite3

  • Embedded tabular single-file SQL database with an extreme level of adoption.
  • Provides a more direct C API in addition to the SQL interface.
  • We use SQLAlchemy for Object-Relational-Mapping, which is by far the most common Python ORM tool.
  • We overwrite the page size, Write-Ahead-Log format and concurrency settings for higher performance.

Postgre, MySQL and other SQLs

  • Most common open-source SQL databases.
  • Work well in single-node environment, but scale poorly out of the box.
  • Mostly store search indexes in a form of a B-Tree. They generally provide good read performance, but are slow to update.

Neo4J

  • The best known graph database with over 10 year of history.
  • Instead of SQL provides Cyper DSL for queries, which are transmitted using Bolt protocol.
  • Some of the essential indexing capabilities are not availiable in the free version.
  • There are some compatiability issues between API versions 3.5 and 4.
  • In our experience, Neo4J is extremely unstable and doesn't scale beyond tiny datasets. Generally crashes due to Java VM heap management issues.

TODO

  • Benchmark on small & mid graphs.
  • Session management in SQL and Neo4J.
  • Duration constraints for benchmarks.
  • Mixed Multithreaded Read/Write benchmarks.