GH-3416 LMDB-based SAIL. #3413

kenwenzel · 2021-11-10T12:48:19Z

based on NativeStore
- uses similar encodings for triples and values
- uses (varint encoded) long IDs everywhere
uses LWJGL interfaces to LMDB
- fast implementation that is available for multiple platforms
uses one db for indexes and one db for values

GitHub issue resolved: #3416

Briefly describe the changes proposed in this PR:

This change adds a full LMDB-based SAIL that stores values and triples in two LMDB databases.

PR Author Checklist (see the contributor guidelines for more details):

my pull request is self-contained
I've added tests for the changes I made
I've applied code formatting (you can use mvn process-resources to format from the command line)
I've squashed my commits where necessary
every commit message starts with the issue number (GH-xxxx) followed by a meaningful description of the change

kenwenzel · 2021-11-10T12:51:27Z

# LMDB (with Varints and improved memory management)
Benchmark                                                       Mode  Cnt      Score      Error  Units
TransactionsPerSecondBenchmark.largerTransaction               thrpt    5     24.096 ±    1.903  ops/s
TransactionsPerSecondBenchmark.largerTransactionLevelNone      thrpt    5     23.398 ±   33.358  ops/s
TransactionsPerSecondBenchmark.mediumTransactionsLevelNone     thrpt    5  13477.604 ± 8788.563  ops/s
TransactionsPerSecondBenchmark.transactions                    thrpt    5  15810.539 ± 3486.354  ops/s
TransactionsPerSecondBenchmark.transactionsLevelNone           thrpt    5  28134.420 ± 5945.131  ops/s
TransactionsPerSecondBenchmark.veryLargerTransactionLevelNone  thrpt    5      0.276 ±    0.052  ops/s

Benchmark                                               Mode  Cnt     Score     Error  Units
QueryBenchmark.complexQuery                             avgt    5    30.823 ±   2.769  ms/op
QueryBenchmark.distinctPredicatesQuery                  avgt    5   815.920 ±  80.725  ms/op
QueryBenchmark.groupByQuery                             avgt    5    13.679 ±   0.898  ms/op
QueryBenchmark.removeByQuery                            avgt    5   240.115 ±  25.557  ms/op
QueryBenchmark.removeByQueryReadCommitted               avgt    5   641.684 ±  43.026  ms/op
QueryBenchmark.simpleUpdateQueryIsolationNone           avgt    5   497.786 ±  92.203  ms/op
QueryBenchmark.simpleUpdateQueryIsolationReadCommitted  avgt    5  1042.506 ± 148.653  ms/op

# Native Store
Benchmark                                                       Mode  Cnt    Score     Error  Units
TransactionsPerSecondBenchmark.largerTransaction               thrpt    5    8.552 ±   3.788  ops/s
TransactionsPerSecondBenchmark.largerTransactionLevelNone      thrpt    5   10.301 ±   0.960  ops/s
TransactionsPerSecondBenchmark.mediumTransactionsLevelNone     thrpt    5  324.496 ±  72.688  ops/s
TransactionsPerSecondBenchmark.transactions                    thrpt    5  331.587 ±  42.337  ops/s
TransactionsPerSecondBenchmark.transactionsLevelNone           thrpt    5  298.569 ± 167.852  ops/s
TransactionsPerSecondBenchmark.veryLargerTransactionLevelNone  thrpt    5    0.080 ±   0.014  ops/s

Benchmark                                               Mode  Cnt     Score     Error  Units
QueryBenchmark.complexQuery                             avgt    5    35.331 ±  10.004  ms/op
QueryBenchmark.distinctPredicatesQuery                  avgt    5  1721.104 ± 578.482  ms/op
QueryBenchmark.groupByQuery                             avgt    5    26.727 ±   3.426  ms/op
QueryBenchmark.removeByQuery                            avgt    5  1507.656 ± 805.813  ms/op
QueryBenchmark.removeByQueryReadCommitted               avgt    5  5348.434 ± 650.709  ms/op
QueryBenchmark.simpleUpdateQueryIsolationNone           avgt    5  5422.728 ± 894.426  ms/op
QueryBenchmark.simpleUpdateQueryIsolationReadCommitted  avgt    5  5360.098 ± 750.562  ms/op

hmottestad · 2021-11-10T15:18:14Z

That query performance is pretty impressive!

Edit: Any chance it's too good to be true? It's miles ahead of the MemoryStore.

kenwenzel · 2021-11-10T15:49:30Z

That query performance is pretty impressive!

Edit: Any chance it's too good to be true? It's miles ahead of the MemoryStore.

I'm also not sure if the numbers are correct. I've re-run the benchmarks several times with the same results.
To get more confidence I've also added corresponding tests in compliance/sparql. Those tests pass - at least on my computer.

BTW: I've used MDB_NOSYNC and MDB_NOMETASYNC for the benchmarks. Maybe most of the data is cached in (off-heap) memory.

I will dig into it.

JervenBolleman · 2021-11-11T15:47:22Z

core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/ContextStore.java

+ * byte 12 - A : the UTF-8 encoded the encoded context identifer
+ * </pre>
+ *
+ * @author Jeen Broekstra


I think this Author field should be different.

Some files like ContextStore.java are just (almost) verbatim copies from NativeStore. Therefore I didn't touch the author fields.
Should I change or extend the field even for small changes?

JervenBolleman · 2021-11-11T15:57:32Z

I think this is fantastic. I would love to have it with longs as identifiers. Making it usable for large datasets like UniProt or Wikidata.

I put some small issues in already from a brief first look. But to me it looks like good quality code.

kenwenzel · 2021-11-11T16:57:13Z

I think this is fantastic. I would love to have it with longs as identifiers. Making it usable for large datasets like UniProt or Wikidata.

I put some small issues in already from a brief first look. But to me it looks like good quality code.

I also think that longs are the way to go. To keep storage space efficiency also VarInts can be used.

abrokenjester

This looks very impressvie @kenwenzel , thanks! I'll try and find time soon to do a more comprehensive review as soon as possible.

I've noticed you have copy-pasted a lot of existing code into this new module to be able to reuse it (can tell by the use of old copyright headers and author tags). Two things:

I'd prefer treating these as "new" (so with a new copyright header and author) even if they are derived from existing code.
I haven't looked in detail but do we need code duplication in this fashion, or is there some way we can organize this to make code reuse a little easier?

abrokenjester · 2021-11-13T01:25:36Z

core/sail/lmdb/pom.xml

+ <dependency>
+ <groupId>org.lwjgl</groupId>
+ <artifactId>lwjgl-lmdb</artifactId>
+ <version>${lwjgl.version}</version>
+ </dependency>
+ <dependency>
+ <groupId>org.lwjgl</groupId>
+ <artifactId>lwjgl-lmdb</artifactId>
+ <classifier>natives-linux</classifier>
+ <version>${lwjgl.version}</version>
+ </dependency>
+ <dependency>
+ <groupId>org.lwjgl</groupId>
+ <artifactId>lwjgl-lmdb</artifactId>
+ <classifier>natives-macos</classifier>
+ <version>${lwjgl.version}</version>
+ </dependency>
+ <dependency>
+ <groupId>org.lwjgl</groupId>
+ <artifactId>lwjgl-lmdb</artifactId>
+ <classifier>natives-windows</classifier>
+ <version>${lwjgl.version}</version>
+ </dependency>
+ <dependency>
+ <groupId>org.lwjgl</groupId>
+ <artifactId>lwjgl</artifactId>
+ <classifier>natives-linux</classifier>
+ <version>${lwjgl.version}</version>
+ </dependency>
+ <dependency>
+ <groupId>org.lwjgl</groupId>
+ <artifactId>lwjgl</artifactId>
+ <classifier>natives-macos</classifier>
+ <version>${lwjgl.version}</version>
+ </dependency>
+ <dependency>
+ <groupId>org.lwjgl</groupId>
+ <artifactId>lwjgl</artifactId>
+ <classifier>natives-windows</classifier>
+ <version>${lwjgl.version}</version>
+ </dependency>


We will need to to check the IP status of these and if necessary file CQs for them.

We'll need to file CQs against this: they are not sufficiently documented in ClearlyDefined, and not previously registered for an IP check with Eclipse Foundation either. I'll start filing some CQs.

I've gotten CQs for the two main libraries (lwjgl and lwjgl-lmdb) approved. If we include the native extensions as optional runtimes only, we won't need to file additional CQs for those (especially since they're just different compilations of essentially the same code).

abrokenjester · 2021-11-13T01:26:54Z

compliance/sparql/src/test/java/org/eclipse/rdf4j/sail/lmdb/LmdbComplexSPARQLQueryTest.java

@@ -0,0 +1,41 @@
+/*******************************************************************************
+ * Copyright (c) 2015 Eclipse RDF4J contributors, Aduna, and others.


This copyright header needs to be changed - even if you copied it from an existing file, it's essentially a new contribution I'd say.

abrokenjester · 2021-11-14T03:07:47Z

@kenwenzel can you please create an issue in our issue tracker that describes the proposed improvement, and link this PR to it? And if you have them, I'd appreciate some pointers to the source of the LMDB libraries that you're using, and any licensing information for them - I'll need that info to log them for IP review.

hmottestad · 2021-11-14T11:19:13Z

I'm on OS X with the new ARM based M1 processor. Seems that support is coming in the next release (3.3.0).

LWJGL/lwjgl3#601

I'm currently not able to run the tests or benchmarks. We could consider trying out the 3.3.0 snapshot version.

kenwenzel · 2021-11-14T14:22:51Z

Short update from my side: I'm going to switch to varint encoded long ids and I'm planning to push this change within the next few days.
I'm also working on a zero-copy design for the record iterators on another branch.

kenwenzel · 2021-11-15T08:39:52Z

@kenwenzel can you please create an issue in our issue tracker that describes the proposed improvement, and link this PR to it? And if you have them, I'd appreciate some pointers to the source of the LMDB libraries that you're using, and any licensing information for them - I'll need that info to log them for IP review.

@jeenbroekstra You mean besides #3416? So I should create an LMDB-specific issue with some explanation?

abrokenjester · 2021-11-15T10:58:59Z

Ah, no, you can use that issue, but it was not clear to me that was what you were using. Can you link them up by mentioning the issue number in your commit messages and the PR description please?

kenwenzel · 2021-11-18T07:34:07Z

I'm on OS X with the new ARM based M1 processor. Seems that support is coming in the next release (3.3.0).

LWJGL/lwjgl3#601

I'm currently not able to run the tests or benchmarks. We could consider trying out the 3.3.0 snapshot version.

@hmottestad I've pushed a commit with an update to LWJGL 3.3.0.

kenwenzel · 2021-11-18T07:47:57Z

A short update: I have some open issues with this.

drop custom comparator and use natural sort order for indexes (callbacks from native to Java are slow with LWJGL)
change group varint encoding to be compatible with natural sort order of the numbers (prerequisite for the above)
use varints everywhere in value store

hmottestad · 2021-11-19T11:55:02Z

I got it running on my laptop :) Thanks!

I took a look at the benchmarks and changed the complex query one to this:

	@Benchmark
	public long complexQuery() {

		try (SailRepositoryConnection connection = repository.getConnection()) {
			long count = connection
				.prepareTupleQuery(query4)
				.evaluate()
				.stream()
				.count();
			System.out.println(count);
			return count;
		}
	}

It prints 0 which is wrong. You can change the one in the MemoryStore and see what it's supposed to be.

core/sail/lmdb/pom.xml

kenwenzel · 2021-11-19T13:25:09Z

I got it running on my laptop :) Thanks!

I took a look at the benchmarks and changed the complex query one to this:
	@Benchmark
	public long complexQuery() {

		try (SailRepositoryConnection connection = repository.getConnection()) {
			long count = connection
				.prepareTupleQuery(query4)
				.evaluate()
				.stream()
				.count();
			System.out.println(count);
			return count;
		}
	}
It prints 0 which is wrong. You can change the one in the MemoryStore and see what it's supposed to be.

Thank you! I will dig into this.

abrokenjester · 2021-11-20T01:03:30Z

Minor remark - you're using isse number 3415 in several commits, but the actual related issue is #3416.

…ders.

kenwenzel · 2021-11-25T12:41:17Z

@jeenbroekstra The SPARQL compliance tests for LMDB fail now due to the optional dependencies to LWJGL native libraries.
Should I add those dependencies to the compliance tests or just remove the compliance tests for LMDB?

abrokenjester · 2021-11-25T20:27:42Z

@kenwenzel Ah, yes, adding them as runtime dependence in the compliance test module should fix that

… hashed values.

kenwenzel · 2021-11-28T15:52:09Z

# LMDB
Benchmark                                                       Mode  Cnt      Score      Error  Units
TransactionsPerSecondBenchmark.largerTransaction               thrpt    5     31.845 ±    0.861  ops/s
TransactionsPerSecondBenchmark.largerTransactionLevelNone      thrpt    5     44.121 ±    2.161  ops/s
TransactionsPerSecondBenchmark.mediumTransactionsLevelNone     thrpt    5  19986.561 ±  774.509  ops/s
TransactionsPerSecondBenchmark.transactions                    thrpt    5  21565.831 ±  815.203  ops/s
TransactionsPerSecondBenchmark.transactionsLevelNone           thrpt    5  38754.785 ± 1597.893  ops/s
TransactionsPerSecondBenchmark.veryLargerTransactionLevelNone  thrpt    5      0.268 ±    0.016  ops/s

Benchmark                                               Mode  Cnt     Score     Error  Units
QueryBenchmark.complexQuery                             avgt    5    33.494 ±   2.694  ms/op
QueryBenchmark.distinctPredicatesQuery                  avgt    5  1231.341 ± 120.133  ms/op
QueryBenchmark.groupByQuery                             avgt    5    18.464 ±   1.735  ms/op
QueryBenchmark.removeByQuery                            avgt    5   231.470 ±  14.651  ms/op
QueryBenchmark.removeByQueryReadCommitted               avgt    5   683.210 ±  25.924  ms/op
QueryBenchmark.simpleUpdateQueryIsolationNone           avgt    5   514.449 ±  20.417  ms/op
QueryBenchmark.simpleUpdateQueryIsolationReadCommitted  avgt    5  1095.385 ± 216.371  ms/op

Findings:

insertion speed also improved for non-hashed values (benchmarks only generate very short literals)
degraded query perfomance because of 2 hops for lookups of hashed values (id -> hash and hash+id -> value)
TODO: rework storing of hashed values again

…ove runtime of test.

…and triple store.

…ore API.

kenwenzel · 2021-12-06T14:20:54Z

This is now feature complete in comparison to the native store. Two open points are:

add a perfomant GC algorithm for values (Which was the initial motivation for this PR)
maybe drop the complex logic for transaction isolation and solely rely on LMDB's transactions (problem: only one writer at a time is allowed)

Point 1 is really necessary while point 2 would be nice to have.

How should we proceed? Do you want to merge this PR first and then we add GC and maybe other features?

hmottestad · 2021-12-06T16:31:26Z

Previously we've marked features as experimental or for internal use only. That way we can merge early while still allowing for large changes later on.

JervenBolleman · 2021-12-06T20:58:43Z

As the work is stand alone as a new sail, I agree with @hmottestad to merge it with an experimental tag.

…Store as experimental.

abrokenjester · 2021-12-11T02:52:53Z

Great work @kenwenzel , thanks!

kenwenzel force-pushed the lmdb branch 4 times, most recently from 6fcb342 to ae041b4 Compare November 10, 2021 15:03

kenwenzel force-pushed the lmdb branch from ae041b4 to 5db7944 Compare November 10, 2021 15:43

kenwenzel force-pushed the lmdb branch from 5db7944 to cc9d395 Compare November 10, 2021 16:25

JervenBolleman reviewed Nov 11, 2021

View reviewed changes

kenwenzel force-pushed the lmdb branch from cc9d395 to 052c89f Compare November 11, 2021 16:58

abrokenjester reviewed Nov 13, 2021

View reviewed changes

abrokenjester linked an issue Nov 13, 2021 that may be closed by this pull request

Implement alternative embedded persistent backend #3416

Closed

abrokenjester added the ✋ CQ-Pending requires a CQ to be approved label Nov 14, 2021

kenwenzel force-pushed the lmdb branch from 052c89f to a7171cd Compare November 16, 2021 07:09

kenwenzel changed the title ~~Initial LMDB-based SAIL.~~ GH-3416 Initial LMDB-based SAIL. Nov 16, 2021

kenwenzel force-pushed the lmdb branch from a7171cd to ce83445 Compare November 16, 2021 08:23

hmottestad reviewed Nov 19, 2021

View reviewed changes

core/sail/lmdb/pom.xml Outdated Show resolved Hide resolved

kenwenzel force-pushed the lmdb branch from b736e11 to 8d5999e Compare November 25, 2021 12:19

eclipse-rdf4jGH-3416 Remove specific authors and update copyright hea…

4ebda36

…ders.

kenwenzel force-pushed the lmdb branch from 8d5999e to 4ebda36 Compare November 25, 2021 12:27

eclipse-rdf4jGH-3416 Fixed hashing of values and use max key size of 16.

b4da0c4

kenwenzel force-pushed the lmdb branch from eaa9d35 to b4da0c4 Compare November 25, 2021 16:00

kenwenzel added 2 commits November 25, 2021 17:19

eclipse-rdf4jGH-3416 Use varints for namespace and datatype IDs.

a175daa

eclipse-rdf4jGH-3416 Use varints also for hash values.

7b18de1

kenwenzel added 4 commits November 26, 2021 18:40

eclipse-rdf4jGH-3416 Improve storage of hashed values.

6cfd2a7

eclipse-rdf4jGH-3416 Include LWJGL natives for SPARQL compliance tests.

33075fe

eclipse-rdf4jGH-3416 Correctly handle forceSync option.

218d241

eclipse-rdf4jGH-3416 Fix Varint.firstToLength and speed up storage of…

f64ce93

… hashed values.

eclipse-rdf4jGH-3416 Directly store data of hashed values keyed by ID.

535af54

kenwenzel changed the title ~~GH-3416 Initial LMDB-based SAIL.~~ GH-3416 LMDB-based SAIL. Nov 29, 2021

kenwenzel marked this pull request as ready for review November 29, 2021 11:51

kenwenzel added 6 commits November 29, 2021 13:22

eclipse-rdf4jGH-3416 Use MDB_RESERVE to store larger values.

64d7023

eclipse-rdf4jGH-3416 Correctly use subfolders "triples" and "values".

9b4bbb1

eclipse-rdf4jGH-3416 Remove superfluous upgrade logic.

8f67d96

eclipse-rdf4jGH-3416 Replace RepositoryUtil.difference to vastly impr…

9155045

…ove runtime of test.

eclipse-rdf4jGH-3416 Streamline config and include db size for value …

8e4e457

…and triple store.

eclipse-rdf4jGH-3416 Adapt SPARQL compliance tests to changed LMDB St…

0540994

…ore API.

eclipse-rdf4jGH-3416 Update remaining copyright headers and mark Lmdb…

4b8023d

…Store as experimental.

abrokenjester approved these changes Dec 11, 2021

View reviewed changes

abrokenjester merged commit bd2f136 into eclipse-rdf4j:develop Dec 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-3416 LMDB-based SAIL. #3413

GH-3416 LMDB-based SAIL. #3413

kenwenzel commented Nov 10, 2021 •

edited

Loading

kenwenzel commented Nov 10, 2021 •

edited

Loading

hmottestad commented Nov 10, 2021 •

edited

Loading

kenwenzel commented Nov 10, 2021 •

edited

Loading

JervenBolleman Nov 11, 2021

kenwenzel Nov 11, 2021

JervenBolleman commented Nov 11, 2021

kenwenzel commented Nov 11, 2021

abrokenjester left a comment

abrokenjester Nov 13, 2021

abrokenjester Nov 14, 2021

abrokenjester Nov 24, 2021

abrokenjester Nov 13, 2021

abrokenjester commented Nov 14, 2021 •

edited

Loading

hmottestad commented Nov 14, 2021

kenwenzel commented Nov 14, 2021

kenwenzel commented Nov 15, 2021 •

edited

Loading

abrokenjester commented Nov 15, 2021

kenwenzel commented Nov 18, 2021 •

edited

Loading

kenwenzel commented Nov 18, 2021

hmottestad commented Nov 19, 2021 •

edited

Loading

kenwenzel commented Nov 19, 2021

abrokenjester commented Nov 20, 2021

kenwenzel commented Nov 25, 2021

abrokenjester commented Nov 25, 2021

kenwenzel commented Nov 28, 2021

kenwenzel commented Dec 6, 2021

hmottestad commented Dec 6, 2021

JervenBolleman commented Dec 6, 2021

abrokenjester commented Dec 11, 2021

		@@ -0,0 +1,41 @@
		/*******************************************************************************
		* Copyright (c) 2015 Eclipse RDF4J contributors, Aduna, and others.

GH-3416 LMDB-based SAIL. #3413

GH-3416 LMDB-based SAIL. #3413

Conversation

kenwenzel commented Nov 10, 2021 • edited Loading

kenwenzel commented Nov 10, 2021 • edited Loading

hmottestad commented Nov 10, 2021 • edited Loading

kenwenzel commented Nov 10, 2021 • edited Loading

JervenBolleman Nov 11, 2021

Choose a reason for hiding this comment

kenwenzel Nov 11, 2021

Choose a reason for hiding this comment

JervenBolleman commented Nov 11, 2021

kenwenzel commented Nov 11, 2021

abrokenjester left a comment

Choose a reason for hiding this comment

abrokenjester Nov 13, 2021

Choose a reason for hiding this comment

abrokenjester Nov 14, 2021

Choose a reason for hiding this comment

abrokenjester Nov 24, 2021

Choose a reason for hiding this comment

abrokenjester Nov 13, 2021

Choose a reason for hiding this comment

abrokenjester commented Nov 14, 2021 • edited Loading

hmottestad commented Nov 14, 2021

kenwenzel commented Nov 14, 2021

kenwenzel commented Nov 15, 2021 • edited Loading

abrokenjester commented Nov 15, 2021

kenwenzel commented Nov 18, 2021 • edited Loading

kenwenzel commented Nov 18, 2021

hmottestad commented Nov 19, 2021 • edited Loading

kenwenzel commented Nov 19, 2021

abrokenjester commented Nov 20, 2021

kenwenzel commented Nov 25, 2021

abrokenjester commented Nov 25, 2021

kenwenzel commented Nov 28, 2021

kenwenzel commented Dec 6, 2021

hmottestad commented Dec 6, 2021

JervenBolleman commented Dec 6, 2021

abrokenjester commented Dec 11, 2021

kenwenzel commented Nov 10, 2021 •

edited

Loading

kenwenzel commented Nov 10, 2021 •

edited

Loading

hmottestad commented Nov 10, 2021 •

edited

Loading

kenwenzel commented Nov 10, 2021 •

edited

Loading

abrokenjester commented Nov 14, 2021 •

edited

Loading

kenwenzel commented Nov 15, 2021 •

edited

Loading

kenwenzel commented Nov 18, 2021 •

edited

Loading

hmottestad commented Nov 19, 2021 •

edited

Loading