Group CQL keys into a single query

Fixes #3863 Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
JanusGraph · Jul 8, 2023 · 397d65e · 397d65e
1 parent 16846ad
commit 397d65e
Show file tree

Hide file tree

Showing 22 changed files with 1,055 additions and 184 deletions.
diff --git a/docs/changelog.md b/docs/changelog.md
@@ -328,6 +328,29 @@ and optimize this method execution to execute all the slice queries for all the
 (for example, by using asynchronous programming, slice queries grouping, multi-threaded execution, or any other technique which 
 is efficient for the respective storage adapter).  
 
+##### CQL storage backend now groups multiple keys and multiple slice queries together for the same token ranges
+
+Starting from JanusGraph 1.0.0 CQL storage implementation now groups different keys (partition keys) and slice queries together to 
+minimize the amount of CQL queries sent to the storage backend. 
+This often comes with better performance characteristics due to less network overhead. 
+However, in some cases it might degrade performance due to imbalanced queries execution. 
+Moreover, keys grouping requires fetching an additional `key` column (i.e. vertex id) per each fetched data column (i.e. per each fetched property or edge). 
+An additional `key` column retrieval is often smaller than the separate query network overhead (unless very large `String` vertex ids are used), 
+but for some Serverless storage deployments this additional `key` bytes might be charged by the storage providers.
+In many cases proper grouping of queries and / or keys together bring benefits for CQL queries execution, but in 
+case the previous behaviour is desired then the grouping can be disabled by setting `storage.cql.grouping.slice-allowed = false` and
+`storage.cql.grouping.keys-allowed = false`.
+
+It is recommended to use the below new configurations to configure grouping limits which brings the most benefit for your deployment:
+-   `storage.cql.grouping.slice-limit` - maximum amount of slice queries grouped together. As for now only queries which are fetching multiple properties with cardinality `SINGLE` can be grouped. Any other query will be executed separately.
+-   `storage.cql.grouping.keys-limit` - maximum amount of keys (vertex ids) grouped together. Only keys which are placed on the same tokens range can be grouped together (i.e. only keys which touch the same nodes).
+
+By default `storage.cql.grouping.keys-limit` is set to `100`, but it's recommended to increase this value depending on the use case. 
+Notice, ScyllaDB storage backend has a limit on distinct clustering key restrictions per query which is set to `100` by default. 
+To be able to use more than `100` for `storage.cql.grouping.keys-limit` or `storage.cql.grouping.slice-limit` it is required first to ensure 
+that a bigger amount of distinct clustering key restrictions is allowed on the storage backend. On ScyllaDB it's possible 
+to configure this restriction using [max-partition-key-restrictions-per-query](https://enterprise.docs.scylladb.com/branch-2022.2/faq.html#how-can-i-change-the-maximum-number-of-in-restrictions) configuration option.
+
 ##### Removal of deprecated classes/methods/functionalities
 
 ###### Methods

diff --git a/docs/configs/janusgraph-cfg.md b/docs/configs/janusgraph-cfg.md
@@ -461,10 +461,6 @@ CQL storage backend options
 | storage.cql.request-timeout | Timeout for CQL requests in milliseconds. See DataStax Java Driver option `basic.request.timeout` for more information. | Long | 12000 | MASKABLE |
 | storage.cql.session-leak-threshold | The maximum number of live sessions that are allowed to coexist in a given VM until the warning starts to log for every new session. If the value is less than or equal to 0, the feature is disabled: no warning will be issued. See DataStax Java Driver option `advanced.session-leak.threshold` for more information. | Integer | (no default value) | MASKABLE |
 | storage.cql.session-name | Default name for the Cassandra session | String | JanusGraph Session | MASKABLE |
-| storage.cql.slice-grouping-allowed | If `true` this allows multiple Slice queries which are allowed to be performed as non-range queries (i.e. direct equality operation) to be grouped together into a single CQL query via `IN` operator. Notice, currently only operations to fetch properties with Cardinality.SINGLE are allowed to be performed as non-range queries (edges fetching or properties with Cardinality SET or LIST won't be grouped together).
-If this option if `false` then each Slice query will be executed in a separate asynchronous CQL query even when grouping is allowed. | Boolean | true | MASKABLE |
-| storage.cql.slice-grouping-limit | Maximum amount of grouped together slice queries into a single CQL query.
-This option is used only when `storage.cql.slice-grouping-allowed` is `true`. | Integer | 100 | MASKABLE |
 | storage.cql.speculative-retry | The speculative retry policy. One of: NONE, ALWAYS, <X>percentile, <N>ms. | String | (no default value) | FIXED |
 | storage.cql.ttl-enabled | Whether TTL should be enabled or not. Must be turned off if the storage does not support TTL. Amazon Keyspace, for example, does not support TTL by default unless otherwise enabled. | Boolean | true | LOCAL |
 | storage.cql.use-external-locking | True to prevent JanusGraph from using its own locking mechanism. Setting this to true eliminates redundant checks when using an external locking mechanism outside of JanusGraph. Be aware that when use-external-locking is set to true, that failure to employ a locking algorithm which locks all columns that participate in a transaction upfront and unlocks them when the transaction ends, will result in a 'read uncommitted' transaction isolation level guarantee. If set to true without an appropriate external locking mechanism in place side effects such as dirty/non-repeatable/phantom reads should be expected. | Boolean | false | MASKABLE |
@@ -482,6 +478,27 @@ Configuration options for CQL executor service which is used to process deserial
 | storage.cql.executor-service.max-pool-size | Maximum pool size for executor service. Ignored for `fixed` and `cached` executor services. May be ignored if custom executor service is used (depending on the implementation of the executor service). | Integer | 2147483647 | LOCAL |
 | storage.cql.executor-service.max-shutdown-wait-time | Max shutdown wait time in milliseconds for executor service threads to be finished during shutdown. After this time threads will be interrupted (signalled with interrupt) without any additional wait time. | Long | 60000 | LOCAL |
 
+### storage.cql.grouping
+Configuration options for controlling CQL queries grouping
+
+
+| Name | Description | Datatype | Default Value | Mutability |
+| ---- | ---- | ---- | ---- | ---- |
+| storage.cql.grouping.keys-allowed | If `true` this allows multiple partition keys which belong to the same token range to be grouped together into a single CQL query via `IN` operator.
+Notice, that any CQL query grouped with more than 1 key will require to return a row key for any column fetched.
+If this option is `false` then each partition key will be executed in a separate asynchronous CQL query even when multiple keys from the same token range are queried. | Boolean | true | MASKABLE |
+| storage.cql.grouping.keys-limit | Maximum amount of the keys which belong to the same token range to be grouped together into a single CQL query. If more keys for the same token range are queried, they are going to be grouped into separate CQL queries.
+Notice, for ScyllaDB this option should not exceed the maximum number of distinct clustering key restrictions per query which can be changed by ScyllaDB configuration option `max-partition-key-restrictions-per-query` (https://enterprise.docs.scylladb.com/branch-2022.2/faq.html#how-can-i-change-the-maximum-number-of-in-restrictions).
+This option takes effect only when `storage.cql.grouping.keys-allowed` is `true`. | Integer | 100 | MASKABLE |
+| storage.cql.grouping.keys-min | Minimum amount of keys to consider for grouping. Grouping will be skipped for any multi-key query which has less than this amount of keys (i.e. a separate CQL query will be executed for each key in such case).
+Usually this configuration should always be set to `2`. It is useful to increase the value only in cases when queries with more keys should not be grouped, but be performed separately to increase parallelism in expense of the network overhead.
+This option takes effect only when `storage.cql.grouping.keys-allowed` is `true`. | Integer | 2 | MASKABLE |
+| storage.cql.grouping.slice-allowed | If `true` this allows multiple Slice queries which are allowed to be performed as non-range queries (i.e. direct equality operation) to be grouped together into a single CQL query via `IN` operator. Notice, currently only operations to fetch properties with Cardinality.SINGLE are allowed to be performed as non-range queries (edges fetching or properties with Cardinality SET or LIST won't be grouped together).
+If this option if `false` then each Slice query will be executed in a separate asynchronous CQL query even when grouping is allowed. | Boolean | true | MASKABLE |
+| storage.cql.grouping.slice-limit | Maximum amount of grouped together slice queries into a single CQL query.
+Notice, for ScyllaDB this option should not exceed the maximum number of distinct clustering key restrictions per query which can be changed by ScyllaDB configuration option `max-partition-key-restrictions-per-query` (https://enterprise.docs.scylladb.com/branch-2022.2/faq.html#how-can-i-change-the-maximum-number-of-in-restrictions).
+This option is used only when `storage.cql.grouping.slice-allowed` is `true`. | Integer | 100 | MASKABLE |
+
 ### storage.cql.internal
 Advanced configuration of internal DataStax driver. Notice, all available configurations will be composed in the order. Non specified configurations will be skipped. By default only base configuration is enabled (which has the smallest priority. It means that you can overwrite any configuration used in base programmatic configuration by using any other configuration type). The configurations are composed in the next order (sorted by priority in descending order): `file-configuration`, `resource-configuration`, `string-configuration`, `url-configuration`, `base-programmatic-configuration` (which is controlled by `base-programmatic-configuration-enabled` property). Configurations with higher priority always overwrite configurations with lower priority. I.e. if the same configuration parameter is used in both `file-configuration` and `string-configuration` the configuration parameter from `file-configuration` will be used and configuration parameter from `string-configuration` will be ignored. See available configuration options and configurations structure here: https://docs.datastax.com/en/developer/java-driver/4.13/manual/core/configuration/reference/
 

diff --git a/janusgraph-core/src/main/java/org/janusgraph/diskstorage/EntryMetaData.java b/janusgraph-core/src/main/java/org/janusgraph/diskstorage/EntryMetaData.java
@@ -29,7 +29,8 @@ public enum EntryMetaData {
 
     TTL(Integer.class, false, data -> data instanceof Integer && ((Integer) data) >= 0L),
     VISIBILITY(String.class, true, data -> data instanceof String && StringEncoding.isAsciiString((String) data)),
-    TIMESTAMP(Long.class, false, data -> data instanceof Long);
+    TIMESTAMP(Long.class, false, data -> data instanceof Long),
+    ROW_KEY(StaticBuffer.class, false, data -> data instanceof StaticBuffer);
 
     public static final java.util.Map<EntryMetaData,Object> EMPTY_METADATA = Collections.emptyMap();
 

diff --git a/janusgraph-core/src/main/java/org/janusgraph/diskstorage/util/CompletableFutureUtil.java b/janusgraph-core/src/main/java/org/janusgraph/diskstorage/util/CompletableFutureUtil.java
@@ -18,7 +18,9 @@
 
 import java.util.Collection;
 import java.util.HashMap;
+import java.util.HashSet;
 import java.util.Map;
+import java.util.Set;
 import java.util.concurrent.CompletableFuture;
 import java.util.concurrent.CompletionException;
 import java.util.concurrent.ExecutionException;
@@ -63,14 +65,17 @@ public static <T> T get(CompletableFuture<T> future){
     public static <K,V> Map<K,V> unwrap(Map<K,CompletableFuture<V>> futureMap) throws Throwable{
         Map<K, V> resultMap = new HashMap<>(futureMap.size());
         Throwable firstException = null;
+        Set<Throwable> uniqueExceptions = null;
         for(Map.Entry<K, CompletableFuture<V>> entry : futureMap.entrySet()){
             try{
                 resultMap.put(entry.getKey(), entry.getValue().get());
             } catch (Throwable throwable){
+                Throwable rootException = unwrapExecutionException(throwable);
                 if(firstException == null){
-                    firstException = throwable;
-                } else {
-                    firstException.addSuppressed(throwable);
+                    firstException = rootException;
+                    uniqueExceptions = new HashSet<>(1);
+                } else if(firstException != rootException && uniqueExceptions.add(rootException)){
+                    firstException.addSuppressed(rootException);
                 }
             }
         }
@@ -84,14 +89,17 @@ public static <K,V> Map<K,V> unwrap(Map<K,CompletableFuture<V>> futureMap) throw
     public static <K,V,R> Map<K,Map<V, R>> unwrapMapOfMaps(Map<K, Map<V, CompletableFuture<R>>> futureMap) throws Throwable{
         Map<K, Map<V, R>> resultMap = new HashMap<>(futureMap.size());
         Throwable firstException = null;
+        Set<Throwable> uniqueExceptions = null;
         for(Map.Entry<K, Map<V, CompletableFuture<R>>> entry : futureMap.entrySet()){
             try{
                 resultMap.put(entry.getKey(), unwrap(entry.getValue()));
             } catch (Throwable throwable){
+                Throwable rootException = unwrapExecutionException(throwable);
                 if(firstException == null){
-                    firstException = throwable;
-                } else {
-                    firstException.addSuppressed(throwable);
+                    firstException = rootException;
+                    uniqueExceptions = new HashSet<>(1);
+                } else if(firstException != rootException && uniqueExceptions.add(rootException)){
+                    firstException.addSuppressed(rootException);
                 }
             }
         }
@@ -104,14 +112,17 @@ public static <K,V,R> Map<K,Map<V, R>> unwrapMapOfMaps(Map<K, Map<V, Completable
 
     public static <V> void awaitAll(Collection<CompletableFuture<V>> futureCollection) throws Throwable{
         Throwable firstException = null;
+        Set<Throwable> uniqueExceptions = null;
         for(CompletableFuture<V> future : futureCollection){
             try{
                 future.get();
             } catch (Throwable throwable){
+                Throwable rootException = unwrapExecutionException(throwable);
                 if(firstException == null){
-                    firstException = throwable;
-                } else {
-                    firstException.addSuppressed(throwable);
+                    firstException = rootException;
+                    uniqueExceptions = new HashSet<>(1);
+                } else if(firstException != rootException && uniqueExceptions.add(rootException)){
+                    firstException.addSuppressed(rootException);
                 }
             }
         }

diff --git a/janusgraph-core/src/main/java/org/janusgraph/diskstorage/util/StaticArrayEntryList.java b/janusgraph-core/src/main/java/org/janusgraph/diskstorage/util/StaticArrayEntryList.java
@@ -654,6 +654,8 @@ public static MetaDataSerializer getSerializer(EntryMetaData meta) {
                 return LongSerializer.INSTANCE;
             case VISIBILITY:
                 return ASCIIStringSerializer.INSTANCE;
+            case ROW_KEY:
+                return StaticBufferSerializer.INSTANCE;
             default: throw new AssertionError("Unexpected meta data: " + meta);
         }
     }
@@ -736,4 +738,38 @@ public String read(byte[] data, int startPos) {
         }
     }
 
+    private enum StaticBufferSerializer implements MetaDataSerializer<StaticBuffer> {
+
+        INSTANCE;
+
+        private static final StaticBuffer EMPTY_STATIC_BUFFER = StaticArrayBuffer.of(new byte[0]);
+
+        @Override
+        public int getByteLength(StaticBuffer value) {
+            return value.length() + 4;
+        }
+
+        @Override
+        public void write(byte[] data, int startPos, StaticBuffer value) {
+            int length = value.length();
+            StaticArrayBuffer.putInt(data, startPos, length);
+            if(length > 0){
+                startPos+=4;
+                for(int i=0; i<length; i++){
+                    data[startPos++] = value.getByte(i);
+                }
+            }
+        }
+
+        @Override
+        public StaticBuffer read(byte[] data, int startPos) {
+            int bufSize = StaticArrayBuffer.getInt(data, startPos);
+            if(bufSize == 0){
+                return EMPTY_STATIC_BUFFER;
+            }
+            startPos+=4;
+            return new StaticArrayBuffer(data, startPos, startPos+bufSize);
+        }
+    }
+
 }
diff --git a/janusgraph-cql/src/main/java/org/janusgraph/diskstorage/cql/CQLColValGetter.java b/janusgraph-cql/src/main/java/org/janusgraph/diskstorage/cql/CQLColValGetter.java
@@ -18,10 +18,15 @@
 import io.vavr.Tuple3;
 import org.janusgraph.diskstorage.EntryMetaData;
 import org.janusgraph.diskstorage.StaticBuffer;
+import org.janusgraph.diskstorage.util.StaticArrayBuffer;
 import org.janusgraph.diskstorage.util.StaticArrayEntry.GetColVal;
 
+import java.nio.ByteBuffer;
+
 public class CQLColValGetter implements GetColVal<Tuple3<StaticBuffer, StaticBuffer, Row>, StaticBuffer> {
 
+    private static final StaticArrayBuffer EMPTY_KEY = StaticArrayBuffer.of(new byte[0]);
+
     private final EntryMetaData[] schema;
 
     CQLColValGetter(final EntryMetaData[] schema) {
@@ -50,6 +55,9 @@ public Object getMetaData(final Tuple3<StaticBuffer, StaticBuffer, Row> tuple, f
                 return tuple._3.getLong(CQLKeyColumnValueStore.WRITETIME_COLUMN_NAME);
             case TTL:
                 return tuple._3.getInt(CQLKeyColumnValueStore.TTL_COLUMN_NAME);
+            case ROW_KEY:
+                ByteBuffer rawKey = tuple._3.getByteBuffer(CQLKeyColumnValueStore.KEY_COLUMN_NAME);
+                return rawKey == null ? EMPTY_KEY : StaticArrayBuffer.of(rawKey);
             default:
                 throw new UnsupportedOperationException("Unsupported meta data: " + metaData);
         }