Replies: 3 comments 11 replies
-
Nice writup, you sum up the problem in a comprehensive way which makes it easy to wrap your head around it. I just wanted to let you know that I am currently working on an approach to avoid traversing supernodes by applying the concept of materialized views to graph databases. However, this research is still in the early stages and of course, it will only cover a subset of use cases. |
Beta Was this translation helpful? Give feedback.
-
As mentioned in the opening comment, to address the limitation of the existing partitioning feature, we could explicitly let users do the POC is below: It now works like this (complete code is available in the commit above): mgmt.makePropertyKey("proxies").dataType(Long.class).cardinality(Cardinality.LIST).make();
mgmt.makeVertexLabel("proxy").setProxy().make();
mgmt.commit();
newTx();
Vertex v0 = graph.addVertex("vertexId", "v0");
Vertex v1 = graph.addVertex("vertexId", "v1");
Vertex v2 = graph.addVertex("vertexId", "v2");
Vertex v3a = graph.addVertex("vertexId", "v3a");
Vertex v3b = graph.addVertex("vertexId", "v3b");
Vertex v4 = graph.addVertex("vertexId", "v4");
v0.addEdge("normal-edge", v1);
v2.addEdge("normal-edge", v3a);
v1.addEdge("labelX", v4);
// assume now v1 becomes a super node and we decide to introduce proxy node(s). The previous edges are retained.
// assume the application logic adds an edge from v1 to v2 with labelX, another edge from v1 to v3a with labelY,
// and another edge from v1 to v3b with labelY.
// Users need to connect v1 to v2/v3a/v3b via vProxy explicitly.
Vertex vProxy = graph.addVertex(T.label, "proxy", "canonicalId", v1.id());
vProxy.addEdge("labelX", v2, "runDate", "01");
vProxy.addEdge("labelY", v3a, "runDate", "02");
vProxy.addEdge("labelY", v3b, "runDate", "02");
v1.property("proxies", vProxy.id());
graph.tx().commit();
// Now we can do normal traversals and JanusGraph will handle the proxy!
assertEquals(4, graph.traversal().V(v1).out().count().next()); As you can see, this approach gives users the flexibility to decide when to create proxies and when an edge should be added to the proxy node. |
Beta Was this translation helpful? Give feedback.
-
Okay I’ll create a PR and we can collaborate on it, including review/feedback/code.
Get Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: dxtr-1-0 ***@***.***>
Sent: Wednesday, April 17, 2024 1:47:50 AM
To: JanusGraph/janusgraph ***@***.***>
Cc: ***@***.*** ***@***.***>; Mention ***@***.***>
Subject: Re: [JanusGraph/janusgraph] Proxy node for super node (Discussion #2717)
Yes, I would also be interested in taking it forward since i feel for bigger graphs/use cases, super nodes become an inevitable problem. Although, I am not that familiar with JanusGraph code yet so I will have to go through it. It would be awesome if you could raise a PR whenever your schedule allows so that its easier to track and contribute.
—
Reply to this email directly, view it on GitHub<#2717 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AGENUWUYXQMT6CPUD7MXST3Y5YZLNAVCNFSM5A3DG2O2U5DIOJSWCZC7NNSXTOKENFZWG5LTONUW63SDN5WW2ZLOOQ5TSMJTHE3DIMQ>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Supernode is a famous problem in graph databases. A supernode is a node with too many neighbors (and thus edges), typically in hundreds of thousands and more. This causes two problems:
Super slow traversal when a supernode is reached. Due to the high amount of possible paths, traversal involving super nodes usually becomes very slow. This is known as "fanning out problem" and well discussed. In the gremlin language, one can explicitly add a
limit
step (and of course, one can implement a custom traversal strategy to insert limit steps, depending on their use case). Some other graph databases also have their own solutions/workarounds. For example, Nebula Graph provides an optionmax_edge_returned_per_vertex
which essentially does the same thing as thelimit
step.High memory usage. Depending on how the data is stored, this can be a problem for some graph databases, including JanusGraph. Essentially, any graph database that stores adjacency lists of a node in the same partition, suffers this problem. In JanusGraph, because all properties and edges of an arbitrary vertex are stored as adjacency lists in the same partition (see data model here), a supernode can take up a lot of memory. This does not only slow down the performance but also increases the chance of getting OOM (out of memory) when you attempt to load the whole node. Even worse, when your supernode grows to an extent that it cannot be fitted into a single machine, you have to migrate data to a more capable machine.
To solve the 1st problem, JanusGraph does not only allow you to add
limit
steps to restrict traversal scopes but also provides vertex-centric indexes. These approaches, however, cannot solve the 2nd problem. In fact, usage of vertex-centric indexes makes the 2nd problem worse because those indexes themselves are stored in the same partition with edges and properties of that particular vertex.What is the best solution? Neo4j among many other graph providers suggests re-modeling by creating proxy nodes. However, it is not clear how the proxy nodes can be "recognized" in queries. Usually, users have to resort to application-level logic.
JanusGraph has a feature called Vertex Cut which allows you to create "partitioned" vertices. A vertex that is partitioned will be stored in different partitions (may or may not be on different machines), and when traversal touches a partitioned vertex, JanusGraph is able to automatically merge results across partitions. This solves the memory pressure problem! Unfortunately, this approach itself has a few limitations:
cluster.max-partitions
. This value is FIXED and cannot be changed in the entire lifecycle of your graph. The default value is 32. This means a partitioned vertex will always have 32 representatives. This can be too large for some nodes and too small for others. In reality, a supernode can have 100k neighbors or 10 million neighbors. Making the number of partitions a constant is not a good solution in many scenarios.So far, data partitioning seems a viable solution, but the existing option in JanusGraph has many limitations. But what if we still "partition" the data, but do it in another way that allows users to decide how the data is partitioned? See this for my POC.
Beta Was this translation helpful? Give feedback.
All reactions