Proxy node for super node #2717

li-boxuan · 2021-07-23T01:39:12Z

li-boxuan
Jul 23, 2021
Maintainer

Supernode is a famous problem in graph databases. A supernode is a node with too many neighbors (and thus edges), typically in hundreds of thousands and more. This causes two problems:

Super slow traversal when a supernode is reached. Due to the high amount of possible paths, traversal involving super nodes usually becomes very slow. This is known as "fanning out problem" and well discussed. In the gremlin language, one can explicitly add a limit step (and of course, one can implement a custom traversal strategy to insert limit steps, depending on their use case). Some other graph databases also have their own solutions/workarounds. For example, Nebula Graph provides an option max_edge_returned_per_vertex which essentially does the same thing as the limit step.
High memory usage. Depending on how the data is stored, this can be a problem for some graph databases, including JanusGraph. Essentially, any graph database that stores adjacency lists of a node in the same partition, suffers this problem. In JanusGraph, because all properties and edges of an arbitrary vertex are stored as adjacency lists in the same partition (see data model here), a supernode can take up a lot of memory. This does not only slow down the performance but also increases the chance of getting OOM (out of memory) when you attempt to load the whole node. Even worse, when your supernode grows to an extent that it cannot be fitted into a single machine, you have to migrate data to a more capable machine.

To solve the 1st problem, JanusGraph does not only allow you to add limit steps to restrict traversal scopes but also provides vertex-centric indexes. These approaches, however, cannot solve the 2nd problem. In fact, usage of vertex-centric indexes makes the 2nd problem worse because those indexes themselves are stored in the same partition with edges and properties of that particular vertex.

What is the best solution? Neo4j among many other graph providers suggests re-modeling by creating proxy nodes. However, it is not clear how the proxy nodes can be "recognized" in queries. Usually, users have to resort to application-level logic.

JanusGraph has a feature called Vertex Cut which allows you to create "partitioned" vertices. A vertex that is partitioned will be stored in different partitions (may or may not be on different machines), and when traversal touches a partitioned vertex, JanusGraph is able to automatically merge results across partitions. This solves the memory pressure problem! Unfortunately, this approach itself has a few limitations:

Partitioning applies to vertex labels. This means that all vertices of this label will be partitioned. In reality, this is not very common. For example, in a social network, you probably only want to partition "users" who are celebrities - but you might not know who would become celebrities beforehand. If you partition all users, it would be a great waste of resources and slows down your application's performance. If you make "celebrity" a vertex label other than "user", then you have to decide whether a new node should be classified as "user" or "celebrity" and this cannot be changed anymore. Of course, you could make all new joiners ordinary "users", and migrate those who become popular to "celebrity" groups in your application logic, but this is very cumbersome.
The number of partitions is a constant defined in cluster.max-partitions. This value is FIXED and cannot be changed in the entire lifecycle of your graph. The default value is 32. This means a partitioned vertex will always have 32 representatives. This can be too large for some nodes and too small for others. In reality, a supernode can have 100k neighbors or 10 million neighbors. Making the number of partitions a constant is not a good solution in many scenarios.

So far, data partitioning seems a viable solution, but the existing option in JanusGraph has many limitations. But what if we still "partition" the data, but do it in another way that allows users to decide how the data is partitioned? See this for my POC.

rngcntr · 2021-08-11T08:48:17Z

rngcntr
Aug 11, 2021
Collaborator

Nice writup, you sum up the problem in a comprehensive way which makes it easy to wrap your head around it. I just wanted to let you know that I am currently working on an approach to avoid traversing supernodes by applying the concept of materialized views to graph databases. However, this research is still in the early stages and of course, it will only cover a subset of use cases.

2 replies

li-boxuan Aug 11, 2021
Maintainer Author

That's great! WIthout your comment, I have already forgotten about this thread. I'll complete my approach write-up in a new comment below.

rngcntr Aug 11, 2021
Collaborator

I only noticed this thread when I was catching up with three weeks of dismissed GitHub notifications. Would have commented much earlier if I would have noticed it right away.

li-boxuan · 2021-08-11T09:49:16Z

li-boxuan
Aug 11, 2021
Maintainer Author

As mentioned in the opening comment, to address the limitation of the existing partitioning feature, we could explicitly let users do the
"vertex cut" decisions.

POC is below:

li-boxuan@4bc46e3

It now works like this (complete code is available in the commit above):

mgmt.makePropertyKey("proxies").dataType(Long.class).cardinality(Cardinality.LIST).make();
mgmt.makeVertexLabel("proxy").setProxy().make();
mgmt.commit();

newTx();

Vertex v0 = graph.addVertex("vertexId", "v0");
Vertex v1 = graph.addVertex("vertexId", "v1");
Vertex v2 = graph.addVertex("vertexId", "v2");
Vertex v3a = graph.addVertex("vertexId", "v3a");
Vertex v3b = graph.addVertex("vertexId", "v3b");
Vertex v4 = graph.addVertex("vertexId", "v4");

v0.addEdge("normal-edge", v1);
v2.addEdge("normal-edge", v3a);
v1.addEdge("labelX", v4);

// assume now v1 becomes a super node and we decide to introduce proxy node(s). The previous edges are retained.
// assume the application logic adds an edge from v1 to v2 with labelX, another edge from v1 to v3a with labelY,
// and another edge from v1 to v3b with labelY.
// Users need to connect v1 to v2/v3a/v3b via vProxy explicitly.
Vertex vProxy = graph.addVertex(T.label, "proxy", "canonicalId", v1.id());
vProxy.addEdge("labelX", v2, "runDate", "01");
vProxy.addEdge("labelY", v3a, "runDate", "02");
vProxy.addEdge("labelY", v3b, "runDate", "02");
v1.property("proxies", vProxy.id());
graph.tx().commit();

// Now we can do normal traversals and JanusGraph will handle the proxy!
assertEquals(4, graph.traversal().V(v1).out().count().next());

As you can see, this approach gives users the flexibility to decide when to create proxies and when an edge should be added to the proxy node.

9 replies

li-boxuan Apr 11, 2024
Maintainer Author

@dxtr-1-0 I never invested more time on this since I never heard any interest from users :(

Could you share why you are interested in this?

dxtr-1-0 Apr 16, 2024

Hi @li-boxuan , I am using Janusgraph to build a graph platform for my org and we have usecase where there will be supernodes and vertex cuts won't be very efficient for the reasons that you have mentioned (only 10-15% nodes of a label are super nodes). I really liked this approach where the storage is distributed selectively for super nodes without wasting resources on the entire vertex set for the label.

dxtr-1-0 Apr 16, 2024

I am curious how people are effectively solving for the super node problem in there respective projects/orgs? With obvious resource wastage in the vertex cut approach, i was hoping the community would already have tried and tested efficient solutions for this problem. :p
Moreover, what i could make out of the discussions is that there are traversal strategies which efficiently answer queries like <find the edge(s) between V1 and V2). But what is V2 is not known and the traversal is relying on some filter criteria? Vertex Centric indexes work fine for me but again for too many edges on a supernode, the row for the vertex (in this case V1) becomes way too large.

li-boxuan Apr 17, 2024
Maintainer Author

My feeling is:

Most people don't face super node problems.
For those who hit this problem, they solve it on their application layer. In other words, they "remodel" the data.

Vertex Centric indexes work fine for me but again for too many edges on a supernode

Yeah, Vertex Centric Indexes aim to solve the problem from performance aspect, but it adds more pressure to memory/storage at the same time, making the super node problem ironically more evident.

What I have drafted in my POC is an attempt to address the memory/storage problem. If you are interested, you are more than welcome to take over the POC and drive it forward. Alternatively, I can revisit it and create a PR (as long as there's no breaking change) when my schedule allows me to do so - but I would need your feedback.

dxtr-1-0 Apr 17, 2024

Yes, I would also be interested in taking it forward since i feel for bigger graphs/use cases, super nodes become an inevitable problem. Although, I am not that familiar with JanusGraph code yet so I will have to go through it. It would be awesome if you could raise a PR whenever your schedule allows so that its easier to track and contribute.

li-boxuan · 2024-04-17T17:17:18Z

li-boxuan
Apr 17, 2024
Maintainer Author

Okay I’ll create a PR and we can collaborate on it, including review/feedback/code. Get Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: dxtr-1-0 ***@***.***> Sent: Wednesday, April 17, 2024 1:47:50 AM To: JanusGraph/janusgraph ***@***.***> Cc: ***@***.*** ***@***.***>; Mention ***@***.***> Subject: Re: [JanusGraph/janusgraph] Proxy node for super node (Discussion #2717) Yes, I would also be interested in taking it forward since i feel for bigger graphs/use cases, super nodes become an inevitable problem. Although, I am not that familiar with JanusGraph code yet so I will have to go through it. It would be awesome if you could raise a PR whenever your schedule allows so that its easier to track and contribute. — Reply to this email directly, view it on GitHub<#2717 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AGENUWUYXQMT6CPUD7MXST3Y5YZLNAVCNFSM5A3DG2O2U5DIOJSWCZC7NNSXTOKENFZWG5LTONUW63SDN5WW2ZLOOQ5TSMJTHE3DIMQ>. You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proxy node for super node #2717

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 11 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Proxy node for super node #2717

li-boxuan Jul 23, 2021 Maintainer

Replies: 3 comments · 11 replies

rngcntr Aug 11, 2021 Collaborator

li-boxuan Aug 11, 2021 Maintainer Author

rngcntr Aug 11, 2021 Collaborator

li-boxuan Aug 11, 2021 Maintainer Author

li-boxuan Apr 11, 2024 Maintainer Author

dxtr-1-0 Apr 16, 2024

dxtr-1-0 Apr 16, 2024

li-boxuan Apr 17, 2024 Maintainer Author

dxtr-1-0 Apr 17, 2024

li-boxuan Apr 17, 2024 Maintainer Author

li-boxuan
Jul 23, 2021
Maintainer

Replies: 3 comments 11 replies

rngcntr
Aug 11, 2021
Collaborator

li-boxuan Aug 11, 2021
Maintainer Author

rngcntr Aug 11, 2021
Collaborator

li-boxuan
Aug 11, 2021
Maintainer Author

li-boxuan Apr 11, 2024
Maintainer Author

li-boxuan Apr 17, 2024
Maintainer Author

li-boxuan
Apr 17, 2024
Maintainer Author