Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is anyone using the Hbase Store? #2367

Closed
GCHQDev404 opened this issue Dec 18, 2020 · 5 comments
Closed

Is anyone using the Hbase Store? #2367

GCHQDev404 opened this issue Dec 18, 2020 · 5 comments
Labels
question Specific query about part of the codebase

Comments

@GCHQDev404
Copy link
Contributor

We are looking to do a number of version upgrades in Gaffer 2.0 including moving to Hadoop 3. HBase does not support Hadoop 3 out the box (it needs to be built from source using a profile). If someone is using the Hbase Store, we can take steps to include it in version 2. However, if no one is using it, we would benefit from removing support for the Hbase store until HBase releases a compatible version

@GCHQDev404 GCHQDev404 added the question Specific query about part of the codebase label Dec 18, 2020
@rwer81
Copy link

rwer81 commented Jan 4, 2021

We use HgraphDb for graph but we have some problems(performance, no community etc.) with it. It uses Hbase for graph store. So we gained experience on Hbase and we use Hbase 2.1.
Nowadays, we test Gaffer to decide if it is right tool for us or not. If we decide to use it in production and migrate from Hgraphdb to Gaffer, we may change graph store to Accumulo. But we are not sure it can handle our critical issues that are depend on our business logic and data.

@d47853
Copy link
Member

d47853 commented Jan 4, 2021

Thanks @rwer81

Let us know how you get on with your experiment. We have found Accumulo a highly performant store especially leveraging the iterators for business logic. We've got docker project which might be of use to you when trying it out.

@rwer81
Copy link

rwer81 commented Jan 4, 2021

Thank for your reply. The points below may be out of subject, but I consider them to be useful considering the scope of our conversation.

Our special cases;
1- We have 1PB of linked data and it grows by about 1TB daily. It has to handle this traffic.
2- There are some super-vertices that may have +1 million edges, so we limit edges when traversing.
3- There are a lot of DML operations on data. These operations are happening in real-time.
4- 90% of our data consists the same edge. So, edge sharding/distributing is important.
5- Generally, We are creating the ID of edges manually. We use these IDs in upsert operations. When inserting data via bulk-loading, HBase overwrites keys, so that we don't have to check whether the key exists or not. This method improves loading performance in some specific cases.
6- We don't store vertex properties in graph-for now, they are stored at ElasticSearch-. For that reason, we are focused morely on traversals in graph. There are cases which requires us to execute 4-depth traversals.

It would be greatly appreciated if we could use some of your insight and help during our experiments.

Thanks.

@d47853
Copy link
Member

d47853 commented Jan 4, 2021

Your scale shouldn't be a problem as Accumulo is capable of scaling to that kind of size. You could use something like Spark or AddElementsFromHDFS to deal with your data ingest. To limit your edge traversal, we'd recommend using HyperLogLogSketches as a property on an entity as it can be easily generated and aggregated. Gaffer offers its sketches library to help with this. The bulk import capability in Accumulo should work similar to hbase, avoiding the query-update-put sequence.

@rwer81
Copy link

rwer81 commented Jan 4, 2021

Thanks for your explanation.
You encouraged me to use Gaffer. I continue to test.
I conclusion, If we decide to use Gaffer, we use Accumulo presumably instead of Hbase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Specific query about part of the codebase
Projects
None yet
Development

No branches or pull requests

4 participants