Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Relationship multiplicity not working #4334

Open
ubmarco opened this issue Oct 2, 2024 · 6 comments
Open

Bug: Relationship multiplicity not working #4334

ubmarco opened this issue Oct 2, 2024 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@ubmarco
Copy link

ubmarco commented Oct 2, 2024

Kùzu version

v0.6.0

What operating system are you using?

Arch Linux x64

What happened?

I feel the relationship constraints are not working as documented in https://docs.kuzudb.com/cypher/data-definition/create-table/#relationship-multiplicities.

This Cypher executes successfully on an empty in-memory DB:

CREATE NODE TABLE User(id STRING, PRIMARY KEY (id));
CREATE NODE TABLE City(id STRING, PRIMARY KEY (id));
CREATE REL TABLE LivesIn(FROM User TO City, ONE_ONE);
CREATE (u:User { id: 'user1' });
CREATE (u:User { id: 'user2' });
CREATE (c:City { id: 'city1' });
CREATE (c:City { id: 'city2' });
MATCH (u:User), (c:City) WHERE u.id = 'user1' AND c.id = 'city1' CREATE (u)-[e:LivesIn]->(c);
MATCH (u:User), (c:City) WHERE u.id = 'user2' AND c.id = 'city1' CREATE (u)-[e:LivesIn]->(c);
MATCH (u:User), (c:City) WHERE u.id = 'user1' AND c.id = 'city2' CREATE (u)-[e:LivesIn]->(c);
MATCH (u)-[e]->(c) RETURN *;

This is the result of the last MATCH statement in the Explorer:
image

1 User Lives in 2 Cities while 1 City has 2 LivesIn Users. The point of stating ONE_ONE means that the relation must be 1 in both directions.
Is this a bug or am I doing something wrong?

Are there known steps to reproduce?

Start Explorer with an empty DB

docker run -p 8000:8000 --rm -e KUZU_IN_MEMORY=true kuzudb/explorer:latest

Run the queries

```cypher
CREATE NODE TABLE User(id STRING, PRIMARY KEY (id));
CREATE NODE TABLE City(id STRING, PRIMARY KEY (id));
CREATE REL TABLE LivesIn(FROM User TO City, ONE_ONE);
CREATE (u:User { id: 'user1' });
CREATE (u:User { id: 'user2' });
CREATE (c:City { id: 'city1' });
CREATE (c:City { id: 'city2' });
MATCH (u:User), (c:City) WHERE u.id = 'user1' AND c.id = 'city1' CREATE (u)-[e:LivesIn]->(c);
MATCH (u:User), (c:City) WHERE u.id = 'user2' AND c.id = 'city1' CREATE (u)-[e:LivesIn]->(c);
MATCH (u:User), (c:City) WHERE u.id = 'user1' AND c.id = 'city2' CREATE (u)-[e:LivesIn]->(c);
MATCH (u)-[e]->(c) RETURN *;

Kuzu should complain about the relation user2 to city1 and user1 to city2 as it violates the constraint.

@ubmarco ubmarco added the bug Something isn't working label Oct 2, 2024
@ray6080
Copy link
Contributor

ray6080 commented Oct 3, 2024

Hi @ubmarco yeah this is indeed a known issue we haven't fixed yet. thanks for pointing it out! Will get this fixed as soon as we can.

Meanwhile let me try to gather some feedback on this feature here: is your use case relying on the multiplicity constraint? If so, is limiting the constraint as either one or many good enough? Thanks!

@ubmarco
Copy link
Author

ubmarco commented Oct 4, 2024

Hi @ray6080, thanks for the feedback.

For my use case, the existing constraints

ONE_ONE
ONE_MANY
MANY_ONE
MANY_MANY

are enough for now.

But I can certainly think about situations in which more complex constraints make total sense.
E.g. a user can only be registered in max. 2 cities. Or a user might only want to live in a city where the population is less than 100k. This goes to application business logic and I'm not sure whether it even makes sense to define it in the DB. Basically it would mean, before altering a REL TABLE make sure certain Cypher queries are still valid. Having this on DB side means transactional safety as the query is only committed if the constraint queries after inserting return no elements.

Another point that I have in mind are REL TABLE GROUPs. My experiments show that the relationship can only be given once for the whole group. In the following example

CREATE NODE TABLE User(id STRING, PRIMARY KEY (id));
CREATE NODE TABLE Dog(id STRING, PRIMARY KEY (id));
CREATE NODE TABLE City(id STRING, PRIMARY KEY (id));
CREATE REL TABLE GROUP Knows (FROM User To City, FROM Dog to City, MANY_MANY);

it means I cannot reuse a relationship label in case the multiplicity differs.

It would be good if I can specify the multiplicity per relation, e.g.

CREATE NODE TABLE User(id STRING, PRIMARY KEY (id));
CREATE NODE TABLE Dog(id STRING, PRIMARY KEY (id));
CREATE NODE TABLE City(id STRING, PRIMARY KEY (id));
CREATE REL TABLE GROUP Knows (FROM User To City, MANY_MANY, FROM Dog to City, MANY_ONE);

So a User may know many cities but a Dog can only know one.
Not sure whether my syntax proposal makes sense to you :)

@ubmarco
Copy link
Author

ubmarco commented Oct 4, 2024

A bit unrelated, but here is a question towards the usage of Cypher for DDL. The Kuzu DDL seems not to be not part of the OpenCypher9 specification. Is this a custom language addition to Cypher?
I just find the DROP statement in the section Reserved for future use.

Just wondering whether there are standardized ideas around this topic. I see there is also the CONSTRAINT in the section Reserved for future use, so maybe a constraint definition concept outside of REL TABLE [GROUP] would make sense?

@ray6080 ray6080 self-assigned this Oct 7, 2024
@ray6080
Copy link
Contributor

ray6080 commented Oct 7, 2024

Hi @ubmarco , sorry for the late reply. It's great we can gather your feedback on this and thanks for your suggestions!

it means I cannot reuse a relationship label in case the multiplicity differs.
It would be good if I can specify the multiplicity per relation, e.g.

Yeah, this is indeed one limitation of rel table group now. What you suggest definitely can be an option to support. But I feel it's hard to keep the syntax succinct while allowing the semantic expression to be more flexible. As pointed out in your other comment, maybe defining the CONSTRAINT outside of the CREATE REL TABLE GROUP is a better option.

A bit unrelated, but here is a question towards the usage of Cypher for DDL. The Kuzu DDL seems not to be not part of the OpenCypher9 specification. Is this a custom language addition to Cypher?

Yes, DDL statements are not part of the openCypher standard, instead customized in Kuzu. Here is some brief explanation behind the motivation if you're interested.

I see there is also the CONSTRAINT in the section Reserved for future use, so maybe a constraint definition concept outside of REL TABLE [GROUP] would make sense?

This is a good idea. Indeed we had discussions within the team to rework the multiplicity check as a part of the more generic (CHECK) constraint in Cypher and SQL, but we were not confident if there are many users really care about this kind of constraints, which leads to lower priority on this feature.

@ubmarco
Copy link
Author

ubmarco commented Oct 8, 2024

I would certainly use constraints and multiplicity checks.

For me constraint can be split into 2 categories:

  1. Check at data ingest. This is for basic things like person must have an age and it must be positive.
  2. Check after ingest. This is for quality control like "a child links to a parent, but the parent's age is smaller than the child's". These checks should be running on user's request or maybe automatic when data changes, if it's fast.

I work in an environment in which data is basically schemaless as I don't know the structure of it. A node just links to itself. Then on the node I have a MAP(STRING, STRING) to define properties for the nodes.

I could make very good use of a flexible constraint checker on fields and links on already imported data.
If those are running on database level, it would be super fast. I would want to type check on (nested) node properties, check existence of fields and write a list of Cypher statements to check whether the "network checks" are met. They must return a list of primary node keys that violate the constraint.

In the RDF world, SHACL exists to do that and @prrao87 and Paco Nathan wrote a blog post about how to use it with Kuzu that I eagerly consumed.

Now I personally must say that RDF tooling is not good as it should be and with our amount of data, pyshacl runs for multiple minutes which is bad user experience. It also does not run on database level and puts load to the user machine.
To me RDF feels academic and users first have to understand all the concepts and terms around it. Cypher on the other hand is an easy starter.

@ray6080
Copy link
Contributor

ray6080 commented Oct 8, 2024

Hi @ubmarco thanks for providing more details on your use cases and thoughts. Yeah supporting more flexible constraint checks are very appealing to me, and constraints inside databases should be more performant and graceful compared to application level checks.

  1. Check after ingest. This is for quality control like "a child links to a parent, but the parent's age is smaller than the child's". These checks should be running on user's request or maybe automatic when data changes, if it's fast.

This is an interesting scenario, not sure how the syntax would be, or maybe it should be combined with UDF 🤔. But it actually looks like a quite fancy feature to me, which requires some flexible and efficient CHECK constraint support if it's done automatically on every data change.

Let me bring this up in the team, meanwhile we'll open an issue on this as a feature to gather more community feedback. We can start with basic ones, like non null, positive value, in/out degree checks, etc, though it's hard to promise a timeline for this, as the team is small and everyone is current busy with some important features right now. 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants