-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(Query): Add random keyword in DQL #7693
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahsanbarkati -- can you work on the comment I have.
Thanks @Samyak2 for the PR. This is a good and clean PR, nicely written. @ahsanbarkati would help fix up the filter case I'm talking about.
Reviewable status: 0 of 2 files reviewed, 1 unresolved discussion (waiting on @Samyak2)
query/query.go, line 2422 at r1 (raw file):
// TODO: error handling here func (sg *SubGraph) applyRandom(ctx context.Context) error { sg.updateUidMatrix()
This should be done in the same way as we do first: N. So, if we have a filter, then we need to apply the filter first, and then choose N random nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome @Samyak2. I have a few comments. Also, can we please add some tests? Do let me know if you need some help.
query/query.go
Outdated
@@ -2395,6 +2414,34 @@ func ProcessGraph(ctx context.Context, sg, parent *SubGraph, rch chan error) { | |||
rch <- childErr | |||
} | |||
|
|||
// applies "random" to lists inside uidMatrix | |||
// Params.Random number of nodes are selected | |||
// duplicates are not removed i.e., random selection by replacement is done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not have duplicate results. I'd suggest having some map to keep track of what all we have already picked to randomly pick n
unique results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'll do this, along with the filters application.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need to do anything for filters. You are already doing the random selection after filter application and pagination.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh okay. That explains why I couldn't figure out how to do it by looking at pagination xD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have pushed a commit fixing these.
To remove duplicates, I had to:
- introduce a new struct
UidKey
to store the index of a UID in the matrix - calculate total number of nodes and check if required number of nodes is less than that. If it's not, return all the nodes.
Because of the condition if sg.Params.Random >= totalNodes
at line 2440, when the required number of nodes (sg.Params.Random
) is equal to the total number of nodes, all of the nodes are returned without any processing. I did this thinking of it as an optimization (because there will be many overlaps in random numbers when only a few nodes are left to be selected). But I just realized that this will make it so that the order is not randomized when sg.Params.Random == totalNodes
. Should I fix this to do the random selection even when all the nodes have to be selected?
…e random_ To remove duplicates, I had to: - introduce a new struct `UidKey` to store the index of a UID in the matrix - calculate total number of nodes and check if required number of nodes is less than that. If it's not, return all the nodes. - one caveat is that the order is not randomized when the requested number is equal to total number of nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 2 files reviewed, 4 unresolved discussions (waiting on @ahsanbarkati and @Samyak2)
query/query.go, line 2432 at r2 (raw file):
// calculate total number of nodes totalNodes := 0
Had an initial look and I have some comments.
- I don't think we have to worry about duplicates here because a uid list within a uid matrix doesn't have any duplicates. Yes there could be duplicate uid within two different uids lists but we dont have to worry about that.
- @ahsanbarkati we should add some tests for this feature and how it works at root and at a child level.
- I reckon we can simplify this to just truncate all the uid lists with random elements in place instead of creating a new list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 2 files reviewed, 4 unresolved discussions (waiting on @ahsanbarkati, @pawanrawal, and @Samyak2)
query/query.go, line 2432 at r2 (raw file):
Previously, pawanrawal (Pawan Rawal) wrote…
Had an initial look and I have some comments.
- I don't think we have to worry about duplicates here because a uid list within a uid matrix doesn't have any duplicates. Yes there could be duplicate uid within two different uids lists but we dont have to worry about that.
- @ahsanbarkati we should add some tests for this feature and how it works at root and at a child level.
- I reckon we can simplify this to just truncate all the uid lists with random elements in place instead of creating a new list.
@pawanrawal In the current approach, the way random nodes are selected is by repeatedly choosing a random index. So, even if there is just one uid list in the uidMatrix and it has all unique values, we'll end up with duplicates. Is that something acceptable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see there is one issue with your approach. Suppose the query is for k random nodes, then you are selecting k nodes out of the whole UidMatrix. This is not the right thing to do.
You need to pick k random uids for each UidList in the UidMatrix. To make it clear, let me give you an example:
Suppose you run this query:
{
q2(func: has(name), random:5) {
name
uid
friend(random: 2) {
name
}
}
}
In the first go, we fetch all the Uids having the name predicate. In this case the UidMatrix will contain only one UidList. In the next iteration, when we'll fetch stuff for the children, for each uid in the UidList of uids containing name predicate, we'll get another uidList (list of Uids containing friends for the given UID).
So, you'll need to apply this random selection on each list rather than on complete UIdMatrix.
Secondly, as suggested by @pawanrawal, within a UidList, in order to select k random Uids. We can select a window of length k with random start index. We know that this will reduce the randomness but since we are doing this selection on Uids, it should be fine.
We'd love to merge your changes once this issue is addressed.
Also, we'd need some tests. It'd be great if you add some tests otherwise I'll all tests after merging your PR.
Reviewable status: 0 of 2 files reviewed, 3 unresolved discussions (waiting on @ahsanbarkati, @pawanrawal, and @Samyak2)
Are you still working on this PR @Samyak2? If not, then I can merge your changes and then I can do the fixes in next PR. |
Yes, I'm still working on it. Sorry, I couldn't do it in the last week because of the end sem workload from college 😄. |
Awesome! Thanks @Samyak2. |
the random uids are now selected from each uid list instead of the uid matrix. This simplifies the process.
@ahsanbarkati thank you for the explanation, it makes a lot more sense now. I have implemented your suggestions in the latest commit:
I will look at adding tests for these now. |
I had a few issues while running the tests:
Output of `./t`
Output of `./t -s -p "query"`
I tried rebasing on master, the error persists. |
Awesome @Samyak2. Thanks for the nice work. Your changes looks good to me. I am just concerned if there will be performance issues by using Regarding the test, it seems like your docker is unable to spin up the containers. Can you try running Other thing to look at is by running By the way, I see that the CI has run fine. |
I am merging this PR, so that we can do further development/testing of the feature. Feel free to raise another PR with the tests. :) |
Okay, I will try debugging this.
Thank you! 😄 |
@ahsanbarkati I see you have added the tests in #7758! Are there any other tests to add here? If not, I could try fixing the performance issues by not using Also, is there a timeline on when this feature will be released? |
This change implements the `random` function argument. If an argument `random: k` is provided to a dql function then random k results will be returned. For example: `func(has(name), random:5)` will choose 5 random nodes having the name predicate.
Fixes: https://discuss.dgraph.io/t/pick-random-node/10815
Changes
random: <number of nodes>
Issues/future work
applyRandom
functionThis is my first PR here (and I'm new to Go), there will be a lot more issues I'm not aware of.
Example
I have only tested this on a basic example from Ratel. This will need a lot more testing.
Mutation
Query
Result
This change is