-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Triplet sampling in NeighborLoader
#6004
Conversation
Codecov Report
@@ Coverage Diff @@
## master #6004 +/- ##
==========================================
- Coverage 84.73% 84.66% -0.07%
==========================================
Files 361 361
Lines 20215 20185 -30
==========================================
- Hits 17129 17090 -39
- Misses 3086 3095 +9
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some initial comments.
Incorporated your review. Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some final comments.
assert int(batch['paper'].batch.max()) + 1 == 32 + 16 | ||
# Check if each seed edge has a different source and dstination node: | ||
assert batch['paper'].num_nodes >= 2 * (32 + 16) | ||
assert int(batch['paper'].batch.max()) + 1 == 32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't this 64? Won't the negative samples also have a disjoint batch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally think it makes more sense to group the negatives and positives to its own batch, in particular because they now share time information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I see. But it seems weird that we now have more than 1 seed edge with the same batch number. Even though their computation graphs are in fact disjoint. But we can debate this later.
@wsad1 Updated based on your comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Thanks.
assert int(batch['paper'].batch.max()) + 1 == 32 + 16 | ||
# Check if each seed edge has a different source and dstination node: | ||
assert batch['paper'].num_nodes >= 2 * (32 + 16) | ||
assert int(batch['paper'].batch.max()) + 1 == 32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I see. But it seems weird that we now have more than 1 seed edge with the same batch number. Even though their computation graphs are in fact disjoint. But we can debate this later.
This PR implements triplet sampling in link-level tasks, where we want only want to sample a negative destination node. This PR got quite large, especially because I stumbled upon some weird things in the current link-level pipeline on temporal graphs, e.g., we don't want to sample nodes that do not yet exist for a given edge label timestamp. When `neg_sampling="triplet"`, then the returned `Data` object holds: * a `src_index` which denotes the index from left-hand-side embeddings * a `dst_pos_index` which denotes the index from positive right-hand-side embeddings * a `dst_neg_index` which denotes the index from negative right-hand-side embeddings Overall, it introduces * `NegativeSamplingConfig` with `strategy=binary/triplet` * Deprecates `neg_sampling_ratio` in favor of `neg_sampling` * Introduces a new method called `neg_sampling` that respects the timestamp of nodes while sampling * Refactors and extends `sample_from_edges` => Overall, this function got quite large, and I will consider separating them into multiple functions in the future.
This PR implements triplet sampling in link-level tasks, where we want only want to sample a negative destination node.
This PR got quite large, especially because I stumbled upon some weird things in the current link-level pipeline on temporal graphs, e.g., we don't want to sample nodes that do not yet exist for a given edge label timestamp.
When
neg_sampling="triplet"
, then the returnedData
object holds:src_index
which denotes the index from left-hand-side embeddingsdst_pos_index
which denotes the index from positive right-hand-side embeddingsdst_neg_index
which denotes the index from negative right-hand-side embeddingsOverall, it introduces
NegativeSamplingConfig
withstrategy=binary/triplet
neg_sampling_ratio
in favor ofneg_sampling
neg_sampling
that respects the timestamp of nodes while samplingsample_from_edges
=> Overall, this function got quite large, and I will consider separating them into multiple functions in the future.