Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregations (write side) #5082

Merged
merged 28 commits into from
Jan 23, 2024
Merged

Aggregations (write side) #5082

merged 28 commits into from
Jan 23, 2024

Conversation

lutter
Copy link
Collaborator

@lutter lutter commented Dec 15, 2023

This PR introduces aggregations (see docs/aggregations.md for details) It only implements the write side, i.e., with this pull request, we populate aggregation tables but there is not yet a way to query aggregations.

Until we have a native Timestamp type, timestamps are represented as an Int8 as seconds since the Unix epoch.

An example subgraph that aggregates ethereum block numbers can be found here To use it, graph-node must be run with GRAPH_MAX_SPEC_VERSION="0.1.1". To deploy it, graph-cli needs to be patched:

  • Check out the graph-tooling repo, using the lutter/agg-minimal branch (this PR)
  • Run pnpm install && pnpm build in the graph-tooling checkout
  • Run yarn install && pnpm link <path to graph-tooling/packages/cli> in the checkout of the subgraph
  • Deploy the subgraph against a local graph-node

@@ -57,6 +57,10 @@ impl blockchain::Block for Block {
fn parent_ptr(&self) -> Option<BlockPtr> {
None
}

fn timestamp(&self) -> BlockTime {
BlockTime::NONE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this type be called something like MaybeBlockTime or something just to indicate it may be none even though it's not an option?

I guess the substreams notion of blocks doesn't fit the aggregations model anyway

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NONE thing should really go away (maybe except for tests) I just put it there for the few cases where I wasn't sure what's happening (like substreams)


**TODO** It might be necessary to allow `@aggregate` fields that are only
used for some intervals. We could allow that with syntax like
`@aggregate(fn: .., arg: .., interval: "day")`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels like more complicated aggregations could benefit from a mapping to do them? Essentially providing a (x, x) -> x function to generalise.

Sum,
Max,
Min,
Cnt,
Copy link
Contributor

@mangas mangas Jan 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't lose much by calling this Count, this also reads funny for immature people like me :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, yeah, not sure why I wrote that. Changed it

pub fn aggregation<'a>(&self, schema: &'a InputSchema) -> &'a Aggregation {
schema.inner.type_infos[self.aggregation]
.aggregation()
.expect("the aggregation source is an object types")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.expect("the aggregation source is an object types")
.expect("the aggregation source is an object type")

.expect("the aggregation source is an object types")
}

pub fn agg_type(&self, schema: &InputSchema) -> EntityType {
Copy link
Contributor

@mangas mangas Jan 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "aggregation_type" is more readable


/// The field needed for the finalised aggregation for hourly/daily
/// values
fn as_agg_field(&self) -> Field {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

impl Aggregation {
pub fn new(schema: &Schema, pool: &AtomPool, agg_type: &s::ObjectType) -> Self {
let name = pool.lookup(&agg_type.name).unwrap();
let id_type = IdType::try_from(agg_type).expect("validation caught any issues here");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this return an error instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we should never get here since validations reject a subgraph where the type of id would fail this conversion. The problem is that we first do validations, and then extract things from the validated AST. It would be much better to do the extraction at the same time as validation, but that's a pretty big change that I didn't want to also shoehorn into this PR.

A big part of this and a couple previous PR's is to isolate these kinds of conversions by only doing them while we construct the InputSchema; previously, a lot of this happened everywhere where the AST was used (since there were no internal data structures to represent our idea of schema information)


impl Aggregation {
pub fn new(schema: &Schema, pool: &AtomPool, agg_type: &s::ObjectType) -> Self {
let name = pool.lookup(&agg_type.name).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these cases could benefit from a lookup_unchecked or something which does the unwrap. There's no scenario where this lookup can fail? I guess the ObjectType parsing probably adds it here but would probably be worth adding a comment.

There are a couple of similar cases around

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any failure here is a programming error since the agg_type.name should have been interned already. All a lookup_unchecked could do is lookup(..).unwrap() anyway, but the knowledge that this lookup can't fail (except for genuine bugs) resides here, in the usage of the pool.

// This is not a great idea: we really always need a timestamp; if
// substreams doesn't give us one, we use a fixed one which will
// lead to all kinds of strange behavior
let timestamp = timestamp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC the clock can be assumed to always be there so might be worth adding a log in case we need to do this or a metric.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it should always be present, I just made it missing an error (which gets rid of one of the iffy uses of BlockTime::NONE)

For timeseries, users can not set the timestamp. Instead, the time for the
current block is used.
@lutter lutter merged commit c3f4ec4 into master Jan 23, 2024
7 checks passed
@lutter lutter deleted the lutter/agg branch January 23, 2024 18:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants