Grouping and Aggregation #70

ejoebstl · 2017-01-19T15:48:31Z

It would be awesome to support native grouping and aggregation within graphcool.

This foldable block contains an outdated proposal. Pleas see below for the new one

Given the following schema:

Student {
  year:int
  name:string
  exams {
    name: string
    grade: int
  }
}

I would like to propose to add the following pseudo-schema, like you did for the filter API.

Student {
  year:int
  name:string
  groupby_year {
     exams {
       // Aggregation of grade
       grade_max:int
       grade_min:int
       grade_avg:int
       // ...
     }
  }
  groupby_name {
     // same as above
  }
  exams {
    // ...
  }
}

Note that the above solution only groups by one field. Technically, it is possible to group by a list of fields. I don't see how this could be done nicely in GraphQL, except maybe passing a list of fields to the groupby function.

This issue might overlap with #56 for some use cases. Grouping is also mentioned in #40.

The text was updated successfully, but these errors were encountered:

Noitidart · 2017-02-13T15:07:56Z

May we get sum aggregate functionality. I need to use count and sum to get average of nodes please.

ejoebstl · 2017-02-19T01:54:53Z

The schema I proposed in my original post does not make that much sense. Main issue: We will have to issue the command that indicates aggregation above the model that is aggregated in the AST.

Let's consider again the model mentioned above:

Student {
  year:int
  major:string
  name:string
  exams {
    name: string
    grade: int
  }
}

A possible solution could look like this:

Aggregating by one field at top level

query getAvererageGradeByYear {
    aggregateStudents(byYear: true) {
       year
       aggregateExams {
          grade_avg
       }
    }
}

This would return the average grade over all students and all exams for each year.

Basic concept:

The parameters of aggregateStudent are optional and correspond to the fields in the Student model. In this case, the parameters would be byYear, byName, byMajor and byExam. If we set a parameter to true, we consider that field for grouping.
All nodes that are children of aggregated nodes have to be aggregations as well.

The result might look like this:

[{
  year: 2011, 
  exams: { grade_avg: 3 }
}, {
  year: 2012, 
  exams: { grade_avg: 1 }
}]

Aggregating by two fields at top level

query getAvererageGradeByYearAndSubject {
    aggregateStudents(byYear: true, bySubject: true) {
      year
      subject
       aggregateExams {
          grade_avg
       }
    }
}

This would return the average grade over all students and all exams for each year and each subject.

The result might look like this:

[{
  year: 2011, 
  major: "Computer Science",
  exams: { grade_avg: 3 }
}, {
  year: 2011, 
  major: "Biology",
  exams: { grade_avg: 2.5 }
}, {
  year: 2012, 
  major: "Computer Science",
  exams: { grade_avg: 1 }
}, {
  year: 2012, 
  major: "Biology",
  exams: { grade_avg: 4 }
}]

Aggregating by one field at lower level

query getAvererageGradeForStudent {
    allStudents {
      name
      aggregateExams {
          grade_avg
      }
   }
}

This would calculate the average grade for each student. We don't group the exams explicitly, but they are implicitly grouped by students.
Therefore, we could also write a similar query like this:

query getAvererageGradeForStudent {
    aggregateExams(byStudent: true) {
      grade_avg
      Student {
          name
      }
   }
}

Nested aggregation

query getAvererageGradeForExamPerYear {
    aggregateStudents(byYear: true) {
      year
      aggregateExams(byName: true) {
          grade_avg
      }
   }
}

This would return the average grade over all students for each exam for each year.

Further considerations

Something like aggregateStudents(by: [year, name, ...]) would be way nicer, but does not work with the current GraphQL spec. We could use a list of strings, but this will sacrifice type-safety. Therefore I preferred the solution above.
It might be possible to remove aggregated child nodes from the result and instead embed the aggregated field directly into the topmost aggregated node. This design decision is up to you.
Aggregation functions should, of course, support filter and ordering expressions.
Theoretically, it would be possible to add byX parameters to the existing query functions. However, I suggest an approach where the schema enforces at least some validity of the query, as above.
It is impossible to query a field which is neither grouped, nor aggregated. I don't see a way how we can enforce this using the schema though, since we cannot make the schema dependent on the parameters. An option would be to generate a aggregate call for each field combination, e.g. aggregateStudentByName, aggregateStudentByYear, aggreagateStudentByNameAndYear and so on. The downside is that this would add a lot of overhead to the schema.
Care has to be taken to select good aggregation functions for data types. sum, average, count are the bare minimum.

ejoebstl · 2017-02-27T20:36:31Z

Since we're using a nice tree-based data format, we are not forced to aggregate over relations if we don't want to.

query getAvererageGradeByYear {
    aggregateStudents(byYear: true) {
       year
       Exams {
          name
       }
    }
}

Could return something like

[{
  year: 2011, 
  exams: [
    { name: "Test 2" },
    { name: "Test 3" },
    { name: "Test 4" }
  ]
}, {
  year: 2012, 
  exams: [
    { name: "Test A" },
    { name: "Test B" },
    { name: "Test C" }
  ]
}]

Formally, that would be the same as aggregating the exams by joining them in a list.

kbrandwijk · 2017-06-30T22:17:11Z

I read the suggestion on aggregations, I think a standard 'aggregations' field on every Type is a lot easier to work with that all those type-name-specific fields mentioned above:

query {
    allMovies(filter: { ...}, groupBy: {...}) {
        aggregations {
             max { 
                  fieldname1
                  fieldname2
             }
             min { .... }
             avg { .... }
             count { optionalFields }
             sum { .... }
        }
    }
}

The aggregations can not be mixed with 'normal fields', but could be nested:

query {
    allMovies(group: { by: { releaseYear } }) {
        title
        actors {
            aggregations {
                count
            }
        }
        reviews {
            aggregations {
                 min { lowestRating: rating }
                 max { highestRating: rating }
            }
        }
    }
}

Also note the use of groupby, that is indeed a powerful combination with aggregation.

nephix · 2017-07-26T16:20:27Z

I'm also looking for this feature

jcarva · 2017-08-04T20:27:09Z

Looking forward to this feature

birkir · 2017-08-18T20:25:34Z

Pretty big blocker for me now, as this task is pretty heavy for the client.

query showtimesTodayByCinemas {
  allMovies(
    groupBy: {
      showtimes { cinema { id } }
    }
  ) {
    title
    showtimes {
      playingAt
    }
  }
}

I dont know about this

{
    "allMovies": [{
        "cinema": {
            "id": "..."
        },
        "groupBy": [{        // <-- would this be correct name from aggregation?
            "title": "The Hitman's Bodyguard",
            "showtimes": [{ "..." }]
        }]
    }]
}

ragnorc · 2017-08-30T14:02:59Z

+1

MJones180 · 2017-09-12T12:39:17Z

Highly needed, +1

gustavofulton · 2017-10-06T16:43:19Z

+1

antho1404 · 2017-11-06T15:22:16Z

Is there any plan to do this feature anytime soon ?

Also all those propositions of API are great but I just have one concern about the returned data. Do the clients like apollo handle this kind of data and know how to inspect the results or they just rely on object and array and in this case the group_by will just be a property of the object. If they don't handle that maybe one solution could be to do as many filter as needed and group everything into a batch query (this may seems really too much but I'm not not sure the group_by is something that graphql can manage easily)

Also, why not user the _allXXXMeta to get the aggregation data, that would make more sense to me because we don't actually want real data but just some meta informations.

If we don't take the group_by but use multiple query we would have something like that

query {
  _allMoviesMeta(filter: {
    releasedDate_gte: "2017-11-06",
    releasedDate_lt: "2017-11-07"
  }) {
    rating_avg
    rating_min
    rating_max
    count
  }
}

query {
  _allMoviesMeta(filter: {
    releasedDate_gte: "2017-11-07",
    releasedDate_lt: "2017-11-08"
  }) {
    rating_avg
    rating_min
    rating_max
    count
  }
}

...

This can easily be done by a script to generate all the queries and like that we just get some normal data and not nested that maybe (and it's a big maybe because I don't know) apollo and other client don't support. If so we could add the group by but I still think we should use the Meta instead of the actual model for the aggregated data

Would be happy to hear any feedback.

This feature is quite important for me, at least the aggregation, there is still the multi query workaround for the groupby so if this feature is not in the roadmap, I will be happy to try to implement it on the backend but it could be good to have a good API first

prestontighe · 2017-11-09T00:46:48Z

Is there a hack solution to this? I'm assuming the only way to do this is to create a custom function?

sorenbs · 2017-11-10T21:20:23Z

@antho1404 This item is fairly high on our roadmap. I would encourage everybody to experiment with the API design so we can have a broader discussion before implementation starts. I like the general approach in this proposal, but it has limitations as highlighted in several comments, so we might end up with a final design that is significantly different. Feel free to contribute more ideas and feature requests :-)

kbrandwijk · 2017-11-10T21:59:24Z

@sorenbs Is it possible to update the opening post with the lessons learned from the comments and the current proposal?

sorenbs · 2017-11-21T11:30:15Z

Thank you all for the discussion here. It has been very enlightening. GraphQL turns out to be a very nice language to retrieve aggregated data. I have created a final proposal taking into account all points raised here: #1312

marktani added area/api kind/feature labels Feb 15, 2017

marktani mentioned this issue Mar 2, 2017

Filter nodes based on related aggregations #110

Closed

marktani mentioned this issue Apr 13, 2017

OrderBy and Filter by _QueryMeta! #167

Closed

marktani added the stage/discussion label Apr 17, 2017

marktani added area/api area/extra and removed area/schema labels Sep 17, 2017

marktani removed stage/discussing labels Sep 26, 2017

sorenbs added component/framework/api and removed component/core labels Oct 8, 2017

kbrandwijk mentioned this issue Nov 14, 2017

Support filtering on aggregated values #1279

Closed

1 task

nikolasburk mentioned this issue Nov 15, 2017

[Prisma 1.0] Specifications #353

Closed

32 tasks

sorenbs mentioned this issue Nov 21, 2017

Group By and Aggregated Values #1312

Closed

sorenbs closed this as completed Nov 21, 2017

pantharshit00 added the kind/feature label Jan 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grouping and Aggregation #70

Grouping and Aggregation #70

ejoebstl commented Jan 19, 2017 •

edited

Loading

Noitidart commented Feb 13, 2017

ejoebstl commented Feb 19, 2017 •

edited

Loading

ejoebstl commented Feb 27, 2017

kbrandwijk commented Jun 30, 2017 •

edited

Loading

nephix commented Jul 26, 2017

jcarva commented Aug 4, 2017

birkir commented Aug 18, 2017

ragnorc commented Aug 30, 2017

MJones180 commented Sep 12, 2017

gustavofulton commented Oct 6, 2017

antho1404 commented Nov 6, 2017

prestontighe commented Nov 9, 2017 •

edited

Loading

sorenbs commented Nov 10, 2017

kbrandwijk commented Nov 10, 2017

sorenbs commented Nov 21, 2017

Grouping and Aggregation #70

Grouping and Aggregation #70

Comments

ejoebstl commented Jan 19, 2017 • edited Loading

Noitidart commented Feb 13, 2017

ejoebstl commented Feb 19, 2017 • edited Loading

ejoebstl commented Feb 27, 2017

kbrandwijk commented Jun 30, 2017 • edited Loading

nephix commented Jul 26, 2017

jcarva commented Aug 4, 2017

birkir commented Aug 18, 2017

ragnorc commented Aug 30, 2017

MJones180 commented Sep 12, 2017

gustavofulton commented Oct 6, 2017

antho1404 commented Nov 6, 2017

prestontighe commented Nov 9, 2017 • edited Loading

sorenbs commented Nov 10, 2017

kbrandwijk commented Nov 10, 2017

sorenbs commented Nov 21, 2017

ejoebstl commented Jan 19, 2017 •

edited

Loading

ejoebstl commented Feb 19, 2017 •

edited

Loading

kbrandwijk commented Jun 30, 2017 •

edited

Loading

prestontighe commented Nov 9, 2017 •

edited

Loading