Skip to content
This repository has been archived by the owner on Sep 2, 2022. It is now read-only.

Grouping and Aggregation #70

Closed
ejoebstl opened this issue Jan 19, 2017 · 15 comments
Closed

Grouping and Aggregation #70

ejoebstl opened this issue Jan 19, 2017 · 15 comments

Comments

@ejoebstl
Copy link
Contributor

ejoebstl commented Jan 19, 2017

It would be awesome to support native grouping and aggregation within graphcool.

This foldable block contains an outdated proposal. Pleas see below for the new one

Given the following schema:

Student {
  year:int
  name:string
  exams {
    name: string
    grade: int
  }
}

I would like to propose to add the following pseudo-schema, like you did for the filter API.

Student {
  year:int
  name:string
  groupby_year {
     exams {
       // Aggregation of grade
       grade_max:int
       grade_min:int
       grade_avg:int
       // ...
     }
  }
  groupby_name {
     // same as above
  }
  exams {
    // ...
  }
}

Note that the above solution only groups by one field. Technically, it is possible to group by a list of fields. I don't see how this could be done nicely in GraphQL, except maybe passing a list of fields to the groupby function.

This issue might overlap with #56 for some use cases. Grouping is also mentioned in #40.

@Noitidart
Copy link

May we get sum aggregate functionality. I need to use count and sum to get average of nodes please.

@ejoebstl
Copy link
Contributor Author

ejoebstl commented Feb 19, 2017

The schema I proposed in my original post does not make that much sense. Main issue: We will have to issue the command that indicates aggregation above the model that is aggregated in the AST.

Let's consider again the model mentioned above:

Student {
  year:int
  major:string
  name:string
  exams {
    name: string
    grade: int
  }
}

A possible solution could look like this:

Aggregating by one field at top level

query getAvererageGradeByYear {
    aggregateStudents(byYear: true) {
       year
       aggregateExams {
          grade_avg
       }
    }
}
  • This would return the average grade over all students and all exams for each year.

Basic concept:

  • The parameters of aggregateStudent are optional and correspond to the fields in the Student model. In this case, the parameters would be byYear, byName, byMajor and byExam. If we set a parameter to true, we consider that field for grouping.
  • All nodes that are children of aggregated nodes have to be aggregations as well.

The result might look like this:

[{
  year: 2011, 
  exams: { grade_avg: 3 }
}, {
  year: 2012, 
  exams: { grade_avg: 1 }
}]

Aggregating by two fields at top level

query getAvererageGradeByYearAndSubject {
    aggregateStudents(byYear: true, bySubject: true) {
      year
      subject
       aggregateExams {
          grade_avg
       }
    }
}
  • This would return the average grade over all students and all exams for each year and each subject.

The result might look like this:

[{
  year: 2011, 
  major: "Computer Science",
  exams: { grade_avg: 3 }
}, {
  year: 2011, 
  major: "Biology",
  exams: { grade_avg: 2.5 }
}, {
  year: 2012, 
  major: "Computer Science",
  exams: { grade_avg: 1 }
}, {
  year: 2012, 
  major: "Biology",
  exams: { grade_avg: 4 }
}]

Aggregating by one field at lower level

query getAvererageGradeForStudent {
    allStudents {
      name
      aggregateExams {
          grade_avg
      }
   }
}
  • This would calculate the average grade for each student. We don't group the exams explicitly, but they are implicitly grouped by students.
  • Therefore, we could also write a similar query like this:
query getAvererageGradeForStudent {
    aggregateExams(byStudent: true) {
      grade_avg
      Student {
          name
      }
   }
}

Nested aggregation

query getAvererageGradeForExamPerYear {
    aggregateStudents(byYear: true) {
      year
      aggregateExams(byName: true) {
          grade_avg
      }
   }
}
  • This would return the average grade over all students for each exam for each year.

Further considerations

  • Something like aggregateStudents(by: [year, name, ...]) would be way nicer, but does not work with the current GraphQL spec. We could use a list of strings, but this will sacrifice type-safety. Therefore I preferred the solution above.
  • It might be possible to remove aggregated child nodes from the result and instead embed the aggregated field directly into the topmost aggregated node. This design decision is up to you.
  • Aggregation functions should, of course, support filter and ordering expressions.
  • Theoretically, it would be possible to add byX parameters to the existing query functions. However, I suggest an approach where the schema enforces at least some validity of the query, as above.
  • It is impossible to query a field which is neither grouped, nor aggregated. I don't see a way how we can enforce this using the schema though, since we cannot make the schema dependent on the parameters. An option would be to generate a aggregate call for each field combination, e.g. aggregateStudentByName, aggregateStudentByYear, aggreagateStudentByNameAndYear and so on. The downside is that this would add a lot of overhead to the schema.
  • Care has to be taken to select good aggregation functions for data types. sum, average, count are the bare minimum.

@ejoebstl
Copy link
Contributor Author

Since we're using a nice tree-based data format, we are not forced to aggregate over relations if we don't want to.

query getAvererageGradeByYear {
    aggregateStudents(byYear: true) {
       year
       Exams {
          name
       }
    }
}

Could return something like

[{
  year: 2011, 
  exams: [
    { name: "Test 2" },
    { name: "Test 3" },
    { name: "Test 4" }
  ]
}, {
  year: 2012, 
  exams: [
    { name: "Test A" },
    { name: "Test B" },
    { name: "Test C" }
  ]
}]

Formally, that would be the same as aggregating the exams by joining them in a list.

@kbrandwijk
Copy link
Contributor

kbrandwijk commented Jun 30, 2017

I read the suggestion on aggregations, I think a standard 'aggregations' field on every Type is a lot easier to work with that all those type-name-specific fields mentioned above:

query {
    allMovies(filter: { ...}, groupBy: {...}) {
        aggregations {
             max { 
                  fieldname1
                  fieldname2
             }
             min { .... }
             avg { .... }
             count { optionalFields }
             sum { .... }
        }
    }
}

The aggregations can not be mixed with 'normal fields', but could be nested:

query {
    allMovies(group: { by: { releaseYear } }) {
        title
        actors {
            aggregations {
                count
            }
        }
        reviews {
            aggregations {
                 min { lowestRating: rating }
                 max { highestRating: rating }
            }
        }
    }
}

Also note the use of groupby, that is indeed a powerful combination with aggregation.

@nephix
Copy link

nephix commented Jul 26, 2017

I'm also looking for this feature

@jcarva
Copy link

jcarva commented Aug 4, 2017

Looking forward to this feature

@birkir
Copy link

birkir commented Aug 18, 2017

Pretty big blocker for me now, as this task is pretty heavy for the client.

query showtimesTodayByCinemas {
  allMovies(
    groupBy: {
      showtimes { cinema { id } }
    }
  ) {
    title
    showtimes {
      playingAt
    }
  }
}

I dont know about this

{
    "allMovies": [{
        "cinema": {
            "id": "..."
        },
        "groupBy": [{        // <-- would this be correct name from aggregation?
            "title": "The Hitman's Bodyguard",
            "showtimes": [{ "..." }]
        }]
    }]
}

@ragnorc
Copy link

ragnorc commented Aug 30, 2017

+1

@MJones180
Copy link

Highly needed, +1

@gustavofulton
Copy link

+1

@antho1404
Copy link
Contributor

Is there any plan to do this feature anytime soon ?

Also all those propositions of API are great but I just have one concern about the returned data. Do the clients like apollo handle this kind of data and know how to inspect the results or they just rely on object and array and in this case the group_by will just be a property of the object. If they don't handle that maybe one solution could be to do as many filter as needed and group everything into a batch query (this may seems really too much but I'm not not sure the group_by is something that graphql can manage easily)

Also, why not user the _allXXXMeta to get the aggregation data, that would make more sense to me because we don't actually want real data but just some meta informations.

If we don't take the group_by but use multiple query we would have something like that

query {
  _allMoviesMeta(filter: {
    releasedDate_gte: "2017-11-06",
    releasedDate_lt: "2017-11-07"
  }) {
    rating_avg
    rating_min
    rating_max
    count
  }
}

query {
  _allMoviesMeta(filter: {
    releasedDate_gte: "2017-11-07",
    releasedDate_lt: "2017-11-08"
  }) {
    rating_avg
    rating_min
    rating_max
    count
  }
}

...

This can easily be done by a script to generate all the queries and like that we just get some normal data and not nested that maybe (and it's a big maybe because I don't know) apollo and other client don't support. If so we could add the group by but I still think we should use the Meta instead of the actual model for the aggregated data

Would be happy to hear any feedback.

This feature is quite important for me, at least the aggregation, there is still the multi query workaround for the groupby so if this feature is not in the roadmap, I will be happy to try to implement it on the backend but it could be good to have a good API first

@prestontighe
Copy link
Contributor

prestontighe commented Nov 9, 2017

Is there a hack solution to this? I'm assuming the only way to do this is to create a custom function?

@sorenbs
Copy link
Member

sorenbs commented Nov 10, 2017

@antho1404 This item is fairly high on our roadmap. I would encourage everybody to experiment with the API design so we can have a broader discussion before implementation starts. I like the general approach in this proposal, but it has limitations as highlighted in several comments, so we might end up with a final design that is significantly different. Feel free to contribute more ideas and feature requests :-)

@kbrandwijk
Copy link
Contributor

@sorenbs Is it possible to update the opening post with the lessons learned from the comments and the current proposal?

@sorenbs
Copy link
Member

sorenbs commented Nov 21, 2017

Thank you all for the discussion here. It has been very enlightening. GraphQL turns out to be a very nice language to retrieve aggregated data. I have created a final proposal taking into account all points raised here: #1312

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests