-
Notifications
You must be signed in to change notification settings - Fork 231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow static file data storage for Provider /status_changes and /trips #357
Conversation
Thanks for this PR, it looks really simple and clean to me! One tiny addition I'd include for trips is something like "month's worth of trips (filtered by end_time)", just to make it clear that it's the exact same filtering as in min-max end time, not start time. |
Thanks for writing this up! My three highest bits of feedback surround:
For (1):
In short, I believe pagination is in conflict with the desire to request larger time periods / lower the number of requests because the end result will always be subject to how different providers paginate their data, not the duration of the time period being requested. That leads me to believe that either:
Would love to hear what folks think and whether I missed different use cases for having different time periods! For (2):
So I guess it boils down to how confident we all feel whether historical data can/should ever change, and if our collective confidence is not 100% how do we want providers and consumers to handle that scenario? How important is it to ensure we get this right? For (3):
I understand this is a larger (and likely more controversial!) change, but if we're going to make a breaking change I'd like to entertain the idea of just how large of a breaking change we can make so that provider continues to scale for both consumers and implementors alike :) |
What should be behavior if none of the date params are specified? I would expect it to return an error and not have any implicit logic. Can the spec describe the default behavior for this condition? |
Overall this looks really good and I think is a huge step in the right direction, thanks for putting this together @hunterowens! To expand on what @kheraankit said, I think we should take the opportunity for the next release to specify whether the time-bounding query parameters for the endpoints ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deferring to other folks on what the exact correct approach is, just some feedback on the diff itself, hope this helps!
Based on my experience with APIs and how providers implement them, I think it'd be a good idea:
About time buckets, I also believe it'd be better to ask only one time range, not 3. With 3, I believe it'll be very common that they'll be out-of-sync, something which will just introduce bugs again. Daily splits do not really work, as they are split by UTC days, whereas every processing happens in a region's local time, so it's not really useful. Monthly buckets will probably be too big to return them in a single request. As a conclusion => I recommend only implementing hourly static endpoints, named |
Hi all: I've pushed a new draft, that has some pretty major changes.
This allows us to handle a lot of the GBFS like-use cases that @HeidiMG has been (correctly, IMO) talking about moving away from GBFS to MDS. Thoughts? @johnpena @babldev @rf- @hyperknot @thekaveman @barbeau / anybody else I forgot to tag (apologies!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good!
Potentially silly question... is there any guidance or preference as to what providers should return when end_time
is not part of the request?
Off the top of my head it seems like it'd fall back to the existing paginated view of all historical trips. Are there use cases better served by keeping this path around? Is it worth mandating end_time
and returning an error if not provided?
Not sure how much of an issue the above would be in practice but wanted to flag for the group!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally agree with @ascherkus comments that we should be more clear that the time parameters here mean we are asking for trips
that ended within that hour or status_changes
that occurred within that hour, vs. before that hour / within the prior hour.
I also wonder if we can maintain the existing timestamp querying format (integer epoch milliseconds) vs. introducing ISO 8601 - this might make the change less impactful for clients using the existing query format. We could even go so far as saying if a query comes in that is not cutoff at the hour mark (e.g. end_time=1571250098290
) then Providers should interpret this as a query for that hour (e.g. end_time=1571248800000
)
I am also in favor of requiring end_time
or whatever the time parameter is to avoid having to deal with ambiguously defined windows.
I went with requiring @ascherkus, I added language to describe what should be in the response payload that should clear up those comments / fixed the various typos you caught. LMK if that clears things up. @thekaveman: I think I'd rather keep the time definition explicit, as introducing a snap to hour via any timestamps means there is the possibility of something going wrong on the server side as we make an implicit conversion, I think this keeps it cleaner for implementation and debugging between teams. |
Epoch seconds / milliseconds is a very highly bug prone part of the current specs. Each provider have their own implementation about which one to use, which is especially confusing as they need to support 0.2 and 0.3 clients concurrently. Some providers do an auto-detection and accept both formats (it can be done by saying comparing to year 2100's timestamp in seconds). Also, unix seconds / milliseconds are not fit for describing a time "block", like an UTC hour, as it can be both placed in the beginning of the hour, somewhere in the middle or at the very end, all raising possible bug cases. 13 byte ISO 8601 strings ("2019-10-21T19") solve these issues and define an hourly block explicitly. |
So to conclude, this PR is aiming to replace the current DB based trips and status changes endpoints with the static ones, instead of adding 2 more new endpoints? If so, I agree with the design, it's a clean choice! +1 new endpoint for real-time-ish events (/events), whose future we can discuss in other PRs? |
@hyperknot Yes, the idea is that |
I understand, that's a great change. |
One more important point I'd like to raise is the removal of pagination from static-backed endpoints. 1. Pagination currently is the most "broken" part of the MDS specs. I work on MDS acquisition at Populus, we work with 10+ operators' MDS API, and pagination code is by far the most complex part of our MDS acquisition infrastructure. Most operators have slightly different behavior, and errors are not uncommon. Some operators do not support pagination at all but require us to query data in hourly blocks (exactly like the new static endpoints), some do not add links to the last page, some require custom query parameters and the list goes on. 2. Since on the new endpoints, the time-period is fixed, I went ahead and analyzed all our historical data. => The average size for an hourly block of trips data, gzipped is 100 kB. Since we are talking about servers in datacenters communicating with each other, even sending 7 MB in a single HTTPS request should be a non-issue today. Status changes are tiny in comparison, there is no point in analyzing them. => As a conclusion, I recommend removing all pagination and "links" related parts of the specs from the static-backed endpoints. |
Clarity on hour interval text = LGTM! Thanks! Still find use of Generally speaking I agree with @hyperknot ... removing pagination would make implementation more consistent across providers and the response sizes seem reasonable enough to warrant removal. On the flip side, I recall @bhandzo (maybe?) saying they'd still prefer to have pagination within the hour intervals so would be good to get some input as to what the use cases are there. |
3052757
to
b8f6bbb
Compare
Rebased this PR on the latest |
Explain pull request
In #268, @johnpena discusses the need to make MDS queries more predictable. @babldev attempted to address the issue in #354, but during the discussion on 9-4, it sounded like folks wanted a different type of solution.
This builds on #354 and specifically @hyperknot's suggestion that we split the possible response with different query parameters.
It introduces
static_end_time
to both/trips
and/status_changes
. This would be an ISO month, day, or hour that could be cached on S3 or a similar static file store that could be returned as a single page or set of pages.TBH, adding this complexity makes me want to also finish the openAPI work so that this can be made more clearly and testable by providers and clients alike. I'll take a stab at that shortly, but would love folks thoughts on this.
Is this a breaking change
A breaking change would require consumers or implementors of the API to modify their code for it to continue to function (ex: renaming of a required field or the change in data type of an existing field). A non-breaking change would allow existing code to continue to function (ex: addition of an optional field or the creation of a new optional endpoint).
Provider
oragency
Which API(s) will this pull request impact:
provider
Additional context
Fixes #268.
Fixes #350.
Fixes #385.