Allow static file data storage for Provider /status_changes and /trips #357

hunterowens · 2019-09-17T23:19:24Z

Explain pull request

In #268, @johnpena discusses the need to make MDS queries more predictable. @babldev attempted to address the issue in #354, but during the discussion on 9-4, it sounded like folks wanted a different type of solution.

This builds on #354 and specifically @hyperknot's suggestion that we split the possible response with different query parameters.

It introduces static_end_time to both /trips and /status_changes. This would be an ISO month, day, or hour that could be cached on S3 or a similar static file store that could be returned as a single page or set of pages.

TBH, adding this complexity makes me want to also finish the openAPI work so that this can be made more clearly and testable by providers and clients alike. I'll take a stab at that shortly, but would love folks thoughts on this.

Is this a breaking change

A breaking change would require consumers or implementors of the API to modify their code for it to continue to function (ex: renaming of a required field or the change in data type of an existing field). A non-breaking change would allow existing code to continue to function (ex: addition of an optional field or the creation of a new optional endpoint).

Yes, breaking

`Provider` or `agency`

Which API(s) will this pull request impact:

provider

Additional context

Fixes #268.
Fixes #350.
Fixes #385.

hyperknot · 2019-09-17T23:37:03Z

Thanks for this PR, it looks really simple and clean to me! One tiny addition I'd include for trips is something like "month's worth of trips (filtered by end_time)", just to make it clear that it's the exact same filtering as in min-max end time, not start time.

ascherkus · 2019-09-18T16:41:14Z

Thanks for writing this up!

My three highest bits of feedback surround:

Whether it's possible to reduce the different time periods (month/day/hour) into a single time period for requesting static data to lower specification complexity and implementation burden
Whether the static_end_time response requires / assumes fully complete sets of immutable data (vs. how should we handle mutable data)
Whether it's possible to fold {min,max}_end_time into static_end_time by combining ideas from (1) and (2)

For (1):

The argument (IIRC) for allowing different time periods is to simplify backfill consumption by requesting larger periods of time and thus reducing number of requests
That being said... as long as response pagination is allowed there's no guarantee the provider implementation will honor a single request/response for a particular time period
For example, if a provider implements the "month" response by paginating by day (which they're permitted to do), we haven't reduced the number of requests compared to the "day" method and thus added more complexity to the specification and each implementation with zero added benefit

In short, I believe pagination is in conflict with the desire to request larger time periods / lower the number of requests because the end result will always be subject to how different providers paginate their data, not the duration of the time period being requested.

That leads me to believe that either:

We pick a single sensible time period (since responses will get arbitrarily paginated anyway)
We disallow pagination on these responses (to respect the use case for minimizing # of requests)

Would love to hear what folks think and whether I missed different use cases for having different time periods!

For (2):

The argument for requiring complete and immutable sets of data would clarify expectations for returning errors (e.g., an error is guaranteed to be returned for the month "2019-09" until 2019-10-01 00:00:00 UTC at the earliest)
Conceptually, once the data "goes live" a consumer could feel more confident that the data set is complete
That being said... I'm highly skeptical that providers can guarantee historical data would never change once being published (e.g., a bug gets fixed, erroneous data gets corrected)
That means consumers need to check for changes in static data when backfilling to ensure they have the latest copy
Which leads me to believe we'd want to have requests and responses leverage If-Match / ETag / If-Modified-Since / Last-Modified to efficiently determine if changes have been made when backfilling

So I guess it boils down to how confident we all feel whether historical data can/should ever change, and if our collective confidence is not 100% how do we want providers and consumers to handle that scenario? How important is it to ensure we get this right?

For (3):

If we're in agreement on (1) and (2) that we only need to specify a single time period and that static data can change over time... then I see an opportunity to support "real-ish time use cases" by replacing {min,max}_end_time by having clients simply poll the most current static_end_time, using ETag / Last-Modified to determine when they've been updated
This would imply allowing incomplete data sets for the latest time period for static_end_time (e.g., requesting "2019-09" before 2019-10-01 00:00:00 UTC is permissible but the data / ETag / Last-Modified may keep changing until 2019-10-01 00:00:00 UTC at the earliest)
For example, a provider could re-generate the latest static_end_time period of data every N minutes, with real-time consumers checking every M minutes and fetching the latest data when needed
This would ease specification complexity for consumers and providers by folding the three different request/response methods (static_end_time, {min,max}_end_time < 1 month, {min,max}_end_time >= 1 month) into a single method

I understand this is a larger (and likely more controversial!) change, but if we're going to make a breaking change I'd like to entertain the idea of just how large of a breaking change we can make so that provider continues to scale for both consumers and implementors alike :)

kheraankit · 2019-09-18T17:45:57Z

What should be behavior if none of the date params are specified? I would expect it to return an error and not have any implicit logic.

Can the spec describe the default behavior for this condition?

johnpena · 2019-09-18T17:52:49Z

Overall this looks really good and I think is a huge step in the right direction, thanks for putting this together @hunterowens!

To expand on what @kheraankit said, I think we should take the opportunity for the next release to specify whether the time-bounding query parameters for the endpoints (min_end_time/max_end_time and start_time/end_time) are optional/required and what happens if they're not provided. Lime's feeds work such that when none of the parameters are provided, we return all of the data we have available. I think changing this behavior to return an error would be better, but we should indicate in the documentation that this is the case.

jrheard

Deferring to other folks on what the exact correct approach is, just some feedback on the diff itself, hope this helps!

provider/README.md

hyperknot · 2019-10-01T18:49:02Z

Based on my experience with APIs and how providers implement them, I think it'd be a good idea:

To name these two endpoints differently, for example static_trips or processed_trips. Architecturally these will be very different implementations, the normal endpoints will do a query in a database while these ones will proxy a static file from an S3 bucket for example.
Explicitly state that these endpoints contain "final" or processed data. Providers usually take 1-2 days till they finalize their data even today and this is a very buggy area currently: there is simply no way to know if a provider's data is "final" in an endpoint or not.
Also explicitly state an error message, like {message: "not_yet_processed"} or similar to tell clients when a data has not yet been processed.

About time buckets, I also believe it'd be better to ask only one time range, not 3. With 3, I believe it'll be very common that they'll be out-of-sync, something which will just introduce bugs again. Daily splits do not really work, as they are split by UTC days, whereas every processing happens in a region's local time, so it's not really useful. Monthly buckets will probably be too big to return them in a single request.

As a conclusion => I recommend only implementing hourly static endpoints, named /static_trips and /static_status_changes.

hunterowens · 2019-10-11T17:01:42Z

Hi all:

I've pushed a new draft, that has some pretty major changes.

the current trips and status_changes become static file only backed, on hourly buckets only, per @hyperknot and others suggests.
there is a new API endpoint, per @thekaveman others suggestion, called events, that represents an as real time as possible endpoint, and only contains the last 2 weeks worth of data. Currently, it inherits the status_changes schema, but could change in future versions.

This allows us to handle a lot of the GBFS like-use cases that @HeidiMG has been (correctly, IMO) talking about moving away from GBFS to MDS.

Thoughts? @johnpena @babldev @rf- @hyperknot @thekaveman @barbeau / anybody else I forgot to tag (apologies!)

ascherkus

Looking good!

Potentially silly question... is there any guidance or preference as to what providers should return when end_time is not part of the request?

Off the top of my head it seems like it'd fall back to the existing paginated view of all historical trips. Are there use cases better served by keeping this path around? Is it worth mandating end_time and returning an error if not provided?

Not sure how much of an issue the above would be in practice but wanted to flag for the group!

provider/README.md

thekaveman

Generally agree with @ascherkus comments that we should be more clear that the time parameters here mean we are asking for trips that ended within that hour or status_changes that occurred within that hour, vs. before that hour / within the prior hour.

I also wonder if we can maintain the existing timestamp querying format (integer epoch milliseconds) vs. introducing ISO 8601 - this might make the change less impactful for clients using the existing query format. We could even go so far as saying if a query comes in that is not cutoff at the hour mark (e.g. end_time=1571250098290) then Providers should interpret this as a query for that hour (e.g. end_time=1571248800000)

I am also in favor of requiring end_time or whatever the time parameter is to avoid having to deal with ambiguously defined windows.

hunterowens · 2019-10-21T17:21:46Z

@thekaveman @ascherkus

I went with requiring end_time, otherwise error, since if we are making breaking changes, we might as well make it the ones that are the clearest to implement.

@ascherkus, I added language to describe what should be in the response payload that should clear up those comments / fixed the various typos you caught. LMK if that clears things up.

@thekaveman: I think I'd rather keep the time definition explicit, as introducing a snap to hour via any timestamps means there is the possibility of something going wrong on the server side as we make an implicit conversion, I think this keeps it cleaner for implementation and debugging between teams.

provider/README.md

hyperknot · 2019-10-21T17:53:18Z

Epoch seconds / milliseconds is a very highly bug prone part of the current specs. Each provider have their own implementation about which one to use, which is especially confusing as they need to support 0.2 and 0.3 clients concurrently. Some providers do an auto-detection and accept both formats (it can be done by saying comparing to year 2100's timestamp in seconds).

Also, unix seconds / milliseconds are not fit for describing a time "block", like an UTC hour, as it can be both placed in the beginning of the hour, somewhere in the middle or at the very end, all raising possible bug cases.

13 byte ISO 8601 strings ("2019-10-21T19") solve these issues and define an hourly block explicitly.

hyperknot · 2019-10-21T18:05:21Z

So to conclude, this PR is aiming to replace the current DB based trips and status changes endpoints with the static ones, instead of adding 2 more new endpoints? If so, I agree with the design, it's a clean choice!

+1 new endpoint for real-time-ish events (/events), whose future we can discuss in other PRs?

hunterowens · 2019-10-21T18:16:20Z

@hyperknot Yes, the idea is that /trips, /status_changes become backed by static files, and /events is what status_changes is today.

hyperknot · 2019-10-21T18:19:55Z

@hyperknot Yes, the idea is that /trips, /status_changes become backed by static files, and /events is what status_changes is today.

I understand, that's a great change.

hyperknot · 2019-10-21T18:46:33Z

One more important point I'd like to raise is the removal of pagination from static-backed endpoints.

1. Pagination currently is the most "broken" part of the MDS specs. I work on MDS acquisition at Populus, we work with 10+ operators' MDS API, and pagination code is by far the most complex part of our MDS acquisition infrastructure. Most operators have slightly different behavior, and errors are not uncommon.

Some operators do not support pagination at all but require us to query data in hourly blocks (exactly like the new static endpoints), some do not add links to the last page, some require custom query parameters and the list goes on.

2. Since on the new endpoints, the time-period is fixed, I went ahead and analyzed all our historical data.

=> The average size for an hourly block of trips data, gzipped is 100 kB.
=> The maximum, extreme value for an hourly block of trips data is 7 MB, but this is very rare.

Since we are talking about servers in datacenters communicating with each other, even sending 7 MB in a single HTTPS request should be a non-issue today. Status changes are tiny in comparison, there is no point in analyzing them.

=> As a conclusion, I recommend removing all pagination and "links" related parts of the specs from the static-backed endpoints.

ascherkus · 2019-10-21T21:12:14Z

Clarity on hour interval text = LGTM! Thanks! Still find use of end_time for status_changes to be slightly confusing but can live with it given the clarity improvements.

Generally speaking I agree with @hyperknot ... removing pagination would make implementation more consistent across providers and the response sizes seem reasonable enough to warrant removal. On the flip side, I recall @bhandzo (maybe?) saying they'd still prefer to have pagination within the hour intervals so would be good to get some input as to what the use cases are there.

thekaveman · 2019-10-25T19:46:32Z

Rebased this PR on the latest dev changes, and made a few language cleanups around query parameters. These should address comments from @ascherkus.

hunterowens requested review from thekaveman and a team as code owners September 17, 2019 23:19

jrheard reviewed Sep 18, 2019

View reviewed changes

provider/README.md Outdated Show resolved Hide resolved

provider/README.md Outdated Show resolved Hide resolved

bhandzo mentioned this pull request Oct 11, 2019

Adding a /vehicles endpoint to show the current status of the fleet #376

Merged

ascherkus reviewed Oct 15, 2019

View reviewed changes

provider/README.md Outdated Show resolved Hide resolved

provider/README.md Outdated Show resolved Hide resolved

provider/README.md Outdated Show resolved Hide resolved

provider/README.md Outdated Show resolved Hide resolved

thekaveman reviewed Oct 16, 2019

View reviewed changes

This was referenced Oct 16, 2019

Provider API: status changes order #276

Closed

Timeliness of status change event data #315

Closed

Enable providers to bucket data by fixed time windows #354

Closed

hyperknot reviewed Oct 21, 2019

View reviewed changes

provider/README.md Outdated Show resolved Hide resolved

hunterowens added 7 commits October 25, 2019 12:06

remove query params. Fixes #350

63675e1

attempt at solving #268

fc6db32

turn /trips into a static file backed only endpoint.

eb6eee3

cleanup status_changes

297e3b2

make events API

882e1ff

when error, use an HTTP status code.

24b20c5

add "error with query param note" for clarity

cfab67d

hunterowens and others added 7 commits October 25, 2019 12:06

resolve typos

f1239c8

spell out exact data response

97b0a38

correctly replace : with T.

84811d5

move vehicle_types and propulsion_types under General Info

b0afc54

clean up /trips query param language

da85079

clean up /status_changes query param language

61b4219

clean up some /events language

b8f6bbb

thekaveman force-pushed the provider-historical-endpoint branch from 3052757 to b8f6bbb Compare October 25, 2019 19:45

fix copy/paste typo

83dacb8

thekaveman added the Provider Specific to the Provider API label Oct 25, 2019

thekaveman added this to the 0.4.0 milestone Oct 25, 2019

require both sides of time query range

cc6e945

thekaveman mentioned this pull request Oct 25, 2019

Clearer language for the Provider's time range query params #385

Closed

thekaveman approved these changes Oct 25, 2019

View reviewed changes

thekaveman changed the title ~~Allow for a Static File Store Backed Query Parameter~~ Allow for static file backed data store for Provider /status_changes and /trips Oct 25, 2019

thekaveman changed the title ~~Allow for static file backed data store for Provider /status_changes and /trips~~ Allow static file data storage for Provider /status_changes and /trips Oct 25, 2019

thekaveman merged commit e60b003 into dev Oct 25, 2019

thekaveman mentioned this pull request Oct 26, 2019

Provider language cleanup #395

Merged

thekaveman mentioned this pull request Nov 4, 2019

Changes for MDS 0.4.x CityofSantaMonica/mds-provider#79

Closed

9 tasks

hyperknot mentioned this pull request Jan 9, 2020

Clarify no pagination on static provider endpoints /trips and /status_changes #424

Merged

thekaveman mentioned this pull request Jan 21, 2020

Provide Optional Static Request Format #425

Closed

jiffyclub mentioned this pull request Jan 30, 2020

Clarify "data does not exist" vs empty data for trips and status changes #437

Merged

thekaveman mentioned this pull request Feb 13, 2020

Indicate when status events are removed from the provider API #341

Closed

thekaveman deleted the provider-historical-endpoint branch February 13, 2020 17:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow static file data storage for Provider /status_changes and /trips #357

Allow static file data storage for Provider /status_changes and /trips #357

hunterowens commented Sep 17, 2019 •

edited by thekaveman

Loading

hyperknot commented Sep 17, 2019

ascherkus commented Sep 18, 2019

kheraankit commented Sep 18, 2019

johnpena commented Sep 18, 2019

jrheard left a comment

hyperknot commented Oct 1, 2019 •

edited

Loading

hunterowens commented Oct 11, 2019

ascherkus left a comment

thekaveman left a comment •

edited

Loading

hunterowens commented Oct 21, 2019

hyperknot commented Oct 21, 2019

hyperknot commented Oct 21, 2019 •

edited

Loading

hunterowens commented Oct 21, 2019

hyperknot commented Oct 21, 2019

hyperknot commented Oct 21, 2019 •

edited

Loading

ascherkus commented Oct 21, 2019

thekaveman commented Oct 25, 2019 •

edited

Loading

Allow static file data storage for Provider /status_changes and /trips #357

Allow static file data storage for Provider /status_changes and /trips #357

Conversation

hunterowens commented Sep 17, 2019 • edited by thekaveman Loading

Explain pull request

Is this a breaking change

Provider or agency

Additional context

hyperknot commented Sep 17, 2019

ascherkus commented Sep 18, 2019

kheraankit commented Sep 18, 2019

johnpena commented Sep 18, 2019

jrheard left a comment

Choose a reason for hiding this comment

hyperknot commented Oct 1, 2019 • edited Loading

hunterowens commented Oct 11, 2019

ascherkus left a comment

Choose a reason for hiding this comment

thekaveman left a comment • edited Loading

Choose a reason for hiding this comment

hunterowens commented Oct 21, 2019

hyperknot commented Oct 21, 2019

hyperknot commented Oct 21, 2019 • edited Loading

hunterowens commented Oct 21, 2019

hyperknot commented Oct 21, 2019

hyperknot commented Oct 21, 2019 • edited Loading

ascherkus commented Oct 21, 2019

thekaveman commented Oct 25, 2019 • edited Loading

hunterowens commented Sep 17, 2019 •

edited by thekaveman

Loading

`Provider` or `agency`

hyperknot commented Oct 1, 2019 •

edited

Loading

thekaveman left a comment •

edited

Loading

hyperknot commented Oct 21, 2019 •

edited

Loading

hyperknot commented Oct 21, 2019 •

edited

Loading

thekaveman commented Oct 25, 2019 •

edited

Loading