-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Figure out how to wire up Data Set Creation for Massive Data Sets #340
Comments
I'd argue it's less that it's a really big dataset (though I'm sure that's the underlying cause) and more of a poorly-configured, or at least insufficiently configured, source server. Ideally we'd just say that if your server can't be bothered to return at least a To pick on the Divvy Trips and the City a little bit here for the sake of example:
It's smart enough to use a chunked (streaming) transfer, but still produces the entire content on the server side before sending the header. It should send a Unfortunately I think the best we can do (and I have no idea how to implement it) is check if the SSL handshake happens successfully, so we know we've at least connected to the target server, and then wait some arbitrarily long amount of time for it to get around to providing a header response. We have the slight benefit that such servers should be relatively quick about bad requests so we can somewhat safely assume that a present but silent server is formulating some huge response body. |
To be clear, we can verify the resource exists relatively quickly sending options and head requests, which we already do. |
Could you clarify where the workflow would be different? Data set ingest, small and large, is always carried out in the background from what I understand. Our issues with those have been due to |
The field guesser, specifically. I think I just need to tinker around with that. |
If we know the resource exists, then, I don't see any problem with just putting a stupidly long timeout on it. If we want to be fancy, we could set it automatically based on like 125% of the delay for the response the last time we accessed the resource |
Hmm, streaming the 1000 or so rows off the top doesn't work for those? |
It just kind of dies. I'm doing some verbose logging locally.
And then that's it... |
Ok, so Divvy trips, in particular, takes waaaaaaaaay too long to render server side before it even starts to send a response. I'm going to work with the people at the city to work around a solution. This may be the lead up to building out an ingest process for Top-N data sets, where the body is too large to process in a realistic time frame. |
Blocked until we address #377 |
The internal API was getting really nasty -- we had a bunch of one off functions that clashed in arity (positional arguments, matches, guards, options ...). The web application was also a disaster -- originally I thought it would make it easier to keep the web, admin and API separate in subapps, but that ended up making things that much more difficult. Then that leaves the elephant in the room: Socrata. We've always relied on them and all of their awful decisions. The changes here in remove some of the terrible things about Socrata integration and makes ingesting their data sets a little cleaner. Breaking Changes: - total revision of the migrations - entirely removed the `UserAdminMessage` schema - entirely removes all the outstanding ETL job stuff - entirely removes charts -- that was a really stupid idea - entirely removes exports -- again just a stupid idea - totally new ingest pipeline - slimmed down the API (still needs some work0 Closes #235 Closes #340 Closes #360 Closes #361
Related to #235
For things like Crime Data and Divvy Rides, where the data set takes minutes to just even respond to requests, we need some sort of mechanism to handle this.
One possible solution: add an extra field to the meta that says this is a really big data set. When that flag is true, we can either increase the timeout value for the request or turn it into some sort of background job (and still bump out the request timeout). I'm not so sure about the background job, as it would have a significantly different workflow for the users.
@HeyZoos, @brlodi, @sanilio please comment.
The text was updated successfully, but these errors were encountered: