Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster method for object ID enumeration for sources that do not support pagination #33

Merged
merged 2 commits into from
Feb 23, 2017

Conversation

albarrentine
Copy link
Contributor

When trying to get the Medellín data set to finish, I noticed that for older servers which don't support pagination, pyesridump sends batches of object IDs 100 at a time (regardless of maxRecordCount).

At least for the Medellín server in question, the OID enumeration method of querying seems to be quite slow (was taking 6+ hours to get through 300k records and often failed half-way through). After some brief experimentation, I found that the same could be achieved by sorting the IDs and constructing one range query per batch (WHERE ObjectID >= MIN_ID_IN_BATCH AND ObjectID <= MAX_ID_IN_BATCH), which in my tests was at least an order of magnitude faster and allowed for larger batch sizes.

Not sure if this would break other sources or if there are some versions of ArcGIS which don't support range queries, but if not this could potentially speed up queries to servers running older versions of ArcGIS significantly.

@iandees
Copy link
Member

iandees commented Feb 23, 2017

I used a batch of 100 because some servers balked at enumeration via a POST request so I used GET query parameters and 100 seemed like a decent size to prevent over running URL length. It's probably worth going to POST with a bigger batch size again.

@albarrentine
Copy link
Contributor Author

Got it. So AFAICT the issue didn't have as much to do with batch size as with passing the objectIDs parameter vs the same thing implemented as a WHERE (this thread noted the same performance issue and seemed to indicate that the generated query was doing a table scan - not sure if that's just Oracle or not, but in any case the WHERE clause method works on these systems as well). A range query has the added benefit of decreasing the length of the URL, but if that's not kosher, "WHERE ObjectID IN (...)" would also do the trick.

@iandees
Copy link
Member

iandees commented Feb 23, 2017

Ah yup, your change makes sense. I made my comment before looking at your changes (a consequence of having just landed after a flight 😄 ). I think the table scanning might be related to the backend database the Esri server uses, but this where clause change can only help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants