-
-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Querying large data sets not working #54
Comments
@nbahnson Thanks for the update and troubleshooting yourself a bit. It's interesting that you found things work well if you check Lines 130 to 139 in 2954c4b
Regarding the Bulk API, I would highly recommend using it vs. the REST API for queries that return anything above 10K records or so. I'm surprised you're not getting things to work, so two things:
|
@StevenMMortimer thanks for the response! As for the REST API: It seems this is because when response_parsed$done = True, the next_records_url isn't reset with what should be the final NULL value, and so the loop cannot be closed. When I trace through the logic like this, it seems that this loop can never exit if over 2000 rows are queried with the REST API. Am I missing something? Can this be corrected in the package? As for the Bulk API:
Is there some other method in this package that I could use to correctly download a large query? I would try Bulk 2.0 but as I mentioned earlier, that is not available as an option when I actually call the sf_query function (I get an error saying it's not a valid option). I am pulling my data from another source for my purposes for now but would love to be able to query Salesforce directly with R. |
@nbahnson Thanks for the update! Sorry that hear that things aren't working :( I was able to find that the potential issue could be related to the size of data returned by the SOQL which may or may not be measured by the number of records. The limits are noted HERE. Also, someone previously mentioned that the types of fields being pulled will affect the results as well: #35 (comment). I never really noticed or appreciated that, but maybe that is the issue with hanging at that specific number? As far as the Bulk API, I'm trying to think through the best way to debug. Possibly running the individual internal functions and checking the output. Specifically, this type of routine and looking that every single batch is being tracked and only pulled if "Complete":
Really appreciate the help to debug! |
In your first comment about the specific number, is that in regards to the REST API? Are you talking about the specific number of 2000? Apologies for my loose language, that number is mentioned because that is the number of rows returned per url - when you use the rest API for anything over 2000 you have to make subsequent calls and then stitch the results together, and that recursion is where the function is breaking. That's why there's the recursive call in the first place, to get around the 2000 row limit with the REST api consistent with the documentation you linked (apologies if you already know this and I am misunderstanding your response). Again, I was able to fix the recursion issue by just changing the conditional logic so I think that's the only issue here. I have a local working version of the function which makes me believe that it's unrelated to any nuanced SOQL quirks. It's just a logic error that will lead to the function entering an infinite loop if it ever has to make a recursive call, which again only happens after >2000 rows are queried with the REST API. The REST API issue seems pretty clear - I'll definitely work to debug the Bulk API. I appreciate the help! I'd love to get these fixed in the package so I don't have to have my own versions floating around my projects |
Hey @StevenMMortimer I was able to debug the Bulk API as well! The issue is in the function The bulk jobs are completed fine, but that function is only returning part of the data. My data is returned as content type "text/csv" so it follows that logic fork at line 18. I deconstructed this function, and found that this line in the function: is what only returns the 1/6 of the data like I described. I thought this was odd since the httr_response actually had the full data set in it, but the From my testing, it looks like
This would simplify the code by making the conditional fork based on content_type not necessary (starting line 15). |
So I now have a working version of the bulk api code as well. Let me know if any of my comments need clarifying! Thanks for taking a look at this |
@nbahnson Wow! That's awesome. Much appreciated. I'm going to release the next version on CRAN within a couple days, so this will be shipped with it! Thanks again! |
@StevenMMortimer Glad I could help, and thank you for jumping on this so quickly! I'm trying to get my coworkers to use R so I'm happy to help where I can :) Let me know if I can help test anything! |
The changes made today helped a bit, at least the recursion gets stopped. The problem is that the sub call returns parsed data (after |
Thanks, @mjvtnm! I noticed this when the build failed last night and figured I would tackle it in the morning. You can see the new flow here that casts as character: Lines 123 to 131 in 00f4a5a
In this case, I'm converting all columns to character before binding, then converting if requested by the user. This ensures that the bind operation will not fail because of incompatible types. And in the last iteration, I am guessing the types if requested by the user: Lines 147 to 150 in 00f4a5a
NOTE: Unfortunately, |
@nbahnson @mjvtnm This issue should be fixed in the latest release of the package (0.1.4) that is now on CRAN. Please let me know if there are still problems and we can work through them. If there are no lingering issues or I don't hear back in a couple weeks, then I'll close this issue. Thanks again for your assistance and your patience! 👍 |
I've been able to use it a bit over the last week - it seems to work well! Thanks for jumping on this and getting a fix out so fast :) |
Hello,
Thanks for working on this package! It's really great to be able to analyze salesforce data in R and this makes that process a lot smoother. I currently am trying to use salesforcer to query ~1mil rows.
When I use sf_query with the rest API, any query above 2000 rows hangs indefinitely. I was able to update the sf_query_rest function and fix the hanging by changing this part of the code:
to be
so it now works with over 2000 row queries. I was able to use this to pull my full data set correctly (though it certainly takes a while). Does this fix make sense, or might this cause unforeseen issues?
When I try to use the bulk api "Bulk 1.0", using sf_query_bulk, it only returns about 1/6 of the total data - not sure what's happening there. I also get errors if I try to use "Bulk 2.0" - it is mentioned in the documentation but does not seem to work when try to use it as a function argument as I get an error telling me it's not a valid argument. Apologies if I am simply using the function incorrectly.
Do you know why Bulk 2.0 isn't working, or why the normal bulk query is not returning the full data set?
The text was updated successfully, but these errors were encountered: