Querying large data sets not working #54

nbahnson · 2020-05-27T23:53:20Z

Hello,

Thanks for working on this package! It's really great to be able to analyze salesforce data in R and this makes that process a lot smoother. I currently am trying to use salesforcer to query ~1mil rows.

When I use sf_query with the rest API, any query above 2000 rows hangs indefinitely. I was able to update the sf_query_rest function and fix the hanging by changing this part of the code:

if (!is.null(next_records_url)) {
    next_records <- sf_query_rest(next_records_url = next_records_url, 
      control = control, verbose = verbose, ...)
    resultset <- bind_rows(resultset, next_records)
  }

to be

if (!response_parsed$done) {
    next_records <- sf_query_rest(next_records_url = next_records_url, 
                                  control = control, verbose = verbose, ...)
    resultset <- bind_rows(resultset, next_records)
  }

so it now works with over 2000 row queries. I was able to use this to pull my full data set correctly (though it certainly takes a while). Does this fix make sense, or might this cause unforeseen issues?

When I try to use the bulk api "Bulk 1.0", using sf_query_bulk, it only returns about 1/6 of the total data - not sure what's happening there. I also get errors if I try to use "Bulk 2.0" - it is mentioned in the documentation but does not seem to work when try to use it as a function argument as I get an error telling me it's not a valid argument. Apologies if I am simply using the function incorrectly.

Do you know why Bulk 2.0 isn't working, or why the normal bulk query is not returning the full data set?

The text was updated successfully, but these errors were encountered:

StevenMMortimer · 2020-06-05T19:05:39Z

@nbahnson Thanks for the update and troubleshooting yourself a bit. It's interesting that you found things work well if you check !response_parsed$done. Have you checked what is the value of next_records_url in that case? I'm wondering what URL is being used if !is.null(next_records_url) does not work, but !response_parsed$done works. I would think that it implies the URL is NULL which should restart pulling query data from the original query and start returning duplicate rows.

salesforcer/R/query.R

Lines 130 to 139 in 2954c4b

    
           if(!response_parsed$done){ 
        
             next_records_url <- response_parsed$nextRecordsUrl 
        
           } 
        
           # check whether it has next record 
        
           if(!is.null(next_records_url)){ 
        
             next_records <- sf_query_rest(next_records_url = next_records_url, control = control,  
        
                                           verbose = verbose, ...) 
        
             resultset <- bind_rows(resultset, next_records) 
        
           }

Regarding the Bulk API, I would highly recommend using it vs. the REST API for queries that return anything above 10K records or so. I'm surprised you're not getting things to work, so two things:

I'd recommend using sf_query(....api_type="Bulk 1.0") instead of sf_query_bulk() directly just in case there are any less user-friendly elements or bugs with using that function directly which is more of a helper utility that gets called by sf_query().
I'd recommend running your query from R and then monitoring it in Salesforce's Bulk API dashboard just to see what it's saying there as well. Your Bulk API dashboard is available at the URL: https://{{your.instance}}.salesforce.com/750. Feel free to let me know how it goes.

nbahnson · 2020-06-05T22:05:52Z

@StevenMMortimer thanks for the response!

As for the REST API:
When I change the code pack to the package code the next_records_url just keeps repeating the last url. For example, if I am querying 4500 rows with the rest api, these are the next_records_url values:
"/services/data/v46.0/query/%leavingblankincasethisisidentifying%-2000"
and then infinite repeats of
"/services/data/v46.0/query/%leavingblankincasethisisidentifying%-4000"
and it never exits the loop, thereby returning duplicate rows as you said.

It seems this is because when response_parsed$done = True, the next_records_url isn't reset with what should be the final NULL value, and so the loop cannot be closed. When I trace through the logic like this, it seems that this loop can never exit if over 2000 rows are queried with the REST API. Am I missing something? Can this be corrected in the package?

As for the Bulk API:

I get the same results using sf_query(....api_type="Bulk 1.0"), which again only returns ~115k out of ~715k rows (which I get by looking at salesforce directly or using my modified REST API call).
I looked at the Bulk API dashboard (thanks for that tip! didn't know about that). It says that my query processed ~715k records, and when I look at the results downloaded through the dashboard I see the ~715k results I was expecting. This seems to imply something going wrong with the R bulk API not downloading the full results. Do you have any idea why that might be?

Is there some other method in this package that I could use to correctly download a large query? I would try Bulk 2.0 but as I mentioned earlier, that is not available as an option when I actually call the sf_query function (I get an error saying it's not a valid option). I am pulling my data from another source for my purposes for now but would love to be able to query Salesforce directly with R.

StevenMMortimer · 2020-06-05T23:27:59Z

@nbahnson Thanks for the update! Sorry that hear that things aren't working :(

I was able to find that the potential issue could be related to the size of data returned by the SOQL which may or may not be measured by the number of records. The limits are noted HERE. Also, someone previously mentioned that the types of fields being pulled will affect the results as well: #35 (comment). I never really noticed or appreciated that, but maybe that is the issue with hanging at that specific number?

As far as the Bulk API, I'm trying to think through the best way to debug. Possibly running the individual internal functions and checking the output. Specifically, this type of routine and looking that every single batch is being tracked and only pulled if "Complete":

my_query <- "SELECT .... FROM ...."

job_info <- sf_create_job_bulk(operation = 'query', object = '..........')

query_info <- sf_submit_query_bulk(job_id = job_info$id, soql = my_query)

result <- sf_batch_details_bulk(job_id = query_info$jobId,
                                batch_id = query_info$id)

recordset <- sf_query_result_bulk(job_id = query_info$jobId,
                                  batch_id = query_info$id,
                                  result_id = result$result)

sf_close_job_bulk(job_info$id)

Really appreciate the help to debug!

nbahnson · 2020-06-05T23:57:09Z

@StevenMMortimer

In your first comment about the specific number, is that in regards to the REST API? Are you talking about the specific number of 2000? Apologies for my loose language, that number is mentioned because that is the number of rows returned per url - when you use the rest API for anything over 2000 you have to make subsequent calls and then stitch the results together, and that recursion is where the function is breaking.

That's why there's the recursive call in the first place, to get around the 2000 row limit with the REST api consistent with the documentation you linked (apologies if you already know this and I am misunderstanding your response). Again, I was able to fix the recursion issue by just changing the conditional logic so I think that's the only issue here. I have a local working version of the function which makes me believe that it's unrelated to any nuanced SOQL quirks. It's just a logic error that will lead to the function entering an infinite loop if it ever has to make a recursive call, which again only happens after >2000 rows are queried with the REST API.

The REST API issue seems pretty clear - I'll definitely work to debug the Bulk API. I appreciate the help! I'd love to get these fixed in the package so I don't have to have my own versions floating around my projects

nbahnson · 2020-06-09T00:48:08Z

Hey @StevenMMortimer

I was able to debug the Bulk API as well!

The issue is in the function
sf_query_result_bulk
Any line # references will be for this function

The bulk jobs are completed fine, but that function is only returning part of the data. My data is returned as content type "text/csv" so it follows that logic fork at line 18. I deconstructed this function, and found that this line in the function:
response_text <- content(httr_response, as = "text", encoding = "UTF-8")
at line 14

is what only returns the 1/6 of the data like I described. I thought this was odd since the httr_response actually had the full data set in it, but the content function was only returning part of it. If I instead parse the httr_response using content(recordset, type = "text/csv") then I get the full data set back.

From my testing, it looks like httr_response$headers$content-type can be passed directly into the content function as follows:

content_type <- httr_response$headers$`content-type`  (line 14)
res <- content(httr_response, type = content_type)

This would simplify the code by making the conditional fork based on content_type not necessary (starting line 15).

nbahnson · 2020-06-09T00:52:18Z

So I now have a working version of the bulk api code as well. Let me know if any of my comments need clarifying! Thanks for taking a look at this

StevenMMortimer · 2020-06-09T00:52:42Z

@nbahnson Wow! That's awesome. Much appreciated. I'm going to release the next version on CRAN within a couple days, so this will be shipped with it! Thanks again!

nbahnson · 2020-06-09T00:54:49Z

@StevenMMortimer Glad I could help, and thank you for jumping on this so quickly!

I'm trying to get my coworkers to use R so I'm happy to help where I can :)

Let me know if I can help test anything!

mjvtnm · 2020-06-11T10:45:57Z

The changes made today helped a bit, at least the recursion gets stopped. The problem is that the sub call returns parsed data (after type.convert) which then is combined with the raw data. So if there's say a date variable the bind_rows fails. Quick fix seems to be to set guess_types=FALSE. Maybe this should be the default for the recursive calls (line 227 etc)? The first call would do the type conversions anyway eventually.

StevenMMortimer · 2020-06-11T12:27:45Z

Thanks, @mjvtnm! I noticed this when the build failed last night and figured I would tackle it in the morning. You can see the new flow here that casts as character:

salesforcer/R/query.R

Lines 123 to 131 in 00f4a5a

    
           response_parsed <- content(httr_response, "text", encoding="UTF-8") 
        
           response_parsed <- fromJSON(response_parsed, flatten=TRUE) 
        
           if(length(response_parsed$records) > 0){ 
        
             resultset <- response_parsed$records %>%  
        
               select(-matches("^attributes\\.")) %>% 
        
               select(-matches("\\.attributes\\.")) %>% 
        
               as_tibble() %>%  
        
               mutate_all(as.character)

In this case, I'm converting all columns to character before binding, then converting if requested by the user. This ensures that the bind operation will not fail because of incompatible types.

And in the last iteration, I am guessing the types if requested by the user:

salesforcer/R/query.R

Lines 147 to 150 in 00f4a5a

    
           # cast the data in the final iteration if requested 
        
           if(is.null(next_records_url) & (nrow(resultset) > 0) & (guess_types)){ 
        
             resultset <- resultset %>%  
        
               type_convert(col_types = cols(.default = col_guess()))

NOTE: Unfortunately, fromJSON is already doing some type conversion internally. I cannot think of a fix right now to actually ensure that the raw text returned from the API call is not formatted by fromJSON. This is only a problem when a user needs something like a really long integer that needs to be kept a string as noted in #40. But it seems like the current strategy has addressed that specific case for now, so I'll just continue to monitor the issues while I think about a better solution.

StevenMMortimer · 2020-06-13T19:13:32Z

@nbahnson @mjvtnm This issue should be fixed in the latest release of the package (0.1.4) that is now on CRAN. Please let me know if there are still problems and we can work through them. If there are no lingering issues or I don't hear back in a couple weeks, then I'll close this issue.

Thanks again for your assistance and your patience! 👍

nbahnson · 2020-06-22T06:01:36Z

I've been able to use it a bit over the last week - it seems to work well! Thanks for jumping on this and getting a fix out so fast :)

StevenMMortimer self-assigned this Jun 8, 2020

StevenMMortimer added this to the 0.2.0 milestone Jun 8, 2020

StevenMMortimer added the bug Unintended behavior that should be corrected label Jun 8, 2020

StevenMMortimer modified the milestones: 0.2.0, 0.1.4 Jun 11, 2020

StevenMMortimer closed this as completed Jun 21, 2020

This was referenced Jul 20, 2020

sf_query() timeout on queries with more than 260 results #19

Closed

Nested Parent-to-Child Relationship Query Hangs #35

Closed

StevenMMortimer mentioned this issue Sep 11, 2022

Oauth related example #123

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Querying large data sets not working #54

Querying large data sets not working #54

nbahnson commented May 27, 2020 •

edited

Loading

StevenMMortimer commented Jun 5, 2020

nbahnson commented Jun 5, 2020 •

edited

Loading

StevenMMortimer commented Jun 5, 2020

nbahnson commented Jun 5, 2020 •

edited

Loading

nbahnson commented Jun 9, 2020 •

edited

Loading

nbahnson commented Jun 9, 2020

StevenMMortimer commented Jun 9, 2020

nbahnson commented Jun 9, 2020 •

edited

Loading

mjvtnm commented Jun 11, 2020

StevenMMortimer commented Jun 11, 2020 •

edited

Loading

StevenMMortimer commented Jun 13, 2020

nbahnson commented Jun 22, 2020

Querying large data sets not working #54

Querying large data sets not working #54

Comments

nbahnson commented May 27, 2020 • edited Loading

StevenMMortimer commented Jun 5, 2020

nbahnson commented Jun 5, 2020 • edited Loading

StevenMMortimer commented Jun 5, 2020

nbahnson commented Jun 5, 2020 • edited Loading

nbahnson commented Jun 9, 2020 • edited Loading

nbahnson commented Jun 9, 2020

StevenMMortimer commented Jun 9, 2020

nbahnson commented Jun 9, 2020 • edited Loading

mjvtnm commented Jun 11, 2020

StevenMMortimer commented Jun 11, 2020 • edited Loading

StevenMMortimer commented Jun 13, 2020

nbahnson commented Jun 22, 2020

nbahnson commented May 27, 2020 •

edited

Loading

nbahnson commented Jun 5, 2020 •

edited

Loading

nbahnson commented Jun 5, 2020 •

edited

Loading

nbahnson commented Jun 9, 2020 •

edited

Loading

nbahnson commented Jun 9, 2020 •

edited

Loading

StevenMMortimer commented Jun 11, 2020 •

edited

Loading