Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable PKChunking for sf_query_bulk #46

Closed
JordanLewis312 opened this issue Jan 23, 2020 · 2 comments
Closed

Enable PKChunking for sf_query_bulk #46

JordanLewis312 opened this issue Jan 23, 2020 · 2 comments
Assignees
Labels
bug Unintended behavior that should be corrected question or help Clarification or help may suffice to resolve
Milestone

Comments

@JordanLewis312
Copy link

I am attempting to query a large table (~30M records) from Salesforce to R using the SalesforceR package, and hope to do so without splitting up into many queries.

The following query succeeds, if I limit the records:

OrderItemsSF <- sf_query_bulk("SELECT Id,Order,Product,Quantity FROM OrderItems LIMIT 1000000",object_name = "OrderItems", api_type = "Bulk 1.0", verbose = TRUE, max_attempts = 1000)

This query fails ("Error: column name 'result' must not be duplicated") whenever it takes longer than 10 minutes, which fits with SF's documentation saying bulk queries retry after 10 minutes. I believe the answer is to enable PKChunking to automatically separate my query into smaller batches, but I am having trouble finding a working syntax for this.

I have tried the following:

OrderItemsSF <- sf_query_bulk("SELECT Id,Order,Product,Quantity FROM OrderItems LIMIT 1000000",object_name = "OrderItems", api_type = "Bulk 1.0", verbose = TRUE, max_attempts = 1000, control = sf_control(PKChunkingHeader = list(Sforce-Enable-PKChunking= TRUE)))

This results in the error "Error in catch_errors(httr_response) : ClientInputError: Sforce-Enable-PkChunking doesn't have a valid value. The same error results from using FALSE, and according to the documentation (see link 2 at bottom), FALSE is the default for this parameter, so that doesn't make sense!

Other syntax attempts, like the one below, succeed on queries under 10 minutes and fail over 10 minutes, suggesting that R is ignoring this text and not actually PKChunking:

OrderItemsSF <- sf_query_bulk("SELECT Id,Order,Product,Quantity FROM OrderItems LIMIT 1000000",object_name = "OrderItems", api_type = "Bulk 1.0", verbose = TRUE, max_attempts = 1000, control = list(Sforce-Enable-PKChunking= TRUE))

I'm at a loss for what seems like a simple syntax issue, and can't find any specific examples online of PKChunking in a SalesforceR query, despite the documentation saying this can be done. I'd greatly appreciate some guidance here.

@StevenMMortimer StevenMMortimer self-assigned this Jan 26, 2020
@StevenMMortimer StevenMMortimer added the question or help Clarification or help may suffice to resolve label Jan 26, 2020
@StevenMMortimer
Copy link
Owner

First, the issue may not be related to PKChunking. If you are seeing "Error: column name 'result' must not be duplicated" that is a dplyr error when there are multiple columns with the same name in the resulting CSV. It is possible that your query is completing, but the results are not being parsed correctly back into R when it tries to load the recordset. If you go to https://na1.salesforce.com/750, are you able to see that the query finished successfully? (replace "na.1" in the URL with your Salesforce domain)

Next, there are two ways to pass control arguments:

# Method 1
result <- sf_query("SELECT Id, Order, Product, Quantity FROM OrderItems LIMIT 1000000",
                   object_name = "OrderItems",
                   PKChunkingHeader = list(`Sforce-Enable-PKChunking` = TRUE), 
                   api_type = "Bulk 1.0")

# Method 2
result <- sf_query("SELECT Id, Order, Product, Quantity FROM OrderItems LIMIT 1000000",
                   object_name = "OrderItems",
                   control = sf_control(PKChunkingHeader = list(`Sforce-Enable-PKChunking` = TRUE)), 
                   api_type = "Bulk 1.0")

This appears to match the first style you gave. However, I have not tested each header parameter individually so it is possible that it is being passed, just not be in the way Salesforce wants it (e.g. Sforce-Enable-PKChunking: chunkSize=50000; startRow=00130000000xEftMGH). If you turn on the verbose parameter, does the printed header look like the example above? Have you tried specifying the chunk size? Like this:

result <- sf_query("SELECT Id, Order, Product, Quantity FROM OrderItems LIMIT 1000000",
                   object_name = "OrderItems",
                   control = sf_control(PKChunkingHeader = list(chunkSize = 50000)), 
                   api_type = "Bulk 1.0")

That example above should show something like Sforce-Enable-PKChunking: chunkSize=50000 in the printed verbose message.

Lastly, the second style you tried out to pass the argument is not the correct syntax because it doesn't tell the control argument that the list you are passing is for setting the PKChunkingHeader header, so I would expect that to be either ignored or fail.

@StevenMMortimer StevenMMortimer added the bug Unintended behavior that should be corrected label Jul 18, 2020
@StevenMMortimer StevenMMortimer added this to the 0.2.0 milestone Jul 18, 2020
StevenMMortimer added a commit that referenced this issue Jul 19, 2020
Fix so that a character version of the logical TRUE/FALSE will be properly formatted. This arose from having to handle TRUE/FALSE as valid values but also the elements in the list themselves that need to be formatted and imply TRUE.

Fixes #46
StevenMMortimer added a commit that referenced this issue Jul 19, 2020
Update the process by which the Bulk 1.0 query functions pull down results because it is different when the data is PKChunked #46
@StevenMMortimer
Copy link
Owner

@JordanLewis312 Apologize for the extremely long delay, but finally got to the heart of the issue. It had two distinct parts:

  1. First, the value to turn on PKChunking can either be TRUE/FALSE or details like chunkSize that imply TRUE, which wasn't being handled well by the code.
  2. Second, when users invoke PKChunking Salesforce splits up the query results into batches. This is not typical for Bulk 1.0 queries which usually return all their results in a single CSV, so I had to rewrite some functions to account for queries that appear to have been PKChunked.

This fix is already on GitHub and will be included in the new release of the package (v0.2.0) coming in the next few days. Upgrade your version of {salesforcer} package to (>= 0.1.4.9999), then you can invoke like this:

library(salesforcer)
sf_auth()
contacts <- sf_query("SELECT Id, Name FROM Contact", 
                     object_name = "Contact", 
                     api_type = "Bulk 1.0",
                     PKChunkingHeader = list(`Sforce-Enable-PKChunking` = TRUE),
                     # alternatively you can turn on using specific chunking options like this ...
                     # PKChunkingHeader = list(`chunkSize` = 500),
                     interval_seconds = 10,
                     max_attempts = 200)
contacts
#> # A tibble: 383 x 2
#>    Id                 Name                           
#>    <chr>              <chr>                                     
#> 1 0033s000012NdFhAAK John Doe
#> 2 0033s000012NkzwAAC Jane Doe
#> 3 0033s000012NkzxAAC Jane Doe
#> 4 0033s000012NkzyAAC Jane Doe
#> 5 0033s000012NkzzAAC Jane Doe
#> # … with 378 more rows

As a heads up, I'll also answer your post over on StackOverflow in case others stumble across it there (https://stackoverflow.com/questions/59863547/salesforcer-enable-pkchunking-for-salesforce-bulk-query-in-r)

StevenMMortimer added a commit that referenced this issue Jul 21, 2020
Fix so that a character version of the logical TRUE/FALSE will be properly formatted. This arose from having to handle TRUE/FALSE as valid values but also the elements in the list themselves that need to be formatted and imply TRUE.

Fixes #46
StevenMMortimer added a commit that referenced this issue Jul 21, 2020
Update the process by which the Bulk 1.0 query functions pull down results because it is different when the data is PKChunked #46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Unintended behavior that should be corrected question or help Clarification or help may suffice to resolve
Projects
None yet
Development

No branches or pull requests

2 participants