Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible Issue: Parallel putObject can cause RequestTimeTooSkewed #221

Closed
my8bird opened this issue Jan 28, 2014 · 18 comments
Closed

Possible Issue: Parallel putObject can cause RequestTimeTooSkewed #221

my8bird opened this issue Jan 28, 2014 · 18 comments

Comments

@my8bird
Copy link

my8bird commented Jan 28, 2014

Description of the Issue

I wrote a quick script to upload my family photos to S3 which was supposed to upload multiple photos in parallel but I keep hitting an issue which identified itself as a "Recursive process.nextTick detected". This would get printed until the max stack was exceeded and only happened after the script had been running for a bit and thus was quite annoying. However, the pain of restarting the script was lower then debugging.

Later on I sat down to debug this issue and noticed that prior to the nextTick message, after scrolling up a lot, was a message about RequestTimeTooSkewed. Google led me to multiple people saying my machines time was incorrect. I know this to be false and kept digging. After a bit I found a boto bug, https://code.google.com/p/boto/issues/detail?id=473, which was very similar in that the issue was not happening on every request and the request it did happen on was not special in any way.

Possible Issue

Based on the boto bug I believe the issue is that the time sent to Amazon is being determined to early in the pipeline for putObject. I tried to dive into the code and figure out the lifecycle of an S3 request but could not figure out when the request is signed, or created for the matter. I did find the SequentialExecutor which could be the starting point for the issue. My theory is that when I add multiple putObject's each the requests time is set at invocation time however due to the requests being run in sequence S3 will not get the request until the prior ones finish. If the preceding requests take longer then 15 minutes then the next request will fail because the "time" for the request will be rejected by the S3 servers as to skewed.

If that makes sense awesome (because I got myself confused reading my first couple attempts at writing this). If not feel free to ask questions about the scenario I am trying to do.

Secondly, I have to say i was surprised to find out that the libraries requests can not be parallel. Is there a way to use the library with multiple in flight requests?

@lsegal
Copy link
Contributor

lsegal commented Jan 28, 2014

The SDK supports parallel requests. Everything in the SDK, and in fact, Node.js, is asynchronous, so you can dispatch multiple requests in parallel. If you can show some code reproducing your issue I can help you debug why the requests are getting sent serially.

Note if requests are not being dispatched in parallel, this is possibly why you are also getting the "Recursive nextTick" warning. It would also explain why you are getting time skew errors. The SDK must send requests within 15 minutes of them being built. If the SDK is building all the requests up front in parallel but sending them serially, it's very likely the last of the requests will not send correctly.

@my8bird
Copy link
Author

my8bird commented Jan 30, 2014

I have run some tests, by adding some logging to Request to log when all the events happen per request, and determined that it appears the requests are happening in parallel, as I had expected initially and then got myself in a tangent.

This leaves me back at square one with trying to figure out why after N putObjects in batches of M I get the time skewed error.

My actual code is probably to large for this but this is analogous, perhaps it is a logic not code flaw. If you say this is correct then I will dig through my code some more and see where it does not conform to this. With this code the only thing that seems to change as the uploads are in progress is the size of the files. If, perhaps, a large file takes to long to upload would that cause the subsequent uploads to be skewed (given the below procedure).

// Please forgive any syntax errors as I free formed this code

function uploadFiles(files, maxUpload) {
   var done_d = Q.defer(), 
       results = [],
       next_idx = 0;

   // Add some work
   function add_work() {
      next_idx = next_idx + 1;
      var file = files[next_idx];

      var client = new aws.S3();
      return Q.ninvoke(client, 'putObject', {
            Bucket: 'bucket',
            Body:   file,
            Key:      'some random'
         })
         // When this upload finishes schedule the next one
         .then(work_done, work_done);
   }

   function work_done(res) {
      // Store results for the end
      results.push(res);
     // If there is more work to do then add it
     if (next_idx < files.length) {
         return add_work;
      }
      // If all work is done then end this session
      if (results.length === files.length) { done_d.result(results); }
      return res;
   }

   // Start up to N file uploads
   _(files).first(maxUploads).map(add_work);

   return done_d.promise;
}

// Heterogeneous set of files (essentially files from 1 to 20 megabytes
var files = ['1.txt','2.txt','3.txt','4.txt','5.txt','6.mpg','7.mpg','8.mpg','9.png','10.txt','11.txt','12.txt'];
var file_buffers = _.map(files, fs.readFileSync);

uploadFiles(files, 5);

@mhart
Copy link
Contributor

mhart commented Jan 30, 2014

Out of interest, do you get the error if you create the s3_client fresh for each putObject invocation?

That would at least clear up if there's some sort of state being held in it across requests.

@my8bird
Copy link
Author

my8bird commented Jan 30, 2014

I guess i glossed over that in the sample. Yes, I am currently making a new client for each upload (just a convenient spot to scope the variable). Does that tell you something?

@mhart
Copy link
Contributor

mhart commented Jan 30, 2014

You are making a new client? It looks from the code like you're reusing it.

@lsegal
Copy link
Contributor

lsegal commented Jan 30, 2014

Clients do not share any state between requests. Or should not, anyway. It would be very hard for them to do so, given that all request state is in the Request object.

@my8bird
Copy link
Author

my8bird commented Jan 30, 2014

One other thing that seems very relevant to this discussion is that this same code (at least the real code) works correctly for hours if i set the maxUpload to 1.

I updated the example to show the client correctly.

@mhart
Copy link
Contributor

mhart commented Jan 30, 2014

I'm not sure I fully understand the code - I don't really know the Q library that well - but it sounds like it's got to do with the requests being created a while before they're sent (or at least, received by s3) - so I'm not sure if you're queueing them up too early, or if your bandwidth is really slow or something?

@my8bird
Copy link
Author

my8bird commented Jan 30, 2014

The code should have N requests in flight at a time and then create a new one after each on finishes. At least that was the intent and adding debug logging provided what I expected.

My bandwidth could be very slow for these uploads. If a large file that simply takes longer then 15 minutes to upload can cause this issue then this issue could be completely explained by me over saturating my network and causing the uploads to take too long. That would also explain why when uploading one at a time things work as expected.

@mhart
Copy link
Contributor

mhart commented Jan 30, 2014

You could always test with another async library to see whether it's your usage of Q that might be getting in the way of things. Limited requests are pretty easy with the async module:

var async = require('async')
var aws = require('aws-sdk')
var s3 = new aws.S3()
var fileBuffers = // ...
var maxUploads = // ...

function putFile(file, cb) {
  s3.putObject({Bucket: 'bucket', Body: file, Key: 'some random'}, cb)
}

async.eachLimit(fileBuffers, maxUploads, putFile, console.error)

@lsegal
Copy link
Contributor

lsegal commented Jan 30, 2014

I also am having trouble understanding, and executing, the code. When I run it, it only uploads the first 5 files and then stops.

I rewrote your example using async.js, which is my preferred parallelization lib:

var AWS = require('aws-sdk');
var async = require('async');

AWS.config.logger = process.stdout;
var s3_client = new AWS.S3({params: {Bucket: 'bucket'}});

function uploadFiles(files, maxUploads, done) {
  async.eachLimit(files, maxUploads, function(file, next) {
    s3_client.putObject({Key: 'key', Body: file}, next);
  }, done);
}

var files = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k'];
uploadFiles(files, 5, function() { console.log("Done"); });

If you could reproduce the issue with the above code, that would certainly take out one variable.

As far as your code-- my original hunch stands: if Q is building the Request objects and only sending after, this would cause those old objects to be build too early.

You might also want to try increasing the maxSockets on your HTTP agent (default is 5). My guess is this is how the SDK could end up waiting on other files to be sent. You can do that like so:

var AWS = require('aws-sdk');
var async = require('async');
var https = require('https');

AWS.config.logger = process.stdout;
var agent = new https.Agent();
agent.maxSockets = 20;
var s3_client = new AWS.S3({params: {Bucket: 'bucket'}, httpOptions: {agent: agent}});


function uploadFiles(files, maxUploads, done) {
  async.eachLimit(files, maxUploads, function(file, next) {
    s3_client.putObject({Key: 'key', Body: file}, next);
  }, done);
}

var files = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k'];
uploadFiles(files, 5, function() { console.log("Done"); });

@lsegal
Copy link
Contributor

lsegal commented Jan 30, 2014

@mhart and I are a hive-mind.

@mhart
Copy link
Contributor

mhart commented Jan 30, 2014

Haha 🔗

@my8bird
Copy link
Author

my8bird commented Jan 30, 2014

95% certain you just nailed it with the agent.maxSockets thing since I kept trying to upload 10 at a time. With this flag set to 5 I was forcing the later uploads to be "paused" until one of 5 executors was available, which on a slow network could take longer then 15min.

I will have to setup a new repo (as I don't want to mess with the family photos) but I will report back.

Aside, I will try async again as a helper like that is amazingly useful. I tired it a long time ago and convinced myself I did not need a lib as callbacks were "not that bad". I also ported the twisted deferred system to JS long ago https://github.com/my8bird/nodejs-twisted-deferreds but never actually used it since others had "established" solutions, although every so often I hear from people using it.

@lsegal
Copy link
Contributor

lsegal commented Jan 30, 2014

I would suggest that if you are going to send in parallel, you should make sure that maxSockets is set to at least the same size as the parallelization amount you are using. In this case, it should be at least 10 if you are uploading 10 at a time.

I can look into other solutions in the SDK, but the only way we could fix this would be if we waited until the connection opened to sign the request. While this technically is a logical approach, it may be a backwards incompatible change. I will also talk to maintainers of our other SDKs to see how this issue might be handled there.

@my8bird
Copy link
Author

my8bird commented Feb 4, 2014

I have done some more testing and my issue was definitely the maxSockets setting so I was right about something cutting into the parallel'ness of code but was wrong about the spot causing it.

Perhaps, this setting could be documented somewhere?

At any rate I think this issue is resolved.

@my8bird my8bird closed this as completed Feb 4, 2014
lsegal added a commit that referenced this issue Mar 25, 2014
Fixes the root cause of #221. When maxSockets is set to a value lower than the
amount of parallel requests, some of those requests can be queued for
"a long time". Because signing happens before those requests are queued, their
15 minute signing window can timeout before they ever hit the 'send' event.

This change adds an extra check to the 'send' even to force a re-sign of
the request if more than 10 minutes passed since the sign event was triggered.
This should allow multiple requests to be queued for an indefinite period of
time.
@lsegal
Copy link
Contributor

lsegal commented Mar 25, 2014

@my8bird just an FYI-- I just pushed a change that should allow the SDK to re-sign requests that were queued up for a long period of time. This should allow you to lower your maxSockets setting below your level of parallelism, if you wanted, and should generally provide a much more graceful experience with parallel requests.

@lock
Copy link

lock bot commented Sep 30, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs and link to relevant comments in this thread.

@lock lock bot locked as resolved and limited conversation to collaborators Sep 30, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants