-
Notifications
You must be signed in to change notification settings - Fork 495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File Upload: Intermittent Failure to upload files on dataset create using direct to s3 #6829
Comments
Not sure I can debug unless I can reproduce it. A couple thoughts to start: On the server side, the 'Retrying after:' message itself is not fatal. The code retries up to 20 times, every three seconds to see if S3 is reporting that the file exists yet (there can be a delay in us-east1 between when the file is created on S3 and when calls to get it/it's metadata will succeed. If that delay is greater than one minute, it's surprising. The log should show 20 retries if that's the case though.) The 'Cannot get S3 object ...' line could be caused by a failure from having 20 retries fail, but the next line 'Could not find storage driver for: s3://iqssqa:16ea3cbba0c-638d1e50ec5e' might suggest another cause. That happens when the code strips the beginning 's3' off and can't find the storage type - looking for the dataverse.files.s3.type=s3 jvm option. Could that not be set on this instance? |
@qqmyers I can retest with your suggestions. This is the same box and config I used to do the original multistore testing.
Correction: needed to wait longer than expected (no indication it was doing something) but was doing the 20 retries and then 2 of the 3 files eventually uploaded.
server log looked like this:
-Second test, it worked as expected
console for above test 3 failure:
|
@qqmyers I've repeated the tests and tried to capture the console and server log output. Jim, please note the three files I'm testing with are tiny. |
Given the 'Upload complete for ' messages, which indicate a successful upload to S3, and the 20 Retries, it looks like the root cause is that Amazon has just gotten really slow in making the new object available. (I'm not sure what the exception in the log means for the last trial - is it the view timing out One thing to fix would be to avoid having the dataset fail. Beyond that, we could look at using longer timeouts, but if we're already seeing a view issue, that won't work (and how long should we wait?). One other test that could be run - other, non us-east-1, regions guarantee you won't get a 404 after creating a new object - using a bucket in one of those regions would show whether there's some other cause here of if it is just Amazon slowing down lately. |
@qqmyers Would it help to make a debug build with lots of extra logging like we did during testing? |
OK - found a real error as well - a race condition. In create mode, when the UI requests a signed directupload URL, we need to know the path in the bucket which means the dataset (new in create mode) needs a global identifier. I had been checking for a null identifier in the URL request method and setting it then. However, it looks like, with the UI processing files in parallel, it is possible for two calls to get URLs to both see a dataset with a null globalId, resulting in it being set twice with files being sent to two different paths, one of which can't later be found (since the rest of the code assumes the globalId is constant.). I saw this happen on EC2 with a new S3 bucket (where it was easy to see two paths after one 'new dataset' call. To fix this, I've just switched to assigning the globalId during the DatasetPage init call (if directUpload is enabled) so it is always set before any URL request. That change is now in the PR. |
This had been working but now is failing consistently but with different symptoms. Uploading files using direct to s3 works if the dataset is already created but fails if it is while creating the dataset.
There are two types of failures observed:
Other, possibly related server log errors:
The text was updated successfully, but these errors were encountered: