Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset: Loading dataset page fails with exception if original file size is null, eg. some legacy Murray #7205

Closed
kcondon opened this issue Aug 18, 2020 · 10 comments · Fixed by #7581

Comments

@kcondon
Copy link
Contributor

kcondon commented Aug 18, 2020

[Leonid]
Updating the issue, since we had some incorrect assumptions back when it was opened (L.A.):

  1. Yes, dataset page is bombing if the original size is null in the db. for one of the tab. files. (when it tries to calculate the total size for "download all").
  2. This has nothing to do with whether it can access S3, or whether the actual saved original exists: there is NO attempt to check the size of the stored physical file at this point. A null in the db results in a null pointer. (The code that looks up the physical file, if the value is null and then saves the size in the db does exist; but only in the Acces API).
  3. For most installations seeing this issue, this would be resolved by running /api/admin/datafiles/integrity/fixmissingoriginalsizes. We recommended (but not required!) to do this in 4.10 release; but some installations may have missed it.
  4. The code that throws that null pointer in this case should of course be modified, as not to kill the page. (It should simply assume that null = 0).
  5. Some ancient installations (like ours) may have legacy tab. files with no saved originals. In addition to addressing it with the fix in 4., we should probably also modify the fixmissingoriginalsizes API, not to leave any nulls in the db, but replace them with -1 in cases where the originals are missing. That's what I did in our prod. manually. (and then maybe we could remind people with pre-4.10 legacy installations to run that API)

All of the above should be very straightforward.

[Kevin] Original comment
It was discovered that loading the dataset page fails with an exception when original file size is null and cannot be calculated. This happened with a Murray dataset on a test instance. It can happen if 1. a legacy file does not have an original file 2. the file system is temporarily inaccessible and so the get size function fails.

It would seem that this should not be a fatal error in loading the dataset. After consulting with Dev, it seems that this behavior was added when the down all files in a dataset feature was added because it wants to alert you to the size of download.


From @qqmyers in #6343:

When Amazon had a recent S3 outage (DDOS attack), I noted that Dataverse can no longer display file pages when S3 is down. Previously, the only issue with S3 being out related to operations with file upload/download and the Dataverse site itself worked. This issue now is fairly trivial - it looks like the file size to be displayed on the page is being pulled from S3 metadata and hence, if S3 is down, the page breaks/shows a 500 error. I'm not sure its worth fixing, but avoiding 500 errors could be as simple as doing a null check on the file size. (In any case, this issue might help someone debug - 500 errors related to file sizes could be due to S3 being misconfigured/down,)

@landreev
Copy link
Contributor

The reason this is happening on the test box where the above was observed is that it does not have the credentials to access the production S3 bucket where the files are stored. So it can't look up the size of the stored original (which hasn't been cached in the db yet for some files in this dataset). The failure is a bit too fatal/catastrophic - but this would not be an issue on a properly configured system either.

@kcondon
Copy link
Contributor Author

kcondon commented Aug 18, 2020

@landreev so the legacy aspect is definitely not an issue then? I can add the credentials to confirm.

@landreev
Copy link
Contributor

Should be easier (definitely safer!) to simulate the "legacy" case with a QA S3 bucket (by deleting the saved original). I would be a little weary about configuring any test boxes with the prod. credentials.
I can run that test too.

@qqmyers
Copy link
Member

qqmyers commented Aug 18, 2020

duplicates #6343

@landreev
Copy link
Contributor

Just making a reminder note to re-test the file page as well. Since #6343, now closed, seems to suggest that it was file.xhtml page that was throwing a 500 w/ no S3 access; or at least it says "file pages". Worth to double-check.

@qqmyers
Copy link
Member

qqmyers commented Aug 18, 2020

My guess on that - the file page tries to show the 'friendlysize' - if size is null and gets sent to FileSizeChecker.bytesToHumanReadable() that may trigger a 500 even though the file.xhtml page is trying to do an empty check (jsf:rendered="#{!(empty FilePage.file.friendlySize)}) - if it checked the raw size for the rendering test, it could work (and a similar test could fix the dataset page where friendlysizes are displayed in the filesFragment).

@landreev
Copy link
Contributor

I am admittedly tired and stupid r/n, but I don't think that value, from which the "friendlysize" is calculated, should ever be coming from the S3 metadata... It should always be the value of "filesize" column from the DataFile db table - no?

(there may entirely be something else on that page that can be bombing w/ no storage access; not disputing that possibility... some direct compute url silliness? - anyway, we'll retest and figure it out).

My guess on that - the file page tries to show the 'friendlysize' - if size is null and gets sent to FileSizeChecker.bytesToHumanReadable() that may trigger a 500 even though the file.xhtml page is trying to do an empty check (jsf:rendered="#{!(empty FilePage.file.friendlySize)}) - if it checked the raw size for the rendering test, it could work (and a similar test could fix the dataset page where friendlysizes are displayed in the filesFragment).

@qqmyers
Copy link
Member

qqmyers commented Aug 18, 2020

Yes - there are two parts. The pages all have problems if the datafile table has a null size.

One way to get that is/was with an S3 direct upload if S3 is slow/flaky such that the upload succeeds but the subsequent call to S3 to get the metadata for the new object fails (us-east1 is only eventually consistent so under load (or DDOS) you may not get metadata to start, even though the upload itself succeeds (and you have a successful response from S3 to prove that)). I think since we saw this problem (and with Kevin's testing of very small files) we've added some delays and retries to make this less likely.

If there are any other current/historic ways to get a null filesize , they would have the same symptoms. (As with S3, temp files and file stores populate the datafile table once at Datafile creation so if there is a file system problem at that point I guess you could get a null. In any case, if it isn't/wasn't set at creation, a null filesize is then persistent until corrected.)

@landreev
Copy link
Contributor

landreev commented Aug 18, 2020

So we are talking about a situation where a (temporary) loss of access S3 during the creation/upload of the file left the files in a messed-up state (with null sizes in the datafile table - correct?
I.e., we are not talking about the page failing because it actually needs something from S3 in real time. I think I got it now.

It's a bit of a different situation from what this issue was opened for.
But we could address it the same way it's already addressed for these "saved original" sizes: if the value in the table is NULL, we could open StorageIO, look up the size, save it in the db, all in real time.

@landreev landreev changed the title Dataset: Loading dataset page fails with exception if original file size is null for some reason and can't be calculated, eg. Murray Dataset: Loading dataset page fails with exception if original file size is null, eg. some legacy Murray Jan 27, 2021
@landreev
Copy link
Contributor

(I updated the main description of the issue, with the info we learned when helping UVA with this same issue recently)

@sekmiller sekmiller self-assigned this Feb 3, 2021
sekmiller added a commit that referenced this issue Feb 5, 2021
@sekmiller sekmiller removed their assignment Feb 5, 2021
sekmiller added a commit that referenced this issue Feb 5, 2021
kcondon added a commit that referenced this issue Feb 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants