Skip to content

Add "download all" buttons (including size of dataset) to dataset page #7047

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jul 20, 2020

Conversation

pdurbin
Copy link
Member

@pdurbin pdurbin commented Jul 1, 2020

What this PR does / why we need it:

This pull request adds "download all" buttons at the dataset level and shows the size of the dataset.

Which issue(s) this PR closes:

Special notes for your reviewer:

(this should probably be removed, since the issue has been resolved? - At least I'm going to mark it as resolved. -- L.A.)
Riddle me this... why do I get a different "size of dataset" number that what I see for the file? Here's a screenshot:

Screen Shot 2020-07-01 at 3 47 34 PM

Also, there has been discussion of adding a "download all files" API but @djbrooke and I have decided to split this off into a separate chunk, represented by #4529.

Suggestions on how to test this:

  • Test both tabular and non tabular files.
  • Test datasets with and without terms of use.
  • Test datasets with and without a guestbook.
  • Test restricted files.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

I don't see any mockups linked from #6118 but I believe they are our there somewhere. In addition to the screenshot above, here's one for a dataset with no tabular files:

Screen Shot 2020-07-01 at 3 54 03 PM

Is there a release notes update needed for this change?:

I think there's already a release note about general dataset redesign improvements? I can add another note if we want.

Additional documentation:

I didn't touch the User Guide because my understanding is that we are going to update it all at once after the UI changes.

@coveralls
Copy link

coveralls commented Jul 1, 2020

Coverage Status

Coverage decreased (-0.04%) to 19.61% when pulling 174ec43 on 6118-download into 4eda5a1 on develop.

}

public void validateAllFilesForDownloadArchival() {
selectAllFiles();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this selectAllFiles() is what's responsible for the odd UI behavior... when you select "Download" from the Access Dataset dropdown, you get the ZIP download, but now all the files are selected in the files table... that shouldn't be necessary... there has to be a way to get the ZIP without the unexpected UI behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried adding clearSelection() to the end like this...

     public void validateAllFilesForDownloadArchival() {
         selectAllFiles();
         boolean guestbookRequired = isDownloadPopupRequired();
         boolean downloadOriginal = false;
         validateFilesForDownload(guestbookRequired, downloadOriginal);
+        clearSelection();
     }

... but now it thinks no files are selected:

Screen Shot 2020-07-02 at 1 08 07 PM

So yes, I agree a fix would be nice but I'm not sure what that fix is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't the fix be to use a different method than what the one that cares about selected files? i.e

change current method to get selected files and get a list, path that to new private method that takes core of current method but uses this new list.

Create a new method that generates a list and calls new that new private method.

You'd of course have to make sure all the checks for permissions are happening correctly, but I'd expect that to already be in the backend side (i.e. we already shouldn't be relying in the front end to only pass in the permissible values)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scolapasta @mheppler please see 427c883. This pull request is now much bigger but it no longer uses selectedFiles which is what the checkboxes are connected to.

GetDatasetStorageSizeCommand cmd = new GetDatasetStorageSizeCommand(dvRequestService.getDataverseRequest(), dataset, false, GetDatasetStorageSizeCommand.Mode.DOWNLOAD, workingVersion);
try {
long bytes = commandEngine.submit(cmd);
return FileSizeChecker.bytesToHumanReadable(bytes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug in the download size displayed. My locally tested datasets are saying the download will be "300 B", but the resulting ZIP is 11 KB, or 11,000 B.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this what I noted when saying "riddle me this" in the description of this pull request (with a screenshot), that there's a mismatch in size. GetDatasetStorageSizeCommand needs discussion in tech hours or #dv-tech (heads up @scolapasta ) or a smaller group. I haven't looked into the implementation but the thing I was wondering about is how we estimate the size of a zip. Text files compress really well. Video files don't. I doubt the number will ever be 100% accurate. But yes, it seems pretty far off right now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can discuss next tech hours, sure. But isn't the inaccuracy in the wrong direction?

i.e. the ip should be smaller than the size from GetDatasetStorageSizeCommand? I could imagine a scenario where it would be large (due to zip headers), but in that case, I would expect it to be close in size.

But maybe this is "close" in this scenario.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in tech hours, the problem seems to stem from double counting tabular files (counting both the size of the original file and the archival version). I just pushed a fix in 1957776.

@mheppler
Copy link
Contributor

mheppler commented Jul 2, 2020

Kicked the tires on the analytics a lil more. Built the branch, added style classes for btn-download, and confirmed that worked. Also realized that the explore like was missing the btn-explore style class, so the latest from develop needs to be merged as that style class was added in the recently merged 6938 analytics new buttons #7008 PR.

One additional detail that will need to be tweaked in the analytics code is to resolve how the new Access Dataset > Download clicks are recorded in the Google Analytics reports. The way the code works, it pulls the link text to report what button was clicked, so now that the file format and size was added to the new download link text, it makes into the report, which is going to mean you can not group those events by category (see screenshot).

Screen Shot 2020-07-01 at 11 49 27 PM

@djbrooke djbrooke added this to the Dataverse 5 milestone Jul 7, 2020
@pdurbin
Copy link
Member Author

pdurbin commented Jul 7, 2020

I'm ready for more code review. I addresses the checkbox issue and am now differentiating between sizes for original files vs archival. Here's a screenshot:

Screen Shot 2020-07-07 at 5 18 29 PM

@pdurbin pdurbin removed their assignment Jul 7, 2020
@landreev landreev self-requested a review July 8, 2020 17:58
@landreev landreev self-assigned this Jul 8, 2020
@djbrooke
Copy link
Contributor

djbrooke commented Jul 13, 2020

We discussed and decided to not make the change below. While it would provide a path w/ one less click to a very common use case, it would take away from the scalability of the "Access Dataset" umbrella and would require a potential revisitation in the future.


One more change to make here (it was added to another issue by mistake).

Screen Shot 2020-07-13 at 11 58 14 AM

@kcondon kcondon self-assigned this Jul 13, 2020
@kcondon
Copy link
Contributor

kcondon commented Jul 13, 2020

@pdurbin Any resolution to the file size download all mismatch? I'm seeing the zip download is larger than the individual files.

@pdurbin
Copy link
Member Author

pdurbin commented Jul 13, 2020

@kcondon I thought the size estimate was better as of 1957776 (before that, tabular files were being counted twice). Can you please post a screenshot of the problem you're seeing?

@kcondon
Copy link
Contributor

kcondon commented Jul 13, 2020

After rereading and discussing a bit with Leonid, the download size is the uncompressed size (size of the dataset), so issue 2 is a non issue.

  1. the download size for both tab and original are off but the tab is off by a lot.
    2. the actual size of the zip file is considerably smaller than what's presented in the download, so I am guessing it refers to the uncompressed size?
  2. did find that for displaying original/archive options, it appears to check the dataset level for tabular rather than the version you are on since it will also display the option when the current version does not contain tabular file but was deleted and therefore exists in a prior version.

Otherwise download all appears to work. Leonid made a suggestion below on calling out the uncompressed size.

@kcondon
Copy link
Contributor

kcondon commented Jul 13, 2020

image

@kcondon
Copy link
Contributor

kcondon commented Jul 13, 2020

This is a screen shot using FF, saving the zip to windows, downloading the original format:
image

This is a screen shot using FF, saving zip to windows, downloading the archival format:
image

@kcondon kcondon assigned pdurbin and unassigned kcondon Jul 13, 2020
@landreev
Copy link
Contributor

So there are at least two issues:

1. the download size for both tab and original are off but the tab is off by a lot.

2. the actual size of the zip file is considerably smaller than what's presented in the download, so I am guessing it refers to the uncompressed size?

Otherwise download all appears to work

Yes on 2. - that's not really an issue; the size the user is seeing in the download dialog is the compressed size of the zip; we want to report the sizes of the actual files.

As for 1., looking at the sizes in the pull down menu in the example above:
the size for the "original" format appears to be correct (62.3KB); but for the "tabular" it does appear to be off - actually, it looks like it's off by exactly the size of the original; i.e., it keeps counting both format sizes, instead of counting just the tabular size.

@djbrooke djbrooke assigned TaniaSchlatter and djbrooke and unassigned djbrooke Jul 14, 2020
@mheppler mheppler removed their assignment Jul 16, 2020
@pdurbin pdurbin self-assigned this Jul 16, 2020
@pdurbin
Copy link
Member Author

pdurbin commented Jul 16, 2020

In 7e79f1b I addressed the problem tabular download options showing for versions of datasets that no longer have tabular files (because they were deleted and then the dataset was republished).

In 115353c I took another swing at getting GetDatasetStorageSizeCommand to behave properly and I thought I had it working but while testing I found further problems and weirdness so I ultimately decided to write a small command that does only the job I want. It seems to work fine so I switched over to it. See c13a5cc. I'm not doing any error checking within that small method (getDownloadSize in DatasetUtil) but I'm happy to add some if we're worried about nulls in the database for the size of files.

I'm moving this to code review.

@djbrooke and @TaniaSchlatter I noticed you're still assigned. I deployed the code here if you'd like to check it out but there are no UI changes: http://ec2-54-160-66-157.compute-1.amazonaws.com

@pdurbin pdurbin removed their assignment Jul 16, 2020
@mheppler
Copy link
Contributor

mheppler commented Jul 16, 2020

Told Phil that I trusted his work, but I still went and confirmed the download sizes that I had poked at early. Looking much, much better. Didn't test anything larger than the 2.1 MB download here, but it might be worth testing if a 210 MB download is off by 20 MB and determining if that is worth trying to fix.

Screen Shot 2020-07-16 at 5 00 27 PM

@djbrooke djbrooke removed their assignment Jul 20, 2020
@landreev
Copy link
Contributor

@mheppler @pdurbin The numbers above look right to me. What looks like a discrepancy - the fact that you're seeing "2,252,470 bytes (2.3 MB on disk)" in the Apple info box, but "2.1 MB" on the Dataverse side appears to be due to the fact that Apple is interpreting "1 MB" as "1,000,000 bytes"; and Dataverse is (more correctly!) using the "1MB = 1024 * 1024 bytes" definition. What we are doing is more correct. But what Apple is doing is somewhat OK too - both definitions are widely used (unfortunately). And real life users are used to seeing discrepancies between the 2 definitions by now.

Can I move it to QA?

@kcondon kcondon self-assigned this Jul 20, 2020
@kcondon
Copy link
Contributor

kcondon commented Jul 20, 2020

@landreev One thing I learned from my networking class is the 1,000,000 bytes value is used to count network traffic and the 1024 style is for disk storage. I think. Will need to reference my class notes ;)

@mheppler
Copy link
Contributor

@landreev @kcondon if that difference between the two math formulas is true, then I suspect we want to use the storage version of the number to display here in the UI for the end user.

@kcondon
Copy link
Contributor

kcondon commented Jul 20, 2020

@pdurbin @landreev @mheppler So, it looks pretty good -I'm fine with file sizes. However, I did notice one issue:
the old style download all button still presents orig and archival formats when tab file is removed in the current version. While testing shows this exists in 4.20, what is different is when you choose download original, it throws an error on download:
{"status":"ERROR","code":404,"message":"'/api/v1/access/datafile/53792' datafile access error: requested optional service (image scaling, format conversion, etc.) is not supported on this datafile."}

OK, need to do a little more testing on the above since I had also done a replace and restrict file in same dataset and not seeing behavior on just removal of tab.

So, kind of a weird issue where part of it preexists but is actually worse now. Maybe the same fix that was applied to the access download all would work here too?

Update: This issue exists in 4.20 but is very narrow: requires a tabular file removed, download all original format, as restricted file with terms of use (not just terms of access) enabled.

I think there may be a use case here where what we really need to do is confirm md5's or sha's of downloaded files are the same as what is shown on server to confirm downloads are intact. However, since that would need to be run on client side, exercise is up to user but maybe some messaging to suggest that at some point? Out of scope now, I'd imagine and maybe obvious to people who work in this space.

@kcondon
Copy link
Contributor

kcondon commented Jul 20, 2020

@pdurbin @landreev @mheppler I've found a weird, preexisting bug with old download all that is technically out of scope for this issue but I'm wondering whether the same fix for hiding tab download options when tab deleted in current version might be easy to add to it?
The bug is when you have deleted tab file in existing version, have a restricted file with terms of use, not just terms of access enabled, do not yet have access to restricted file, and choose download all but original version only. Kind of narrow but throws ugly white error screen and nothing is downloaded: {"status":"ERROR","code":404,"message":"'/api/v1/access/datafile/53792' datafile access error: requested optional service (image scaling, format conversion, etc.) is not supported on this datafile."}

@kcondon kcondon removed their assignment Jul 20, 2020
@landreev
Copy link
Contributor

@kcondon What's "old download all"? Do you mean selecting all the checkboxes and clicking "download" in the upper right corner of the files table?
I'm quite confused by the condition you are describing. If it's something pre-existing, I'm surprised nobody has noticed before.

@landreev
Copy link
Contributor

@pdurbin @kcondon @mheppler
I'm assuming this is the test dataset in question: https://dataverse-internal.iq.harvard.edu/dataset.xhtml?persistentId=doi%3A10.70122%2FFK2%2FSJSEZX&version=3.1

Yeah, I can see the "original" and "tabular" options under the old style multi-download button. I assumed earlier, that the way it used to work, there was never a real "download all", i.e. it was always a download of the specified list of files from the list of the selected checkboxes... and that way the above condition would never happen, in a version that has no tabular files. But we must have added an optimization at some point - I now have some vague memories of it - for that "select all" checkbox, that actually bypasses the checkboxes and goes to the complete list of files when "all" is selected... So yes, it is probably the same logic error - searching for tabular files in the entire dataset; instead of the current version. And could probably be fixed just as easily. 
Whether it should be fixed as part of this PR, seeing how it is pre-existing and in a different code section - I don't have a strong opinion.

@landreev
Copy link
Contributor

@pdurbin @kcondon @mheppler

Oh, and here's the reason you get that weird API error: because one of the files is restricted, and there are 2 files total, in the end there is only 1 file in the list - and Dataverse is smart enough to then determine that this is NOT a multiple file download case; so it redirects it to the single file download API. apparently it is not smart enough to then drop the "format=original" parameter - and the single download API treats that as a fatal error, when a format that's not applicable to the current file is requested.
This is annoying - but is indeed a fairly edge case....

@kcondon
Copy link
Contributor

kcondon commented Jul 20, 2020

@landreev @pdurbin @mheppler Leonid, thanks for confirming! I don't believe it is technically in scope but the question I was asking is whether it would be easy to adopt the solution for not showing download original/archive on versions that do not have tabular files, like Phil did already. If not, I'll just open a new issue.

@landreev
Copy link
Contributor

As I just mentioned to @kcondon and @djbrooke , I'm totally fine with merging this PR; and then using #6972 to fix the pre-existing conditions described above. Any objections?

@kcondon
Copy link
Contributor

kcondon commented Jul 20, 2020

Going once, twice...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Display the total size of a dataset on the dataset page Make download all files feature more prominent
8 participants