Faster downloads #798

BryceStevenWilley · 2023-10-18T14:32:46Z

Speeds up the downloads by directly downloading the file instead of trying to rely on response callbacks, which almost always timeout.

I made things a bit resilient; I think things work fine without all of the failed_to_download stuff (see https://github.com/SuffolkLITLab/ALKiln/actions/runs/6556325525/job/17806135932, an earlier test run without that code), but figured I'd keep the existing code in, in case there are situations we couldn't foresee. Happy to rip out the old code if we want to though.

I tried a lot of different options, none of them seemed to work particularly well:

https.get doesn't work with http://localhost (might be a me problem, but IMO, it's frustrating), the workaround is hacky and uses both https.get and http.get.
- I also had issues getting the file to download still? Just kept getting File Not Found. I think it's because DA by default checks the cookies / logged in status of who's downloading a file (that's why you need to use url_for(private=False)). So in general, trying to access the file from outside of the browser session is going to be tricky, if not impossible.
You can't pass back a blob of buffer object directly from the browser (Cannot return blob or arrayBuffer from page.evaluate puppeteer/puppeteer#3722). I took the suggested code from that issue and used it here.

In this PR, I have:

Ensured issues that this PR closes will be automatically closed

Links to any solved or related issues

#492

Any manual testing I have done to ensure my PR is working

Running a test that downloads a PDF (@o17) runs in 3.4 seconds now, as opposed to 33 seconds on v5.

plocket

Speeding this up will be a huge win. I think we may need to check on some stuff first as this may be a bit more complex than it would first appear, but maybe I'm off - I feel like I'm missing some of this picture. Anyway, some notes here and some in the review comments. At least some of them are relevant.

Numbered for easier referencing:

Do we really need to update shrinkwrap?
Added some comments about adding messages to let the author know what's going on.
Should we add something in the report when the download has succeeded? Before we didn't know if it had succeeded, but now we should.
I think this would be especially good to add to the changelog as an internal change.
Does this change handle the timeout running out? Previously, we'd go till the timeout and just continue the tests regardless of what happened (since we couldn't tell what happened). If the files didn't download, there wasn't anything to do about it. Now, though, we may need to handle the timeout if our await goes too long. Not sure if we'd want to error or warn at that point. We may want to add a test for whatever behavior we choose - make a huge image file or something to download.
When comparing downloads, did we make a provision for if/when the download doesn't exist (since it may not complete in time)? ~~I'll go check when I have a mo.~~ It looks like we didn't, so we might need to take care of that in another PR fairly soon.

lib/scope.js

BryceStevenWilley · 2023-10-18T21:17:10Z

Do we really need to update shrinkwrap?

That happens when I run npm install, which I needed to do after #755 merged.

Should we add something in the report when the download has succeeded? Before we didn't know if it had succeeded, but now we should.
I think this would be especially good to add to the changelog as an internal change.

👍 (I've been waiting for changelog stuff when I put out multiple PRs b/c I don't know which will go out first)

Does this change handle the timeout running out? Previously, we'd go till the timeout and just continue the tests regardless of what happened (since we couldn't tell what happened). If the files didn't download, there wasn't anything to do about it. Now, though, we may need to handle the timeout if our await goes too long. Not sure if we'd want to error or warn at that point. We may want to add a test for whatever behavior we choose - make a huge image file or something to download.

Good points. IMO I'd rather error, since if we tried to download something and didn't download it, something's likely wrong with the server (testing might be hard; even really big stuff will download quickly on the same machine).

When comparing downloads, did we make a provision for if/when the download doesn't exist (since it may not complete in time)?

This will exist though? It completes in this step (unless there's a timeout, and I that case I think we should error, as mentioned above).

plocket · 2023-10-19T01:59:07Z

shrinkwrap

I think we could start using npm ci for this. It's meant for pipelines, but I think it would serve the purpose we're looking for - installing a consistent version based on the lockfile each time. When we need to update for security or for new packages we would, of course, just npm install.

Another option is to use npm install --no-save in future to avoid updating shrinkwrap, but that unfortunately doesn't really make things stable - the shrinkwrap file doesn't change, but your local environment is still different than what the shrinkwrap uses. It looks like we all end up using different shrinkwraps.

I wonder if item 4 here indicates that what happened to your shrinkwrap is a bug and may be worth reporting.

IMO I'd rather error

Sounds good to me 👍

testing might be hard

Yeah, we probably won't be able to write tests that will consistently pass (by failing) until we do some research about implementing throttling.

[The download] completes in this step [unless it errors]

Good point. I can still see a situation where someone would forget to include a download step before comparing. We might handle that in a different issue, though.

plocket · 2023-10-19T02:03:23Z

Not sure if you saw that note about a success message, so just repeating that here.

BryceStevenWilley · 2023-10-19T02:32:03Z

I've been a bit lenient about shrinkwrap changes. I'll avoid them more in the future (though this time it's because the version in package.json on main is still 5.1.7-sources-7, like I pointed out in #755 (comment) 👀 , not a bug).

lib/scope.js

plocket · 2023-10-19T14:02:30Z

testing might be hard; even really big stuff will download quickly on the same machine

It occurred to me that we can decrease the max timeout for the Step to a fraction of a second and thus trigger a timeout. Would that work for our purposes? It's only one type of error, but that may still be worth testing.

BryceStevenWilley · 2023-10-19T15:04:55Z

It occurred to me that we can decrease the max timeout for the Step to a fraction of a second and thus trigger a timeout.

Good point, I can try that! IMO, we can try to manually and see how it does. I don't think we need to add it to the test suite though.

lib/scope.js

plocket

This is looking pretty good to me! I think once you feel comfortable with it, we can send it out into the wild!

BryceStevenWilley · 2023-10-20T19:50:54Z

It occurred to me that we can decrease the max timeout for the Step to a fraction of a second and thus trigger a timeout.

I did this, and got the following printout in the cucumber report:

1) Scenario: I compare the same PDFs (attempt 2) # docassemble/ALKilnTests/data/sources/observation_steps.feature:220
   ✔ Before # lib/steps.js:104
   ✔ Given I start the interview at "test_pdf" # lib/steps.js:165
   ✔ Then the question id should be "proxy vars" # lib/steps.js:453
   ✔ When I set the var "x[i].name.first" to "Proxyname1" # lib/steps.js:939
   ✔ And I tap to continue # lib/steps.js:740
   ✔ Then I sign # lib/steps.js:984
   ✔ And I tap to continue # lib/steps.js:740
   ✔ Then the question id should be "simple doc" # lib/steps.js:453
   ✔ Given the max seconds for each step in this scenario is 0.1 # lib/steps.js:279
   ✖ When I download "simple-doc.pdf" # lib/steps.js:999
       Error: Action did not complete within 0 milliseconds
           at Timeout.<anonymous> (/home/brycew/Developer/LITLab/code/ALKiln/lib/cucumber_8_shim.js:18:14)
           at listOnTimeout (node:internal/timers:569:17)
           at process.processTimers (node:internal/timers:512:7)
   - And I expect the baseline PDF "simple-doc-Baseline.pdf" and the new PDF "simple-doc.pdf" to be the same # lib/steps.js:708
   ✔ After # lib/steps.js:1113

The PDF file did actually download fine, because I guess the download had enough time to start. The conflicting states isn't great, but I'm assuming that the presence of a clear "Action did not complete within " message is good enough for people. I tried uploading a larger PDF to do this with, but ran into issues (DA doesn't like too big of files, and it doesn't like PDFs without fields, all of the big PDFs I have don't have fields and it'd be difficult to add them to it).

I'll add to the changelog now and merge!

The existing `content-disposition` doesn't work because there are no content disposition headers. From https://thom4.net/2018/puppeteer-download-file/

puppeteer/puppeteer#3722

* revert shrinkwrap changes * get back more detailed error info from browser download * write both success and failure to download with new method to report

Co-authored-by: plocket <52798256+plocket@users.noreply.github.com>

BryceStevenWilley requested a review from plocket as a code owner October 18, 2023 14:32

plocket reviewed Oct 18, 2023

View reviewed changes

lib/scope.js Outdated Show resolved Hide resolved

lib/scope.js Outdated Show resolved Hide resolved

lib/scope.js Show resolved Hide resolved

lib/scope.js Show resolved Hide resolved

plocket mentioned this pull request Oct 19, 2023

In the download Step, allow the author to give the downloaded file a custom name #800

Closed

plocket reviewed Oct 19, 2023

View reviewed changes

lib/scope.js Outdated Show resolved Hide resolved

plocket reviewed Oct 19, 2023

View reviewed changes

lib/scope.js Outdated Show resolved Hide resolved

plocket approved these changes Oct 19, 2023

View reviewed changes

BryceStevenWilley and others added 8 commits October 20, 2023 15:58

Faster downloads using in-page fetch

2527e05

The existing `content-disposition` doesn't work because there are no content disposition headers. From https://thom4.net/2018/puppeteer-download-file/

Actually got working again

365475e

puppeteer/puppeteer#3722

Be more resillient when downloading PDF

c48d2a7

Remove old console logs

4ad3476

Address PR comments

fd9dadd

* revert shrinkwrap changes * get back more detailed error info from browser download * write both success and failure to download with new method to report

Clarify "Using" in report message

f480b22

Co-authored-by: plocket <52798256+plocket@users.noreply.github.com>

correct typos

f6a3b7e

Co-authored-by: plocket <52798256+plocket@users.noreply.github.com>

Add to changelog

35788ad

BryceStevenWilley force-pushed the faster_downloads branch from 06f2bd3 to 35788ad Compare October 20, 2023 19:59

BryceStevenWilley merged commit 7075674 into v5 Oct 20, 2023
2 checks passed

BryceStevenWilley deleted the faster_downloads branch October 20, 2023 20:23

plocket mentioned this pull request Nov 21, 2024

Bug: Downloads take the max time, they don't end the step when they're done downloading (I think) #492

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster downloads #798

Faster downloads #798

BryceStevenWilley commented Oct 18, 2023

plocket left a comment •

edited

Loading

BryceStevenWilley commented Oct 18, 2023

plocket commented Oct 19, 2023

plocket commented Oct 19, 2023

BryceStevenWilley commented Oct 19, 2023

plocket commented Oct 19, 2023

BryceStevenWilley commented Oct 19, 2023

plocket left a comment

BryceStevenWilley commented Oct 20, 2023

Faster downloads #798

Faster downloads #798

Conversation

BryceStevenWilley commented Oct 18, 2023

In this PR, I have:

Links to any solved or related issues

Any manual testing I have done to ensure my PR is working

plocket left a comment • edited Loading

Choose a reason for hiding this comment

BryceStevenWilley commented Oct 18, 2023

plocket commented Oct 19, 2023

plocket commented Oct 19, 2023

BryceStevenWilley commented Oct 19, 2023

plocket commented Oct 19, 2023

BryceStevenWilley commented Oct 19, 2023

plocket left a comment

Choose a reason for hiding this comment

BryceStevenWilley commented Oct 20, 2023

plocket left a comment •

edited

Loading