Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate API key to inform dev of errors caused by server reload #392

Closed
plocket opened this issue Dec 3, 2021 · 27 comments · Fixed by #570
Closed

Integrate API key to inform dev of errors caused by server reload #392

plocket opened this issue Dec 3, 2021 · 27 comments · Fixed by #570
Assignees
Labels
priority A combination of urgency and impact

Comments

@plocket
Copy link
Collaborator

plocket commented Dec 3, 2021

Problem: When someone on the server edits a module, pulls in a repo with a module, or does various other things, the server restarts. That takes a while and causes test timeout errors - false failures [for other tests - they can timeout]. A test running, in fact, can cause a server reload timeout error.

Solution: There is a way to ping the API to see if the server is reloading/busy. Use that to detect when retrying is more appropriate than failing. This needs to be done wherever the test is going to have to wait for a page to load.

This will bump us to v4. We could take extra care to avoid breaking v3, but the complication introduced by that doesn't seem worth it.

We expect this change to take a while because of all the associated tasks and limitations. For example, currently, we don't know how to use the API at all, so this is going to involve that learning as well.

Things to implement before this will be ready for release:

  1. Adjust the setup interview to take people through getting an API key.
  2. Adjust the setup code to create the correct GitHub secret.
  3. Code the API for the setup [and takedown] steps.
  4. Code the API for running the tests.
  5. [Remove use of username and password (not required, but nice). e.g. Use the API to set up tests as well? e.g. getting environment variables that we'd otherwise need to ask the user for? Though now it's seeming more and more likely that we won't need the ones that we can get through the API at runtime anyway.]

Things to do after implementation:

  1. Find all the interviews using this testing framework.
  2. Give them the appropriate GitHub secret.
  3. Update their versions.

I think, though, that folks using this framework all use organizations (it's probably just us) and we will only need to adjust the organization secrets to make this work, so we can do it all in one go. [The packages that have repo secrets that override the GitHub secrets may have a problem. There are some of those for suffolk that are using the dev server instead of the test server. That said, when we update their script versions, admins can also check their secrets.]

See #389 (comment), "Option 2".

@plocket plocket added the priority A combination of urgency and impact label Dec 3, 2021
@plocket
Copy link
Collaborator Author

plocket commented Dec 6, 2021

Server restart

Detecting that there isn't a problem with the server:

...the API for installing a package returns a code that can be passed to the package_update_status endpoint, which will return completed when the server has finished restarting. See https://docassemble.org/docs/api.html#package_update_status

Also, when polling that endpoint, I recommend using a short timeout and retry the API call in a loop. Depending on how busy the server is, a call to the package_update_status API may time out, so it is best to poll it in a loop, with the assumption that the server may not be able to respond to the status request while it is in the middle of restarting.

Setup

Create a project: https://docassemble.org/docs/api.html#playground_post_project

Install code: https://docassemble.org/docs/api.html#playground_install (do allow default restart). Doesn't currently seem to allow pulling from GitHub, so we'll have to think about that.

Delete a project: https://docassemble.org/docs/api.html#playground_delete_project

Making an API key instructions: https://docassemble.org/docs/api.html#manage_api

@plocket
Copy link
Collaborator Author

plocket commented Dec 6, 2021

Create a project

Generated by insomnia

const http = require("https");

const options = {
  "method": "POST",
  "hostname": "apps-dev.suffolklitlab.org",
  "port": null,
  "path": "/api/playground/project?key=myKey&=",
  "headers": {
    "cookie": "session=aSessionID",
    "Content-Type": "multipart/form-data; boundary=---011000010111000001101001",
    "Content-Length": "0"
  }
};

const req = http.request(options, function (res) {
  const chunks = [];
  res.on("data", function (chunk) { chunks.push(chunk); });

  res.on("end", function () {
    const body = Buffer.concat(chunks);
    console.log(body.toString());
  });
});

req.write("-----011000010111000001101001\r\nContent-Disposition: form-data; name=\"project\"\r\n\r\nAPIProject7\r\n-----011000010111000001101001--\r\n");
req.end();

My untested version

Maybe needs await added in there somewhere. I'm not sure.

const http = require("https");

const options = {
  "method": "POST",
  "hostname": "apps-dev.suffolklitlab.org",
  "path": "/api/playground/project?key=myKey&=projectName",
};

const req = http.request(options, function (res) {
  const chunks = [];
  res.on("data", function (chunk) { chunks.push(chunk); });

  res.on("end", function () {
    const body = Buffer.concat(chunks);
    console.log(body.toString());
  });
});

req.end();

Delete a project

Insomnia didn't work for some reason. The curl it generated looked weird. Why does it want to keep making cookies?

My assumption

The same except with DELETE instead of POST

Detect server status

It doesn't actually detect server status. It detects the package update status. And I'm not even sure if that means a project package update. It might be enough that it detects whether the server responds at all, but that ping requires a package update code and I'm not sure how to get that code if it's not a project package update.

@plocket plocket added the blocked There's a reason this work can't move forward right now label Dec 8, 2021
@plocket
Copy link
Collaborator Author

plocket commented Dec 11, 2021

You can get the user id this way, so we won't need that info either: https://docassemble.org/docs/api.html#user_retrieve

Path: /api/user
Method: GET
Parameters:
key: the API key (optional if the API key is passed in an X-API-Key cookie or header).

Get the returned JSON's id value.

@plocket
Copy link
Collaborator Author

plocket commented Dec 12, 2021

The API is now available!

https://docassemble.org/docs/changelog.html 1.3.9 - 2021-12-11

Added
The /api/playground_pull endpoint for pulling a package into a Playground.
The /api/restart_status endpoint for monitoring the status of a server restart.

We can start working on v4 API stuff.

@plocket plocket removed the blocked There's a reason this work can't move forward right now label Dec 12, 2021
@plocket plocket self-assigned this Dec 12, 2021
@plocket plocket added this to the v4 milestone Dec 12, 2021
@plocket
Copy link
Collaborator Author

plocket commented Dec 12, 2021

Actually, we can't do this until our server updates to v1.3.9, which really shouldn't happen till the work week starts.

@plocket plocket added the blocked There's a reason this work can't move forward right now label Dec 12, 2021
@plocket
Copy link
Collaborator Author

plocket commented Dec 12, 2021

I already said this :P API has been updated! Pulling from github: https://docassemble.org/docs/api.html

@plocket plocket added blocked There's a reason this work can't move forward right now and removed blocked There's a reason this work can't move forward right now labels Dec 12, 2021
@plocket
Copy link
Collaborator Author

plocket commented Dec 17, 2021

No longer blocked, now working on integration.

There's been another adjustment where the requester (us) no longer needs to guess at the right address to send depending on the repo being private or not or GitHub being integrated for the da account or not. I'll start moving forward where I can, but further testing will be needed when we implement that.

@plocket
Copy link
Collaborator Author

plocket commented Dec 17, 2021

Pull a repo

Note: as of 1.3.10, url format shouldn't matter. It can be ssh or non-ssh.

https://docassemble.org/docs/api.html#playground_pull

public/owner/no ssh/no integration

curl --insecure --request POST --url https://apps-dev.suffolklitlab.org/api/playground_pull -d key=someKey -d project=APIPull1 -d github_url=https://github.com/plocket/docassemble-ALAutomatedTestingTests -d branch=157_feedback

Right result: { "task_id": "cynAqWnbeJedvwckhJSewESK" }, repo and branch correctly pulled

public/owner/try ssh, but have no integration

curl --insecure --request POST --url https://apps-dev.suffolklitlab.org/api/playground_pull -d key=someKey -d project=APIPull1 -d github_url=git@github.com:plocket/docassemble-ALAutomatedTestingTests.git -d branch=main

Right result: "Pull process encountered an error: error running git clone. Cloning into 'docassemble-ALAutomatedTestingTests'...\nHost key verification failed.\r\nfatal: Could not read from remote repository.\n\nPlease make sure you have the correct access rights\nand the repository exists.\n"

public/owner/has integration/da email DOES match github email

??

public/owner/integrated/da email doesn't match github email

curl --insecure --request POST --url https://apps-dev.suffolklitlab.org/api/playground_pull -d key=someKey -d project=APIPull1 -d github_url=git@github.com:plocket/docassemble-ALAutomatedTestingTests.git -d branch=main

Right result: { "task_id": "WlbCYHcSmYNUMzSdLOmtnIIR" } and code was pulled in

private/owner/non-ssh url/no integration

??

private/owner/ssh url/no integration

?? url format shouldn't matter a of 1.3.10

private/owner/has github integration/ssh url/da email DOES match github email

??

private/owner/has github integration/ssh url/no match between da email and github email

curl --insecure --request POST --url https://apps-dev.suffolklitlab.org/api/playground_pull -d key=someKey -d project=APIPull1 -d github_url=git@github.com:plocket/docassemble-TestPrivateRepo.git -d branch=main

Right result: nothing returned, but project pulled in (it doesn't have a module, so no restart was required I guess)

private/collaborator/ssh url/no match between da email and github email

curl --insecure --request POST --url https://apps-dev.suffolklitlab.org/api/playground_pull -d key=someKey -d project=APIPullColab1 -d github_url=git@github.com:plocket/docassemble-TestPrivateRepo.git -d branch=main

Right result: code pulled in, no id because no module, so no restart

private/no permissions/ssh url/da email doesn't match github email

curl --insecure --request POST --url https://apps-dev.suffolklitlab.org/api/playground_pull -d key=someKey -d project=APIPullColab1 -d github_url=git@github.com:plocket/docassemble-TestPrivateRepo.git -d branch=main

Right result: "Pull process encountered an error: error running git clone. Doing GIT_SSH=/tmp/someid.sh git clone -b "main" git@github.com:plocket/docassemble-TestPrivateRepo.git\nCloning into 'docassemble-TestPrivateRepo'...\nWarning: Permanently added 'github.com,140.82.112.4' (ECDSA) to the list of known hosts.\r\nERROR: Repository not found.\nfatal: Could not read from remote repository.\n\nPlease make sure you have the correct access rights\nand the repository exists.\n"

@plocket
Copy link
Collaborator Author

plocket commented Dec 17, 2021

HTTP libraries

Edit: I don't want to spend a ton of time searching, so I'll start with axios, but if others have thoughts, we can probably switch without too much pain.

request:
Deprecated: request/request#3142

node-fetch:
In 2017, it looked less good than axios: https://blog.bitsrc.io/performing-http-requests-fetch-vs-axios-b62b44fed10d. Also in 2020: https://masteringjs.io/tutorials/axios/vs-fetch. I think fetch can return JSON now with result.json() now, but I may be wrong.

axios:
See above comparison with fetch.


Here's a list with no reviews whatsoever from Aug 2019: request/request#3143

@plocket
Copy link
Collaborator Author

plocket commented Dec 17, 2021

All the examples in axios use qs (https://www.npmjs.com/package/qs) for their data. It might be overkill, but it'll mean we don't have to think about query strings at all, so I think we'll use that too.

@plocket
Copy link
Collaborator Author

plocket commented Dec 21, 2021

I'd appreciate a hand with a current API handling behavior dilemma.

I've got three functions: dai.server_is_busy() waits until the server responds or it times out (at which point it throws, a behavior I'm also not sure of). That can be done with any arbitrary request to the server (more later). ​dai.get_dev_id()​​ waits till the server is free (using the previous function), then gets the developer's id. ​dai.test_api_key()​ throws if the API key is not valid.

The first two use ​da.get_dev_id()​ (note ​da​, not ​dai​§) which is the actual request to the server for the developer id. The api key test uses ​dai.get_dev_id()​ to test that an api key is valid and throw an error with a detailed description if they key is invalid. The thing is, whenever ​da.get_dev_id()​ is used, it might get an 'Access Denied' response error, [which is only caused in this case by an invalid API key], so should I really wait to throw the error all the way back at ​dai.test_api_key()​, or should I throw it as soon as possible in ​dai.server_is_busy()​?

There are lots of things that call ​dai.server_is_busy()​ and all of them would probably want to indicate if the api key is suddenly invalid. It also seems weird to push back the detailed error description so it only comes up when ​dai.test_api_key()​ is called. On the other hand,

  1. the stack trace for that 'Access Denied' error might be more useful if it points to ​dai.test_api_key()​ as the place where the error happened. You could make that argument for any of the other functions too, I suppose, and that would end up being more boilerplate to put everywhere, even if I abstract it.

  2. ​dai.server_is_busy()​'s name doesn't indicate that its purpose is to, under some circumstances, deal with/throw an 'Access Denied' error. That said, I'm already throwing an error there under the assumption that anything asking whether the server is busy would not want to throw if the server is not responding in time.

Which of those seem most compelling? Are there any other reasons to lean one way vs. another?

§ da is 'docassemble api', dai is 'docassemble api interface'.

For reference, see a sample of the code

[This question might become more relevant as we use the API to test the interview itself and detect the difference between page load issues and server response issues. Should all the api interface functions just rely on their own timeouts? Is there another reason for them to timeout other than a busy server?]

@BryceStevenWilley
Copy link
Collaborator

§ da is 'docassemble api', dai is 'docassemble api interface'.

This doesn't make it clear what the difference between these two is. Both are interfaces (an already overloaded word IMO). It looks like dai just does await da.get_dev_id() before every call? But it's really unclear from the names.

The core question here that isn't explained is why da.get_dev_id() is the thing you use to check to see if the server is responding? It seems very implementation dependent, and like you said, not really related to what server_is_busy() is supposed to do. There's no reason that DA couldn't release an different endpoint that says "server ok" without a key. IMO it shouldn't throw.

@plocket
Copy link
Collaborator Author

plocket commented Dec 21, 2021

This doesn't make it clear what the difference between these two is.

I can understand that. Do you have thoughts on how I can make that more clear? "docassemble api" is trying to stick as close to axios as possible. It's not doing anything other than making the requests. "docassemble api interface" is supposed to manage various request operations in ways that help get more complex things done, like waiting for a busy server, creating unique Project names, etc.

Will post another response to the other question in a bit.

@plocket
Copy link
Collaborator Author

plocket commented Dec 21, 2021

IMO it shouldn't throw.

This is making me think that maybe there shouldn't be an .is_server_busy() function - maybe that's not useful (why would it be?). Maybe I need a .throw_an_error_if_server_is_not_responding(). Would that make more sense?

There's no reason that DA couldn't release an different endpoint

I used getting the id because it doesn't cause a restart and there isn't a specific api endpoint for just that. Shall we add an endpoint? The API docs have a section on polling the server that only talks about doing it for tasks that restart the server and therefore have task_ids, not just for polling at any time.

@plocket
Copy link
Collaborator Author

plocket commented Dec 21, 2021

[Separate from the discussion above:]

Should all the api interface functions just rely on their own timeouts?

[Answer, potentially: To me, the answer based on the below discussion below is that we can do without polling and just use a timeout. If a user finds they have trouble with it, though, I hope we can find this discussion again...]

[Prior discussion:]
This is its own question and the question is really "Do we need to use a loop to poll"?

The docassemble API recommends that:

Canceling a request after a few seconds and trying again can be more efficient than waiting 60 seconds for the server to cancel the request.

The following conversation aimed for clarifying polling suggestion:
Q:

In the API docs for polling the server it says, “Canceling a request after a few seconds and trying again can be more efficient than waiting 60 seconds for the server to cancel the request.” Is that specific to the request library you’re using? We're using Axios and I don’t seem to get less efficient behavior. Is it possible I just haven’t run into the circumstances under which it would be less efficient?

A:

I don't think it would be specific to the request library but it might be specific to the server. I don't know under what circumstances a request times out. That has to do with how NGINX and uwsgi work.

Q:

Is that separate from/outside of what docassemble controls?

A:

I don't know the details about how NGINX and uwsgi work. Those are separate software packages developed by other people. They are part of docassemble's default Docker container setup.

@plocket
Copy link
Collaborator Author

plocket commented Dec 21, 2021

Another separate question:

Could we modify the api so we can start an interview and get its url so we can then go to it: https://docassemble.org/docs/api.html#session_new. The url is currently not returned.

That way we won't need to get the developer id.

We'll also then have access to the interview through the API. I'm not sure how that could be helpful currently, but there may come a time...

plocket added a commit that referenced this issue Dec 21, 2021
functions instead of going through the server check. See
#392 (comment)
@plocket
Copy link
Collaborator Author

plocket commented Dec 21, 2021

Why do we want to poll the server during tests?

This is to differentiate between a long page load and a busy server. I'll explain the complications later.

First, it'll only be partially effective, and there's not a way to avoid that. The server could become unresponsive just after we test it or become responsive just after we test it. We can test before trying to load a page, during the loading of the page, and after the page times out on loading, but all of them will have this problem.

That said, we can take a shot at making these false failures more rare. Do we want to?

How about just testing a page load multiple times? That can result in more false failures and longer test times.

False failure scenario:

The developer sets a custom timeout for 1 second per step. They continue to the next page. Our code gives the page 3 tries, but the server is going to take 5 seconds to finish restarting. The server needed five more seconds, though. The test will fail because the server didn't have enough time to restart.

Tests can take longer:

I know some people don't find this to be a problem, but I definitely would find it frustrating.

Scenario 1: Supposing the developer knows some big files will take a long time to load. They give the test 4 min. to load a page. The server is busy, but finishes 5 seconds after when we start timing the page load. The page fails to load in those first 4 min because it got stuck (which I've experienced before). That's 8 min. If we had been polling the server, it would have taken 4 min. and 5 seconds.

Scenario 2: If they set the 4 min. timeout for the whole scenario, not just one step, then this situation could happen: The page would have taken 1 sec to load. The server is busy, but finishes 5 seconds after the page tries to load. The page doesn't load in that first 4 min. window. It then tries again and loads in 1 second. That's 4 min and 1 second instead of 6 seconds.

@plocket
Copy link
Collaborator Author

plocket commented Dec 23, 2021

Closing this issue will let us close (nope) SuffolkLITLab/docassemble-ALKilnSetup#171 of the setup interview because the user will only need to give us the API key.

[I'm rethinking this. We could probably make a page to let the repeat user give us all the fields on one page.]

@plocket
Copy link
Collaborator Author

plocket commented Dec 24, 2021

Random event. Unintentional server restart muddled things up a bit. Do docassemble API keys expire? Some of mine [disappeared without me deleting them]. I'll ask some questions. Unless this is something we can change, we may have to rethink how we do this...

rpigneri-vol added a commit that referenced this issue Feb 7, 2022
There's a lot here, both as notes in the issue and as changes in the files. Some of it is just changing a folder name, but this might take a real-time discussion depending on how much folks know about these API requests.

This addresses #392. Specifically the setup and takedown steps. The next part, helping manage the tests themselves, will probably be a lot smaller.

I hope I haven't missed anything important, but this is a big move. I can try to redo this in a more step-by-step fashion, making a PR for each step. I have some ideas how, so do let me know if that would help. I just needed to see how it would all fit together.

This also addresses a couple of other issues, and I'll try to track them all down and add them here. (Addresses #438, session_vars). (Addresses #432, better Project name, for v4, I think - title of docassemble Project. See [this line of code](https://github.com/SuffolkLITLab/ALKiln/blob/5782821f6de94335b6cf715e91986cccce4639a1/lib/utils/session_vars.js#L97).) It made sense considering the other changes I was making. I'm very unsure about how I'm handling `session_vars` (which are meant to eventually replace `env_vars`). ~It's a lot of getting things for stuff that's basically static, but there are some in there that need to be gotten, and I thought it would be more consistent.~ See comment below about at least part of why `session_vars` is all `.get...()`. Maybe it needs to be broken up further somehow, though. Open to suggestions.

In the future, `da_i.get_dev_id()` will be needed to get the interview address. Right now its use is a bit questionable. That discussion has started in #392.

`da_i.throw_an_error_if_server_is_not_responding()` is still more problematic, but it's not currently being used by anything. It was before and it will be again, but it'll have time to get worked out later I think.

-------

* Start developing API calls

- Add user_vars file to eventyally remove env_vars file
- Other than pull and checking task status, calls appear working
- Install axios and qs

* Fix api path for server status update, increase pull default

timeout, add some questions

* Change user_vars to session

* Finish changing user_vars to session

* Implement Project creation through API

* Add pulling through the API

* Implements wait_for_server_to_restart()

Adds:
- log.js for logging ALKiln-specific messages
- time.js to wait for a timeout

A lot of re-arranging. Still not sure how to organize logs vs.
throws, etc.

* Implemented delete project

* Delete puppeteer-utils.js, remove spurious `while` for creating

a project (it should probably be handled in project creation) and
add some comments.

* Clean up a bit

* Add developer API key to workflow

* Replace env_var with session in env_vars.js where possible

* Switch `session` to `session_vars` because session has too much

baggage

* Move request complexity into its own file

* Get more info about interview url (for errors)

* Restore scope.js, error resolved

* Comment out unused funcs, move creation loop into da interface

and simplify it a big. Add various comments.

* Add pull attempts to run again if the server is busy

* Implement pull retries if server is busy

* Move wait for pull to complete into da interface. Also

- Rename some functions
- Trust the API allowing simplification of if statements
- Adjusted log messages

* Rearrange locations of files

* Add test for correct API key, adjusted logs

* Fix behavior for an invalid API key

* Trust server check timeout instead of loop in most cases, ensure

dai.delete() too waits for server to be free. Logs and comments.

* Adjust names, move long timeouts directly into the relevant

functions instead of going through the server check. See
#392 (comment)

* Add and correct comments/descriptions

* Add API key to action.yml

* Fix invalid project name allowed

* Remove dev console log

* Update messages, improve log.js, improve some names, as noted and

also with some of my own changes. Also, had to move dev __delete
function into REST to handle deleting with axios because I
apparently changed the actual `delete` functions to use only the
name of the current project. No longer able to hand one in.

* Concat log string instead of using loop

Co-authored-by: rpigneri-vol <78454056+rpigneri-vol@users.noreply.github.com>

* Make SERVER_URL safe and other review fixes

Co-authored-by: rpigneri-vol <78454056+rpigneri-vol@users.noreply.github.com>
plocket added a commit that referenced this issue Feb 8, 2022
Close #40 

* add conditional for name with more than 4 parts

* create test for more that 4 part name warning

* fix typo in tests

* fix repeated code

* fix spacing and typo errors

* fix grammar in error

* fix grammar in testing

* log changes in CHANGELOG

* Address #392, use API to help with server functionality (#442)

There's a lot here, both as notes in the issue and as changes in the files. Some of it is just changing a folder name, but this might take a real-time discussion depending on how much folks know about these API requests.

This addresses #392. Specifically the setup and takedown steps. The next part, helping manage the tests themselves, will probably be a lot smaller.

I hope I haven't missed anything important, but this is a big move. I can try to redo this in a more step-by-step fashion, making a PR for each step. I have some ideas how, so do let me know if that would help. I just needed to see how it would all fit together.

This also addresses a couple of other issues, and I'll try to track them all down and add them here. (Addresses #438, session_vars). (Addresses #432, better Project name, for v4, I think - title of docassemble Project. See [this line of code](https://github.com/SuffolkLITLab/ALKiln/blob/5782821f6de94335b6cf715e91986cccce4639a1/lib/utils/session_vars.js#L97).) It made sense considering the other changes I was making. I'm very unsure about how I'm handling `session_vars` (which are meant to eventually replace `env_vars`). ~It's a lot of getting things for stuff that's basically static, but there are some in there that need to be gotten, and I thought it would be more consistent.~ See comment below about at least part of why `session_vars` is all `.get...()`. Maybe it needs to be broken up further somehow, though. Open to suggestions.

In the future, `da_i.get_dev_id()` will be needed to get the interview address. Right now its use is a bit questionable. That discussion has started in #392.

`da_i.throw_an_error_if_server_is_not_responding()` is still more problematic, but it's not currently being used by anything. It was before and it will be again, but it'll have time to get worked out later I think.

-------

* Start developing API calls

- Add user_vars file to eventyally remove env_vars file
- Other than pull and checking task status, calls appear working
- Install axios and qs

* Fix api path for server status update, increase pull default

timeout, add some questions

* Change user_vars to session

* Finish changing user_vars to session

* Implement Project creation through API

* Add pulling through the API

* Implements wait_for_server_to_restart()

Adds:
- log.js for logging ALKiln-specific messages
- time.js to wait for a timeout

A lot of re-arranging. Still not sure how to organize logs vs.
throws, etc.

* Implemented delete project

* Delete puppeteer-utils.js, remove spurious `while` for creating

a project (it should probably be handled in project creation) and
add some comments.

* Clean up a bit

* Add developer API key to workflow

* Replace env_var with session in env_vars.js where possible

* Switch `session` to `session_vars` because session has too much

baggage

* Move request complexity into its own file

* Get more info about interview url (for errors)

* Restore scope.js, error resolved

* Comment out unused funcs, move creation loop into da interface

and simplify it a big. Add various comments.

* Add pull attempts to run again if the server is busy

* Implement pull retries if server is busy

* Move wait for pull to complete into da interface. Also

- Rename some functions
- Trust the API allowing simplification of if statements
- Adjusted log messages

* Rearrange locations of files

* Add test for correct API key, adjusted logs

* Fix behavior for an invalid API key

* Trust server check timeout instead of loop in most cases, ensure

dai.delete() too waits for server to be free. Logs and comments.

* Adjust names, move long timeouts directly into the relevant

functions instead of going through the server check. See
#392 (comment)

* Add and correct comments/descriptions

* Add API key to action.yml

* Fix invalid project name allowed

* Remove dev console log

* Update messages, improve log.js, improve some names, as noted and

also with some of my own changes. Also, had to move dev __delete
function into REST to handle deleting with axios because I
apparently changed the actual `delete` functions to use only the
name of the current project. No longer able to hand one in.

* Concat log string instead of using loop

Co-authored-by: rpigneri-vol <78454056+rpigneri-vol@users.noreply.github.com>

* Make SERVER_URL safe and other review fixes

Co-authored-by: rpigneri-vol <78454056+rpigneri-vol@users.noreply.github.com>

Co-authored-by: plocket <52798256+plocket@users.noreply.github.com>
Co-authored-by: rpigneri-vol <78454056+rpigneri-vol@users.noreply.github.com>
@plocket plocket linked a pull request Feb 14, 2022 that will close this issue
plocket added a commit that referenced this issue Mar 16, 2022
Addresses #392, closes #438. [As part of this change, env_vars is no longer needed, so it's deleted, and checks for `session_vars` props needed to be moved around as this gets the ID in a different way.]

* Use api to get Playground ID instead of ID env var. #392.

* Update tests and fix code for session var validation

* Add clarifying comment about env.BRANCH_NAME

Co-authored-by: Bryce Willey <Bryce.Steven.Willey@gmail.com>

* Remove outdated comment

* Update lib/utils/session_vars.js

Co-authored-by: Bryce Willey <Bryce.Steven.Willey@gmail.com>
@plocket
Copy link
Collaborator Author

plocket commented Mar 16, 2022

For errors caused by another package reloading the server and thus causing test failure: What if we do some constant async polling of the server and keep that status in scope? That way we're not checking after failure, before failure, etc, and possibly running into async issues regarding those. We'll just be polling regularly (every 5 seconds? [Maybe some fraction of the timeout?]) and would probably catch a server restart and flag it? Not sure that this idea is much better than just checking at point of failure, but can't think it through completely right now.

@plocket
Copy link
Collaborator Author

plocket commented Mar 26, 2022

Consider removing this as an item for milestone "v4". I now think the goal of this issue doesn't match the milestone. The real breaking change for users was converting to using the API. This particular server reload issue can be icing on the cake that might be more possible now that we can use the API.

@plocket
Copy link
Collaborator Author

plocket commented Mar 27, 2022

The discussion continues in #523 as an actual "discussion" item. When we have a course of action, we'll bring it in here as an action item.

Proposal to remove from milestone "v4" going on in #533

@plocket plocket removed this from the v4 milestone Mar 30, 2022
@plocket plocket removed their assignment Apr 4, 2022
@plocket
Copy link
Collaborator Author

plocket commented May 18, 2022

Webhook idea moved to discussion #523

@plocket
Copy link
Collaborator Author

plocket commented Jun 23, 2022

Our decision from #523:

Ok, I have maybe actually for real got an idea for how to solve our server reload detection issue. I swear, there have to be solutions to this out there, I just don’t know how to find them. Here goes:

  1. Poll the server every X ms to see if it’s up. Keep a list of statuses for the last Y seconds (let’s say 30)
  2. Run functions in the test. If a page times out (let’s say after 30 secs), check the list of statuses from the present into the past for a bit over 30 secs.
  3. If the server was unavailable anytime during that time, wait till the server status shows that it’s up again, with some kind of maximum timeout there too.
  4. Once the server seems to be up, try re-doing the operation that timed out.
  5. Only do that a limited number of times and with a reasonable pause in between each poll.

This only works because we happen to know that the servers will be down for more than 5 seconds or so (though usually it's much more than that). That lets us poll at a reasonable rate - maybe every 500 ms or even 100 ms.

Possible issue someone else brought up: A server’s latency period that is longer than its downtime period.
Latency: What is the amount of time that a request can expect a response? Can depend on Circumstances.
Downtime: What is the min amount of time that it takes a server to reload?

[Btw, we can handle that by overlapping our polling to the server. We wouldn’t wait for a response before polling again.]

@plocket plocket self-assigned this Jun 23, 2022
@plocket
Copy link
Collaborator Author

plocket commented Jun 30, 2022

Summary:

  1. If the server causes an error by reloading, we can't just redo the previous Steps to finish the test. For example, non-story table Steps would be unclear about how far back to go to re-do just that single page, especially if the dev didn't use the "continue" Step. Also, I'm not sure cucumber allows you to purposely redo individual steps.
  2. It's possible we could carry on if there's not a 504 timeout error or something, but I don't know when that would appear on the page/how long after reload is done it would take for anything useful to appear on the page. Maybe we can take screenshots for practice, but I'm not sure that it'll happen the same way every time.
  3. Maybe our best/only way of dealing with this is adding --retry 1 or 2 to the node test run command, then always failing on reload so that the test will be re-run. We can wait till the server has finished reloading before rerunning or moving onto the next Step. This, of course, will cause all failed tests to retry, so it's not ideal, but I haven't yet found a simple way (i.e. one we wouldn't have to maintain) to rerun only specific kinds of errors.

This is from brainstorms in scope.tapElement() currently being used as a foundation for making these decisions (and others):

      /* ===== Brainstorms =====
      Store the promise in state and then await that promise in here
      Who should retry? If this is a story table, retrying shouldn't happen in here
      Otherwise, this in here should retry.
      Except! If we reload the page when the dev is using individual steps to set
      values, the page's values will be blank. There's no way to look back over the
      previous steps on this page.
      Would it be crazy to just restart the test?
        Depends on the test, but probably not
      Is it even possible to restart the test?
        I don't know at all [later: yes, it is, with --retry n]
      Raising a re-tryable error and then detect that in cucumber to somehow
      trigger restarting the same scenario?
      How about just failing the test and telling them why and letting them rerun manually?
        That can be a useful temporary improvement, but I'd rather not teach devs to
        ignore test failures. It's better than opaque failures, though.
      TODO: Also have to do this when first loading an interview or going to a
        url with a link check, though the latter isn't yet implemented, so we can
        skip that for now.
      MVP possibly: Add the server reload error message to the report, but also
        add --retry 1 to the node script _and_ wait for the server to reload.
        Sadly, --retry 1 will rerun actually failing tests too.
        https://github.com/cucumber/cucumber-js/blob/main/docs/retry.md
      ==========================
      */

@plocket
Copy link
Collaborator Author

plocket commented Jun 30, 2022

So. The result we get when the test fails because of a reload [and then passes on the second try] is that the report gets both a failed scenario and a passed scenario. In addition, the test passes, but the cucumber error still shows up.

I see this as a problem because it can be confusing. Suppose two tests fail, one from the server reloading and one fails legitimately. A person might look at the report and think that both tests are failing legitimately and try to solve both problems. I do add a report message to the reload failure Scenario, so maybe that's enough? I'm not sure.

plocket added a commit that referenced this issue Aug 9, 2022
…efully (#570)

* Start tracking server status
* Creates untested way to wait for server to reload
* Confirm server reload check is triggering and working by
implementing roughly in story table step. A bit of a mess as I
need to keep confirming its working state as I go forward.
* Broken. Add waiting for server response in some places. See more...

Currently, the story table page id loop rejection doesn't error.
Waiting for server response also seems to not error on rejection.
Need to throw errors instead? This is probably why the try/catch
block wasn't working for triggering the check (earlier on)

Also, the puppeteer will timeout before the page id story table
loop times out, leaving the process unexited until the server
waiting and page id loop finish. How do we exit those immediately
when necessary? Or will it not be necessary when everything else
is lined up?

* Add throw in a couple places, make functions async some places
* Abstract throwing server reload error, add brainstorms
* Solve multiple promises being created, prep for custom value
* Clean up and add to comments
* Solve browser closing too early
* Handle reload error for initial page load, abstract throwing server reload errors.

Initial page load handling is untested because that's hard to do
locally. We may have to figure out some other way to test. Remote
testing for this probably won't work either as the timing would
be impossible - we'd have to start running setup on both tests
and hope that the server timeout didn't interfere with the other
setup, just with the page load once the tests had started.
Maybe make two local repos and set one up and then time the test
and new setup so they overlap? I'm not sure.

* Clarify message about server reloading
* Change server tracker from recursive to `while` loop
* Discover and mend the immediate reload oversight which was
the "continue" Step using `wrapPromiseWithTimeout()` to ensure
preventing an infinite loop. It was causing a timeout too early
independently from the reload timeout. The problem now is where
else needs to deal with it, where to put the reoload complexity,
and how to make it clear where that's happening since it's so
far removed from a lot of the action. Also, is it only a problem
when elements are tapped, or other times as well? For example,
when downloading documents. Maybe that counts as tapping an
element.
* Simplify timeout catching
* Fix server sign in not erroring properly on server reload
* Handle tapping element when reloading. I think...
Async is always hard to test. Basically, cleaned up where the
page was being closed and where the browser was being created.
Not sure if `sign_in` needs to create a browser now, or a new
page, etc, but I think it's enough that the cleanup is handled
consistently.
* Clean up old page closing code
* Pass tests for tapping var button and sign out at end of step
* Test sign out, continuing two ways, reload on both attempts,
and reload completes during first test.
* Comment out local-only tests
* Remove unnecessary test
* Simplify (I hope) creation of clickWait
* Clean up some cosmetics found in review
* Fix wrong var name used

Co-authored-by: Bryce Willey <Bryce.Steven.Willey@gmail.com>

* Ensure full time is waited between server reload checks

Co-authored-by: Bryce Willey <Bryce.Steven.Willey@gmail.com>
@plocket
Copy link
Collaborator Author

plocket commented Aug 9, 2022

Closed by #570

@plocket plocket closed this as completed Aug 9, 2022
@plocket plocket changed the title Integrate API key to remove/reduce errors caused by server reload Integrate API key to inform dev of errors caused by server reload Aug 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority A combination of urgency and impact
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants