Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate new probe interface into GCP framework #259

Merged
merged 22 commits into from
Aug 9, 2024

Conversation

eth1030
Copy link
Contributor

@eth1030 eth1030 commented Jul 24, 2024

What does this PR do? / Related Issues / Jira

Part of OSD-24065, this PR integrates new probe interface into GCP framework.

  • added a startup-script analogous to the AWS userdata-script
  • copy the curl json probe as gcp curl json probe, to make changes but not effect probe used by AWS
  • go embed the startup-script.sh instead of userdata-template.yaml
  • escape os.expand() by adding extra userDataVariables in GetExpandedUserData function because os.expand deletes unknown variables
  • run new gcp_curl_json_probe instead of dummy probe
  • added functionality to fetch url list from github

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have tested the functionality against gcp / aws, it doesn't cause any regression
  • I have added execution results to the PR's readme

Reviewer's Checklist

  • (This needs to be done after technical review) I've run the branch on my local, verified that the functionality is ok

How to test this PR locally / Special Instructions

set environment variable GCP_PROJECT_ID to the project ID of the VPC

export GCP_PROJECT_ID=<your-VPC-project-ID>

run verifier optionally with --debug flag

./osd-network-verifier egress --platform=gcp-classic --subnet-id <your-subnet-id> --vpc-name <your-vpc-name>

Logs

Expected output: verifier is able to log any blocked egresses

Using Project ID <project-ID>
Created instance with ID: verifier-<rand-num>
Applying labels
Successfully applied labels 
ComputeService Instance: verifier-<rand-num> RUNNING
Gathering and parsing console log output...
Summary:
printing out failures:
 - egressURL error: https://events.pagerduty.com:443 (Empty reply from server)
 - egressURL error: https://events.amazonaws.com:443 (Could not resolve host: events.amazonaws.com)

printing out exceptions preventing the verifier from running the specific test:
printing out errors faced during the execution:
Failure!

If client breaks or loses connection with instance, instance should self-delete within default or specified time

@openshift-ci openshift-ci bot requested review from AlexVulaj and reedcort July 24, 2024 22:07
@codecov-commenter
Copy link

codecov-commenter commented Jul 24, 2024

Codecov Report

Attention: Patch coverage is 23.37662% with 59 lines in your changes missing coverage. Please review.

Project coverage is 25.08%. Comparing base (87b6710) to head (32fcdf0).
Report is 1 commits behind head on main.

Files Patch % Lines
pkg/verifier/gcp/entry_point.go 0.00% 51 Missing ⚠️
pkg/verifier/gcp/gcp_verifier.go 0.00% 4 Missing ⚠️
pkg/probes/curl/curl_json.go 0.00% 1 Missing and 1 partial ⚠️
pkg/verifier/gcp/gcp_verifier_functions.go 90.00% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #259      +/-   ##
==========================================
+ Coverage   24.82%   25.08%   +0.25%     
==========================================
  Files          23       24       +1     
  Lines        1740     1802      +62     
==========================================
+ Hits          432      452      +20     
- Misses       1286     1327      +41     
- Partials       22       23       +1     
Files Coverage Δ
pkg/probes/curl/curl_json.go 78.57% <0.00%> (+0.52%) ⬆️
pkg/verifier/gcp/gcp_verifier_functions.go 90.00% <90.00%> (ø)
pkg/verifier/gcp/gcp_verifier.go 0.00% <0.00%> (ø)
pkg/verifier/gcp/entry_point.go 0.00% <0.00%> (ø)

@eth1030 eth1030 changed the title Osd 24065 bash Integrate new probe interface into GCP framework Jul 25, 2024
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this entire file is unnecessary, as it's the same as the other machine_images.go.

@AlexVulaj
Copy link
Contributor

I'd really like to find a way to combine more of the common functionality here. For example, pkg/probes/gcp_curl/curl_json_result.go, pkg/probes/gcp_curl/gcp_curl_json.go, and pkg/probes/gcp_curl/machine_images.go are identical to their AWS counterparts.

Ideally the same probe should be used independently of AWS or GCP, perhaps with some configuration that can be passed in via parameters from the entrypoint files. (i.e. userDataTemplate and userDataVariables).

@AlexVulaj
Copy link
Contributor

/approve

Holding off on an lgtm as I don't have time to test this at the moment, but all looks good.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 29, 2024
@eth1030
Copy link
Contributor Author

eth1030 commented Jul 31, 2024

After meeting with @AlexVulaj I propose a new PR where we groom the GCP egress URL list because events.amazonaws.com does not appear to be an actual endpoint. When running curl on my local device I get:

curl events.amazonaws.com
curl: (6) Could not resolve host: events.amazonaws.com

running the network verifier results in:

./osd-network-verifier egress --platform gcp-classic --subnet-id default --vpc-name default
Using Project ID emhammon
Created instance with ID: verifier-134
Applying labels
Successfully applied labels 
ComputeService Instance: verifier-134 RUNNING
Gathering and parsing console log output...
Summary:
printing out failures:
 - egressURL error: https://events.amazonaws.com:443 (Could not resolve host: events.amazonaws.com)

printing out exceptions preventing the verifier from running the specific test:
printing out errors faced during the execution:
Failure!

I think the network verifier is working correctly in identifying blocked egress, and the egress URL list for GCP should be groomed. Addressed in OSD-24918

@@ -9,6 +9,6 @@ type Probe interface {
GetMachineImageID(platformType string, cpuArch cpu.Architecture, region string) (string, error)
GetStartingToken() string
GetEndingToken() string
GetExpandedUserData(map[string]string) (string, error)
GetExpandedUserData(map[string]string, string) (string, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor - this will help us remember what these values are supposed to be in the future if we ever implement a new probe.

Suggested change
GetExpandedUserData(map[string]string, string) (string, error)
GetExpandedUserData(userDataVariables map[string]string, userDataTemplate string) (string, error)

Comment on lines 42 to 46
if vei.Timeout == 0 {
vei.Timeout = DEFAULT_TIMEOUT
}

vei.Timeout = DEFAULT_TIMEOUT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section looks a little confusing - it looks like vei.Timeout will always get set to DEFAULT_TIMEOUT no matter what. What's the intention here for setting that timeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to if vei.Timeout <= 0 and only setting timeout there based on those conditions.

eth1030 and others added 13 commits August 2, 2024 11:22
service to print userdata begin runs multiple times despite
succeeding. service to run curl fails. printing end token requires
curl to succeed so it does not start running. crash recovery kernel
arming runs everytime vm is created.
shell script prints starting and ending token and runs curl. shell
script checks if command was successful. also added external IP so
curl has an output
used a systemd script and shell script that gets GCP NAME and ZONE
so the instance self deletes. this ensures that if something
happens to the client, the resource is not left on customer
accounts.
Egress URLs are fetched from GitHub as done in the AWS verifier
target after multi-user unneeded because other outputs are
completely silenced and wont interfere with probe output
systemd service that runs curl should only fail if bash script
returns an error code, this happens if curl error code is
1-4, 27, 41-43, 45 which means curl failed, not a network failure
added userDataTemplate to the probe interface so
getExpandedUserData can take in different templates for GCP and AWS
Co-authored-by: Alex Vulaj <ajvulaj@gmail.com>
Co-authored-by: Alex Vulaj <ajvulaj@gmail.com>
made minor changes to curl probe to make more readible and added
functionality to startup-script to add cacerts and export proxy
environment variables if specified
Copy link
Contributor

@abyrne55 abyrne55 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome work @eth1030! I suggested some changes inline. In general, you'll want to explain some of the less-obvious workarounds you (very adeptly) implemented to get this working

@@ -79,7 +79,7 @@ func (lgp Probe) GetMachineImageID(platformType string, cpuArch cpu.Architecture
// variables listed in the template's "network-verifier-required-variables" directive, or if
// values *are* provided for variables that must be set to a certain value for the probe to
// function correctly (presetUserDataVariables) -- this function will fill-in those values for you.
func (lgp Probe) GetExpandedUserData(userDataVariables map[string]string) (string, error) {
func (lgp Probe) GetExpandedUserData(userDataVariables map[string]string, _ string) (string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment to the doc string above noting why this probe doesn't need that last parameter

@@ -81,7 +77,7 @@ func (clp Probe) GetMachineImageID(platformType string, cpuArch cpu.Architecture
// variables listed in the template's "network-verifier-required-variables" directive, or if
// values *are* provided for variables that must be set to a certain value for the probe to
// function correctly (presetUserDataVariables) -- this function will fill-in those values for you.
func (clp Probe) GetExpandedUserData(userDataVariables map[string]string) (string, error) {
func (clp Probe) GetExpandedUserData(userDataVariables map[string]string, userDataTemplate string) (string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update docstring with new param

@@ -213,7 +213,7 @@ func TestLegacyProbe_GetExpandedUserData(t *testing.T) {

prb := Probe{}
// First check if function is returning an error
got, err := prb.GetExpandedUserData(tt.userDataVariables)
got, err := prb.GetExpandedUserData(tt.userDataVariables, userDataTemplate)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider passing an empty string here instead of userdataTemplate to reduce potential confusion

const (
cloudImageIDDefault = "rhel-9-v20240703"
DEFAULT_CLOUDIMAGEID = "rhel-9-v20240703"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already defined inside the probe (machine_images.go)

Comment on lines 96 to 99
"ret": "${ret}",
"?": "$?",
"array[@]": "${array[@]}",
"value": "$value",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment explaining why this is necessary

g.Logger.Debug(vei.Ctx, "Generated userdata script:\n---\n%s\n---", userData)

if vei.CloudImageID == "" {
vei.CloudImageID = cloudImageIDDefault
vei.CloudImageID = DEFAULT_CLOUDIMAGEID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use probe.GetMachineImageID() here instead of a global constant. See also my above comment on that constant

Comment on lines 62 to 81
func Test_get_unreachable_endpoints(t *testing.T) {
type args struct {
consoleOutput string
probe probes.Probe
}
tests := []struct {
name string
args args
wantErr bool
}{
// TODO: Add test cases.
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
if err := get_unreachable_endpoints(tt.args.consoleOutput, tt.args.probe); (err != nil) != tt.wantErr {
t.Errorf("get_unreachable_endpoints() error = %v, wantErr %v", err, tt.wantErr)
}
})
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func Test_get_unreachable_endpoints(t *testing.T) {
type args struct {
consoleOutput string
probe probes.Probe
}
tests := []struct {
name string
args args
wantErr bool
}{
// TODO: Add test cases.
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
if err := get_unreachable_endpoints(tt.args.consoleOutput, tt.args.probe); (err != nil) != tt.wantErr {
t.Errorf("get_unreachable_endpoints() error = %v, wantErr %v", err, tt.wantErr)
}
})
}
}

It's understandable to skip tests here for now, but no need to leave the boilerplate code in the meantime

pkg/verifier/gcp/gcp_verifier_functions_test.go Outdated Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would appear that two separate copies of this file now exist, which poses a maintainability challenge

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately go embed can only handle embedding files from the same directory, a solution is to take in a platformType parameter in getExpandedUserData preemptively embed both userdata-template and startup-script in the curl probe and then use one of them depending on the platformType. Any thoughts about this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the userdata-template in the curl directory is only used for testing.

@@ -177,7 +181,7 @@ func (a *AwsVerifier) ValidateEgress(vei verifier.ValidateEgressInput) *output.O
userDataVariables["DELAY"] = "60"
}

unencodedUserData, err := vei.Probe.GetExpandedUserData(userDataVariables)
unencodedUserData, err := vei.Probe.GetExpandedUserData(userDataVariables, userDataTemplate)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this has us providing the same value of userDataTemplate to every probe, yeah? My first thought is that different probes need to use different user-data, but then I saw that your LegacyProbe just ignores the userDataTemplate param. So it works, but it's not immediately clear to somebody scanning through the code why this would work. IOW, this needs comments at the very least, but let's have a larger discussion about this first

a workaround of not changing the probe interface and using a
different template for GetExpandedUserData is to pass a
userDataVariable through the probe interface. In the
GetExpandedUserData, based off if a GCP platform exists, the
function chooses userdata-template or startup-script.
also changed gcp verifier to use GetMachineImageID which sets
default machine image.
@eth1030
Copy link
Contributor Author

eth1030 commented Aug 7, 2024

/retest

@eth1030
Copy link
Contributor Author

eth1030 commented Aug 7, 2024

/retest

Copy link
Contributor

@abyrne55 abyrne55 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really, really good work and very close to merging.

Comment on lines 29 to 30
//go:embed startup-script.sh
var startupScript string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//go:embed startup-script.sh
var startupScript string
//go:embed systemd-template.sh
var systemdTemplate string

Renaming to make more consistent with userdata-template.yaml. More broadly, I'd like to rebrand your approach as the "systemd curl probe" vs. the "GCP curl probe" so that things don't get confusing if/when we try to use systemd on AWS in the future

Comment on lines 88 to 93
// Check if GCP instance is being used, if so use startup-script instead of userdata-template
// Only GCP should have USE_GCP_STARTUPSCRIPT userDataVariable
// Serves as a workaround for identifying platform type without a new probe interface
if userDataVariables["USE_GCP_STARTUPSCRIPT"] == "true" {
userDataTemplate = startupScript
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Check if GCP instance is being used, if so use startup-script instead of userdata-template
// Only GCP should have USE_GCP_STARTUPSCRIPT userDataVariable
// Serves as a workaround for identifying platform type without a new probe interface
if userDataVariables["USE_GCP_STARTUPSCRIPT"] == "true" {
userDataTemplate = startupScript
}
// Use systemd to run curl (instead of cloud-init) if requested. Useful for
// platforms that don't include cloud-init in their OS images (e.g., GCP)
if userDataVariables["USE_SYSTEMD"] == "true" {
userDataTemplate = systemdTemplate
}

See above "rebrand" comment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious: is there a benefit to pinning to a specific image version? Does it still work if we just pass "rhel-9" instead of "rhel-9-v20240709"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not work if we pass in "rhel-9"

Comment on lines 48 to 51
// Set instance type to default if not specified and validate it
if vei.InstanceType == "" {
vei.InstanceType = DEFAULT_INSTANCE_TYPE
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Set instance type to default if not specified and validate it
if vei.InstanceType == "" {
vei.InstanceType = DEFAULT_INSTANCE_TYPE
}
// Set instance type to default if not specified and validate it
if vei.InstanceType == "" {
vei.InstanceType, err = vei.CPUArchitecture.DefaultInstanceType(helpers.PlatformGCP)
if err != nil {
return g.Output.AddError(err)
}
g.Logger.Debug(vei.Ctx, fmt.Sprintf("defaulted to instance type %s", vei.InstanceType))
}

Use vei.CPUArchitecture.DefaultInstanceType() instead of a DEFAULT_INSTANCE_TYPE constant in order to enable ARM-friendliness. We also probably need to do a little work on lining up the "user specified a bad instance type; what do?" logic with how it's done on the AWS side, but this works okay for now.

pkg/verifier/gcp/gcp_verifier_functions.go Outdated Show resolved Hide resolved
pkg/verifier/gcp/gcp_verifier.go Show resolved Hide resolved
Copy link
Contributor

@abyrne55 abyrne55 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only nitpicks in this one. After you fix those, feel free to post in the Aurora channel asking the team to test the code in their GCP accounts

pkg/verifier/gcp/entry_point.go Outdated Show resolved Hide resolved
pkg/probes/curl/curl_json.go Outdated Show resolved Hide resolved
eth1030 and others added 3 commits August 8, 2024 18:13
Co-authored-by: Anthony Byrne <abyrne@redhat.com>
Co-authored-by: Anthony Byrne <abyrne@redhat.com>
// function that tests probe order logic that is part of findUnreachableEndpoints in gcp_verifier.go
// get_tokens checks for the presence of startingToken and endingToken in the consoleOutput
// probe outsput should be between startingToken and endingToken
func get_tokens(consoleOutput string, probe probes.Probe) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick here - use camelcase for function names in Go, i.e. getTokens

Copy link
Contributor

@abyrne55 abyrne55 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested locally and confirmed GCP functionality using the following command

GCP_PROJECT_ID=abyrne ./osd-network-verifier egress --platform=gcp-classic --vpc-name=default --subnet-id=default --debug

Output was as-expected: verifiergcp.2024-08-09T15.59.10Z.log.

Also performed a basic test against AWS and confirmed that there were no obvious regressions.

Was not able to test ARM functionality as cmd.go does not pass the value of the --cpu-arch flag into vei.CPUArchitecture in the GCP-specific case (only in the AWS case), but that can be addressed in a future PR.

/approve
/lgtm
/hold

Awesome work @eth1030! Feel free to /unhold whenever you're ready

@openshift-ci openshift-ci bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. labels Aug 9, 2024
Copy link
Contributor

openshift-ci bot commented Aug 9, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abyrne55, AlexVulaj

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Aug 9, 2024
Copy link
Contributor

openshift-ci bot commented Aug 9, 2024

@eth1030: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@eth1030
Copy link
Contributor Author

eth1030 commented Aug 9, 2024

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 9, 2024
@abyrne55
Copy link
Contributor

abyrne55 commented Aug 9, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 9, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit e338e54 into openshift:main Aug 9, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants