Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker Fleet integration tests failing to create ASG's #165

Closed
horsmand opened this issue Oct 9, 2020 · 3 comments · Fixed by #167
Closed

Worker Fleet integration tests failing to create ASG's #165

horsmand opened this issue Oct 9, 2020 · 3 comments · Fixed by #167
Labels
bug This issue is a bug.

Comments

@horsmand
Copy link
Contributor

horsmand commented Oct 9, 2020

The Worker Fleet integration tests are failing when trying to create autoscaling groups named similarly to WorkerStructWF1Worker3ASG803D4B7C and WorkerStructWF1Worker2ASG6D4C29FF, both with the reason:

Received 1 FAILURE signal(s) out of 1. Unable to satisfy 100% MinSuccessfulInstancesPercent requirement

Reproduction Steps

  1. Checkout our release candidate branch bump/0.18.0
  2. Setup integ/test-config.sh with these values:
    • DEADLINE_VERSION='10.1.10.6'
    • LINUX_DEADLINE_AMI_ID='ami-05d4887175201bde8'
    • WINDOWS_DEADLINE_AMI_ID='ami-09c712180218564f2'
    • SKIP_deadline_01_repository_TEST=true
    • SKIP_deadline_02_renderQueue_TEST=true
  3. Configure your AWS credentials
  4. Run yarn run e2e from the directory integ

Error Log

2020-10-08T17:12:26.551-05:00	2020-10-08 22:12:20: BEGIN - ip-###-###-###-###\ec2-user
2020-10-08T17:12:26.551-05:00	2020-10-08 22:12:20: Operating System: Linux
2020-10-08T17:12:26.551-05:00	2020-10-08 22:12:20: CPU Architecture: x86_64
2020-10-08T17:12:26.551-05:00	2020-10-08 22:12:20: CPUs: 2
2020-10-08T17:12:26.551-05:00	2020-10-08 22:12:20: Video Card: Cirrus Logic GD 5446
2020-10-08T17:12:26.551-05:00	2020-10-08 22:12:20: Deadline Worker 10.1 [v10.1.10.6 Release (1a2a926fa)]
2020-10-08T17:12:26.551-05:00	2020-10-08 22:12:20: AccessDeniedException occurred while fetching tags for the instance:
2020-10-08T17:12:26.551-05:00	2020-10-08 22:12:20: Got Access Denied when trying to DescribeTags in EC2 Instance EC2 Instance. Please make sure your user has the 'iam:GetUser' IAM Permission to make these error messages better.
2020-10-08T17:12:26.551-05:00	2020-10-08 22:12:20: Please make sure your IAM user <unknown> has the following IAM Permission(s) to access EC2 Instance.
2020-10-08T17:12:26.551-05:00	2020-10-08 22:12:20: ec2:DescribeTags
2020-10-08T17:12:26.551-05:00	2020-10-08 22:12:20: (Deadline.AWS.AWSPortalAccessDeniedException)
2020-10-08T17:12:26.551-05:00	2020-10-08 22:12:20: Is tracked by resource tracker: true
2020-10-08T17:12:26.551-05:00	2020-10-08 22:12:20: Scanning for auto configuration
2020-10-08T17:12:26.551-05:00	2020-10-08 22:12:23: Auto Configuration: No auto configuration for Repository Path could be detected, using local configuration
2020-10-08T17:12:26.551-05:00	2020-10-08 22:12:23: Connecting to repository
2020-10-08T17:12:26.551-05:00	2020-10-08 22:12:23: Could not connect to Deadline Repository: The configured root CA ('/var/lib/Thinkbox/Deadline10/gateway_certs/ca.crt') does not exist.
2020-10-08T17:12:31.551-05:00	2020-10-08 22:12:23: Deadline Worker will try to connect again in 10 seconds...
2020-10-08T17:12:33.806-05:00	2020-10-08 22:12:33: Could not connect to Deadline Repository: The configured root CA ('/var/lib/Thinkbox/Deadline10/gateway_certs/ca.crt') does not exist.
2020-10-08T17:12:38.551-05:00	2020-10-08 22:12:33: Deadline Worker will try to connect again in 10 seconds...
2020-10-08T17:12:43.812-05:00	2020-10-08 22:12:43: Could not connect to Deadline Repository: The configured root CA ('/var/lib/Thinkbox/Deadline10/gateway_certs/ca.crt') does not exist.
2020-10-08T17:12:48.551-05:00	2020-10-08 22:12:43: Deadline Worker will try to connect again in 10 seconds...
2020-10-08T17:12:54.069-05:00	2020-10-08 22:12:53: Could not connect to Deadline Repository: The configured root CA ('/var/lib/Thinkbox/Deadline10/gateway_certs/ca.crt') does not exist.
 2020-10-08T17:12:58.551-05:00	2020-10-08 22:12:53: Deadline Worker will try to connect again in 10 seconds...

Environment

  • CDK CLI Version: 1.66.0
  • CDK Framework Version: 1.66.0
  • RFDK Version: 0.18.0 (release candidate)
  • Deadline Version: 10.1.10.6
  • Node.js Version: 12.18.3
  • OS : AL2
  • Language (Version): TypeScript (~4.0.3)

Other


This is 🐛 Bug Report

@horsmand horsmand added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. and removed needs-triage This issue or PR still needs to be triaged. labels Oct 9, 2020
@horsmand horsmand changed the title Worker Fleet integration tests failing Worker Fleet integration tests failing to create ASG's Oct 9, 2020
@aws-painec
Copy link
Contributor

I'm running the integration tests on RFDK 0.17 using both Deadline 10.1.9.2 and 10.1.10.6

10.1.10.6 failed creating the first worker tier with the error in the OP.
10.1.9.2 is still running but has successfully created the first render struct so it was able to create at least one ASG

@horsmand
Copy link
Contributor Author

horsmand commented Oct 9, 2020

From Daniel:

10.1.10 added these configuration options to deadline.ini for all AWSPortal worker AMIs:

ProxyUseSSL=True
ProxySSLCA=/var/lib/Thinkbox/Deadline10/gateway_certs/ca.crt
ClientSSLAuthentication=NotRequired

We configure Workers here:

def configure_deadline( config ):
"""
Configures Deadline to be able to connect to the given Render Queue
:param config: The parsed configuration object
"""
repo_args = ['ChangeRepository','Proxy',config.render_queue.address]
if config.render_queue.scheme == 'http':
print( "Configuring Deadline to connect to the Render Queue (%s) using HTTP Traffic" % config.render_queue.address )
#Ensure SSL is disbaled
call_deadline_command(['SetIniFileSetting','ProxyUseSSL','False'])
else:
print("Configuring Deadline to connect to the Render Queue using HTTP Traffic")
call_deadline_command(['SetIniFileSetting','ProxyUseSSL','True'])
try:
os.makedirs(CERT_DIR)
except OSError as e:
if e.errno != errno.EEXIST:
raise
if config.tls_ca:
"""
If we are configuring Deadline to connect using a CA for trust then we need to:
* Fetch the cert chain
* Confirm the chain contains only 1 cert
* Tell Deadline that SSL Authentication is not required
"""
cert_path = os.path.join(CERT_DIR,'ca.crt')
cert_contents = fetch_secret(config.tls_ca)
if len( CERT_COUNT_RE.findall(cert_contents) ) != 1:
raise ValueError("The TLS CA Cert must contain exactly 1 certificate")
with open(cert_path, 'w') as f:
f.write(cert_contents)
call_deadline_command(['SetIniFileSetting', 'ProxySSLCA', cert_path])
call_deadline_command(['SetIniFileSetting', 'ClientSSLAuthentication', 'NotRequired'])
repo_args.append(cert_path)
else:
"""
If we are configuring Deadline to connect using a client cert we need to:
* Fetch the pkcs12 binary file
* Optionally fetch the password
* Tell Deadline that SSL Authentication is Required
"""
cert_path = os.path.join(CERT_DIR, 'client.pfx')
cert_contents = fetch_secret(config.client_tls_cert)
with open(cert_path, 'wb') as f:
f.write(cert_contents)
call_deadline_command(['SetIniFileSetting', 'ProxySSLCA', cert_path])
call_deadline_command(['SetIniFileSetting', 'ClientSSLAuthentication', 'Required'])
repo_args.append(cert_path)
if config.client_tls_cert_passphrase:
passphrase = fetch_secret(config.client_tls_cert_passphrase)
repo_args.append(passphrase)
change_repo_results = call_deadline_command(repo_args)
if change_repo_results.startswith('Deadline configuration error:'):
print(change_repo_results)
raise Exception(change_repo_results)

I'm guessing that one of two things is happening:

  1. The Worker is starting up before we have a chance to kill it, and failing to connect due to the existing deadline.ini on the AMI. In which case that message in the Worker log shouldn't matter -- it wouldn't be related to any code from our UserData so it couldn't cause a deployment-fail. Solution here would be to ignore it.
  2. RFDK's Worker configuration script is leaving the ProxySSLCA field unchanged when TLS is disabled, and Deadline is being dumb by trying to verify that the file exists even though it won't be used. Solution here would be to set ProxySSLCA in our no-TLS path to an empty value.

@horsmand
Copy link
Contributor Author

horsmand commented Oct 9, 2020

Similar to what Claire has seen, the 0.18.0 release candidate can pass the worker fleet integration tests using Deadline version 10.1.9.2. I think this confirms Daniel's suspicions that the new configuration options being set in 10.1.10 are causing a problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants