Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DNM] Fixes for podman v4 #1573

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

[DNM] Fixes for podman v4 #1573

wants to merge 1 commit into from

Conversation

s0me0ne-unkn0wn
Copy link

I've spent a weekend playing around with larger zombienets using podman 4.7.2 as a provider. This PR is a set of dirty fixes and ugly hacks I made to make things work. Neither of them is complete, well-tested, or carefully designed; I'm not ready to wear a Typescript developer's hat. I just hope someone will pick this PR up and make things pretty using the ideas I exposed here. With the patches in this PR, I'm currently running 20 validators and 6 parachains on the local machine, and the network starts up smoothly every time.

I'm going to leave some comments in a self-review to make it easier to understand the underlying ideas.

} else {
uniqueName = `${name}-${mUsedNames[name]}`;
mUsedNames[name] += 1;
let uniqueName = `${name}-${generateNamespace()}`;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was causing naming collisions right? There is another issue with the cleanup but using some random here sounds good :)

} else {
uniqueName = `${name}-${mUsedNames[name]}`;
mUsedNames[name] += 1;
let uniqueName = `${name}-${generateNamespace()}`;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

podman's --filter option expects a regular expression rather than an exact matching string. Thus, doing podman pod ps -f name=temp will return an array of statuses for every pod which contains temp in its name, so you'll get temp, temp-2, and even tempo, and just checking the first element of that array doesn't guarantee you see the info for that very pod you're looking for.

My naive attempt to wrap it like name='^temp$' didn't succeed for some reason, so I ended up just assigning unique names to nodes. This should definitely be improved.

@@ -56,6 +56,7 @@ const TMP_DONE = "echo done > /tmp/zombie-tmp-done";
const TRANSFER_CONTAINER_WAIT_LOG = "waiting for tar to finish";
const NODE_CONTAINER_WAIT_LOG = "waiting for copy files to finish";
const WAIT_UNTIL_SCRIPT_SUFIX = `until [ -f ${FINISH_MAGIC_FILE} ]; do echo ${NODE_CONTAINER_WAIT_LOG}; sleep 1; done; echo copy files has finished`;
const PODMAN_WAIT_UNTIL_SCRIPT_SUFIX = `sleep 5`; // FIXME
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was surprised to find out that putting the magic file is a no-op for podman provider. I'm not sure how it was supposed to even work, so just put a long enough delay for now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think there was a change on how the pods run in podman between v3/v4 that make no need to use the magic-file. I can review on v3.

@@ -432,7 +434,7 @@ export class PodmanClient extends Client {

await this.createResource(podDef, false, false);

await this.wait_pod_ready(name);
await this.wait_pod(name, waitExit ? ["Exited"] : ["Running"]);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race conditions are a big problem for podman. If we're spinning up a node, we want to wait until the pod is running. But if we're generating a chainspec, we need to wait until the pod has finished its job. Every second time the chainspec was generated, I ended up with an empty chainspec in the network directory because we only waited for the pod to start, and we tried to copy the chainspec at once, but it was not ready yet. The pod started writing it but didn't finish, so the empty file is copied.

To address that, I added an explicit waiting condition, and now races are gone (mostly; see other comments).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can reuse the same approach from k8s here. We wait until some log to detect that the chain-spec is ready.

@@ -449,6 +451,15 @@ export class PodmanClient extends Client {
localFilePath: string,
): Promise<void> {
debug(`cp ${this.tmpDir}/${identifier}${podFilePath} ${localFilePath}`);
while (true) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To copy a file we need that file to be ready; it was not always the case. Sometimes, it's just not there yet, and sometimes, the pod starts writing something, but the file is still zero-length when we try to copy it. I don't remember the exact reason why in this very case I couldn't use wait_pod for Exited status as implemented above, but I had to implement this loop as my best effort to address this problem. Given that the pause between iterations is 1 sec, and PODMAN_WAIT_UNTIL_SCRIPT_SUFIX is a 5 sec sleep, it always succeeds. Nevertheless, we definitely could use a better design here.

if (waitReady) await this.wait_pod_ready(name);
if (waitReady) await this.wait_pod(name);

if (typeof this.podSuffix === "undefined") {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

podman v3 and podman v4 resource naming schemes are different. Here, I try to guess which one is used (because I believe the naming changed not exactly in v4.0.0, but in v4.something). Beware: this wasn't tested with podman v3. I only made sure it works for v4.

@@ -34,7 +34,7 @@ export interface ContainerPort {
export interface Container {
image: string;
name: string;
imagePullPolicy: "Always";
imagePullPolicy: string;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forcing this to be Always triggers a Docker Hub rate limit rather soon. I'm manually patching this to be IfNotPresent for individual resources when doing a lot of iterations; hence, some freedom is required on the interface level. I'd love to see a config file option to explicitly set the image pull policy for every image.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this should be configurable through the network config and use IfNotPresent as default. (both here and in k8s)

@pepoviola
Copy link
Collaborator

Hey @s0me0ne-unkn0wn thanks for your contribution 🙌 , looks great! I will check if something is break in podman v3.
Also, worth notice that we plan to deprecate this provider (at least as is now) in the new version in favor of docker as provider.

Thanks again for your contribution!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants