Skip to content

Conversation

sunshowers
Copy link
Contributor

@sunshowers sunshowers commented Oct 16, 2025

Ensure that DNS lookups for all internal services work on the example system, through both the standard and qorb DNS resolvers.

I did notice that the example system doesn't set up every service. Something worth doing in a followup, but this does catch issues like (manually tested) the repo depot response being too large.

Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for doing this!

}

/// Returns the list of DNS service names expected in an example system.
fn service_names_to_query(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first thought was that we'd use strum to iterate over the variants and then maybe a match to say what we expected for each one (sort of combining this function with match_service_names). I guess that won't quite work because of the variants that have data? I see that strum will happily give these to you anyway with data filled in with default() but that won't match what's in our system. Is that we have to unroll all this?

Suggestion: do the above anyway, but wherever we have a variant with the default data, we just replace it with the data that we want from the example system. I'm not sure but I hope that'll be less code, clearer, and impossible to get out of sync in the way that you could today by adding something to match_service_names but not this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So strum is what I first tried to use, but the Default impl for the variants with data just didn't sit right with me. But I agree that the scheme I did come up with is pretty fragile, so maybe the strum approach is the least bad one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this seems reasonable.

logctx.cleanup_successful();
}

/// Returns the list of DNS service names expected in an example system.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't quite right since there are service names here that aren't expected in the example system, and there are values here too. I guess it's a map with:

  • all DNS service names that carry no data, plus
  • DNS service names that do carry data, with specific data for the example system

and the values are what you'd expect to find by looking each one up in the example system.

Admittedly that's awkward to describe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded as:

    /// Returns the list of potential DNS service names in an example system,
    /// along with whether we expect a successful or error response for each of
    /// them.

Comment on lines 1459 to 1460
// When adding a new variant, ensure that it is covered by
// service_names_to_query above.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is alright but I feel like it's going to be too easy for someone to see this compile error and just add a new entry here without adding the case above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed via strum.


out.insert(ServiceName::CruciblePantry, Ok(()));

// XXX the SledAgent service name doesn't appear to be used?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it probably is unused but arguably should be. I think we could incorporate it in the same way that you did for Crucible below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note we don't even register the sled agent service names within blueprint_internal_dns_config today. Have added these to the map with a response of QueryError::NoRecordsFound.

Comment on lines 1209 to 1213
// is fragmented. Add a short timeout (5s) for this.
let lookup_fut = tokio::time::timeout(
Duration::from_secs(5),
resolver.lookup_all_socket_v6(service),
);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the client not already do this? I'd rather not add a shorter timeout for a case that we don't expect to come up in a correct test run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client does do this with a 15s timeout. I just thought that wasn't a good experience. I don't care strongly either way so I'll just rely on the client timeout.

);
}

// TODO: add assertions on the returned addresses.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// TODO: add assertions on the returned addresses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should check that the addresses match what we expect? I kind of skipped over that because I wanted to detect packet fragmentation first.

// Wait for at least one result to be returned. Like above, we
// don't want to wait forever, so set a reasonable timeout.
let backends = match tokio::time::timeout(
Duration::from_secs(5),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we probably don't already have a client timeout here, I'd make this more generous -- maybe 30s? To avoid flakes on a busy system. If the test is working, we won't hit it, right?

Or are we querying for ones that we expect to find no records for? I think I would just leave those out. We don't care to verify that we don't have records for stuff that's not in the example system. And it's hard with this particular interface to identify that case.

Copy link
Contributor Author

@sunshowers sunshowers Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used a 15s timeout to match the simple DNS resolver timeout.

Or are we querying for ones that we expect to find no records for? I think I would just leave those out. We don't care to verify that we don't have records for stuff that's not in the example system. And it's hard with this particular interface to identify that case.

No, I skip over those checks with the qorb resolver since that is level-triggered (only test them against the simple resolver)

}
};

// TODO: add assertions on the returned addresses.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// TODO: add assertions on the returned addresses.

Created using spr 1.3.6-beta.1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants