Skip to content

Conversation

@eshitachandwani
Copy link
Member

@eshitachandwani eshitachandwani commented Jan 6, 2026

Fixes: #8801

There were 2 problems here :

  1. The actual DNS lookup calls return different resolved addresses for "localhost" on different machines. Changed the code to replace the DNS resolver with a manual resolver in test to mock the DNS resolver.

  2. In the case where the tests send a EDS or cluster error , the xds management server keeps sending resource error continuously which in turn triggers updates from dependency manager which pushes the update to a channel. To mitigate this situation we had a done channel , that we close when the test ends and the update function check for the done channel , if it is close, it returns. In TestAggregateCLusterChildError test , (and other tests too) , the channel was being closed at the end of the test , which meant it will close only when the test passes. And if the tests fails, the done channel was not getting closed , which made the update function to block.
    Changed the close(done channel) to be a defer function , but this should be the first thing that happens after the test is done , before the dependency manager closes, so this is the last defer called in the test.

RELEASE NOTES: None

@eshitachandwani eshitachandwani added this to the 1.79 Release milestone Jan 6, 2026
@eshitachandwani eshitachandwani added Type: Bug Area: xDS Includes everything xDS related, including LB policies used with xDS. labels Jan 6, 2026
@codecov
Copy link

codecov bot commented Jan 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.40%. Comparing base (88ac703) to head (5359c2b).
⚠️ Report is 6 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8805      +/-   ##
==========================================
+ Coverage   83.30%   83.40%   +0.10%     
==========================================
  Files         418      417       -1     
  Lines       32897    32978      +81     
==========================================
+ Hits        27404    27505     +101     
+ Misses       4093     4078      -15     
+ Partials     1400     1395       -5     

see 41 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@easwars
Copy link
Contributor

easwars commented Jan 6, 2026

Were you able to reproduce this on your setup (either on a debian linux workstation or on forge) and ensure that the fix actually solves the problem?

@easwars easwars assigned eshitachandwani and unassigned easwars and arjan-bal Jan 6, 2026
@eshitachandwani
Copy link
Member Author

Were you able to reproduce this on your setup (either on a debian linux workstation or on forge) and ensure that the fix actually solves the problem?

Yes, I was able to reproduce the hang by making the test fail on purpose , and have verified that it is fixed with this change.

Copy link
Contributor

@arjan-bal arjan-bal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with some nits.

t.Fatalf("received unexpected error from dependency manager: %v", err)
}
dnsR := replaceDNSResolver(t)
dnsR.InitialState(resolver.State{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You can call UpdateState now; after last year's changes, it correctly buffers updates if the resolver hasn't been built yet. The InitialState method is now redundant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

case err := <-watcher.errorCh:
t.Fatalf("received unexpected error from dependency manager: %v", err)
}
dnsR := replaceDNSResolver(t)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please move the DNS resolver replacement near the top of the test, before any xDS configuration is sent? This would avoid potential races where the real DNS resolver gets used before replacement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 1292 to 1301
{
Addresses: []resolver.Address{
{Addr: "127.0.0.1:8081"},
},
},
{
Addresses: []resolver.Address{
{Addr: "[::1]:8081"},
},
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Please revert the formatting changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@easwars easwars assigned eshitachandwani and unassigned easwars Jan 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: xDS Includes everything xDS related, including LB policies used with xDS. Type: Bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky Test: TestAggregateCluster

3 participants