Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Reduce flakiness in test fts_segment_reset
Have seen some flakiness in test fts_segment_reset because sometimes FTS would still promote mirror if the primary takes a bit longer to restart after getting out of RESET stage. An example like below: - Primary 0 gets out of RESET and was going to be restarted: 2022-05-23 15:32:53.924540 UTC,,,p105578,th1560833280,,,,0,,,seg0,,,,,"LOG","00000","all server processes terminated; reinitializing",,,,,,,0,,"postmaster.c",4284, - And it takes primary 0 about 2-3 seconds to do so: 2022-05-23 15:32:56.184117 UTC,,,p105578,th1560833280,,,,0,,,seg0,,,,,"LOG","00000","database system is ready to accept connections” - Unfortunately before primary 0 could restart, FTS makes one last probe and finds that it is in recovery mode, and not making progress (which is "correct" because primary 0 has finished recovery): 2022-05-23 15:32:56.009206 UTC,,,p102591,th2023709952,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS: detected segment is in recovery mode and not making progress (content=0) primary dbid=2, mirror dbid=5",,,,,,,0,,"ftsprobe.c",254, 2022-05-23 15:32:56.065399 UTC,,,p102591,th2023709952,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS max (5) retries exhausted (content=0, dbid=2) state=9",,,,,,,0,,"ftsprobe.c”,788 Currently, we let primary stay in the RESET stage for 27 seconds. The FTS has a default of 5-second retry cycle, at the end of which it makes promote decision. That leaves about 3 seconds for the primary to start after getting out of RESET, which is probably too short. Now make the retry cycle 15 seconds and let the RESET delay to be 17 seconds. That leave about 13 seconds for the primary to start after that, which should be well enough to reduce common flakiness.
- Loading branch information