Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] MetadataRegressionIT tests fail #176

Closed
downsrob opened this issue Oct 29, 2021 · 1 comment
Closed

[BUG] MetadataRegressionIT tests fail #176

downsrob opened this issue Oct 29, 2021 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@downsrob
Copy link
Contributor

downsrob commented Oct 29, 2021

The following issues track this failure in greater detail -
Issue on opensearch: opensearch-project/OpenSearch#1473
Issue on opensearch-build: opensearch-project/opensearch-build#953
Describe the bug
All tests under MetadataRegressionIT fail for Index Management main with the following stack trace:

java.lang.IllegalStateException: Message not fully read (request) for requestId [77], action [cluster:monitor/nodes/stats], available [9]; resetting
        at __randomizedtesting.SeedInfo.seed([FF2514888B92F199:59182A48218AD972]:0)
        at org.opensearch.transport.InboundHandler.handleRequest(InboundHandler.java:215)
        at org.opensearch.transport.InboundHandler.messageReceived(InboundHandler.java:120)
        at org.opensearch.transport.InboundHandler.inboundMessage(InboundHandler.java:102)
        at org.opensearch.transport.TcpTransport.inboundMessage(TcpTransport.java:713)
        at org.opensearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:155)
        at org.opensearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:130)
        at org.opensearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:95)
        at org.opensearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:87)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:271)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:620)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:583)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at java.lang.Thread.run(Thread.java:832)

This same failure occurs locally, and on the Github CI runner. Additionally, this failure occurs in any tests on index management which extend from OpenSearchIntegTestCase. Currently MetadataRegressionIT is the only class which does, causing this isolated failure. The tests behave normally, but then fail during cleanup. Here is a failing build run.

To Reproduce
Steps to reproduce the behavior:
On index management main -
REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.indexstatemanagement.MetadataRegressionIT.*"

Expected behavior
The tests should pass, and clean up without issue.

Additional context
Currently, index management's integration tests which extend from OpenSearchIntegTestCase are all failing after this shard indexing pressure PR was backported to 1.x. Initial debugging made it seem that the cause was that our tests were not always initializing the cluster service, and the shard indexing pressure change had an implicit dependency on this behavior. A fix was introduced to remove that dependency by including the shard indexing pressure stats whenever the version is on or after 1.2. This change did not fix the test failures, however.
The failure can be tracked down to the afterInternal function called from cleanUpCluster at the end of OpenSearchIntegTestCase.java. In this function, cluster().assertAfterTest() is called, which, as our integration tests use the external test cluster, triggers ensureEstimatedStats() in ExternalTestCluster.java. When the nodeStats are requested here to confirm that the cluster successfully cleaned up, this request triggers the streaming error which is causing the tests to fail.

Removing the additions to the NodeStats stream in the NodeStats.java file and the CommonStatsFlags.java from this recent PR enable these tests to pass.

Changing the test distribution in the build.gradle file to INTEG_TEST instead of ARCHIVE also enables the tests to pass.

After running the integration tests in index management, if I look through the opensearch-1.2.0-SNAPSHOT.jar in build/testclusters/integTest-0/distro/1.2.0-Archive/lib, I do not see the NodeStats and CommonStatsFlags changes, so it may be the case that an out of date Opensearch 1.2.0 archive snapshot could be causing a discrepancy in the NodeStats API signature.

@downsrob
Copy link
Contributor Author

See the linked issues in OpenSearch and OpenSearch-build for more details. The artifact used in archive distribution tests was not being updated, causing a serialization mismatch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants