Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for multi node clusters and logging #3448

Closed
monusingh-1 opened this issue Apr 26, 2023 · 15 comments
Closed

Support for multi node clusters and logging #3448

monusingh-1 opened this issue Apr 26, 2023 · 15 comments
Assignees
Labels
enhancement New Enhancement

Comments

@monusingh-1
Copy link
Contributor

Is your feature request related to a problem? Please describe

During integration tests of cross cluster replication, clusters with following Topology are created
Leader - Data node -2
Follower- Data node -2

The data nodes join to form a cluster in these distribution builds
2.7.0/opensearch-2.7.0-test.yml, 2.7.0, arm64, 7759, tar,
2.7.0/opensearch-2.7.0-test.yml, 2.7.0, x64, 7759, tar
https://build.ci.opensearch.org/blue/organizations/jenkins/integ-test/detail/integ-test/4653/pipeline

The nodes fail to join to form common cluster in these distributions
2.7.0/opensearch-2.7.0-test.yml, 2.7.0, x64, 7764, deb
2.7.0/opensearch-2.7.0-test.yml, 2.7.0, arm64, 7764, rpm

https://build.ci.opensearch.org/blue/organizations/jenkins/integ-test/detail/integ-test/4658/pipeline

https://build.ci.opensearch.org/job/integ-test/4660/execution/node/858/log/?consoleFull
Search for
Node 1 at 2023-04-26 08:51:30
Node 2 at 2023-04-26 08:50:57
We can see

"number_of_nodes":1,"number_of_data_nodes":1

After investigation we were able to check that when the data nodes fail to join, the clusters are not reachable and the integ tests starts throwing

java.net.ConnectException: Connection refused
        at org.opensearch.client.RestClient.extractAndWrapCause(RestClient.java:953)
        at org.opensearch.client.RestClient.performRequest(RestClient.java:332)
        at org.opensearch.client.RestClient.performRequest(RestClient.java:320)
        at org.opensearch.replication.MultiClusterRestTestCase.stopAllReplicationJobs(MultiClusterRestTestCase.kt:424)
        at org.opensearch.replication.MultiClusterRestTestCase.wipeIndicesFromCluster(MultiClusterRestTestCase.kt:448)
        at org.opensearch.replication.MultiClusterRestTestCase.wipeCluster(MultiClusterRestTestCase.kt:374)
        at org.opensearch.replication.MultiClusterRestTestCase.wipeClusters(MultiClusterRestTestCase.kt:369)
        at jdk.internal.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:568)

Due to insufficient cluster level logging we are not able to investigate why the nodes fail to join in some distribution.

Describe the solution you'd like

  1. Better support for multi-node multi cluster
  2. Support to upload cluster logs of all data nodes
  3. Current method of fetching cluster logs is hacky. A dashboard to view/download individual node/cluster would be great.

Describe alternatives you've considered

No response

Additional context

No response

@monusingh-1 monusingh-1 added enhancement New Enhancement untriaged Issues that have not yet been triaged labels Apr 26, 2023
@monusingh-1
Copy link
Contributor Author

@rishabh6788 could you please look into this.

@bbarani
Copy link
Member

bbarani commented Apr 27, 2023

Tagging @gaiksaya and @peterzhuamazon as well.

@rishabh6788 rishabh6788 removed the untriaged Issues that have not yet been triaged label Apr 28, 2023
@monusingh-1
Copy link
Contributor Author

HI team, can we have some updates on this issue.

@gaiksaya
Copy link
Member

gaiksaya commented Jun 1, 2023

Hi @monusingh-1,

We recently enhanced our testing logs. See #3381 for more information. You can now access all logs for given distribution.

Also the integration testing code is located at https://github.com/opensearch-project/opensearch-build/blob/main/src/test_workflow/integ_test/distribution_rpm.py
https://github.com/opensearch-project/opensearch-build/blob/main/src/test_workflow/integ_test/distribution_deb.py.

Can you look into that if you have the bandwidth and see if the issue in our test workflow or in the distribution itself? You can also try to install manually and see if the cluster for those distribution comes up as expected: https://opensearch.org/docs/latest/install-and-configure/index/

@monusingh-1
Copy link
Contributor Author

Hi @gaiksaya, #3381 does not take into account if there are multiple clusters running according to topology specified.
Ex: https://ci.opensearch.org/ci/dbc/integ-test/2.8.0/7935/linux/x64/deb/test-results/5105/integ-test/cross-cluster-replication/with-security/local-cluster-logs/stdout.txt

@bbarani
Copy link
Member

bbarani commented Jun 5, 2023

@zelinh @gaiksaya Do we need to handle this scenario as well?

@rishabh6788
Copy link
Collaborator

For yum and deb we need to run 2 systemctl service for opensearch, one on 9200 and other on 9300 port to be able to run on the same ec2 host.
This will require a deep-dive and PoC to confirm how we can modify the repo files to be able to run two separate clusters on different ports on the same host.

@monusingh-1
Copy link
Contributor Author

@rishabh6788, currently, the test workflow is able to create multiple clusters when specified in the manifest, pain point is unavailability of individual cluster logs.

@zelinh
Copy link
Member

zelinh commented Jun 16, 2023

I think there is a logging issue that is applied on all distribution types including tarball. The reason is there are two individual clusters and each of them will create a cluster logs (opensearch-service-logs). Our logging implementation(test-recorder) won't handle this case so it wasn't recording both of them which causes lack of logs and failure.

@monusingh-1
Copy link
Contributor Author

Hi zelinh, is this issue on the roadmap ?

@monusingh-1
Copy link
Contributor Author

eg:

FileExistsError: [Errno 17] File exists: '/var/jenkins/workspace/integ-test@4/test-results/5327/integ-test/cross-cluster-replication/with-security/local-cluster-logs/opensearch-service-logs'

@monusingh-1
Copy link
Contributor Author

@gaiksaya could we have some traction on this.

@bbarani
Copy link
Member

bbarani commented Aug 21, 2023

@monusingh-1 @ankitkala This is not on our priority list. We will try to look in to it when we have bandwidth. Having said that, it would be great if you can contribute a fix for this issue to get it in early.

@bbarani
Copy link
Member

bbarani commented Oct 17, 2023

@peterzhuamazon @zelinh @gaiksaya Can we test and close this issue since the PR is merged?

@gaiksaya
Copy link
Member

Thanks for the contribution @monusingh-1 the change works as expected. Tested it with 2.11.0 artifact here: https://build.ci.opensearch.org/blue/organizations/jenkins/integ-test/detail/integ-test/6457/pipeline

Closing this issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New Enhancement
Projects
Development

No branches or pull requests

6 participants