Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zilla is unresponsive sometimes & the app logs Stopped but accepts the CRUD requests #1312

Open
vamsi260 opened this issue Oct 28, 2024 · 3 comments
Assignees
Labels
bug Something isn't working devx Effects the Developer Experience

Comments

@vamsi260
Copy link

vamsi260 commented Oct 28, 2024

Hi Team,
I'm running zilla with kafka crud use case, where i primarily focused on POST. Initially the app was running fine with 0.9.89 , able to get 204 response when we hit it with curl. We see an abnormal behaviour when using curl on CRUD operations. the curl is stuck and awaiting for the response back from the server. I see a log entry but actually the request is not hitting the zilla app.
October 24, 2024 at 22:00 (UTC+5:30)

Zilla:north_http_server [24/Oct/2024:16:30:47 +0000] BINDING_HTTP_REQUEST_ACCEPTED - POST [http://zilla-test.exp-aws.net:80/items](http://zilla-test.exp-aws.net/items)

Strange observation is, if the existing curl is forced to stop and re-run the curl POST couple of times, it works and gives 204 response, but soon after it'll be back in the stuck mode. I don't see much in the logging on what's really happening. I tried deploying a new image version 0.9.98 just to isolate the issue. But it's the same behaviour. I want to know where exactly the problem lies in. please help me on how can i enable better logging. The additional change to the existing zilla config is added telemetry at each section. I don't think that's a problem!

Observations: when we try CRUD operations,

  1. Sometimes stuck in getting the response forever.
  2. Sometimes we get the 204 response
  3. Sometimes we see Empty reply from server.
  4. We see Zilla logging Stopped abruptly, but still accepts the request.

Here is some additional info. this is something which popped in the log when a load run was triggered from jmeter( 8MPBS network 800 TPS load with 10kb of payload size)

2024-10-29T16:59:36.649+05:30.       java.lang.IllegalArgumentException: invalid offset: -65536
2024-10-29T16:59:36.649+05:30.        stopped

Attached a Diagnostic zip zilla_20241028121032.tar.gz

Env: Zilla is deployed on AWS ECS fargate. The Kafka implementation is Confluent Cloud.

zilla.yaml config:

---
name: Zilla
bindings:
  north_tcp_server:
    type: tcp
    kind: server
    options:
      host: 0.0.0.0
      port:
        - 7114
    routes:
      - when:
          - port: 7114
        exit: north_http_server
    telemetry:
      metrics:
        - stream.*
  north_http_server:
    type: http
    kind: server
    routes:
      - when:
          - headers:
              :scheme: http
        exit: north_http_kafka_mapping
    telemetry:
      metrics:
        - http.*
  north_http_kafka_mapping:
    type: http-kafka
    kind: proxy
    routes:
      - when:
          - method: POST
            path: /items
        exit: north_kafka_cache_client
        with:
          capability: produce
          topic: events
          key: $${idempotencyKey}
      - when:
          - method: DELETE
            path: /items/{id}
        exit: north_kafka_cache_client
        with:
          capability: produce
          topic: events
          key: $${params.id}
    telemetry:
      metrics:
        - stream.*
        - http.*
  north_kafka_cache_client:
    type: kafka
    kind: cache_client
    exit: south_kafka_cache_server
  south_kafka_cache_server:
    type: kafka
    kind: cache_server
    options:
      bootstrap:
        - events
    exit: south_kafka_client
  south_kafka_client:
    type: kafka
    kind: client
    options:
      servers:
        - ${kafka_bootstrap_host}:9092
      sasl:
        mechanism: plain
        username: ${kafka_username}
        password: ${kafka_password}
    exit: south_tls_client
  south_tls_client:
    type: tls
    kind: client
    exit: south_tcp_client
  south_tcp_client:
    type: tcp
    kind: client
telemetry:
  metrics:
    - http.active.requests
    - http.duration
    - http.request.size
    - http.response.size
    - stream.active.received
    - stream.active.sent
    - stream.opens.received
    - stream.opens.sent
    - stream.data.received
    - stream.data.sent
    - stream.errors.received
    - stream.errors.sent
    - stream.closes.received
    - stream.closes.sent
  exporters:
    stdout_logs_exporter:
      type: stdout
    prometheus_metric_exporter:
      type: prometheus
      options:
        endpoints:
          - scheme: http
            path: /metrics
            port: 7190

Attached.
zilla.yaml.zip

Slack ref: https://aklivitycommunity.slack.com/archives/C027SEH2KRC/p1730119696631149?thread_ts=1729788282.245839&cid=C027SEH2KRC

Let me know if you need anything more!

@vamsi260 vamsi260 added the bug Something isn't working label Oct 28, 2024
@vordimous vordimous added the devx Effects the Developer Experience label Oct 28, 2024
@bmaidics
Copy link
Contributor

bmaidics commented Nov 7, 2024

Container definition used:

{
	"name": "zilla",
	"image": "${docker_image}:${docker_version}",
	${docker_auth}
	"essential": true,
    "command": [ "start" ],
	"logConfiguration": {
	  "logDriver": "awslogs",
	  "options": {
		"awslogs-group": "${loggroup}",
		"awslogs-region": "${region}",
		"awslogs-stream-prefix": "/"
	  }
	},
	"dockerLabels": {
    	"com.datadoghq.ad.instances": "[{\"openmetrics_endpoint\": \"http://127.0.0.1:7190/metrics\", \"namespace\": \"zilla\", \"metrics\": [\"metastore_status\",\"counter_*\",\"jvm_*\",\"system_*\",\"timer_*\",\"http_*\",\"process_*\",\"stream_*\"], \"tags\": { \"region\":\"${region}\"} }]",
		"com.datadoghq.ad.check_names": "[\"openmetrics\"]",
		"com.datadoghq.ad.init_configs": "[{}]",
		"com.datadoghq.tags.service": "zilla",
		"com.datadoghq.tags.slug": "${env_name}-${region}",
		"com.datadoghq.tags.flavor": "${env_name}"
  },
	"portMappings": [
	  {
		"containerPort": 7114,
		"hostPort": 7114
	  },
	  {
		"containerPort": 7190, 
		"hostPort": 7190
	  }
	],
	"healthCheck": {
	  "command": ["CMD", "bash", "-c", "echo -n '' > /dev/tcp/127.0.0.1/7114"],
	  "interval": 5,
	  "timeout": 3,
	  "retries": 5,
	  "startPeriod": 10
	},
	"environment":[
	  {
		"name": "HEAPSIZE",
		"value": "${heapsize}"
	  },	  
	  {
		"name": "S3_BUCKET_NAME",
		"value": "${s3_bucket_name}"
	  },
	  {
		"name": "ZILLA_YAML",
		"value": "${zilla_yaml}"
	  },
	  {
		"name": "LOG4J_FORMAT_MSG_NO_LOOKUPS",
		"value": "true"
	  }
	],
	"ulimits": [
	  {
		"name": "nofile",
		"softLimit": 65536,
		"hardLimit": 65536
	  },
	  {
		"name": "nproc",
		"softLimit": 65536,
		"hardLimit": 65536
	  }
	],
	"systemControls": [
          {
                "namespace": "net.ipv4.tcp_keepalive_time",
                "value": "${tcp_keepalive_time}"
          },
          {
                "namespace": "net.ipv4.tcp_keepalive_intvl",
                "value": "${tcp_keepalive_intvl}"
          },
          {
                "namespace": "net.ipv4.tcp_keepalive_probes",
                "value": "${tcp_keepalive_probes}"
          }
        ]
}

The cpu and memory are given 4 vcpu & 16GB.
tcp_keepalive_time = 200 sec
tcp_keepalive_intvl = 30 sec
tcp_keepalive_probes = 2

@bmaidics
Copy link
Contributor

bmaidics commented Nov 7, 2024

My suspicion is that something is wrong with Zilla's Kafka connection. Can you validate that you are able to reach Kafka from zilla's container?
How I did it:

aws ecs execute-command  \
    --region eu-west-1 \
    --cluster [cluster-name] \
    --task [task id, for example 0f9de17a6465404e8b1b2356dc13c2f8] \
    --container [container name from the task definition] \
    --command "/bin/bash" \
    --interactive

This opens you a terminal on the ecs container.
For this you'll need to install the sessionmanager plugin. Also, you might need to add some additional roles if your execute-command fails: https://repost.aws/knowledge-center/ecs-error-execute-command
Once you're inside run:

apt-get update
apt-get install netcat

Check if you can reach the bootstrap server:
nc -zv <bootstrap server set for zilla> 9092
Please copy the bootstrap server you can see at the aws portal set for the ecs task at the Environment variables and files tab.
It should say:
Connection to <bootstrap server set for zilla> (xx.xx.xx.xx) 9092 port [tcp/*] succeeded!

@vamsi260
Copy link
Author

I've observed the same response/behaviour while hitting the zilla endpoint with the new version(0.9.101). The zilla container is able to connect to the kafka cluster(from inside the container, verified the connectivity to kafka through netcat command while the curl to the zilla endpoint is tried from other terminal). Shared the diagnostic log with new version (0.9.101) to review for any traces.
Adding the slack conversation Here for visibility: https://aklivitycommunity.slack.com/archives/C027SEH2KRC/p1731657046825739?thread_ts=1729788282.245839&cid=C027SEH2KRC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working devx Effects the Developer Experience
Projects
None yet
Development

No branches or pull requests

3 participants