Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak with jaeger tracing enabled #1363

Closed
vesvalo opened this issue Apr 16, 2019 · 14 comments · Fixed by #1374
Closed

Memory leak with jaeger tracing enabled #1363

vesvalo opened this issue Apr 16, 2019 · 14 comments · Fixed by #1374

Comments

@vesvalo
Copy link

vesvalo commented Apr 16, 2019

Describe the bug
Memory leak with Jaeger tracing enabled.

To Reproduce
Steps to reproduce the behavior:

  1. Run serve all with TRACING_PROVIDER="jaeger", TRACING_PROVIDERS_JAEGER_SAMPLING_TYPE="const", TRACING_PROVIDERS_JAEGER_SAMPLING_VALUE="1"
  2. run test plan with jmetter
  3. hydra's memory usage grows

Screenshots
memory

Version:

  • Environment: CentOS Linux release 7.5.1804 (Core), Docker version 18.09.2, build 6247962, image oryd/hydra:v1.0.0-rc.8_oryOS.10, Postgres 10.7
  • Version v1.0.0-rc8

Mem profiles with and without tracing:
mem-trc-on.zip
mem-trc-off.zip

@vesvalo vesvalo changed the title **Describe the bug** Memory leak with tracing enabled Apr 16, 2019
@vesvalo vesvalo changed the title Memory leak with tracing enabled Memory leak with jaeger tracing enabled Apr 16, 2019
@aeneasr
Copy link
Member

aeneasr commented Apr 16, 2019

Thank you!

Ping @aaslamin

@aaslamin
Copy link
Contributor

Hi @vesvalo, what storage backend are you using for your Jaeger deployment?

The memory storage is purely for testing purposes and is not meant for production workloads as there is no limit to how many traces that will be stored. It will just grow indefinitely.

@aeneasr
Copy link
Member

aeneasr commented Apr 16, 2019

Also, am I assuming correctly that in both cases (off/on) you were running the same load test for the same amount of time?

I've created trace graphs to make analyzing that easier. Seems like a bulk of mem usage comes from "ReadMIMEHeader" (could this be from jaeger?) as well as WithContext.

I'd also be interested to see if the memory load decreases once the load test has ended. Keep in mind that this might take a couple of minutes (10-15) as Go doesn't free the memory immediately. If it doesn't, it's a real memory leak, otherwise it could just be that churning this much data to Jaeger takes up a lot of memory.

With tracing

profile001

Without tracing

profile002

@vesvalo
Copy link
Author

vesvalo commented Apr 16, 2019

The test plan for hydra with and without tracing is the same and test duration is about 15 minutes.

Jaeger is configured to use memory storage (it is not a production environment) and MaxTraces:100000 and it is running on another server.

Jaeger's memory consumption looks normal (this test run was started at 19:51 and stopped at ~20:15):
jaeger

Hydra releases some memory after test stop, but holds more than 2 gigabytes so far:
hydra

@aeneasr
Copy link
Member

aeneasr commented Apr 16, 2019

Ok, the release surely comes from the http buffer freeing up, so there's definitely some memory leak going on. Probably some context not being released.

@aeneasr
Copy link
Member

aeneasr commented Apr 18, 2019

@vesvalo could you please provide the full config for your services? e.g. docker-compose? @aaslamin was unable to reproduce your findings.

@vesvalo
Copy link
Author

vesvalo commented Apr 18, 2019

Hydra is running on a physical server with Centos 7.3 in docker:

--name hydradev \
-d \
-p 5444:4444 -p 5445:4445 \
-e SECRETS_SYSTEM="somesystemsecret" \
-e DSN="postgres://postgres:postgres@localhost:5432/hydradev?sslmode=disable" \
-e HTTPS_ALLOW_TERMINATION_FROM="0.0.0.0/0" \
-e OAUTH2_TTL_LOGIN_CONSENT_REQUEST="1h" \
-e OAUTH2_TTL_ACCESS_TOKEN="24h" \
-e OAUTH2_TTL_REFRESH_TOKEN="720h" \
-e OAUTH2_TTL_ID_TOKEN="720h" \
-e OAUTH2_TTL_AUTH_CODE="720h" \
-e URLS_SELF_ISSUER="https://auth.tst" \
-e URLS_SELF_PUBLIC="https://auth.tst" \
-e URLS_LOGIN="https://auth.tst/oauth2/login" \
-e URLS_CONSENT="https://auth.tst/oauth2/consent" \
-e URLS_POST_LOGOUT_REDIRECT="https://auth.tst/oauth2/logout" \
-e URLS_ERROR="https://auth.tst/oauth2/error" \
-e OIDC_SUBJECT_IDENTIFIERS_ENABLED="pairwise" \
-e OIDC_SUBJECT_IDENTIFIERS_PAIRWISE_SALT="somesalt" \
-e OAUTH2_HASHERS_BCRYPT_COST="8" \
-e LOG_FORMAT="json" \
-e LOG_LEVEL="warning" \
-e DISABLE_TELEMETRY="1" \
-e SQA_OPT_OUT="1" \
-e TRACING_PROVIDER="jaeger" \
-e TRACING_SERVICE_NAME="HYDRADEV" \
-e TRACING_PROVIDERS_JAEGER_LOCAL_AGENT_ADDRESS="172.19.10.11:30831" \
-e TRACING_PROVIDERS_JAEGER_SAMPLING_TYPE="const" \
-e TRACING_PROVIDERS_JAEGER_SAMPLING_VALUE="1" \
-e TRACING_PROVIDERS_JAEGER_SAMPLING_SERVER_URL="http://172.19.10.11:30778/sampling" \
oryd/hydra:v1.0.0-rc.8_oryOS.10 serve all --dangerous-force-http

Login and consent app and Jaeger instance are running in the Kubernetes.
Jaeger is installed using helm chart stable/jaeger-operator (chart jaeger-operator-2.4.0, app version 1.11.0).
Jaeger's resource definition:

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger
  namespace: default
spec:
  agent:
    image: jaegertracing/jaeger-agent:1.11
    options: null
    resources: {}
    strategy: ""
    volumeMounts: null
    volumes: null
  allInOne:
    image: jaegertracing/all-in-one:1.11
    options: null
    resources: {}
    volumeMounts: null
    volumes: null
  collector:
    image: ""
    options: null
    replicas: null
    resources: {}
    size: 0
    volumeMounts: null
    volumes: null
  ingester:
    image: ""
    options: null
    replicas: null
    resources: {}
    size: 0
    volumeMounts: null
    volumes: null
  ingress:
    enabled: false
    resources: {}
    security: none
    volumeMounts: null
    volumes: null
  query:
    image: ""
    options: null
    replicas: null
    resources: {}
    size: 0
    volumeMounts: null
    volumes: null
  resources: {}
  sampling:
    options: {}
  storage:
    cassandraCreateSchema:
      datacenter: ""
      enabled: null
      image: ""
      mode: ""
    dependencies:
      cassandraClientAuthEnabled: false
      elasticsearchClientNodeOnly: false
      elasticsearchNodesWanOnly: false
      enabled: null
      image: jaegertracing/spark-dependencies
      javaOpts: ""
      schedule: 55 23 * * *
      sparkMaster: ""
    elasticsearch:
      image: quay.io/openshift/origin-logging-elasticsearch5:latest
      nodeCount: 1
      redundancyPolicy: ""
      resources: {}
      storage: {}
    esIndexCleaner:
      enabled: null
      image: jaegertracing/jaeger-es-index-cleaner
      numberOfDays: 7
      schedule: 55 23 * * *
    esRollover:
      conditions: ""
      image: jaegertracing/jaeger-es-rollover
      readTTL: ""
      schedule: '*/30 * * * *'
    options:
      memory.max-traces: "100000"
    secretName: ""
    type: memory
  strategy: allInOne
  ui:
    options: {}
  volumeMounts: null
  volumes: null

@aeneasr
Copy link
Member

aeneasr commented Apr 18, 2019

And what endpoint are you loadtesting?

@vesvalo
Copy link
Author

vesvalo commented Apr 19, 2019

/oauth2/auth
/oauth2/auth/requests/login?challenge={challenge}
/oauth2/auth/requests/login/accept?challenge={challenge}
/oauth2/auth/requests/login/reject?challenge={challenge}
/oauth2/auth/requests/consent?challenge={challenge}
/oauth2/auth/requests/consent/accept?challenge={challenge}
/oauth2/auth/requests/consent/reject?challenge={challenge}
/oauth2/introspect
/oauth2/token

@aeneasr
Copy link
Member

aeneasr commented Apr 19, 2019

Thank you! It could be possible that there's a leak in the CSRF cookie (/oauth2/auth) in correlation with tracing @aaslamin . I'm actually pretty sure about that

@aeneasr
Copy link
Member

aeneasr commented Apr 19, 2019

@vesvalo would it be possible for you to share the config for the test plan?

@vesvalo
Copy link
Author

vesvalo commented Apr 19, 2019

You'll need to change "server name", "client_id" and "secret" parameters ("auth.tst", "some_cliend_id" and "some_secret" in this file) to appropriate values.
The "Login Thread Group" is disabled, in order to use it you'll need to register a bunch of users and export them to the csv file, also it requires JMeter Plugins Manager and "Random CSV Data Set" plugin installed.
hydratest.zip

aeneasr added a commit that referenced this issue Apr 19, 2019
Closes #1363

Signed-off-by: aeneasr <aeneas@ory.sh>
@aeneasr
Copy link
Member

aeneasr commented Apr 19, 2019

Thank you for helping us trace this! @aaslamin your hunch was right - it was indeed the gorilla/sessions package that leaked in combination with jaeger!

@vesvalo would you mind checking #1374 to see if it resolves the memory leak? I was able to confirm it working without memory leak now.

aeneasr added a commit that referenced this issue Apr 20, 2019
Closes #1363

Signed-off-by: aeneasr <aeneas@ory.sh>
@vesvalo
Copy link
Author

vesvalo commented Apr 22, 2019

Our test confirms that the issue is fixed, thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants