Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1625330 - Reduce DEFAULT_BATCH_MAX_DELAY to avoid OOM errors #1233

Merged
merged 1 commit into from
Apr 9, 2020

Conversation

relud
Copy link
Contributor

@relud relud commented Apr 6, 2020

see Bug 1625330 comment 2 for why this is likely worth doing.

I will test this in stage by setting BATCH_MAX_DELAY=10s and BIG_QUERY_OUTPUT_MODE=file_loads, and seeing if we hit OOM errors.

@relud relud requested a review from jklukas April 6, 2020 21:04
@codecov-io
Copy link

codecov-io commented Apr 8, 2020

Codecov Report

Merging #1233 into master will decrease coverage by 3.77%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #1233      +/-   ##
============================================
- Coverage     86.29%   82.52%   -3.78%     
+ Complexity      672      143     -529     
============================================
  Files            89       25      -64     
  Lines          3737      904    -2833     
  Branches        386      120     -266     
============================================
- Hits           3225      746    -2479     
+ Misses          370      113     -257     
+ Partials        142       45      -97     
Flag Coverage Δ Complexity Δ
#ingestion_beam ? ?
#ingestion_edge ? ?
#ingestion_sink 82.52% <100.00%> (-0.22%) 143.00 <0.00> (ø)
Impacted Files Coverage Δ Complexity Δ
...la/telemetry/ingestion/sink/config/SinkConfig.java 89.78% <100.00%> (-0.76%) 11.00 <0.00> (ø)
...ingestion/core/schema/SchemaNotFoundException.java 0.00% <0.00%> (-100.00%) 0.00% <0.00%> (-2.00%)
...om/mozilla/telemetry/ingestion/core/util/Time.java 54.54% <0.00%> (-40.91%) 1.00% <0.00%> (-4.00%)
...om/mozilla/telemetry/ingestion/core/util/Json.java 60.97% <0.00%> (-9.76%) 20.00% <0.00%> (-4.00%)
...a/telemetry/ingestion/core/schema/SchemaStore.java 75.34% <0.00%> (-6.85%) 22.00% <0.00%> (-3.00%)
...n/java/com/mozilla/telemetry/IpPrivacyDecoder.java
.../com/mozilla/telemetry/util/NoColonFileNaming.java
...lla/telemetry/republisher/RepublishPerDocType.java
...lla/telemetry/transforms/DecodePubsubMessages.java
...m/mozilla/telemetry/metrics/PerDocTypeCounter.java
... and 59 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8d07f9d...d53e0f4. Read the comment docs.

@relud
Copy link
Contributor Author

relud commented Apr 8, 2020

update:

  • added $JAVA_OPTS to docker command, so that we can do like -Xmx3584m in kubernetes to ensure that java will use more of the memory available to it.
  • removed getDefaultMaxOutstandingElementCount and getDefaultMaxOutstandingRequestBytes from gcs and bq file loads, because the default limits are better for avoiding OOM errors
  • tested using BIG_QUERY_OUTPUT_MODE=file_loads instead of STREAMING_DOCTYPES=^$ because the latter wasn't working (still investigating why)

With these changes autoscaling is working again, and we're no longer hitting OOM errors.

I am still seeing some unexpected null pointer exceptions, and I need to get BIG_QUERY_OUTPUT_MODE=mixed with STREAMING_DOCTYPES working.

@relud
Copy link
Contributor Author

relud commented Apr 8, 2020

we're no longer hitting OOM errors.

as of 23:38 PDT we are hitting OOM errors again. I'm attempting to mitigate this by reducing max outstanding message bytes.

@relud relud force-pushed the sink-file-loads-faster branch from 2a2f388 to d53e0f4 Compare April 9, 2020 22:36
@relud relud merged commit fd1eb6b into master Apr 9, 2020
@relud relud deleted the sink-file-loads-faster branch April 9, 2020 22:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants