Bug 1625330 - Reduce DEFAULT_BATCH_MAX_DELAY to avoid OOM errors #1233

relud · 2020-04-06T21:04:19Z

see Bug 1625330 comment 2 for why this is likely worth doing.

I will test this in stage by setting BATCH_MAX_DELAY=10s and BIG_QUERY_OUTPUT_MODE=file_loads, and seeing if we hit OOM errors.

codecov-io · 2020-04-08T00:31:49Z

Codecov Report

Merging #1233 into master will decrease coverage by 3.77%.
The diff coverage is 100.00%.

@@             Coverage Diff              @@
##             master    #1233      +/-   ##
============================================
- Coverage     86.29%   82.52%   -3.78%     
+ Complexity      672      143     -529     
============================================
  Files            89       25      -64     
  Lines          3737      904    -2833     
  Branches        386      120     -266     
============================================
- Hits           3225      746    -2479     
+ Misses          370      113     -257     
+ Partials        142       45      -97

Flag	Coverage Δ	Complexity Δ
#ingestion_beam	`?`	`?`
#ingestion_edge	`?`	`?`
#ingestion_sink	`82.52% <100.00%> (-0.22%)`	`143.00 <0.00> (ø)`

Impacted Files	Coverage Δ	Complexity Δ
...la/telemetry/ingestion/sink/config/SinkConfig.java	`89.78% <100.00%> (-0.76%)`	`11.00 <0.00> (ø)`
...ingestion/core/schema/SchemaNotFoundException.java	`0.00% <0.00%> (-100.00%)`	`0.00% <0.00%> (-2.00%)`
...om/mozilla/telemetry/ingestion/core/util/Time.java	`54.54% <0.00%> (-40.91%)`	`1.00% <0.00%> (-4.00%)`
...om/mozilla/telemetry/ingestion/core/util/Json.java	`60.97% <0.00%> (-9.76%)`	`20.00% <0.00%> (-4.00%)`
...a/telemetry/ingestion/core/schema/SchemaStore.java	`75.34% <0.00%> (-6.85%)`	`22.00% <0.00%> (-3.00%)`
...n/java/com/mozilla/telemetry/IpPrivacyDecoder.java
.../com/mozilla/telemetry/util/NoColonFileNaming.java
...lla/telemetry/republisher/RepublishPerDocType.java
...lla/telemetry/transforms/DecodePubsubMessages.java
...m/mozilla/telemetry/metrics/PerDocTypeCounter.java
... and 59 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8d07f9d...d53e0f4. Read the comment docs.

relud · 2020-04-08T00:32:04Z

update:

added $JAVA_OPTS to docker command, so that we can do like -Xmx3584m in kubernetes to ensure that java will use more of the memory available to it.
removed getDefaultMaxOutstandingElementCount and getDefaultMaxOutstandingRequestBytes from gcs and bq file loads, because the default limits are better for avoiding OOM errors
tested using BIG_QUERY_OUTPUT_MODE=file_loads instead of STREAMING_DOCTYPES=^$ because the latter wasn't working (still investigating why)

With these changes autoscaling is working again, and we're no longer hitting OOM errors.

I am still seeing some unexpected null pointer exceptions, and I need to get BIG_QUERY_OUTPUT_MODE=mixed with STREAMING_DOCTYPES working.

relud · 2020-04-08T20:18:59Z

we're no longer hitting OOM errors.

as of 23:38 PDT we are hitting OOM errors again. I'm attempting to mitigate this by reducing max outstanding message bytes.

relud requested a review from jklukas April 6, 2020 21:04

jklukas approved these changes Apr 7, 2020

View reviewed changes

Reduce DEFAULT_BATCH_MAX_DELAY to avoid OOM errors

d53e0f4

relud force-pushed the sink-file-loads-faster branch from 2a2f388 to d53e0f4 Compare April 9, 2020 22:36

relud merged commit fd1eb6b into master Apr 9, 2020

relud deleted the sink-file-loads-faster branch April 9, 2020 22:51

relud added a commit that referenced this pull request Apr 16, 2020

Reduce DEFAULT_BATCH_MAX_DELAY to avoid OOM errors (#1233)

938d142

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1625330 - Reduce DEFAULT_BATCH_MAX_DELAY to avoid OOM errors #1233

Bug 1625330 - Reduce DEFAULT_BATCH_MAX_DELAY to avoid OOM errors #1233

relud commented Apr 6, 2020 •

edited

Loading

codecov-io commented Apr 8, 2020 •

edited

Loading

relud commented Apr 8, 2020 •

edited

Loading

relud commented Apr 8, 2020 •

edited

Loading

Bug 1625330 - Reduce DEFAULT_BATCH_MAX_DELAY to avoid OOM errors #1233

Bug 1625330 - Reduce DEFAULT_BATCH_MAX_DELAY to avoid OOM errors #1233

Conversation

relud commented Apr 6, 2020 • edited Loading

codecov-io commented Apr 8, 2020 • edited Loading

Codecov Report

relud commented Apr 8, 2020 • edited Loading

relud commented Apr 8, 2020 • edited Loading

relud commented Apr 6, 2020 •

edited

Loading

codecov-io commented Apr 8, 2020 •

edited

Loading

relud commented Apr 8, 2020 •

edited

Loading

relud commented Apr 8, 2020 •

edited

Loading