Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 sink batch size stuck at 2.4 MB sized files #21696

Open
ElementTech opened this issue Nov 4, 2024 · 5 comments
Open

S3 sink batch size stuck at 2.4 MB sized files #21696

ElementTech opened this issue Nov 4, 2024 · 5 comments
Labels
type: bug A code related bug.

Comments

@ElementTech
Copy link

ElementTech commented Nov 4, 2024

Problem

I have Vector installed in Kubernetes in AWS. I am using SQS as a source and S3 as a sink. No matter how high I set batching and buffer parameters, at max-load of event ingestion, my s3 bucket receives them at exactly 2.4 MB batches. When an event spike ends, it releases the rest of the events in smaller files until finished.

Screenshot 2024-11-04 at 15 35 56

Configuration

api:
  address: 127.0.0.1:8686
  enabled: true
  playground: true
data_dir: /var/lib/vector
expire_metrics_secs: 60
log_schema:
  timestamp_key: date
sinks:
  dropped:
    acknowledgements:
      enabled: true
    batch:
      max_bytes: 10000000
      max_events: 10000000
      timeout_secs: 60
    bucket: xxx-vector-xxx-s3
    buffer:
      max_size: 1073741824
      type: disk
    compression: none
    encoding:
      codec: json
      timestamp_format: unix_ms
    filename_append_uuid: true
    filename_extension: json
    filename_time_format: '%s'
    framing:
      method: newline_delimited
    healthcheck:
      enabled: false
    inputs:
    - remap_sources.dropped
    key_prefix: vector_dropped/year=%Y/month=%m/day=%d/
    region: us-east-1
    server_side_encryption: AES256
    storage_class: ONEZONE_IA
    type: aws_s3
  prometheus_exporter:
    address: 0.0.0.0:9598
    default_namespace: service
    inputs:
    - vector_metrics
    type: prometheus_exporter
  s3_export:
    acknowledgements:
      enabled: true
    batch:
      max_bytes: 10000000
      max_events: 10000000
      timeout_secs: 60
    bucket: dev-vector-analytics-s3
    buffer:
      max_size: 1073741824
      type: disk
    compression: none
    encoding:
      codec: json
      timestamp_format: unix_ms
    filename_append_uuid: true
    filename_extension: json
    filename_time_format: '%s'
    framing:
      method: newline_delimited
    healthcheck:
      enabled: false
    inputs:
    - route_events.passed_events
    key_prefix: playable_events_valid/year=%Y/month=%m/day=%d/
    region: us-east-1
    server_side_encryption: AES256
    storage_class: STANDARD
    type: aws_s3
  s3_export_failed:
    acknowledgements:
      enabled: true
    batch:
      max_bytes: 10000000
      max_events: 10000000
      timeout_secs: 60
    bucket: xxx-vector-xxx-s3
    buffer:
      max_size: 1073741824
      type: disk
    compression: none
    encoding:
      codec: json
      timestamp_format: unix_ms
    filename_append_uuid: true
    filename_extension: json
    filename_time_format: '%s'
    framing:
      method: newline_delimited
    healthcheck:
      enabled: false
    inputs:
    - route_events.failed_events
    key_prefix: playable_events_failed/year=%Y/month=%m/day=%d/
    region: us-east-1
    server_side_encryption: AES256
    storage_class: STANDARD_IA
    type: aws_s3
sources:
  dlq_data:
    queue_url: https://sqs.us-east-1.amazonaws.com/xxxx/dev-analytics-ingestion-dlq
    region: us-east-1
    type: aws_sqs
  offline_data:
    acknowledgements:
      enabled: true
    framing:
      method: newline_delimited
    region: us-east-1
    sqs:
      queue_url: https://sqs.us-east-1.amazonaws.com/xxxx/dev-vector-s3-offline-source
    type: aws_s3
  realtime_data:
    queue_url: https://sqs.us-east-1.amazonaws.com/xxxx/dev-vector-realtime-source
    region: us-east-1
    type: aws_sqs
  vector_logs:
    type: internal_logs
  vector_metrics:
    type: internal_metrics
transforms:
  add_date:
    inputs:
    - ignore_heartbeat
    source: .date = from_unix_timestamp!(.timestamp, "milliseconds")
    type: remap
  ignore_heartbeat:
    condition:
      source: .event_type != "heartbeat"
      type: vrl
    inputs:
    - remap_sources
    type: filter
  logs:
    condition:
      source: '!includes(["INFO", "DEBUG"], .metadata.level)'
      type: vrl
    inputs:
    - vector_logs
    type: filter
  remap_sources:
    drop_on_abort: true
    drop_on_error: true
    inputs:
    - offline_data
    - realtime_data
    - dlq_data
    reroute_dropped: true
    source: . = parse_json!(.message)
    type: remap
  route_events:
    inputs:
    - add_date
    reroute_unmatched: false
    route:
      failed_events: .status != 200
      passed_events: .status == 200
    type: route

Version

0.42.0-distroless-libc

Debug Output

No response

Example Data

No response

Additional Context

I have two environments. The only difference between them is the batch.timeout_secs parameter. In my dev environment, it is set to 60, and in my production to 1800. The exact same issue (2.4 MB sized files) happens in both.

References

No response

@ElementTech ElementTech added the type: bug A code related bug. label Nov 4, 2024
@jszwedko
Copy link
Member

jszwedko commented Nov 4, 2024

Hi @ElementTech ! I think this is the case, but do I take it to mean that you tried increasing batch.max_bytes and batch.max_events but saw no difference?

Note that batch.max_bytes may not always map to object size because the way that event sizes are calculated can differ from batch serialized size (see #10020).

@ElementTech
Copy link
Author

Hey @jszwedko, yes, I've played around with those values both up down. Just for sake of testing as you can see I've put all numbers extremely high but still no difference in behavior.

I should also note that each of those .json files has around ~3500 events (single level dictionaries), but not exactly 3500, it can deviate in a few hundreds. I can assume that whatever it is that decides to save it stops at a certain size as opposed to a certain event count. Also of course not all events are evenly sized.

I might be wrong but I'm also using disk instead of memory when collecting the events, and even if it was memory, should this resulting batch size still be this comparatively small?

Thanks!

@jszwedko
Copy link
Member

Apologies for the delayed response!

Hey @jszwedko, yes, I've played around with those values both up down. Just for sake of testing as you can see I've put all numbers extremely high but still no difference in behavior.

Gotcha, that is interesting.

I should also note that each of those .json files has around ~3500 events (single level dictionaries), but not exactly 3500, it can deviate in a few hundreds. I can assume that whatever it is that decides to save it stops at a certain size as opposed to a certain event count. Also of course not all events are evenly sized.

One shot in the dark, can you try setting filename_time_format to ""? I believe that suffix is only added when writing the batch, but I could be wrong and it is actually involved in the partitioning of events such that each object may represent roughly one-second worth of events.

I might be wrong but I'm also using disk instead of memory when collecting the events, and even if it was memory, should this resulting batch size still be this comparatively small?

In Vector's architecture, buffers appear in front of sinks and so, from the sink perspective, it makes no difference if the fronting buffer is memory or disk. It is transparent to the sink.

@ParsonsProjects
Copy link

@jszwedko could this be related to this issue #3829 where batching to S3 seems to be a little fragile and doesnt seem to result in the expected output?

@jszwedko
Copy link
Member

jszwedko commented Jan 6, 2025

@jszwedko could this be related to this issue #3829 where batching to S3 seems to be a little fragile and doesnt seem to result in the expected output?

I don't think that issue is related. That issue is more about doing incremental upload of batches when large batches are configured to avoid needing to keep large batches in memory until they are sent (and thus bloat Vector's memory use).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug A code related bug.
Projects
None yet
Development

No branches or pull requests

3 participants