s3_output: add parquet format support#11312
Conversation
📝 WalkthroughWalkthroughAdds a Parquet encoder (C++), Arrow/Parquet CMake detection, Snappy framed compression, and a large refactor of the S3 output: multipart lifecycle, presigned auth, upload queue/blob DB, streaming/store, SDS buffer migration, S3 key/timezone enhancements, and extensive unit/runtime tests. Changes
Sequence Diagram(s)sequenceDiagram
participant App as Application
participant Core as Fluent Bit Core
participant Queue as S3 Queue
participant Blob as Blob DB
participant Stream as Stream Processor
participant Auth as Auth Service
participant S3 as AWS S3
App->>Core: emit logs
Core->>Queue: buffer chunk / enqueue file
Queue->>Stream: format chunk (JSON or Parquet)
Stream->>Blob: persist metadata / register parts
Blob->>Queue: return file/part entries
Queue->>Auth: request presigned URL (CREATE)
Auth->>S3: fetch presigned CREATE URL
S3-->>Auth: return UploadId
loop per-part
Queue->>Stream: produce/compress part body
Queue->>Auth: request presigned URL (UPLOAD_PART)
Auth->>S3: fetch presigned PUT for part
S3-->>Auth: return ETag
Queue->>Blob: mark part uploaded
end
Queue->>Auth: request presigned URL (COMPLETE)
Auth->>S3: submit complete with ETags
S3-->>Queue: CompleteMultipartUpload
Queue->>Blob: notify delivery success
Queue-->>Core: upload complete
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
b4c0a26 to
a821fec
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (10)
src/CMakeLists.txt (1)
525-528: Clarify SASL library linking scope.The comments indicate "KAFKA SPECIFIC" for SASL library linking, but the linking appears in both the shared library and binary targets. This suggests it's a general requirement when
FLB_SASL_ENABLEDis true, not specific to this Parquet PR.Consider updating the comment to be more accurate:
🔎 Suggested comment improvement
- # KAFKA SPECIFIC - Link SASL library for shared library + # Link SASL library when SASL support is enabled (used by Kafka and other plugins) if(FLB_SASL_ENABLED) target_link_libraries(fluent-bit-shared sasl2) endif()- # KAFKA SPECIFIC - Link SASL library for binary + # Link SASL library when SASL support is enabled (used by Kafka and other plugins) if(FLB_SASL_ENABLED) target_link_libraries(fluent-bit-bin sasl2) endif()Also applies to: 600-603
examples/s3_output/run-s3-parquet-test.sh (1)
38-39: Add error checking for fluent-bit binary and configuration.The script assumes the Fluent Bit binary and configuration file exist without verification, which could lead to confusing errors.
🔎 Proposed pre-flight checks
+# Verify fluent-bit binary exists +if [ ! -f ./build/bin/fluent-bit ]; then + echo "Error: fluent-bit binary not found at ./build/bin/fluent-bit" + echo "Please build Fluent Bit first:" + echo " cd ../../build && cmake .. && make" + exit 1 +fi + +# Verify configuration file exists +if [ ! -f fluent-bit-s3-parquet.conf ]; then + echo "Error: Configuration file fluent-bit-s3-parquet.conf not found" + exit 1 +fi + +# Check AWS credentials +if [ -z "$AWS_ACCESS_KEY_ID" ] || [ -z "$AWS_SECRET_ACCESS_KEY" ]; then + echo "Warning: AWS credentials not set. Fluent Bit may fail to upload to S3." + echo "Set credentials with: export AWS_ACCESS_KEY_ID=... AWS_SECRET_ACCESS_KEY=..." + echo "" +fi + # Run Fluent Bit ./build/bin/fluent-bit -c fluent-bit-s3-parquet.confexamples/s3_output/README.md (1)
112-114: Consider adding language identifiers to fenced code blocks.Static analysis detected some fenced code blocks without language specifiers, which improves syntax highlighting and readability.
🔎 Minor markdown improvements
-``` +```text Upload Trigger = upload_timeout reached OR total_file_size reached```diff -``` +```text s3://s3-bucket-kafka-sink/fluent-bit-logs/ ├── putobject/app/day=YYYYMMDD/ # Case 1: Small files via PutObject └── multipart/access/day=YYYYMMDD/ # Case 2: Large files via Multipart Upload```diff -``` +```text [parquet] Data quality summary for 53206 records: [parquet] Missing non-nullable fields (defaults used): [parquet] field='level' count=52506 [parquet] field='timestamp' count=52506</details> Also applies to: 177-179, 270-272 </blockquote></details> <details> <summary>examples/s3_output/athena-queries.sql (2)</summary><blockquote> `38-45`: **Use placeholder bucket names in example documentation.** The hardcoded bucket name `s3-bucket-kafka-sink` appears throughout the file. Consider using a placeholder like `<your-bucket-name>` or `my-fluent-bit-bucket` to make it clearer that users should substitute their own bucket name. --- `99-99`: **Document the placeholder date format.** The hardcoded date `20251224` is used in all example queries. Consider adding a note at the top of the query section (around line 83) explaining that users should replace this with their actual date, similar to the note at line 346. </blockquote></details> <details> <summary>cmake/parquet.cmake (2)</summary><blockquote> `151-163`: **Consider allowing default paths as fallback.** Using `NO_DEFAULT_PATH` exclusively may miss Arrow/Parquet installations in standard system locations not explicitly listed. Consider removing `NO_DEFAULT_PATH` or adding a second search without it as a fallback to improve detection success on systems with non-standard installations. <details> <summary>🔎 Suggested approach</summary> ```diff find_path(ARROW_INCLUDE_DIR NAMES arrow/api.h PATHS ${SEARCH_PATHS} PATH_SUFFIXES ${INCLUDE_SUFFIXES} include/arrow - NO_DEFAULT_PATH )Or add a fallback search without
NO_DEFAULT_PATHif the first search fails.
223-226: Unicode characters in CMake messages may cause issues.The checkmark (✓) and cross (✗) Unicode characters in status messages may not render correctly on all terminals or CI systems. Consider using ASCII alternatives like
[OK]and[MISSING].🔎 Proposed fix
if(ARROW_FOUND AND PARQUET_FOUND) - message(STATUS "✓ Arrow found: ${ARROW_LIBRARIES}") + message(STATUS "[OK] Arrow found: ${ARROW_LIBRARIES}") message(STATUS " Include dirs: ${ARROW_INCLUDE_DIRS}") - message(STATUS "✓ Parquet found: ${PARQUET_LIBRARIES}") + message(STATUS "[OK] Parquet found: ${PARQUET_LIBRARIES}") message(STATUS " Include dirs: ${PARQUET_INCLUDE_DIRS}") else() - message(STATUS "✗ Arrow/Parquet not found") + message(STATUS "[MISSING] Arrow/Parquet not found")examples/s3_output/fluent-bit-s3-parquet.conf (1)
145-145: Very long schema_str line reduces readability.The
schema_strvalue is a single long JSON line that's difficult to read and maintain. Consider adding a comment above it with a formatted version of the schema for documentation purposes, or noting in the README that multi-line values would need escaping.src/flb_parquet_impl.cpp (1)
250-251: Appending nullptr to BinaryBuilder may cause undefined behavior.The
append_default_valuefor BINARY type passesnullptrwith size 0. While this may work, it's safer to use an empty byte array explicitly.🔎 Proposed fix
case arrow::Type::BINARY: - return static_cast<arrow::BinaryBuilder*>(builder)->Append(static_cast<const uint8_t*>(nullptr), 0).ok(); + return static_cast<arrow::BinaryBuilder*>(builder)->Append(reinterpret_cast<const uint8_t*>(""), 0).ok();plugins/out_s3/s3.c (1)
49-56: Duplicate forward declarations.These forward declarations at lines 49-56 duplicate those at lines 86-91. Remove the duplicates to avoid confusion.
🔎 Proposed fix
-/* Forward declarations */ -static struct multipart_upload *get_upload(struct flb_s3 *ctx, - const char *tag, int tag_len); -static struct multipart_upload *create_upload(struct flb_s3 *ctx, - const char *tag, int tag_len, - time_t file_first_log_time); -static flb_sds_t flb_pack_msgpack_extract_log_key(void *out_context, const char *data, - uint64_t bytes, struct flb_config *config); -
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (19)
.github/workflows/unit-tests.yamlCMakeLists.txtcmake/parquet.cmakeexamples/s3_output/README.mdexamples/s3_output/athena-queries.sqlexamples/s3_output/fluent-bit-s3-parquet.confexamples/s3_output/run-s3-parquet-test.shinclude/fluent-bit/aws/flb_aws_compress.hinclude/fluent-bit/flb_parquet.hplugins/out_s3/s3.cplugins/out_s3/s3.hsrc/CMakeLists.txtsrc/aws/CMakeLists.txtsrc/aws/compression/CMakeLists.txtsrc/aws/compression/arrow/CMakeLists.txtsrc/aws/compression/arrow/compress.csrc/aws/compression/arrow/compress.hsrc/aws/flb_aws_compress.csrc/flb_parquet_impl.cpp
💤 Files with no reviewable changes (5)
- src/aws/compression/CMakeLists.txt
- src/aws/compression/arrow/compress.c
- src/aws/compression/arrow/compress.h
- src/aws/CMakeLists.txt
- src/aws/compression/arrow/CMakeLists.txt
🧰 Additional context used
🧠 Learnings (16)
📓 Common learnings
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components such as ARROW/PARQUET (which use `#ifdef FLB_HAVE_ARROW` guards), ZSTD support is always available and doesn't need build-time conditionals. ZSTD headers are included directly without guards across multiple plugins and core components.
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:52-56
Timestamp: 2025-08-29T06:24:55.855Z
Learning: ZSTD compression is always available in Fluent Bit and does not require conditional compilation guards. Unlike Arrow/Parquet which use #ifdef FLB_HAVE_ARROW guards, ZSTD is built unconditionally with flb_zstd.c included directly in src/CMakeLists.txt and a bundled ZSTD library at lib/zstd-1.5.7/.
📚 Learning: 2025-11-21T06:23:29.770Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 11171
File: include/fluent-bit/flb_lib.h:52-53
Timestamp: 2025-11-21T06:23:29.770Z
Learning: In Fluent Bit core (fluent/fluent-bit repository), function descriptions/documentation are not required for newly added functions in header files.
Applied to files:
examples/s3_output/README.mdinclude/fluent-bit/flb_parquet.h
📚 Learning: 2025-08-07T10:15:46.187Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10691
File: .github/workflows/unit-tests.yaml:94-101
Timestamp: 2025-08-07T10:15:46.187Z
Learning: Apache Arrow latest packages have changing checksums, making hardcoded checksum verification impractical for CI workflows that download the latest version.
Applied to files:
.github/workflows/unit-tests.yamlcmake/parquet.cmake
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit CMakeLists.txt, the system library preference flags are defined as FLB_PREFER_SYSTEM_LIB_ZSTD and FLB_PREFER_SYSTEM_LIB_KAFKA with the FLB_ prefix.
Applied to files:
.github/workflows/unit-tests.yamlsrc/CMakeLists.txtcmake/parquet.cmakeCMakeLists.txt
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit, the correct CMake flag for using system librdkafka is `FLB_PREFER_SYSTEM_LIB_KAFKA=ON`.
Applied to files:
.github/workflows/unit-tests.yamlsrc/CMakeLists.txtcmake/parquet.cmakeCMakeLists.txt
📚 Learning: 2025-08-29T06:25:27.250Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components such as ARROW/PARQUET (which use `#ifdef FLB_HAVE_ARROW` guards), ZSTD support is always available and doesn't need build-time conditionals. ZSTD headers are included directly without guards across multiple plugins and core components.
Applied to files:
.github/workflows/unit-tests.yamlsrc/CMakeLists.txtsrc/aws/flb_aws_compress.cCMakeLists.txtplugins/out_s3/s3.cinclude/fluent-bit/aws/flb_aws_compress.h
📚 Learning: 2025-08-29T06:24:55.855Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:52-56
Timestamp: 2025-08-29T06:24:55.855Z
Learning: ZSTD compression is always available in Fluent Bit and does not require conditional compilation guards. Unlike Arrow/Parquet which use #ifdef FLB_HAVE_ARROW guards, ZSTD is built unconditionally with flb_zstd.c included directly in src/CMakeLists.txt and a bundled ZSTD library at lib/zstd-1.5.7/.
Applied to files:
.github/workflows/unit-tests.yamlsrc/CMakeLists.txtsrc/aws/flb_aws_compress.cCMakeLists.txtplugins/out_s3/s3.cinclude/fluent-bit/aws/flb_aws_compress.h
📚 Learning: 2025-08-29T06:25:27.250Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components, ZSTD support is always available and doesn't need build-time conditionals.
Applied to files:
.github/workflows/unit-tests.yamlsrc/CMakeLists.txtsrc/aws/flb_aws_compress.cCMakeLists.txtplugins/out_s3/s3.cinclude/fluent-bit/aws/flb_aws_compress.h
📚 Learning: 2025-08-29T06:24:26.170Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:39-42
Timestamp: 2025-08-29T06:24:26.170Z
Learning: In Fluent Bit, ZSTD compression support is enabled by default and does not require conditional compilation guards (like #ifdef FLB_HAVE_ZSTD) around ZSTD-related code declarations and implementations.
Applied to files:
.github/workflows/unit-tests.yamlsrc/CMakeLists.txtsrc/aws/flb_aws_compress.cCMakeLists.txtplugins/out_s3/s3.cinclude/fluent-bit/aws/flb_aws_compress.h
📚 Learning: 2025-08-29T06:24:44.797Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:26-26
Timestamp: 2025-08-29T06:24:44.797Z
Learning: In Fluent Bit, ZSTD support is always available and enabled by default. The build system automatically detects and uses either the system libzstd library or builds the bundled ZSTD version. Unlike other optional dependencies like Arrow which use conditional compilation guards (e.g., FLB_HAVE_ARROW), ZSTD does not require conditional includes or build flags.
Applied to files:
.github/workflows/unit-tests.yamlsrc/CMakeLists.txtsrc/aws/flb_aws_compress.cCMakeLists.txtinclude/fluent-bit/aws/flb_aws_compress.h
📚 Learning: 2025-09-08T11:21:33.975Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10851
File: include/fluent-bit/flb_simd.h:60-66
Timestamp: 2025-09-08T11:21:33.975Z
Learning: Fluent Bit currently only supports MSVC compiler on Windows, so additional compiler compatibility guards may be unnecessary for Windows-specific code paths.
Applied to files:
.github/workflows/unit-tests.yamlsrc/CMakeLists.txtCMakeLists.txt
📚 Learning: 2025-12-22T05:39:02.291Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 11250
File: lib/ripser-1.2.1/Makefile:1-18
Timestamp: 2025-12-22T05:39:02.291Z
Learning: In fluent/fluent-bit, the Makefile at lib/ripser-1.2.1/Makefile is imported from upstream Ripser and is not used in the actual build process. The project uses CMake for building (lib/ripser-1.2.1/CMakeLists.txt), so changes to the imported Makefile are not necessary.
Applied to files:
src/CMakeLists.txt
📚 Learning: 2025-08-29T06:25:02.561Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:7-7
Timestamp: 2025-08-29T06:25:02.561Z
Learning: In Fluent Bit, ZSTD (zstandard) compression library is bundled directly in the source tree at `lib/zstd-1.5.7` and is built unconditionally as a static library. Unlike optional external dependencies, ZSTD does not use conditional compilation guards like `FLB_HAVE_ZSTD` and is always available. Headers like `<fluent-bit/flb_zstd.h>` can be included directly without guards.
Applied to files:
src/CMakeLists.txtsrc/aws/flb_aws_compress.cCMakeLists.txtplugins/out_s3/s3.cinclude/fluent-bit/aws/flb_aws_compress.h
📚 Learning: 2025-09-14T09:46:09.531Z
Learnt from: aminvakil
Repo: fluent/fluent-bit PR: 10844
File: conf/fluent-bit:13-15
Timestamp: 2025-09-14T09:46:09.531Z
Learning: For fluent-bit Debian packaging, /opt/fluent-bit/bin/ is the appropriate installation path since the package may be installed from non-official Debian sources, making /opt compliant with FHS for optional software packages.
Applied to files:
src/CMakeLists.txt
📚 Learning: 2025-09-04T07:28:37.083Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10832
File: src/wasm/CMakeLists.txt:112-131
Timestamp: 2025-09-04T07:28:37.083Z
Learning: In fluent-bit CMake files, the user cosmo0920 prefers treating Git as a command rather than a package, emphasizing that Git is not a pkg-config retrievable package but just a command.
Applied to files:
src/CMakeLists.txt
📚 Learning: 2025-09-04T12:35:22.872Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10825
File: plugins/out_s3/s3.c:1339-1344
Timestamp: 2025-09-04T12:35:22.872Z
Learning: In the Fluent Bit S3 plugin, the user prefers to maintain current retry_limit behavior without special handling for FLB_OUT_RETRY_UNLIMITED (-1), as there's no documentation indicating -1 should be used for infinite retries and consistency with current logic is preferred.
Applied to files:
plugins/out_s3/s3.c
🧬 Code graph analysis (3)
src/flb_parquet_impl.cpp (1)
include/fluent-bit/flb_mem.h (1)
flb_free(126-128)
src/aws/flb_aws_compress.c (2)
src/flb_gzip.c (1)
flb_gzip_compress(157-252)src/flb_zstd.c (1)
flb_zstd_compress(33-57)
plugins/out_s3/s3.h (1)
plugins/out_s3/s3.c (1)
s3_put_object(1663-1800)
🪛 Cppcheck (2.19.0)
plugins/out_s3/s3.c
[error] 791-791: failed to expand 'flb_plg_error', it is invalid to use a preprocessor directive as macro parameter
(syntaxError)
🪛 LanguageTool
examples/s3_output/README.md
[grammar] ~56-~56: Ensure spelling is correct
Context: ..., 201, 404, 500, 401) - Response times (2ms to 5000ms) - Client information (IP, us...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~465-~465: Ensure spelling is correct
Context: ...les uploaded when reaching 50MB (before 60min timeout) Verification Steps: 1. R...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🪛 markdownlint-cli2 (0.18.1)
examples/s3_output/README.md
112-112: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
177-177: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
270-270: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🔇 Additional comments (14)
include/fluent-bit/flb_parquet.h (1)
82-85: LGTM! Well-documented API with clear ownership semantics.The function signature and documentation are comprehensive and clear. The API design properly communicates:
- Memory ownership (caller must free with flb_free)
- Error handling (NULL on failure)
- Schema format expectations (Arrow JSON schema)
- Supported types and examples
include/fluent-bit/aws/flb_aws_compress.h (2)
33-46: Clear deprecation notice and migration guidance.The documentation clearly separates true compression algorithms from format converters and provides excellent guidance:
- ARROW marked as deprecated with rationale
- PARQUET (5) directs users to use
format=parquetinstead- Clear distinction between compression and format conversion
This will help users migrate away from the deprecated patterns.
30-31: No backward compatibility issue—this renumbering is safe.The compression constants are internal implementation details used only at runtime. Configuration files use string values (
compression=gzip,compression=snappy,compression=zstd) which are mapped to numeric constants viaflb_aws_compression_get_type(). Since the numeric constant values are never persisted to storage and users never interact with them directly, renumbering fromFLB_AWS_COMPRESS_ZSTD=4toFLB_AWS_COMPRESS_ZSTD=3does not break backward compatibility.Likely an incorrect or invalid review comment.
CMakeLists.txt (1)
284-286: LGTM! Consistent encoder enablement pattern.When
FLB_ALLis enabled, both Avro and Parquet encoders are automatically enabled, following the existing pattern for optional features. This provides a consistent user experience.src/CMakeLists.txt (1)
157-166: LGTM! Proper C++ integration for Parquet implementation.The Parquet encoder support is correctly integrated:
- Source file added conditionally based on
FLB_PARQUET_ENCODER- Explicit CXX language property set for the C++ implementation file
- Follows the existing pattern for optional features like Avro encoder
src/aws/flb_aws_compress.c (2)
31-37: LGTM! Correct adapter pattern for Snappy compression.The wrapper function properly adapts
flb_snappy_compress(which useschar*pointers) to the standard compression interface (which usesvoid*pointers). This follows the same pattern as the existingflb_gzip_compressandflb_zstd_compressfunctions shown in the relevant code snippets.
46-76: Excellent documentation and clean separation of concerns.The updated comments clearly distinguish:
- True compression algorithms (gzip, snappy, zstd)
- Deprecated format converters (ARROW, PARQUET)
- Guidance to use
format=parquetinstead ofcompression=parquetThe removal of Arrow/Parquet compression options from this module aligns with the PR's goal of decoupling format conversion from the compression subsystem.
examples/s3_output/README.md (1)
1-518: Exceptional documentation! Comprehensive and production-ready guidance.This README provides outstanding documentation that covers:
- Clear test scenarios with realistic use cases
- Detailed explanation of upload triggers (
upload_timeoutvstotal_file_size)- Schema design best practices with nullable/non-nullable guidance
- Data quality monitoring and troubleshooting
- Performance testing methodology
- Athena integration examples
The documentation will significantly help users understand Parquet encoder usage in production scenarios.
plugins/out_s3/s3.h (1)
234-235: LGTM - API signature updated for SDS-based buffers.The signature change from
char* bodytoflb_sds_t bodyaligns with the SDS-based buffering approach used throughout the refactored code, ensuring consistent memory management.examples/s3_output/fluent-bit-s3-parquet.conf (1)
104-117: LGTM - Well-documented PutObject configuration.The configuration clearly documents the upload strategy, timeout behavior, and use case. The comments explaining the difference between
upload_timeoutandtotal_file_sizeare helpful for users.src/flb_parquet_impl.cpp (1)
696-710: LGTM - Exception handling with proper cleanup.The exception handlers correctly destroy the msgpack unpacked result and free the output buffer if allocated, preventing memory leaks on error paths.
plugins/out_s3/s3.c (3)
752-774: LGTM - Deprecated compression values handled gracefully.The code properly warns users about deprecated
arrow/parquetcompression values while still supporting them by mapping to the newformat=parquetapproach. The build-time guard ensures helpful error messages when Parquet isn't available.
1583-1630: LGTM - Parquet conversion with proper error handling.The Parquet conversion path validates
schema_stris present, handles conversion failures gracefully, and properly manages memory by converting the Parquet buffer toflb_sds_tfor consistent lifecycle management.
3951-3962: Storing raw msgpack improves efficiency for Parquet.Storing raw msgpack data and deferring format conversion to upload time is the right approach for columnar formats like Parquet, which benefit from batching records together for better compression and schema inference.
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
plugins/out_s3/s3.c (1)
3701-3725: Leak ofchunkSDS inunit_test_flush()path.In
unit_test_flush()you now treatchunkasflb_sds_t, but never destroy it:s3_store_buffer_put(..., chunk, ...); ret = construct_request_buffer(ctx, chunk, upload_file, &buffer, &buffer_size); ... ret = upload_data(..., buffer, ...); flb_sds_destroy(buffer); FLB_OUTPUT_RETURN(ret);
construct_request_buffer()only reads fromchunkand does not take ownership, sounit_test_flush()shouldflb_sds_destroy(chunk)once that call returns. As written, every test‑mode flush leaks the SDS buffer.Suggested minimal fix
static void unit_test_flush(void *out_context, struct s3_file *upload_file, const char *tag, int tag_len, flb_sds_t chunk, int chunk_size, struct multipart_upload *m_upload_file, time_t file_first_log_time) { int ret; flb_sds_t buffer = NULL; size_t buffer_size; struct flb_s3 *ctx = out_context; s3_store_buffer_put(ctx, upload_file, tag, tag_len, chunk, (size_t) chunk_size, file_first_log_time); - ret = construct_request_buffer(ctx, chunk, upload_file, &buffer, &buffer_size); + ret = construct_request_buffer(ctx, chunk, upload_file, &buffer, &buffer_size); + flb_sds_destroy(chunk); if (ret < 0) { flb_plg_error(ctx->ins, "Could not construct request buffer for %s", upload_file->file_path); FLB_OUTPUT_RETURN(FLB_RETRY); } ret = upload_data(ctx, upload_file, m_upload_file, buffer, buffer_size, tag, tag_len); flb_sds_destroy(buffer); FLB_OUTPUT_RETURN(ret); }
♻️ Duplicate comments (7)
examples/s3_output/run-s3-parquet-test.sh (1)
6-16: Logic issue: directories created then immediately cleaned.The script creates directories (lines 8-11) and then immediately removes their contents (line 15). This sequence is confusing and redundant. If the intent is to ensure clean directories exist, reverse the order or use
mkdir -palone (which is idempotent).🔎 Proposed fix
# Create temporary directories STORE_DIR="/tmp/fluent-bit-test" + +# Clean old data first (if exists) +if [ -d "$STORE_DIR" ]; then + rm -rf "$STORE_DIR" +fi + +# Create fresh directories mkdir -p "$STORE_DIR/parquet-gzip-simple" mkdir -p "$STORE_DIR/json-snappy-complex-low" mkdir -p "$STORE_DIR/json-gzip-complex-medium" mkdir -p "$STORE_DIR/json-zstd-complex-high" echo "✓ Created store directories" - -# Clean old data -rm -rf "$STORE_DIR"/*/ -echo "✓ Cleaned old data"This also addresses the safety concern from the previous review about
rm -rfon a potentially unset variable, by checking directory existence first.plugins/out_s3/s3.h (1)
123-124: Unusedapply_outer_compressionfield instruct flb_s3.
apply_outer_compressionis not referenced anywhere inplugins/out_s3/s3.c(compression behavior is driven byctx->format != FLB_S3_FORMAT_PARQUETinstead), so this field is dead state and adds confusion. Consider removing it or wiring it into the compression path; otherwise future readers might assume it actually controls outer compression.CMakeLists.txt (1)
202-202: Update Parquet C++ standard from C++11 to C++17.The Parquet encoder block enables C++ and forces
CMAKE_CXX_STANDARD 11, but current Apache Arrow/Parquet C++ releases require at least C++17. Keeping this at 11 will either fail the build or subtly misconfigure Arrow compilation whenFLB_PARQUET_ENCODERis ON. Please bump this to 17 (and consider aligning the earlier simdutfCMAKE_CXX_STANDARDas well).You should double‑check against the Arrow/Parquet version you target (their build docs list the minimum C++ standard) and confirm that setting
CMAKE_CXX_STANDARD 17is required whenFLB_PARQUET_ENCODERis enabled.Also applies to: 278-287, 1311-1316
plugins/out_s3/s3.c (1)
4210-4217:compressionconfig map still writes a string into anintfield.The
compressionentry is declared asFLB_CONFIG_MAP_STRbut its offset points tostruct flb_s3::compression(anint):{ FLB_CONFIG_MAP_STR, "compression", NULL, 0, FLB_TRUE, offsetof(struct flb_s3, compression), ... },During
flb_output_config_map_set(), this causes the parser to store achar *into anintslot, corruptingcompressionand adjacent fields. You now parsecompressionexplicitly viaflb_output_get_property()incb_s3_init(), so this map binding is both unsafe and unnecessary.Recommend either:
- Removing the struct offset (
FLB_FALSE, 0) and treatingcompressionas a pure property, or- Introducing a
char *compression_strfield and mapping the string there.src/flb_parquet_impl.cpp (3)
138-151: Add overflow handling for uint64 → int64 inconvert_to_int64().
MSGPACK_OBJECT_POSITIVE_INTEGERvalues are cast directly fromuint64_ttoint64_t:if (obj->type == MSGPACK_OBJECT_POSITIVE_INTEGER) { return builder->Append(static_cast<int64_t>(obj->via.u64)); }Values greater than
INT64_MAXwill silently wrap, corrupting data in the Parquet file. This mirrors the earlierint32bug you already guarded against.You should add an overflow check similar to
convert_to_int32()and either clamp, append a default, or return an error so the caller can account for it.Please confirm how Arrow/Parquet is expected to handle out‑of‑range
uint64values (e.g., whether to clamp, treat as error, or promote to unsigned logical type) based on the versions you target, and align this converter with that behavior.
213-226: Timestamp conversion ignores schema time unit.
convert_to_timestamp()always treats the input as seconds and multiplies by 1,000,000 to get microseconds:return builder->Append(static_cast<int64_t>(obj->via.u64) * 1000000);However, when parsing the schema you support
"unit": "s"|"ms"|"us"|"ns"and buildarrow::timestamp(time_unit)accordingly. The converter never looks at that unit, so a schema declaringmsornswill still receive values scaled as if they were seconds→µs.You likely want to:
- Downcast
type(or the builder’s type) toarrow::TimestampType,- Inspect its
unit(),- Apply the appropriate multiplier/divider so the stored values actually match the declared unit.
Double‑check the Arrow
TimestampTypesemantics for units (SECOND,MILLI,MICRO,NANO) and ensure the scaling here matches how your upstream logs encode timestamps (seconds vs ms vs ns).Also applies to: 386-424, 505-555
301-321: Storingmsgpack_objectinstances acrossmsgpack_unpack_next()iterations may rely on msgpack zone semantics.You collect
msgpack_objectinstances into astd::vectorand process them after the unpack loop:std::vector<msgpack_object> records; ... while (msgpack_unpack_next(&result, ...) == MSGPACK_UNPACK_SUCCESS) { ... records.push_back(array.ptr[i]); // or result.data } ... for (auto &record : records) { ... }These
msgpack_objects contain pointers into themsgpack_unpackedzone. This is fine only ifmsgpack_unpack_next()never invalidates existing zone allocations while reusing the samemsgpack_unpackedacross iterations; some msgpack documentation/patterns assume you consume each object immediately.If the zone is reused or compacted between iterations,
recordscould end up with dangling pointers once more data is unpacked.Please verify against the msgpack‑c docs for
msgpack_unpackedandmsgpack_unpack_next()whether it is safe to keepmsgpack_objectcopies after subsequentmsgpack_unpack_next()calls, or whether you must either:
- Process each record inside the loop, or
- Deep‑copy the record into your own storage (including map/array/string contents) before the next unpack.
Also applies to: 474-555
🧹 Nitpick comments (9)
examples/s3_output/run-s3-parquet-test.sh (1)
38-39: Relative paths may cause execution failures.The script uses relative paths for both the binary (
./build/bin/fluent-bit) and config file (fluent-bit-s3-parquet.conf). This will fail unless run from the repository root.Consider adding a working directory check or using
$SCRIPT_DIRto make paths relative to the script location.🔎 Proposed improvement
+# Get script directory for relative paths +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)" + # Run Fluent Bit -./build/bin/fluent-bit -c fluent-bit-s3-parquet.conf +"$REPO_ROOT/build/bin/fluent-bit" -c "$SCRIPT_DIR/fluent-bit-s3-parquet.conf"examples/s3_output/athena-queries.sql (2)
38-44: Hardcoded bucket name - consider adding placeholder instructions.The S3 bucket
s3-bucket-kafka-sinkis hardcoded throughout. While acceptable for examples, consider adding a comment at the top noting that users should replace this with their own bucket name.
99-101: Hardcoded date filter in example queries.All queries use
day='20251224'which matches "today's date" per the context. For a reusable example, consider using a dynamic date or adding a note about updating this value.examples/s3_output/README.md (3)
112-114: Add language specifier to fenced code block.Per markdownlint, fenced code blocks should have a language specified. This block shows a logical expression.
🔎 Proposed fix
-``` +```text Upload Trigger = upload_timeout reached OR total_file_size reached</details> --- `177-181`: **Add language specifier to fenced code block.** Per markdownlint, fenced code blocks should have a language specified for the directory tree. <details> <summary>🔎 Proposed fix</summary> ```diff -``` +```text s3://s3-bucket-kafka-sink/fluent-bit-logs/ ├── putobject/app/day=YYYYMMDD/ # Case 1: Small files via PutObject └── multipart/access/day=YYYYMMDD/ # Case 2: Large files via Multipart Upload</details> --- `270-275`: **Add language specifier to fenced code block.** Per markdownlint, fenced code blocks should have a language specified for log output examples. <details> <summary>🔎 Proposed fix</summary> ```diff -``` +```text [parquet] Data quality summary for 53206 records: [parquet] Missing non-nullable fields (defaults used): [parquet] field='level' count=52506 [parquet] field='timestamp' count=52506</details> </blockquote></details> <details> <summary>cmake/parquet.cmake (1)</summary><blockquote> `150-163`: **Consider adding default path fallback for Arrow discovery.** The `NO_DEFAULT_PATH` option prevents CMake from searching standard system paths. This might miss Arrow installations in non-standard but common locations not listed in `SEARCH_PATHS`. Consider removing `NO_DEFAULT_PATH` or adding a secondary search without it as a fallback. <details> <summary>🔎 Proposed improvement</summary> ```diff find_path(ARROW_INCLUDE_DIR NAMES arrow/api.h PATHS ${SEARCH_PATHS} PATH_SUFFIXES ${INCLUDE_SUFFIXES} include/arrow - NO_DEFAULT_PATH ) find_library(ARROW_LIBRARY NAMES ${ARROW_LIB_NAMES} PATHS ${SEARCH_PATHS} PATH_SUFFIXES ${LIB_SUFFIXES} - NO_DEFAULT_PATH )Alternatively, keep
NO_DEFAULT_PATHfor the first search and add a fallback search without it if the first fails.include/fluent-bit/flb_parquet.h (2)
23-25: Consider removing unused include.The header includes
<fluent-bit/flb_sds.h>but the function signature doesn't useflb_sds_t. If this is not used elsewhere in the implementation exposed by this header, consider removing it to reduce compilation dependencies.#!/bin/bash # Check if flb_sds types are used in the parquet implementation or header rg -n "flb_sds" src/flb_parquet_impl.cpp include/fluent-bit/flb_parquet.h
82-85: Missing extern "C" guard for C++ compatibility.Since this header will be included by C++ code (
flb_parquet_impl.cpp), consider addingextern "C"guards to ensure proper C linkage when compiled as C++.🔎 Proposed fix
+#ifdef __cplusplus +extern "C" { +#endif + void *flb_msgpack_raw_to_parquet(const void *in_buf, size_t in_size, const char *schema_str, int compression, size_t *out_size); +#ifdef __cplusplus +} +#endif + #endif
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (19)
.github/workflows/unit-tests.yamlCMakeLists.txtcmake/parquet.cmakeexamples/s3_output/README.mdexamples/s3_output/athena-queries.sqlexamples/s3_output/fluent-bit-s3-parquet.confexamples/s3_output/run-s3-parquet-test.shinclude/fluent-bit/aws/flb_aws_compress.hinclude/fluent-bit/flb_parquet.hplugins/out_s3/s3.cplugins/out_s3/s3.hsrc/CMakeLists.txtsrc/aws/CMakeLists.txtsrc/aws/compression/CMakeLists.txtsrc/aws/compression/arrow/CMakeLists.txtsrc/aws/compression/arrow/compress.csrc/aws/compression/arrow/compress.hsrc/aws/flb_aws_compress.csrc/flb_parquet_impl.cpp
💤 Files with no reviewable changes (5)
- src/aws/compression/arrow/compress.h
- src/aws/compression/arrow/CMakeLists.txt
- src/aws/compression/CMakeLists.txt
- src/aws/CMakeLists.txt
- src/aws/compression/arrow/compress.c
🚧 Files skipped from review as they are similar to previous changes (2)
- .github/workflows/unit-tests.yaml
- examples/s3_output/fluent-bit-s3-parquet.conf
🧰 Additional context used
🧠 Learnings (15)
📚 Learning: 2025-08-29T06:25:27.250Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components such as ARROW/PARQUET (which use `#ifdef FLB_HAVE_ARROW` guards), ZSTD support is always available and doesn't need build-time conditionals. ZSTD headers are included directly without guards across multiple plugins and core components.
Applied to files:
src/aws/flb_aws_compress.csrc/CMakeLists.txtCMakeLists.txtplugins/out_s3/s3.cinclude/fluent-bit/aws/flb_aws_compress.h
📚 Learning: 2025-08-29T06:24:55.855Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:52-56
Timestamp: 2025-08-29T06:24:55.855Z
Learning: ZSTD compression is always available in Fluent Bit and does not require conditional compilation guards. Unlike Arrow/Parquet which use #ifdef FLB_HAVE_ARROW guards, ZSTD is built unconditionally with flb_zstd.c included directly in src/CMakeLists.txt and a bundled ZSTD library at lib/zstd-1.5.7/.
Applied to files:
src/aws/flb_aws_compress.csrc/CMakeLists.txtCMakeLists.txtsrc/flb_parquet_impl.cppplugins/out_s3/s3.cinclude/fluent-bit/aws/flb_aws_compress.h
📚 Learning: 2025-08-29T06:24:26.170Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:39-42
Timestamp: 2025-08-29T06:24:26.170Z
Learning: In Fluent Bit, ZSTD compression support is enabled by default and does not require conditional compilation guards (like #ifdef FLB_HAVE_ZSTD) around ZSTD-related code declarations and implementations.
Applied to files:
src/aws/flb_aws_compress.csrc/CMakeLists.txtCMakeLists.txtplugins/out_s3/s3.cinclude/fluent-bit/aws/flb_aws_compress.h
📚 Learning: 2025-08-29T06:25:02.561Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:7-7
Timestamp: 2025-08-29T06:25:02.561Z
Learning: In Fluent Bit, ZSTD (zstandard) compression library is bundled directly in the source tree at `lib/zstd-1.5.7` and is built unconditionally as a static library. Unlike optional external dependencies, ZSTD does not use conditional compilation guards like `FLB_HAVE_ZSTD` and is always available. Headers like `<fluent-bit/flb_zstd.h>` can be included directly without guards.
Applied to files:
src/aws/flb_aws_compress.csrc/CMakeLists.txtCMakeLists.txtplugins/out_s3/s3.cinclude/fluent-bit/aws/flb_aws_compress.h
📚 Learning: 2025-08-29T06:25:27.250Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components, ZSTD support is always available and doesn't need build-time conditionals.
Applied to files:
src/aws/flb_aws_compress.csrc/CMakeLists.txtCMakeLists.txtplugins/out_s3/s3.cinclude/fluent-bit/aws/flb_aws_compress.h
📚 Learning: 2025-08-29T06:24:44.797Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:26-26
Timestamp: 2025-08-29T06:24:44.797Z
Learning: In Fluent Bit, ZSTD support is always available and enabled by default. The build system automatically detects and uses either the system libzstd library or builds the bundled ZSTD version. Unlike other optional dependencies like Arrow which use conditional compilation guards (e.g., FLB_HAVE_ARROW), ZSTD does not require conditional includes or build flags.
Applied to files:
src/aws/flb_aws_compress.csrc/CMakeLists.txtCMakeLists.txt
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit CMakeLists.txt, the system library preference flags are defined as FLB_PREFER_SYSTEM_LIB_ZSTD and FLB_PREFER_SYSTEM_LIB_KAFKA with the FLB_ prefix.
Applied to files:
src/CMakeLists.txtCMakeLists.txtcmake/parquet.cmake
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit, the correct CMake flag for using system librdkafka is `FLB_PREFER_SYSTEM_LIB_KAFKA=ON`.
Applied to files:
src/CMakeLists.txtCMakeLists.txtcmake/parquet.cmake
📚 Learning: 2025-12-22T05:39:02.291Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 11250
File: lib/ripser-1.2.1/Makefile:1-18
Timestamp: 2025-12-22T05:39:02.291Z
Learning: In fluent/fluent-bit, the Makefile at lib/ripser-1.2.1/Makefile is imported from upstream Ripser and is not used in the actual build process. The project uses CMake for building (lib/ripser-1.2.1/CMakeLists.txt), so changes to the imported Makefile are not necessary.
Applied to files:
src/CMakeLists.txt
📚 Learning: 2025-09-08T11:21:33.975Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10851
File: include/fluent-bit/flb_simd.h:60-66
Timestamp: 2025-09-08T11:21:33.975Z
Learning: Fluent Bit currently only supports MSVC compiler on Windows, so additional compiler compatibility guards may be unnecessary for Windows-specific code paths.
Applied to files:
src/CMakeLists.txtCMakeLists.txt
📚 Learning: 2025-09-14T09:46:09.531Z
Learnt from: aminvakil
Repo: fluent/fluent-bit PR: 10844
File: conf/fluent-bit:13-15
Timestamp: 2025-09-14T09:46:09.531Z
Learning: For fluent-bit Debian packaging, /opt/fluent-bit/bin/ is the appropriate installation path since the package may be installed from non-official Debian sources, making /opt compliant with FHS for optional software packages.
Applied to files:
src/CMakeLists.txt
📚 Learning: 2025-09-04T07:28:37.083Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10832
File: src/wasm/CMakeLists.txt:112-131
Timestamp: 2025-09-04T07:28:37.083Z
Learning: In fluent-bit CMake files, the user cosmo0920 prefers treating Git as a command rather than a package, emphasizing that Git is not a pkg-config retrievable package but just a command.
Applied to files:
src/CMakeLists.txt
📚 Learning: 2025-11-21T06:23:29.770Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 11171
File: include/fluent-bit/flb_lib.h:52-53
Timestamp: 2025-11-21T06:23:29.770Z
Learning: In Fluent Bit core (fluent/fluent-bit repository), function descriptions/documentation are not required for newly added functions in header files.
Applied to files:
include/fluent-bit/flb_parquet.hexamples/s3_output/README.md
📚 Learning: 2025-09-04T12:35:22.872Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10825
File: plugins/out_s3/s3.c:1339-1344
Timestamp: 2025-09-04T12:35:22.872Z
Learning: In the Fluent Bit S3 plugin, the user prefers to maintain current retry_limit behavior without special handling for FLB_OUT_RETRY_UNLIMITED (-1), as there's no documentation indicating -1 should be used for infinite retries and consistency with current logic is preferred.
Applied to files:
plugins/out_s3/s3.c
📚 Learning: 2025-08-07T10:15:46.187Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10691
File: .github/workflows/unit-tests.yaml:94-101
Timestamp: 2025-08-07T10:15:46.187Z
Learning: Apache Arrow latest packages have changing checksums, making hardcoded checksum verification impractical for CI workflows that download the latest version.
Applied to files:
cmake/parquet.cmake
🧬 Code graph analysis (4)
src/aws/flb_aws_compress.c (2)
src/flb_gzip.c (1)
flb_gzip_compress(157-252)src/flb_zstd.c (1)
flb_zstd_compress(33-57)
src/flb_parquet_impl.cpp (1)
include/fluent-bit/flb_mem.h (1)
flb_free(126-128)
include/fluent-bit/flb_parquet.h (1)
src/flb_parquet_impl.cpp (2)
flb_msgpack_raw_to_parquet(265-711)flb_msgpack_raw_to_parquet(265-268)
plugins/out_s3/s3.h (1)
plugins/out_s3/s3.c (1)
s3_put_object(1663-1800)
🪛 Cppcheck (2.19.0)
plugins/out_s3/s3.c
[error] 791-791: failed to expand 'flb_plg_error', it is invalid to use a preprocessor directive as macro parameter
(syntaxError)
🪛 LanguageTool
examples/s3_output/README.md
[grammar] ~56-~56: Ensure spelling is correct
Context: ..., 201, 404, 500, 401) - Response times (2ms to 5000ms) - Client information (IP, us...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~465-~465: Ensure spelling is correct
Context: ...les uploaded when reaching 50MB (before 60min timeout) Verification Steps: 1. R...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🪛 markdownlint-cli2 (0.18.1)
examples/s3_output/README.md
112-112: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
177-177: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
270-270: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🔇 Additional comments (19)
src/aws/flb_aws_compress.c (2)
31-37: LGTM! Wrapper function correctly adapts the signature.The wrapper properly casts between
void*andchar*to adaptflb_snappy_compress's signature to the uniform compression interface. This follows the same pattern used for gzip and zstd.
45-76: Good refactoring to clarify compression algorithms vs format converters.The updated comments clearly distinguish true compression algorithms (gzip, snappy, zstd) from format converters (Arrow/Parquet). Removing ARROW/PARQUET from the runtime
compression_optionstable while keeping deprecated macros in the header is a clean backward-compatible approach.One minor observation: the comment on line 59 ("FLB_AWS_COMPRESS_NONE which is 0 is reserved for array footer") is now orphaned after the blank line. Consider moving it closer to the footer entry on line 77.
include/fluent-bit/aws/flb_aws_compress.h (2)
33-46: Good deprecation strategy with clear documentation.The comments clearly explain that ARROW is deprecated and PARQUET should use
format=parquetinstead. Keeping the macros prevents compilation errors for any downstream code that references them.
28-31: No backward compatibility issue from ZSTD renumbering.Compression type configuration uses string keywords ("gzip", "snappy", "zstd"), not numeric values. The numeric constants (0, 1, 2, 3, etc.) are internal implementation details used only at runtime for comparisons and branching. The
flb_aws_compression_get_type()function converts configuration strings to these numeric types at startup. Since numeric values are never persisted, serialized, or exposed to external systems, renumbering internal constants has no impact on user configurations or backward compatibility.cmake/parquet.cmake (1)
222-226: Good: Clear status reporting with emoji indicators.The status messages with ✓ and ✗ symbols provide clear visual feedback during CMake configuration. The inclusion of library paths and include directories aids debugging.
src/CMakeLists.txt (3)
157-166: LGTM! Parquet encoder integration follows existing patterns.The Parquet encoder source is conditionally added and marked as C++ correctly. This follows the same pattern as the AVRO encoder block above.
525-528: SASL linking additions appear correct for Kafka integration.The SASL library is conditionally linked when
FLB_SASL_ENABLEDis set, properly handling both the shared library and binary targets. This follows standard CMake patterns.Also applies to: 600-605
398-405: The code is correctly guarded. Whenif(FLB_PARQUET_ENCODER)at line 399 is true, it means the parentCMakeLists.txtverified that bothARROW_FOUNDandPARQUET_FOUNDwere true (lines 1320-1328), and it would have setFLB_PARQUET_ENCODER OFFotherwise. The variablesARROW_LIBRARIESandPARQUET_LIBRARIESare set and exported withPARENT_SCOPEincmake/parquet.cmake, ensuring they are available in the subdirectory scope.include/fluent-bit/flb_parquet.h (1)
27-85: Excellent API documentation with comprehensive schema examples.The documentation thoroughly covers:
- Parameter descriptions with format specifications
- Supported type mappings
- Usage example with proper memory management
- Links to Arrow documentation
This level of documentation is exemplary for a public API.
plugins/out_s3/s3.h (1)
197-205: Format fields, macros, ands3_put_objectprototype look consistent.
format/schema_strfields,FLB_S3_FORMAT_JSON/PARQUETmacros, and the updateds3_put_objectprototype (usingflb_sds_t buffer) all match their usage inplugins/out_s3/s3.cand cleanly expose JSON vs Parquet behavior.Also applies to: 234-235
plugins/out_s3/s3.c (8)
192-199: Correctly skippingContent-Encodingfor Parquet payloads.Conditioning the
Content-Encodingheader on(compression == gzip|zstd) && format != FLB_S3_FORMAT_PARQUETensures Parquet objects are not wrapped in an outer HTTP compression layer while still advertising compression for JSON payloads. This matches the “internal only” compression model for Parquet and looks good.Also applies to: 227-239
803-823: Parquet format path: good separation of internal vs outer compression, but hard‑fails withoutschema_str.The Parquet branch in
construct_request_buffer()correctly:
- Avoids outer HTTP compression and uses
ctx->compressiononly as the internal Parquet codec.- Requires
schema_strand bails out early with a clear error if it’s missing.- Converts the Arrow buffer to
flb_sds_tand frees the Arrow‑allocated memory.This behavior is sound: format‑specific requirements are enforced once at conversion time, and buffer ownership is cleanly transferred to Fluent Bit.
Also applies to: 1515-1631
1217-1277: SDS compression path inupload_data()is generally correct, but relies solely onformatto skip outer compression.The new
upload_data()signature and logic look good overall:
bodyis nowflb_sds_t, withbody_to_send/size_to_sendcorrectly switched to the compressed SDS when compression succeeds.- The compression block is gated by
ctx->compression != FLB_AWS_COMPRESS_NONE && ctx->format != FLB_S3_FORMAT_PARQUET, so Parquet payloads are never double‑compressed.- All error paths that allocate
compressed_bodyalso destroy it, and both PutObject and multipart paths clean upcompressed_bodyafter use.No functional issues seen here; the behavior matches the intended “outer compression only for non‑Parquet” design.
Also applies to: 1325-1392
1442-1513: SDS buffer lifecycle aroundconstruct_request_buffer()/upload_data()is consistent (except unit tests).
put_all_chunks(),send_upload_request(), andcb_s3_upload()all:
- Call
construct_request_buffer()to obtain anflb_sds_t buffer.- Pass
bufferintos3_put_object()/upload_data().- Always
flb_sds_destroy(buffer)afterward, regardless of success/failure.This gives a single clear owner for each formatted buffer and avoids leaks in the main upload paths.
Also applies to: 1991-2015, 3440-3488, 3701-3724
1515-1661: Error handling and locking inconstruct_request_buffer()look solid.
- Handles the illegal
(new_data == NULL && chunk == NULL)combination explicitly.- For file‑backed chunks, locks the
s3_fileonly after a successful read, and reliably unlocks on every error path (realloc failure, Parquet schema error, conversion failure).- Frees the intermediate msgpack buffer in all paths (success and failure).
- Ensures
formatted_data/formatted_sizeare both valid before returning and destroys the SDS on conversion failure.This is a good, consistent ownership pattern between the file store, msgpack buffer, and formatted SDS.
1663-1800:s3_put_object()SDS refactor is behavior‑preserving.Switching the body parameter to
flb_sds_tintegrates cleanly:
- MD5 calculation and HTTP request still operate on a contiguous buffer (
body,body_size).- URI/key construction, seq_index updates, error paths, and logging remain unchanged.
- No new ownership expectations are imposed on callers (they still manage the SDS lifetime).
3953-3963: Raw msgpack buffering incb_s3_flush()is a good fit for Parquet.Creating
chunkas an SDS copy of the raw msgpack and deferring JSON/Parquet conversion toconstruct_request_buffer():
- Avoids double conversion when retries occur.
- Makes it easy to plug in additional formats later, since the on‑disk representation stays msgpack.
- Keeps buffer ownership straightforward via
buffer_chunk()/send_upload_request().No issues spotted here.
Also applies to: 4036-4085
4219-4231: Newformat,schema_str,upload_timeout, and bearer token options are wired correctly.
formatis exposed as a documented string option and interpreted viaflb_output_get_property("format", ...).schema_stris mapped intostruct flb_s3::schema_strand used by the Parquet conversion path.upload_timeoutis now a properFLB_CONFIG_MAP_TIMEoption and is referenced consistently in timeout checks and timer setup.authorization_endpoint_bearer_tokenfollows the existing auth option patterns.These mappings and docs align with how the fields are used later in the file.
Also applies to: 4381-4386, 4406-4414
src/flb_parquet_impl.cpp (1)
265-299: Overall Parquet conversion pipeline and memory ownership look good.
- Parameter validation, compression mapping, schema parsing (via cJSON), and Arrow schema construction are all checked with clear error logs.
- Builders are created per field, and all failure paths free cJSON resources and destroy the msgpack zone before returning.
- Record conversion tracks data‑quality stats (missing fields, type failures, int32 overflow, float→int clamps) and emits a concise summary only when issues exist.
- Parquet writing uses
BufferOutputStream(with auto‑growing buffer), closes the writer, then copies the final Arrow buffer intoflb_malloc‑managed memory (output_buffer), which is freed on all exception paths.The high‑level control flow and ownership model are solid; the remaining issues are mainly around numeric edge cases and msgpack object lifetime.
Also applies to: 331-446, 449-611, 616-692
a821fec to
5e868fc
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (3)
CMakeLists.txt (1)
1311-1316: Parquet encoder should request C++17, not C++11Arrow/Parquet C++ now require at least C++17; forcing
CMAKE_CXX_STANDARD 11in the Parquet block will fail with current Arrow toolchains or silently misconfigure builds. Please bump this to 17 (and keep it consistent with any otherCMAKE_CXX_STANDARDoverrides).if(FLB_PARQUET_ENCODER) enable_language(CXX) set(CMAKE_CXX_STANDARD 17) include(cmake/parquet.cmake) ... endif()#!/bin/bash # Verify Arrow C++ minimum standard in your environment rg -n "CMAKE_CXX_STANDARD" CMakeLists.txt cmake -Scmake/parquet.cmake (1)
149-157: Fix Arrow include search (include/arrowsuffix still produces wrong include dir)In the manual search:
find_path(ARROW_INCLUDE_DIR NAMES arrow/api.h PATHS ${SEARCH_PATHS} PATH_SUFFIXES ${INCLUDE_SUFFIXES} include/arrow NO_DEFAULT_PATH )using
include/arrowas a suffix causesARROW_INCLUDE_DIRto resolve to${prefix}/include/arrow, so the later version check:if(EXISTS "${ARROW_INCLUDE_DIRS}/arrow/util/config.h")looks for
${prefix}/include/arrow/arrow/util/config.h, which is incorrect for standard Arrow layouts (${prefix}/include/arrow/...).Drop the
include/arrowsuffix and rely on${INCLUDE_SUFFIXES}(typically justinclude), e.g.:find_path(ARROW_INCLUDE_DIR NAMES arrow/api.h PATHS ${SEARCH_PATHS} PATH_SUFFIXES ${INCLUDE_SUFFIXES} NO_DEFAULT_PATH )The Parquet search’s
include/parquetsuffix is fine and should remain.Also applies to: 198-216
plugins/out_s3/s3.c (1)
744-775: Backward compatibility broken:compression=arrow|parquetnow fails instead of mapping toformat=parquetThe check at line 752 can never be true. When
flb_aws_compression_get_type(tmp)is called with"arrow"or"parquet", it returns-1because these strings are not in thecompression_optionstable inflb_aws_compress.c(which only includes gzip, snappy, and zstd). This causes the code to hit the "Unknown compression type" error path instead of handling these deprecated aliases.The constants
FLB_AWS_COMPRESS_ARROW(4) andFLB_AWS_COMPRESS_PARQUET(5) are defined but unreachable, contradicting the documented deprecated behavior and the inline comments claiming support.Handle these strings explicitly before calling
flb_aws_compression_get_type:tmp = flb_output_get_property("compression", ins); if (tmp) { if (strcasecmp(tmp, "arrow") == 0 || strcasecmp(tmp, "parquet") == 0) { flb_plg_warn(ctx->ins, "DEPRECATED: compression=%s is deprecated. Use format=parquet instead.", tmp); #ifdef FLB_HAVE_PARQUET_ENCODER ctx->format = FLB_S3_FORMAT_PARQUET; #else flb_plg_error(ctx->ins, "Parquet format is not supported in this build. " "Rebuild with -DFLB_PARQUET_ENCODER=On."); return -1; #endif } else { ret = flb_aws_compression_get_type(tmp); if (ret == -1) { flb_plg_error(ctx->ins, "Unknown compression type: %s", tmp); return -1; } ctx->compression = ret; } }
🧹 Nitpick comments (3)
plugins/out_s3/s3.c (2)
49-57: Remove duplicate static declarations forget_upload/create_uploadBoth functions are declared twice (once near the top as forward declarations and again before their definitions). This is harmless but noisy and can confuse readers; you only need one forward declaration (or none if you reorder definitions).
Also applies to: 86-92
1218-1277: SDS-based outer compression flow is sound, but pre/post compression logging is now misleadingThe refactor in
upload_datato:
- Use
body_to_send/size_to_send(possibly compressed) while leaving the originalbody/body_sizeunchanged, and- Wrap the compressed payload in an
flb_sds_t(compressed_body) with consistent cleanup,looks correct and avoids raw
malloc/freehandling in callers.However, the informational log in the "too small, using PutObject" branch still prints:
flb_plg_info(ctx->ins, "Pre-compression upload_chunk_size= %zu, After compression, chunk is only %zu bytes, ...", preCompress_size, body_size);while the actual payload on the wire is
size_to_send, notbody_size. For Parquet, this message can also trigger even though no outer compression runs.Consider:
- Basing the "after compression" size on
size_to_send, and- Gating this log on
ctx->format != FLB_S3_FORMAT_PARQUETto avoid implying any outer compression there.Also applies to: 1300-1314, 1330-1392
examples/s3_output/README.md (1)
63-72: Add language identifiers to fenced code blocks for better tooling supportThe README content is solid and matches the new S3/Parquet behavior, but several fenced blocks (config snippets, shell commands, JSON schemas, SQL) are missing language tags. Adding them will quiet markdownlint and improve rendering/IDE support, e.g.:
```ini [OUTPUT] Name s3 ...cmake .. make -j8{ "fields": [ ... ] }SELECT ...;Also applies to: 85-95, 144-148, 152-156, 160-164, 178-181, 209-218, 241-251, 296-314, 341-349, 361-365, 384-426, 432-438, 442-456, 460-466, 484-495
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (17)
.github/workflows/unit-tests.yamlCMakeLists.txtcmake/parquet.cmakeexamples/s3_output/README.mdexamples/s3_output/athena-queries.sqlexamples/s3_output/fluent-bit-s3-parquet.confexamples/s3_output/run-s3-parquet-test.shinclude/fluent-bit/aws/flb_aws_compress.hplugins/out_s3/s3.cplugins/out_s3/s3.hsrc/CMakeLists.txtsrc/aws/CMakeLists.txtsrc/aws/compression/CMakeLists.txtsrc/aws/compression/arrow/CMakeLists.txtsrc/aws/compression/arrow/compress.csrc/aws/compression/arrow/compress.hsrc/aws/flb_aws_compress.c
💤 Files with no reviewable changes (5)
- src/aws/compression/arrow/CMakeLists.txt
- src/aws/compression/arrow/compress.c
- src/aws/compression/arrow/compress.h
- src/aws/CMakeLists.txt
- src/aws/compression/CMakeLists.txt
🚧 Files skipped from review as they are similar to previous changes (4)
- examples/s3_output/athena-queries.sql
- examples/s3_output/fluent-bit-s3-parquet.conf
- examples/s3_output/run-s3-parquet-test.sh
- .github/workflows/unit-tests.yaml
🧰 Additional context used
🧠 Learnings (16)
📓 Common learnings
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:52-56
Timestamp: 2025-08-29T06:24:55.855Z
Learning: ZSTD compression is always available in Fluent Bit and does not require conditional compilation guards. Unlike Arrow/Parquet which use #ifdef FLB_HAVE_ARROW guards, ZSTD is built unconditionally with flb_zstd.c included directly in src/CMakeLists.txt and a bundled ZSTD library at lib/zstd-1.5.7/.
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components such as ARROW/PARQUET (which use `#ifdef FLB_HAVE_ARROW` guards), ZSTD support is always available and doesn't need build-time conditionals. ZSTD headers are included directly without guards across multiple plugins and core components.
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit CMakeLists.txt, the system library preference flags are defined as FLB_PREFER_SYSTEM_LIB_ZSTD and FLB_PREFER_SYSTEM_LIB_KAFKA with the FLB_ prefix.
Applied to files:
src/CMakeLists.txtcmake/parquet.cmakeCMakeLists.txt
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit, the correct CMake flag for using system librdkafka is `FLB_PREFER_SYSTEM_LIB_KAFKA=ON`.
Applied to files:
src/CMakeLists.txtcmake/parquet.cmakeCMakeLists.txt
📚 Learning: 2025-08-29T06:24:55.855Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:52-56
Timestamp: 2025-08-29T06:24:55.855Z
Learning: ZSTD compression is always available in Fluent Bit and does not require conditional compilation guards. Unlike Arrow/Parquet which use #ifdef FLB_HAVE_ARROW guards, ZSTD is built unconditionally with flb_zstd.c included directly in src/CMakeLists.txt and a bundled ZSTD library at lib/zstd-1.5.7/.
Applied to files:
src/CMakeLists.txtinclude/fluent-bit/aws/flb_aws_compress.hplugins/out_s3/s3.cCMakeLists.txtsrc/aws/flb_aws_compress.c
📚 Learning: 2025-12-22T05:39:02.291Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 11250
File: lib/ripser-1.2.1/Makefile:1-18
Timestamp: 2025-12-22T05:39:02.291Z
Learning: In fluent/fluent-bit, the Makefile at lib/ripser-1.2.1/Makefile is imported from upstream Ripser and is not used in the actual build process. The project uses CMake for building (lib/ripser-1.2.1/CMakeLists.txt), so changes to the imported Makefile are not necessary.
Applied to files:
src/CMakeLists.txt
📚 Learning: 2025-08-29T06:25:27.250Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components such as ARROW/PARQUET (which use `#ifdef FLB_HAVE_ARROW` guards), ZSTD support is always available and doesn't need build-time conditionals. ZSTD headers are included directly without guards across multiple plugins and core components.
Applied to files:
src/CMakeLists.txtinclude/fluent-bit/aws/flb_aws_compress.hplugins/out_s3/s3.cCMakeLists.txtsrc/aws/flb_aws_compress.c
📚 Learning: 2025-09-08T11:21:33.975Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10851
File: include/fluent-bit/flb_simd.h:60-66
Timestamp: 2025-09-08T11:21:33.975Z
Learning: Fluent Bit currently only supports MSVC compiler on Windows, so additional compiler compatibility guards may be unnecessary for Windows-specific code paths.
Applied to files:
src/CMakeLists.txtCMakeLists.txt
📚 Learning: 2025-08-29T06:25:27.250Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components, ZSTD support is always available and doesn't need build-time conditionals.
Applied to files:
src/CMakeLists.txtinclude/fluent-bit/aws/flb_aws_compress.hplugins/out_s3/s3.cCMakeLists.txtsrc/aws/flb_aws_compress.c
📚 Learning: 2025-08-29T06:24:26.170Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:39-42
Timestamp: 2025-08-29T06:24:26.170Z
Learning: In Fluent Bit, ZSTD compression support is enabled by default and does not require conditional compilation guards (like #ifdef FLB_HAVE_ZSTD) around ZSTD-related code declarations and implementations.
Applied to files:
src/CMakeLists.txtinclude/fluent-bit/aws/flb_aws_compress.hplugins/out_s3/s3.cCMakeLists.txtsrc/aws/flb_aws_compress.c
📚 Learning: 2025-08-29T06:25:02.561Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:7-7
Timestamp: 2025-08-29T06:25:02.561Z
Learning: In Fluent Bit, ZSTD (zstandard) compression library is bundled directly in the source tree at `lib/zstd-1.5.7` and is built unconditionally as a static library. Unlike optional external dependencies, ZSTD does not use conditional compilation guards like `FLB_HAVE_ZSTD` and is always available. Headers like `<fluent-bit/flb_zstd.h>` can be included directly without guards.
Applied to files:
src/CMakeLists.txtinclude/fluent-bit/aws/flb_aws_compress.hplugins/out_s3/s3.cCMakeLists.txtsrc/aws/flb_aws_compress.c
📚 Learning: 2025-08-29T06:24:44.797Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:26-26
Timestamp: 2025-08-29T06:24:44.797Z
Learning: In Fluent Bit, ZSTD support is always available and enabled by default. The build system automatically detects and uses either the system libzstd library or builds the bundled ZSTD version. Unlike other optional dependencies like Arrow which use conditional compilation guards (e.g., FLB_HAVE_ARROW), ZSTD does not require conditional includes or build flags.
Applied to files:
src/CMakeLists.txtinclude/fluent-bit/aws/flb_aws_compress.hCMakeLists.txtsrc/aws/flb_aws_compress.c
📚 Learning: 2025-09-14T09:46:09.531Z
Learnt from: aminvakil
Repo: fluent/fluent-bit PR: 10844
File: conf/fluent-bit:13-15
Timestamp: 2025-09-14T09:46:09.531Z
Learning: For fluent-bit Debian packaging, /opt/fluent-bit/bin/ is the appropriate installation path since the package may be installed from non-official Debian sources, making /opt compliant with FHS for optional software packages.
Applied to files:
src/CMakeLists.txt
📚 Learning: 2025-09-04T07:28:37.083Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10832
File: src/wasm/CMakeLists.txt:112-131
Timestamp: 2025-09-04T07:28:37.083Z
Learning: In fluent-bit CMake files, the user cosmo0920 prefers treating Git as a command rather than a package, emphasizing that Git is not a pkg-config retrievable package but just a command.
Applied to files:
src/CMakeLists.txt
📚 Learning: 2025-11-21T06:23:29.770Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 11171
File: include/fluent-bit/flb_lib.h:52-53
Timestamp: 2025-11-21T06:23:29.770Z
Learning: In Fluent Bit core (fluent/fluent-bit repository), function descriptions/documentation are not required for newly added functions in header files.
Applied to files:
examples/s3_output/README.md
📚 Learning: 2025-08-07T10:15:46.187Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10691
File: .github/workflows/unit-tests.yaml:94-101
Timestamp: 2025-08-07T10:15:46.187Z
Learning: Apache Arrow latest packages have changing checksums, making hardcoded checksum verification impractical for CI workflows that download the latest version.
Applied to files:
cmake/parquet.cmake
📚 Learning: 2025-09-04T12:35:22.872Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10825
File: plugins/out_s3/s3.c:1339-1344
Timestamp: 2025-09-04T12:35:22.872Z
Learning: In the Fluent Bit S3 plugin, the user prefers to maintain current retry_limit behavior without special handling for FLB_OUT_RETRY_UNLIMITED (-1), as there's no documentation indicating -1 should be used for infinite retries and consistency with current logic is preferred.
Applied to files:
plugins/out_s3/s3.c
🪛 Cppcheck (2.19.0)
plugins/out_s3/s3.c
[error] 791-791: failed to expand 'flb_plg_error', it is invalid to use a preprocessor directive as macro parameter
(syntaxError)
🪛 LanguageTool
examples/s3_output/README.md
[grammar] ~56-~56: Ensure spelling is correct
Context: ..., 201, 404, 500, 401) - Response times (2ms to 5000ms) - Client information (IP, us...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~465-~465: Ensure spelling is correct
Context: ...les uploaded when reaching 50MB (before 60min timeout) Verification Steps: 1. R...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🪛 markdownlint-cli2 (0.18.1)
examples/s3_output/README.md
112-112: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
177-177: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
270-270: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🔇 Additional comments (11)
plugins/out_s3/s3.h (1)
123-199: New format/schema fields and S3 format macros look consistent with implementationAppending
format,schema_str, the JSON/PARQUET macros, and the updateds3_put_objectprototype matches hows3.cnow routes JSON vs Parquet and SDS bodies. No structural or API consistency issues here.Also applies to: 201-204, 233-234
plugins/out_s3/s3.c (7)
192-239: Header logic correctly skips outer Content-Encoding for ParquetThe updated
create_headersnow omitsContent-Encodingwhenctx->format == FLB_S3_FORMAT_PARQUET, matching the fact that Parquet uses its own internal compression. This aligns headers with the new upload semantics.
777-823: Format selection and Parquet encoder guards are consistent with new designThe new
formatparsing and#ifdef FLB_HAVE_PARQUET_ENCODERchecks correctly:
- Default to JSON unless Parquet was set (including via legacy compression mapping once fixed).
- Reject
format=parquetbuilds that lackFLB_PARQUET_ENCODER.- For Parquet builds, keep
ctx->compressionfor internal Parquet compression only and skip outer compression.This matches the intended separation of JSON vs Parquet and internal-vs-outer compression.
1663-1800: Updateds3_put_objectsignature and SDS handling are consistentThe new
s3_put_objecttakingflb_sds_t body, size_t body_sizeis wired correctly:
- Callers pass SDS buffers and later destroy them (or the owning wrappers).
- MD5 computation and HTTP request use the SDS pointer as a regular
char *.- Error paths free
uriand preserveseq_indexsemantics via thedecrement_indexlabel.No memory lifetime issues are apparent around the new body type.
1991-2015: SDS buffer ownership insend_upload_requestandunit_test_flushis now explicitBoth paths:
- Create an SDS buffer via
construct_request_buffer,- Pass it into
upload_data,- Always
flb_sds_destroy(buffer)after the call,which removes the previous mixed ownership around raw buffers and avoids leaks in error paths.
Also applies to: 3706-3724
3953-4089: Flush path correctly preserves raw MsgPack and defers formatting
cb_s3_flushnow:
- Stores raw MsgPack into an SDS (
chunk),- Uses
s3_store_file_get/buffer_chunkto persist raw MsgPack,- Only converts to JSON/Parquet inside
construct_request_bufferright before upload.This is the right abstraction for supporting multiple output formats (JSON vs Parquet) without duplicating buffer logic.
4210-4231: Compression/format/schema/upload_timeout config docs largely match behavior, but note legacy compression caveatThe updated config map entries:
- Document
compressionasnone|gzip|snappy|zstdwith legacyarrow/parquetvalues deprecated,- Introduce
formatandschema_str(required forformat=parquet),- Add a documented
upload_timeoutparameter, and- Add
authorization_endpoint_bearer_token,which aligns with the new logic in
cb_s3_initand the Parquet encoder path—once the legacycompression=arrow|parquethandling is fixed as noted above.Also applies to: 4381-4413
1450-1660: Raw MsgPack buffering + deferred JSON/Parquet conversion is correctly implementedThe new
construct_request_buffer/put_all_chunksflow:
- Reads raw MsgPack from
s3_store_file_read,- Optionally appends in-memory MsgPack (
new_data) viaflb_realloc+memcpy,- Converts once per upload to either:
log_keyprojection,- JSON lines, or
- Parquet via
flb_msgpack_raw_to_parquet(withschema_strandctx->compression),- Properly cleans up all internal allocations (msgpack buffer, intermediate formatted data) on all error paths before returning,
- Never sets
*out_bufon error, ensuring callers receive only valid or NULL pointers, and- Returns an
flb_sds_twhich all four callers properly destroy on the success path.Chunk locking/unlocking on error paths is balanced throughout. Memory management is sound.
src/CMakeLists.txt (1)
157-166: Parquet encoder and SASL library wiring in src/CMakeLists.txt look correct
flb_parquet_impl.cppis only built whenFLB_PARQUET_ENCODERis enabled and is explicitly marked as C++, matching the Arrow/Parquet dependency.ARROW_LIBRARIES/PARQUET_LIBRARIESare only appended toFLB_DEPSunder the same flag, consistent withcmake/parquet.cmake.- SASL is linked into both
fluent-bit-sharedandfluent-bit-binonly whenFLB_SASL_ENABLEDis set, so non-Kafka or non-SASL builds remain unaffected.Also applies to: 398-405, 525-528, 600-603
include/fluent-bit/aws/flb_aws_compress.h (1)
23-47: Compression enum layout is fine; ensure callers don’t rely on ARROW/PARQUET inflb_aws_compression_get_typeThe new assignments (NONE=0, GZIP=1, SNAPPY=2, ZSTD=3, ARROW=4 [deprecated], PARQUET=5 [“use format=parquet instead”]) are internally consistent and safe as long as all code uses the macros, not raw integers.
Given
flb_aws_compress.cnow only recognizesgzip|snappy|zstd, ARROW/PARQUET are effectively compile-time constants for legacy conditionals; callers must not expectflb_aws_compression_get_type()to ever return these values anymore. The S3 plugin currently still does (and should be fixed as noted ins3.c).src/aws/flb_aws_compress.c (1)
24-38: SNAPPY support is wired correctly; note that ARROW/PARQUET are no longer valid compression types
flb_snappy_compress_wrappercorrectly adaptsflb_snappy_compressto the common(void **out_data, size_t *out_len)signature and is registered under"snappy"incompression_options.- The options table now only contains true compression algorithms (gzip/snappy/zstd); ARROW/PARQUET have been removed, so
flb_aws_compression_get_type("arrow")/"parquet"will now return-1.This matches the new separation between compression and format conversion, but requires callers (notably the S3 plugin) to treat
"arrow"/"parquet"as legacy format aliases rather than compression types—see the earlier comment ins3.c.Also applies to: 57-78
5e868fc to
2555471
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (3)
examples/s3_output/athena-queries.sql (1)
168-181: ReplacePERCENTILEwithapprox_percentilefor Athena compatibility.Athena does not support the
PERCENTILE(duration_ms, 0.95)syntax. Useapprox_percentile(duration_ms, 0.95)instead.🔎 Proposed fix
SELECT path, method, COUNT(*) as request_count, AVG(duration_ms) as avg_duration_ms, - PERCENTILE(duration_ms, 0.95) as p95_duration_ms, + approx_percentile(duration_ms, 0.95) as p95_duration_ms, MAX(duration_ms) as max_duration_ms FROM access_logs_parquetcmake/parquet.cmake (1)
149-170: Remove theinclude/arrowsuffix from Arrow header search.The
PATH_SUFFIXES include/arrowinfind_pathfor Arrow headers is incorrect. When searching forarrow/api.h, CMake would find it at${prefix}/include/arrow/arrow/api.hwhich doesn't exist. The correct include directory should be${prefix}/includeso that#include <arrow/api.h>resolves correctly.🔎 Proposed fix
# Search for Arrow if(NOT ARROW_FOUND) find_path(ARROW_INCLUDE_DIR NAMES arrow/api.h PATHS ${SEARCH_PATHS} - PATH_SUFFIXES ${INCLUDE_SUFFIXES} include/arrow + PATH_SUFFIXES ${INCLUDE_SUFFIXES} NO_DEFAULT_PATH )plugins/out_s3/s3.c (1)
744-824: Legacy compression path should explicitly default to GZIP.The comment at line 764 says "Keep default gzip compression" but
ctx->compressionremainsFLB_AWS_COMPRESS_NONE(set at line 745). This contradicts user expectations for backward compatibility.🔎 Suggested fix
#ifndef FLB_HAVE_PARQUET_ENCODER flb_plg_error(ctx->ins, "Parquet format is not supported in this build. " "Rebuild with -DFLB_PARQUET_ENCODER=On."); return -1; #else ctx->format = FLB_S3_FORMAT_PARQUET; - /* Keep default gzip compression */ + /* Default to gzip compression for backward compatibility */ + if (ctx->compression == FLB_AWS_COMPRESS_NONE) { + ctx->compression = FLB_AWS_COMPRESS_GZIP; + } #endif
🧹 Nitpick comments (3)
examples/s3_output/README.md (2)
112-114: Add language specifier to fenced code block.The code block at line 112 is missing a language specifier, which affects syntax highlighting and linting.
🔎 Proposed fix
-``` +```text Upload Trigger = upload_timeout reached OR total_file_size reached</details> --- `177-181`: **Add language specifier to fenced code block.** The code block showing S3 upload paths is missing a language specifier. <details> <summary>🔎 Proposed fix</summary> ```diff -``` +```text s3://s3-bucket-kafka-sink/fluent-bit-logs/ ├── putobject/app/day=YYYYMMDD/ # Case 1: Small files via PutObject └── multipart/access/day=YYYYMMDD/ # Case 2: Large files via Multipart Upload</details> </blockquote></details> <details> <summary>src/flb_parquet_impl.cpp (1)</summary><blockquote> `708-736`: **Memory management is correct, but ownership should be documented.** The function allocates output_buffer with flb_malloc (line 710) and copies Arrow data into it (line 716). Exception handlers properly clean up on error (line 726). **Suggestion:** Add a comment in the function documentation (or in the header file) explicitly stating that the caller must free the returned buffer with flb_free(). </blockquote></details> </blockquote></details> <details> <summary>📜 Review details</summary> **Configuration used**: defaults **Review profile**: CHILL **Plan**: Pro <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between 5e868fceaf3521954dee257936901981215314ea and 2555471660ff1c95ba1231130e46af6cf36d9017. </details> <details> <summary>📒 Files selected for processing (19)</summary> * `.github/workflows/unit-tests.yaml` * `CMakeLists.txt` * `cmake/parquet.cmake` * `examples/s3_output/README.md` * `examples/s3_output/athena-queries.sql` * `examples/s3_output/fluent-bit-s3-parquet.conf` * `examples/s3_output/run-s3-parquet-test.sh` * `include/fluent-bit/aws/flb_aws_compress.h` * `include/fluent-bit/flb_parquet.h` * `plugins/out_s3/s3.c` * `plugins/out_s3/s3.h` * `src/CMakeLists.txt` * `src/aws/CMakeLists.txt` * `src/aws/compression/CMakeLists.txt` * `src/aws/compression/arrow/CMakeLists.txt` * `src/aws/compression/arrow/compress.c` * `src/aws/compression/arrow/compress.h` * `src/aws/flb_aws_compress.c` * `src/flb_parquet_impl.cpp` </details> <details> <summary>💤 Files with no reviewable changes (5)</summary> * src/aws/compression/arrow/compress.h * src/aws/CMakeLists.txt * src/aws/compression/CMakeLists.txt * src/aws/compression/arrow/CMakeLists.txt * src/aws/compression/arrow/compress.c </details> <details> <summary>🚧 Files skipped from review as they are similar to previous changes (4)</summary> * include/fluent-bit/flb_parquet.h * src/aws/flb_aws_compress.c * examples/s3_output/run-s3-parquet-test.sh * examples/s3_output/fluent-bit-s3-parquet.conf </details> <details> <summary>🧰 Additional context used</summary> <details> <summary>🧠 Learnings (16)</summary> <details> <summary>📓 Common learnings</summary>Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:52-56
Timestamp: 2025-08-29T06:24:55.855Z
Learning: ZSTD compression is always available in Fluent Bit and does not require conditional compilation guards. Unlike Arrow/Parquet which use #ifdef FLB_HAVE_ARROW guards, ZSTD is built unconditionally with flb_zstd.c included directly in src/CMakeLists.txt and a bundled ZSTD library at lib/zstd-1.5.7/.Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like#ifdef FLB_HAVE_ZSTD. Unlike some other optional components such as ARROW/PARQUET (which use#ifdef FLB_HAVE_ARROWguards), ZSTD support is always available and doesn't need build-time conditionals. ZSTD headers are included directly without guards across multiple plugins and core components.</details> <details> <summary>📚 Learning: 2025-08-07T10:15:46.187Z</summary>Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10691
File: .github/workflows/unit-tests.yaml:94-101
Timestamp: 2025-08-07T10:15:46.187Z
Learning: Apache Arrow latest packages have changing checksums, making hardcoded checksum verification impractical for CI workflows that download the latest version.**Applied to files:** - `cmake/parquet.cmake` - `.github/workflows/unit-tests.yaml` </details> <details> <summary>📚 Learning: 2025-08-31T12:46:11.940Z</summary>Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit CMakeLists.txt, the system library preference flags are defined as FLB_PREFER_SYSTEM_LIB_ZSTD and FLB_PREFER_SYSTEM_LIB_KAFKA with the FLB_ prefix.**Applied to files:** - `cmake/parquet.cmake` - `.github/workflows/unit-tests.yaml` - `src/CMakeLists.txt` - `CMakeLists.txt` </details> <details> <summary>📚 Learning: 2025-08-31T12:46:11.940Z</summary>Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit, the correct CMake flag for using system librdkafka isFLB_PREFER_SYSTEM_LIB_KAFKA=ON.**Applied to files:** - `cmake/parquet.cmake` - `.github/workflows/unit-tests.yaml` - `src/CMakeLists.txt` - `CMakeLists.txt` </details> <details> <summary>📚 Learning: 2025-08-29T06:25:27.250Z</summary>Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like#ifdef FLB_HAVE_ZSTD. Unlike some other optional components such as ARROW/PARQUET (which use#ifdef FLB_HAVE_ARROWguards), ZSTD support is always available and doesn't need build-time conditionals. ZSTD headers are included directly without guards across multiple plugins and core components.**Applied to files:** - `.github/workflows/unit-tests.yaml` - `include/fluent-bit/aws/flb_aws_compress.h` - `plugins/out_s3/s3.c` - `src/CMakeLists.txt` - `CMakeLists.txt` </details> <details> <summary>📚 Learning: 2025-08-29T06:24:55.855Z</summary>Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:52-56
Timestamp: 2025-08-29T06:24:55.855Z
Learning: ZSTD compression is always available in Fluent Bit and does not require conditional compilation guards. Unlike Arrow/Parquet which use #ifdef FLB_HAVE_ARROW guards, ZSTD is built unconditionally with flb_zstd.c included directly in src/CMakeLists.txt and a bundled ZSTD library at lib/zstd-1.5.7/.**Applied to files:** - `.github/workflows/unit-tests.yaml` - `src/flb_parquet_impl.cpp` - `include/fluent-bit/aws/flb_aws_compress.h` - `plugins/out_s3/s3.c` - `src/CMakeLists.txt` - `CMakeLists.txt` </details> <details> <summary>📚 Learning: 2025-08-29T06:25:27.250Z</summary>Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like#ifdef FLB_HAVE_ZSTD. Unlike some other optional components, ZSTD support is always available and doesn't need build-time conditionals.**Applied to files:** - `.github/workflows/unit-tests.yaml` - `include/fluent-bit/aws/flb_aws_compress.h` - `plugins/out_s3/s3.c` - `src/CMakeLists.txt` - `CMakeLists.txt` </details> <details> <summary>📚 Learning: 2025-08-29T06:24:26.170Z</summary>Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:39-42
Timestamp: 2025-08-29T06:24:26.170Z
Learning: In Fluent Bit, ZSTD compression support is enabled by default and does not require conditional compilation guards (like #ifdef FLB_HAVE_ZSTD) around ZSTD-related code declarations and implementations.**Applied to files:** - `.github/workflows/unit-tests.yaml` - `include/fluent-bit/aws/flb_aws_compress.h` - `plugins/out_s3/s3.c` - `src/CMakeLists.txt` - `CMakeLists.txt` </details> <details> <summary>📚 Learning: 2025-08-29T06:24:44.797Z</summary>Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:26-26
Timestamp: 2025-08-29T06:24:44.797Z
Learning: In Fluent Bit, ZSTD support is always available and enabled by default. The build system automatically detects and uses either the system libzstd library or builds the bundled ZSTD version. Unlike other optional dependencies like Arrow which use conditional compilation guards (e.g., FLB_HAVE_ARROW), ZSTD does not require conditional includes or build flags.**Applied to files:** - `.github/workflows/unit-tests.yaml` - `src/CMakeLists.txt` - `CMakeLists.txt` </details> <details> <summary>📚 Learning: 2025-09-08T11:21:33.975Z</summary>Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10851
File: include/fluent-bit/flb_simd.h:60-66
Timestamp: 2025-09-08T11:21:33.975Z
Learning: Fluent Bit currently only supports MSVC compiler on Windows, so additional compiler compatibility guards may be unnecessary for Windows-specific code paths.**Applied to files:** - `.github/workflows/unit-tests.yaml` - `src/CMakeLists.txt` - `CMakeLists.txt` </details> <details> <summary>📚 Learning: 2025-08-29T06:25:02.561Z</summary>Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:7-7
Timestamp: 2025-08-29T06:25:02.561Z
Learning: In Fluent Bit, ZSTD (zstandard) compression library is bundled directly in the source tree atlib/zstd-1.5.7and is built unconditionally as a static library. Unlike optional external dependencies, ZSTD does not use conditional compilation guards likeFLB_HAVE_ZSTDand is always available. Headers like<fluent-bit/flb_zstd.h>can be included directly without guards.**Applied to files:** - `include/fluent-bit/aws/flb_aws_compress.h` - `plugins/out_s3/s3.c` - `src/CMakeLists.txt` - `CMakeLists.txt` </details> <details> <summary>📚 Learning: 2025-09-04T12:35:22.872Z</summary>Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10825
File: plugins/out_s3/s3.c:1339-1344
Timestamp: 2025-09-04T12:35:22.872Z
Learning: In the Fluent Bit S3 plugin, the user prefers to maintain current retry_limit behavior without special handling for FLB_OUT_RETRY_UNLIMITED (-1), as there's no documentation indicating -1 should be used for infinite retries and consistency with current logic is preferred.**Applied to files:** - `plugins/out_s3/s3.c` </details> <details> <summary>📚 Learning: 2025-12-22T05:39:02.291Z</summary>Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 11250
File: lib/ripser-1.2.1/Makefile:1-18
Timestamp: 2025-12-22T05:39:02.291Z
Learning: In fluent/fluent-bit, the Makefile at lib/ripser-1.2.1/Makefile is imported from upstream Ripser and is not used in the actual build process. The project uses CMake for building (lib/ripser-1.2.1/CMakeLists.txt), so changes to the imported Makefile are not necessary.**Applied to files:** - `src/CMakeLists.txt` </details> <details> <summary>📚 Learning: 2025-09-14T09:46:09.531Z</summary>Learnt from: aminvakil
Repo: fluent/fluent-bit PR: 10844
File: conf/fluent-bit:13-15
Timestamp: 2025-09-14T09:46:09.531Z
Learning: For fluent-bit Debian packaging, /opt/fluent-bit/bin/ is the appropriate installation path since the package may be installed from non-official Debian sources, making /opt compliant with FHS for optional software packages.**Applied to files:** - `src/CMakeLists.txt` </details> <details> <summary>📚 Learning: 2025-09-04T07:28:37.083Z</summary>Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10832
File: src/wasm/CMakeLists.txt:112-131
Timestamp: 2025-09-04T07:28:37.083Z
Learning: In fluent-bit CMake files, the user cosmo0920 prefers treating Git as a command rather than a package, emphasizing that Git is not a pkg-config retrievable package but just a command.**Applied to files:** - `src/CMakeLists.txt` </details> <details> <summary>📚 Learning: 2025-11-21T06:23:29.770Z</summary>Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 11171
File: include/fluent-bit/flb_lib.h:52-53
Timestamp: 2025-11-21T06:23:29.770Z
Learning: In Fluent Bit core (fluent/fluent-bit repository), function descriptions/documentation are not required for newly added functions in header files.**Applied to files:** - `examples/s3_output/README.md` </details> </details><details> <summary>🧬 Code graph analysis (2)</summary> <details> <summary>src/flb_parquet_impl.cpp (1)</summary><blockquote> <details> <summary>include/fluent-bit/flb_mem.h (1)</summary> * `flb_free` (126-128) </details> </blockquote></details> <details> <summary>plugins/out_s3/s3.h (1)</summary><blockquote> <details> <summary>plugins/out_s3/s3.c (1)</summary> * `s3_put_object` (1663-1800) </details> </blockquote></details> </details><details> <summary>🪛 Cppcheck (2.19.0)</summary> <details> <summary>plugins/out_s3/s3.c</summary> [error] 791-791: failed to expand 'flb_plg_error', it is invalid to use a preprocessor directive as macro parameter (syntaxError) </details> </details> <details> <summary>🪛 LanguageTool</summary> <details> <summary>examples/s3_output/README.md</summary> [grammar] ~56-~56: Ensure spelling is correct Context: ..., 201, 404, 500, 401) - Response times (2ms to 5000ms) - Client information (IP, us... (QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1) --- [grammar] ~465-~465: Ensure spelling is correct Context: ...les uploaded when reaching 50MB (before 60min timeout) **Verification Steps:** 1. R... (QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1) </details> </details> <details> <summary>🪛 markdownlint-cli2 (0.18.1)</summary> <details> <summary>examples/s3_output/README.md</summary> 112-112: Fenced code blocks should have a language specified (MD040, fenced-code-language) --- 177-177: Fenced code blocks should have a language specified (MD040, fenced-code-language) --- 270-270: Fenced code blocks should have a language specified (MD040, fenced-code-language) </details> </details> </details> <details> <summary>🔇 Additional comments (28)</summary><blockquote> <details> <summary>.github/workflows/unit-tests.yaml (2)</summary><blockquote> `56-79`: **CI matrix configuration for Parquet encoder looks correct.** The Parquet encoder test configuration properly: - Uses the new `FLB_PARQUET_ENCODER` flag - Excludes Clang compiler (consistent with other specialized build options like coverage) - Aligns with the broader transition from Arrow GLib to C++ Arrow --- `117-125`: **Apache Arrow setup uses correct packages and official distribution source.** The configuration correctly: - Uses `packages.apache.org` as the official distribution source - Installs `libarrow-dev` and `libparquet-dev` (C++ libraries) instead of the GLib variants - Aligns the step name and condition with the new `FLB_PARQUET_ENCODER` flag </blockquote></details> <details> <summary>include/fluent-bit/aws/flb_aws_compress.h (2)</summary><blockquote> `33-46`: **Good documentation distinguishing compression from format conversion.** The comment block clearly explains: - The distinction between actual compression algorithms and file format conversions - Arrow's deprecation status with guidance not to use in new code - That Parquet should be used via `format=parquet` rather than `compression=parquet` This helps prevent user confusion about the difference between compression and format conversion. --- `24-46`: The concern about backward compatibility from renumbering constants is not applicable. The compression type configuration uses string-based names ("gzip", "snappy", "zstd", "parquet") via an internal lookup table, not numeric values. The enum constants are purely internal implementation details that never get serialized, persisted, or exposed to external code. Renumbering these internal values has no backward compatibility impact. </blockquote></details> <details> <summary>examples/s3_output/athena-queries.sql (1)</summary><blockquote> `1-80`: **Well-structured Athena table definitions and query examples.** The SQL file provides comprehensive examples covering: - Partition projection for automatic partition discovery - Both JSON and Parquet format table definitions - Realistic query patterns for log analysis This is valuable documentation for users implementing the new Parquet format support. </blockquote></details> <details> <summary>examples/s3_output/README.md (1)</summary><blockquote> `1-80`: **Excellent documentation for Parquet format support.** The README provides comprehensive coverage of: - Upload methods (PutObject vs Multipart) - Format and compression options - Schema definition and type handling with best practices - Troubleshooting guidance - Performance testing scenarios This is high-quality documentation that will help users adopt the new Parquet format feature. </blockquote></details> <details> <summary>plugins/out_s3/s3.h (2)</summary><blockquote> `195-203`: **Format and schema fields properly added for Parquet support.** The new struct members and constants are well-organized: - `format` field for output format selection - `schema_str` for user-defined Parquet schema - Clear format type constants with JSON=0 as default --- `233-234`: **Updated `s3_put_object` signature uses SDS buffer type.** The function signature change from raw buffer to `flb_sds_t` aligns with the SDS-based buffer handling throughout the S3 upload flow. </blockquote></details> <details> <summary>src/CMakeLists.txt (3)</summary><blockquote> `157-166`: **Parquet encoder C++ integration correctly configured.** The Parquet encoder integration: - Adds `flb_parquet_impl.cpp` under `FLB_PARQUET_ENCODER` guard - Properly marks the file as C++ using `set_source_files_properties` - Follows the existing pattern for conditional source inclusion --- `398-405`: **Parquet library dependencies properly linked.** The `ARROW_LIBRARIES` and `PARQUET_LIBRARIES` are correctly added to `FLB_DEPS` when `FLB_PARQUET_ENCODER` is enabled, ensuring proper linkage with the detected Arrow/Parquet libraries from `cmake/parquet.cmake`. --- `525-528`: **Verify SASL linking changes are intentionally part of this PR.** The SASL library linking for Kafka support (`sasl2`) is added here but appears unrelated to the Parquet encoder feature. Please confirm this change is intentional for this PR or if it should be in a separate commit. Also applies to: 600-605 </blockquote></details> <details> <summary>cmake/parquet.cmake (1)</summary><blockquote> `1-83`: **Well-designed cross-platform library detection module.** The parquet.cmake module provides robust library detection with: - Three fallback methods (pkg-config → CMake config → manual search) - Platform-specific path handling for Windows, macOS (including Apple Silicon), and Linux - Proper handling of both shared and static library targets - Clear status messaging </blockquote></details> <details> <summary>CMakeLists.txt (3)</summary><blockquote> `202-202`: **New `FLB_PARQUET_ENCODER` option properly defined.** The option follows the existing naming convention for encoders (`FLB_AVRO_ENCODER`) and is disabled by default, which is appropriate for an optional feature requiring external dependencies. --- `284-286`: **`FLB_ALL` properly enables both encoders.** The "Enable all features" mode now consistently enables both Avro and Parquet encoders. --- `1311-1329`: **Parquet encoder integration is well-structured.** The configuration block: - Enables C++ language support as required by Arrow/Parquet - Uses C++17 standard (addressing past review comment about Arrow 10.0+ requiring C++17) - Integrates with `cmake/parquet.cmake` for library detection - Gracefully disables the feature with warnings when dependencies are missing </blockquote></details> <details> <summary>src/flb_parquet_impl.cpp (6)</summary><blockquote> `1-40`: **LGTM: Headers and includes are properly organized.** The extern "C" wrapper correctly isolates C headers from C++ code, and all necessary dependencies are included. --- `41-94`: **LGTM: Converter class design is clean and comprehensive.** The statistics tracking for overflows and type conversions will help users identify data quality issues. --- `139-158`: **Overflow handling correctly implemented.** The uint64 to int64 overflow check at lines 142-146 properly addresses the previous review concern. Values exceeding INT64_MAX are clamped and tracked in statistics. --- `220-257`: **Timestamp unit handling correctly implemented.** The function now properly reads the target time unit from the builder's type (lines 221-223) and applies the appropriate multiplier. This addresses the previous review concern. **Note:** The code assumes input timestamps are in seconds (line 225). Ensure this is documented in the public API or consider accepting a parameter for input unit. --- `458-547`: **Excellent: Immediate record processing avoids dangling pointer issue.** The lambda processes each msgpack record immediately within the unpack loop (lines 469-546), which correctly addresses the past review comment about storing msgpack_object pointers that become invalid. The use of string_view for field lookup (line 493) also avoids unnecessary string copies. --- `332-443`: **Schema parsing is robust with good error messages.** The JSON schema parsing properly validates required fields (lines 365-377) and provides helpful error messages that include the problematic schema content (lines 337-338, 347-348). Timestamp unit parsing (lines 410-420) correctly handles all standard Arrow time units. </blockquote></details> <details> <summary>plugins/out_s3/s3.c (7)</summary><blockquote> `42-57`: **LGTM: Proper conditional compilation for Parquet support.** The FLB_HAVE_PARQUET_ENCODER guard ensures the header is only included when Parquet support is enabled. --- `195-229`: **Correct: Content-Encoding header skipped for Parquet.** Lines 196-197 and 228-229 properly skip the Content-Encoding header when `format=parquet`, since Parquet handles compression internally within the file format. This avoids double-compression overhead. --- `1244-1277`: **LGTM: Compression properly skipped for Parquet with correct memory management.** The check at line 1245 ensures Parquet format bypasses outer compression. The compressed data is properly converted to flb_sds_t (lines 1260-1272) for consistent memory management, and cleanup is handled in all code paths (lines 1333-1335, 1362-1364, 1377-1379, 1388-1391). --- `1582-1630`: **Parquet conversion properly integrated with good error handling.** The Parquet conversion block (lines 1582-1630) correctly: - Validates schema_str is provided (lines 1588-1596) - Calls the Parquet encoder with compression setting (lines 1598-1603) - Converts the result to flb_sds_t for consistent memory management (lines 1615-1627) - Frees the intermediate parquet_buffer (line 1617) - Handles all error paths with proper cleanup Verify that the schema_str requirement is documented in user-facing documentation and error messages are clear enough for users to understand the schema format. --- `1663-1664`: **LGTM: Signature change to flb_sds_t improves consistency.** The s3_put_object signature change to accept flb_sds_t (line 1664) aligns with the overall refactoring for consistent SDS-based buffer management. --- `3953-3962`: **Excellent architectural decision with clear documentation.** The comment at lines 3953-3956 clearly explains the design choice to store raw msgpack and defer format conversion until upload. This is particularly beneficial for Parquet, which can batch multiple records into optimized columnar storage. --- `4210-4231`: **Config map correctly defined with comprehensive documentation.** The configuration parameters are properly defined: - `compression` and `format` use offset 0 (lines 4211, 4220), indicating manual parsing in cb_s3_init, which is correct - `schema_str` is mapped to `ctx->schema_str` (line 4228) - Descriptions clearly document supported values, defaults, and usage (lines 4212-4217, 4221-4225, 4229-4231) **Note:** The past review comment about compression type mismatch is not applicable here since offset 0 means no automatic field mapping. </blockquote></details> </blockquote></details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
664cb4e to
eff671c
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
plugins/out_s3/s3.c (2)
1217-1324: Usesize_to_send(compressed size) for routing decisions and loggingInside
upload_data(), after optional compression you introduce:flb_sds_t body_to_send = body; size_t size_to_send = body_size; ... preCompress_size = body_size; body_to_send = compressed_body; size_to_send = temp_size;but the subsequent logic still uses the original
body_sizefor decisions and logging:
- Timeout/size checks for whether to
goto put_objectvsmultipart.- The “Pre-compression ... After compression ...” log still uses
body_sizefor the “after” value.- Existing upload path compares
body_sizetoMIN_CHUNKED_UPLOAD_SIZE.With compression enabled, this means routing decisions and logs are based on the uncompressed size, which diverges from the actual payload size now stored in
size_to_send. It also breaks the intent of the informational log (both values end up identical).Switching those uses of
body_sizetosize_to_sendpreserves the old semantics (decisions based on the payload actually sent) and fixes the log.Suggested diff
- else if (body_size >= ctx->file_size) { + else if (size_to_send >= ctx->file_size) { @@ - else if(body_size > MIN_CHUNKED_UPLOAD_SIZE) { + else if(size_to_send > MIN_CHUNKED_UPLOAD_SIZE) { @@ - else { - if ((ctx->use_put_object == FLB_FALSE && (ctx->compression == FLB_AWS_COMPRESS_GZIP || ctx->compression == FLB_AWS_COMPRESS_ZSTD))) { - flb_plg_info(ctx->ins, "Pre-compression upload_chunk_size= %zu, After compression, chunk is only %zu bytes, " - "the chunk was too small, using PutObject to upload", preCompress_size, body_size); - } + else { + if (ctx->use_put_object == FLB_FALSE && + (ctx->compression == FLB_AWS_COMPRESS_GZIP || + ctx->compression == FLB_AWS_COMPRESS_ZSTD)) { + flb_plg_info(ctx->ins, + "Pre-compression upload_chunk_size= %zu, " + "After compression, chunk is only %zu bytes, " + "the chunk was too small, using PutObject to upload", + preCompress_size, size_to_send); + } @@ - if (body_size < MIN_CHUNKED_UPLOAD_SIZE) { + if (size_to_send < MIN_CHUNKED_UPLOAD_SIZE) {
1450-1500: SDS lifetime is mostly correct; fix leak inunit_test_flushThe SDS refactor is generally consistent:
construct_request_buffer()now always returnsflb_sds_tand callers destroy buffers withflb_sds_destroy().send_upload_request(),put_all_chunks(), andcb_s3_upload()correctly treat the returned buffer as an SDS and free it onceupload_data()completes.upload_data()cleans up the optionalcompressed_bodySDS in all exit paths.One leak remains in
unit_test_flush():s3_store_buffer_put(ctx, upload_file, tag, tag_len, chunk, (size_t) chunk_size, file_first_log_time); ret = construct_request_buffer(ctx, chunk, upload_file, &buffer, &buffer_size); ... ret = upload_data(ctx, upload_file, m_upload_file, buffer, buffer_size, tag, tag_len); /* buffer destroyed, but chunk is never freed */Unlike
buffer_chunk()and other call sites,chunkis never destroyed here, so every unit‑test flush leaks one SDS buffer whenFLB_S3_PLUGIN_UNDER_TESTis set.You can mirror
buffer_chunk()and freechunkimmediately afterconstruct_request_buffer():Suggested diff
- s3_store_buffer_put(ctx, upload_file, tag, tag_len, - chunk, (size_t) chunk_size, file_first_log_time); - ret = construct_request_buffer(ctx, chunk, upload_file, &buffer, &buffer_size); - if (ret < 0) { + s3_store_buffer_put(ctx, upload_file, tag, tag_len, + chunk, (size_t) chunk_size, file_first_log_time); + ret = construct_request_buffer(ctx, chunk, upload_file, &buffer, &buffer_size); + flb_sds_destroy(chunk); + if (ret < 0) { flb_plg_error(ctx->ins, "Could not construct request buffer for %s", upload_file->file_path); FLB_OUTPUT_RETURN(FLB_RETRY); }Also applies to: 1991-2015, 3701-3724
🧹 Nitpick comments (9)
examples/s3_output/run-s3-parquet-test.sh (2)
1-5: Consider adding shell script safety options.For test scripts, adding safety options helps catch errors early:
🔎 Suggested improvement
#!/bin/bash +set -euo pipefail echo "=== Fluent Bit S3 Comprehensive Test ===" echo ""
37-39: Cleanup section appears incomplete.The script echoes "Cleaning up..." but doesn't perform any actual cleanup. If cleanup is intentionally omitted (e.g., to allow inspection of test artifacts), consider adding a comment explaining this. Otherwise, add the cleanup logic:
🔎 Suggested completion
# Cleanup echo "" echo "Cleaning up..." +# Uncomment below to remove test artifacts after run +# rm -rf "$STORE_DIR" +echo "Test artifacts preserved in $STORE_DIR for inspection"examples/s3_output/README.md (3)
112-114: Add language specifier to fenced code block.Per markdownlint MD040, fenced code blocks should have a language specified for proper syntax highlighting.
🔎 Suggested fix
-``` +```text Upload Trigger = upload_timeout reached OR total_file_size reached</details> --- `177-181`: **Add language specifier to fenced code block.** <details> <summary>🔎 Suggested fix</summary> ```diff -``` +```text s3://s3-bucket-kafka-sink/fluent-bit-logs/ ├── putobject/app/day=YYYYMMDD/ # Case 1: Small files via PutObject └── multipart/access/day=YYYYMMDD/ # Case 2: Large files via Multipart Upload</details> --- `270-275`: **Add language specifier to fenced code block.** <details> <summary>🔎 Suggested fix</summary> ```diff -``` +```text [parquet] Data quality summary for 53206 records: [parquet] Missing non-nullable fields (defaults used): [parquet] field='level' count=52506 [parquet] field='timestamp' count=52506</details> </blockquote></details> <details> <summary>cmake/parquet.cmake (1)</summary><blockquote> `149-171`: **Manual include-path search is fine but extra `parquet` suffix is probably unnecessary** The manual search for Parquet uses: ```cmake PATH_SUFFIXES ${INCLUDE_SUFFIXES} parquetGiven
INCLUDE_SUFFIXESis alreadyinclude, this will check both${prefix}/include/parquet/...(viainclude+ theparquet/api/...name) and${prefix}/parquet/..., which is a bit unusual. It’s not harmful, but simplifying to just${INCLUDE_SUFFIXES}would avoid probing odd${prefix}/parquettrees and be more in line with the Arrow search just above.Also applies to: 173-195
CMakeLists.txt (1)
679-687: Consider unifying the C++ standard to avoid C++11/C++17 mixThe simdutf block sets
CMAKE_CXX_STANDARD 11, while the Parquet block later sets it to 17. CMake will honor the last assignment (17 when Parquet is enabled), but this mix of settings is slightly confusing and could surprise future refactors.You might want to:
- Drop the explicit
set(CMAKE_CXX_STANDARD 11)in the simdutf block, or- Set it to 17 there as well when Parquet support is enabled.
Functionally it works today; this is mainly about clarity and avoiding subtle build surprises.
Also applies to: 1311-1316
plugins/out_s3/s3.c (2)
42-57: Avoid duplicate forward declarations forget_upload/create_upload
get_uploadandcreate_uploadare forward‑declared twice (once in the new “Forward declarations” block and again just before their definitions). This is harmless but unnecessary noise.You can keep the new declaration for
flb_pack_msgpack_extract_log_keyand drop the duplicatedget_upload/create_uploadprototypes to reduce clutter.Also applies to: 86-91
744-823: Format/compression config flow is solid; consider earlyschema_strvalidationThe three‑step flow in
cb_s3_init:
- Defaults
compressiontoNONE, parses thecompressionproperty, and treatsarrow|parquetas legacy values that map toformat=parquet(with compile‑time checks aroundFLB_HAVE_PARQUET_ENCODER).- Adds an explicit
formatoption (json/parquet) that overrides the legacy compression mapping.- Validates that
format=parquetis not allowed when Parquet support isn’t compiled in.This is a good cleanup and keeps legacy configs working.
However,
schema_stris only validated later insideconstruct_request_buffer()andflb_msgpack_raw_to_parquet, which means a misconfiguredformat=parquetwithoutschema_strwill only surface on first flush, failing every chunk.To fail fast and give clearer feedback, you could add an init‑time check like:
#ifdef FLB_HAVE_PARQUET_ENCODER if (ctx->format == FLB_S3_FORMAT_PARQUET && ctx->schema_str == NULL) { flb_plg_error(ctx->ins, "format=parquet requires 'schema_str' to be set"); return -1; } #endifright after the format/validation block.
Also applies to: 4210-4231
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (18)
.github/workflows/unit-tests.yamlCMakeLists.txtcmake/parquet.cmakeexamples/s3_output/README.mdexamples/s3_output/athena-queries.sqlexamples/s3_output/fluent-bit-s3-parquet.confexamples/s3_output/run-s3-parquet-test.shinclude/fluent-bit/aws/flb_aws_compress.hplugins/out_kinesis_firehose/firehose.cplugins/out_s3/s3.cplugins/out_s3/s3.hsrc/CMakeLists.txtsrc/aws/CMakeLists.txtsrc/aws/compression/CMakeLists.txtsrc/aws/compression/arrow/CMakeLists.txtsrc/aws/compression/arrow/compress.csrc/aws/compression/arrow/compress.hsrc/aws/flb_aws_compress.c
💤 Files with no reviewable changes (5)
- src/aws/compression/arrow/compress.c
- src/aws/compression/arrow/compress.h
- src/aws/compression/CMakeLists.txt
- src/aws/compression/arrow/CMakeLists.txt
- src/aws/CMakeLists.txt
🚧 Files skipped from review as they are similar to previous changes (1)
- examples/s3_output/athena-queries.sql
🧰 Additional context used
🧠 Learnings (16)
📓 Common learnings
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:52-56
Timestamp: 2025-08-29T06:24:55.855Z
Learning: ZSTD compression is always available in Fluent Bit and does not require conditional compilation guards. Unlike Arrow/Parquet which use #ifdef FLB_HAVE_ARROW guards, ZSTD is built unconditionally with flb_zstd.c included directly in src/CMakeLists.txt and a bundled ZSTD library at lib/zstd-1.5.7/.
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components such as ARROW/PARQUET (which use `#ifdef FLB_HAVE_ARROW` guards), ZSTD support is always available and doesn't need build-time conditionals. ZSTD headers are included directly without guards across multiple plugins and core components.
📚 Learning: 2025-08-07T10:15:46.187Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10691
File: .github/workflows/unit-tests.yaml:94-101
Timestamp: 2025-08-07T10:15:46.187Z
Learning: Apache Arrow latest packages have changing checksums, making hardcoded checksum verification impractical for CI workflows that download the latest version.
Applied to files:
.github/workflows/unit-tests.yamlcmake/parquet.cmake
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit CMakeLists.txt, the system library preference flags are defined as FLB_PREFER_SYSTEM_LIB_ZSTD and FLB_PREFER_SYSTEM_LIB_KAFKA with the FLB_ prefix.
Applied to files:
.github/workflows/unit-tests.yamlCMakeLists.txtcmake/parquet.cmakesrc/CMakeLists.txt
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit, the correct CMake flag for using system librdkafka is `FLB_PREFER_SYSTEM_LIB_KAFKA=ON`.
Applied to files:
.github/workflows/unit-tests.yamlCMakeLists.txtcmake/parquet.cmakesrc/CMakeLists.txt
📚 Learning: 2025-08-29T06:25:27.250Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components such as ARROW/PARQUET (which use `#ifdef FLB_HAVE_ARROW` guards), ZSTD support is always available and doesn't need build-time conditionals. ZSTD headers are included directly without guards across multiple plugins and core components.
Applied to files:
.github/workflows/unit-tests.yamlinclude/fluent-bit/aws/flb_aws_compress.hsrc/aws/flb_aws_compress.cCMakeLists.txtplugins/out_kinesis_firehose/firehose.csrc/CMakeLists.txtplugins/out_s3/s3.c
📚 Learning: 2025-08-29T06:24:55.855Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:52-56
Timestamp: 2025-08-29T06:24:55.855Z
Learning: ZSTD compression is always available in Fluent Bit and does not require conditional compilation guards. Unlike Arrow/Parquet which use #ifdef FLB_HAVE_ARROW guards, ZSTD is built unconditionally with flb_zstd.c included directly in src/CMakeLists.txt and a bundled ZSTD library at lib/zstd-1.5.7/.
Applied to files:
.github/workflows/unit-tests.yamlinclude/fluent-bit/aws/flb_aws_compress.hsrc/aws/flb_aws_compress.cCMakeLists.txtplugins/out_kinesis_firehose/firehose.csrc/CMakeLists.txtplugins/out_s3/s3.c
📚 Learning: 2025-08-29T06:25:27.250Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components, ZSTD support is always available and doesn't need build-time conditionals.
Applied to files:
.github/workflows/unit-tests.yamlinclude/fluent-bit/aws/flb_aws_compress.hsrc/aws/flb_aws_compress.cCMakeLists.txtplugins/out_kinesis_firehose/firehose.csrc/CMakeLists.txtplugins/out_s3/s3.c
📚 Learning: 2025-08-29T06:24:26.170Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:39-42
Timestamp: 2025-08-29T06:24:26.170Z
Learning: In Fluent Bit, ZSTD compression support is enabled by default and does not require conditional compilation guards (like #ifdef FLB_HAVE_ZSTD) around ZSTD-related code declarations and implementations.
Applied to files:
.github/workflows/unit-tests.yamlinclude/fluent-bit/aws/flb_aws_compress.hsrc/aws/flb_aws_compress.cCMakeLists.txtplugins/out_kinesis_firehose/firehose.csrc/CMakeLists.txtplugins/out_s3/s3.c
📚 Learning: 2025-08-29T06:24:44.797Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:26-26
Timestamp: 2025-08-29T06:24:44.797Z
Learning: In Fluent Bit, ZSTD support is always available and enabled by default. The build system automatically detects and uses either the system libzstd library or builds the bundled ZSTD version. Unlike other optional dependencies like Arrow which use conditional compilation guards (e.g., FLB_HAVE_ARROW), ZSTD does not require conditional includes or build flags.
Applied to files:
.github/workflows/unit-tests.yamlsrc/aws/flb_aws_compress.cCMakeLists.txtsrc/CMakeLists.txt
📚 Learning: 2025-09-08T11:21:33.975Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10851
File: include/fluent-bit/flb_simd.h:60-66
Timestamp: 2025-09-08T11:21:33.975Z
Learning: Fluent Bit currently only supports MSVC compiler on Windows, so additional compiler compatibility guards may be unnecessary for Windows-specific code paths.
Applied to files:
.github/workflows/unit-tests.yamlCMakeLists.txtsrc/CMakeLists.txt
📚 Learning: 2025-11-21T06:23:29.770Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 11171
File: include/fluent-bit/flb_lib.h:52-53
Timestamp: 2025-11-21T06:23:29.770Z
Learning: In Fluent Bit core (fluent/fluent-bit repository), function descriptions/documentation are not required for newly added functions in header files.
Applied to files:
examples/s3_output/README.md
📚 Learning: 2025-08-29T06:25:02.561Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:7-7
Timestamp: 2025-08-29T06:25:02.561Z
Learning: In Fluent Bit, ZSTD (zstandard) compression library is bundled directly in the source tree at `lib/zstd-1.5.7` and is built unconditionally as a static library. Unlike optional external dependencies, ZSTD does not use conditional compilation guards like `FLB_HAVE_ZSTD` and is always available. Headers like `<fluent-bit/flb_zstd.h>` can be included directly without guards.
Applied to files:
include/fluent-bit/aws/flb_aws_compress.hsrc/aws/flb_aws_compress.cCMakeLists.txtsrc/CMakeLists.txtplugins/out_s3/s3.c
📚 Learning: 2025-12-22T05:39:02.291Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 11250
File: lib/ripser-1.2.1/Makefile:1-18
Timestamp: 2025-12-22T05:39:02.291Z
Learning: In fluent/fluent-bit, the Makefile at lib/ripser-1.2.1/Makefile is imported from upstream Ripser and is not used in the actual build process. The project uses CMake for building (lib/ripser-1.2.1/CMakeLists.txt), so changes to the imported Makefile are not necessary.
Applied to files:
src/CMakeLists.txt
📚 Learning: 2025-09-14T09:46:09.531Z
Learnt from: aminvakil
Repo: fluent/fluent-bit PR: 10844
File: conf/fluent-bit:13-15
Timestamp: 2025-09-14T09:46:09.531Z
Learning: For fluent-bit Debian packaging, /opt/fluent-bit/bin/ is the appropriate installation path since the package may be installed from non-official Debian sources, making /opt compliant with FHS for optional software packages.
Applied to files:
src/CMakeLists.txt
📚 Learning: 2025-09-04T07:28:37.083Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10832
File: src/wasm/CMakeLists.txt:112-131
Timestamp: 2025-09-04T07:28:37.083Z
Learning: In fluent-bit CMake files, the user cosmo0920 prefers treating Git as a command rather than a package, emphasizing that Git is not a pkg-config retrievable package but just a command.
Applied to files:
src/CMakeLists.txt
📚 Learning: 2025-09-04T12:35:22.872Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10825
File: plugins/out_s3/s3.c:1339-1344
Timestamp: 2025-09-04T12:35:22.872Z
Learning: In the Fluent Bit S3 plugin, the user prefers to maintain current retry_limit behavior without special handling for FLB_OUT_RETRY_UNLIMITED (-1), as there's no documentation indicating -1 should be used for infinite retries and consistency with current logic is preferred.
Applied to files:
plugins/out_s3/s3.c
🧬 Code graph analysis (2)
plugins/out_s3/s3.h (1)
plugins/out_s3/s3.c (1)
s3_put_object(1663-1800)
plugins/out_s3/s3.c (6)
src/flb_output.c (1)
flb_output_get_property(1108-1111)src/aws/flb_aws_compress.c (2)
flb_aws_compression_get_type(80-97)flb_aws_compression_compress(99-116)src/flb_sds.c (2)
flb_sds_create_len(58-76)flb_sds_destroy(389-399)include/fluent-bit/flb_mem.h (1)
flb_free(126-128)plugins/out_s3/s3_multipart.c (1)
upload_part(685-799)src/flb_pack.c (1)
flb_pack_msgpack_to_json_format(1169-1450)
🪛 Cppcheck (2.19.0)
plugins/out_s3/s3.c
[error] 791-791: failed to expand 'flb_plg_error', it is invalid to use a preprocessor directive as macro parameter
(syntaxError)
🪛 LanguageTool
examples/s3_output/README.md
[grammar] ~56-~56: Ensure spelling is correct
Context: ..., 201, 404, 500, 401) - Response times (2ms to 5000ms) - Client information (IP, us...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~465-~465: Ensure spelling is correct
Context: ...les uploaded when reaching 50MB (before 60min timeout) Verification Steps: 1. R...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🪛 markdownlint-cli2 (0.18.1)
examples/s3_output/README.md
112-112: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
177-177: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
270-270: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🔇 Additional comments (18)
plugins/out_kinesis_firehose/firehose.c (1)
496-501: LGTM! Documentation updated to reflect new compression options.The help text correctly lists the now-supported compression algorithms:
gzip,snappy, andzstd. This aligns with the changes inflb_aws_compress.cthat add Snappy support and remove Arrow/Parquet compression options.src/aws/flb_aws_compress.c (2)
31-37: LGTM! Snappy wrapper correctly adapts the function signature.The wrapper properly casts
void *tochar *as required byflb_snappy_compress, and the return value is passed through directly.
57-78: Well-structured separation of compression algorithms from format converters.The updated
compression_optionstable cleanly includes only true compression algorithms (gzip, snappy, zstd), with clear documentation explaining that Arrow/Parquet are format converters handled separately via theformatparameter.include/fluent-bit/aws/flb_aws_compress.h (1)
24-46: Well-documented deprecation and backward compatibility approach.The header cleanly separates compression algorithms from format converters, retains deprecated values for backward compatibility, and provides clear guidance to use
format=parquetinstead ofcompression=parquet. Based on learnings, this follows the established pattern where Arrow/Parquet use conditional compilation guards unlike ZSTD which is always available.examples/s3_output/fluent-bit-s3-parquet.conf (2)
143-145: Parquet schema correctly matches the access log data structure.The
schema_strdefines fields that align with the dummy access log inputs:timestamp,level,message,method,path,status,duration_ms,client_ip, anduser_agent. Usingint64fortimestampandstatus, andfloat64forduration_msare appropriate choices for avoiding overflow issues as noted in the README.
1-117: Comprehensive and well-documented test configuration.The configuration provides excellent coverage of both upload scenarios (PutObject vs Multipart) with realistic traffic patterns. The inline comments clearly explain the rationale for each setting, making this a valuable reference for users.
.github/workflows/unit-tests.yaml (2)
56-57: LGTM! Build flag renamed to reflect Parquet encoder focus.The change from
-DFLB_ARROW=Onto-DFLB_PARQUET_ENCODER=Oncorrectly reflects the shift from Arrow-based compression to Parquet encoder support.
117-125: Apache Arrow C++ package names are correct.The packages
libarrow-devandlibparquet-devare the correct C++ development packages for Apache Arrow and Parquet, providing the required headers and libraries for theflb_parquet_impl.cppimplementation.plugins/out_s3/s3.h (3)
196-199: New Parquet format fields properly integrated into struct.The
formatandschema_strfields provide clean support for runtime format selection and user-defined Parquet schemas. The placement at the end of the struct with clear section comments improves readability.
201-203: LGTM! Format type macros clearly defined.The
FLB_S3_FORMAT_JSON(0) andFLB_S3_FORMAT_PARQUET(1) macros provide a clean enum-like interface for format selection.
233-235: Function signature change verified as consistent across the codebase.All
s3_put_objectcall sites (s3.c:1330 and s3.c:1494) have been properly updated to use the newflb_sds_t bufferparameter. The declaration in s3.h and implementation in s3.c both align with the new signature.src/CMakeLists.txt (2)
157-166: Parquet encoder source and dependency wiring look consistent
flb_parquet_impl.cppis only added whenFLB_PARQUET_ENCODERis enabled and is explicitly marked as C++ viaset_source_files_properties, with Arrow/Parquet libs added toFLB_DEPSunder the same flag. This keeps Parquet support fully optional and avoids impacting non-Parquet builds.Also applies to: 398-405
525-528: SASL linkage for Kafka/MSK is correctly scoped to SASL-enabled buildsConditionally linking
sasl2intofluent-bit-sharedandfluent-bit-binunderFLB_SASL_ENABLEDmatches how Kafka/MSK SASL is detected and avoids dragging SASL into builds that don’t need it.Also applies to: 600-606
CMakeLists.txt (1)
200-202: Parquet encoder option and C++17 enabling are wired correctly
FLB_PARQUET_ENCODERis now a dedicated option, enabled byFLB_ALL, and the Parquet block:
- Calls
enable_language(CXX)and setsCMAKE_CXX_STANDARD 17when the encoder is requested.- Delegates Arrow/Parquet discovery to
cmake/parquet.cmake.- Only defines
FLB_HAVE_PARQUET_ENCODERand adds include directories when both Arrow and Parquet are found, otherwise it warns and flipsFLB_PARQUET_ENCODERback toOFF.This keeps Parquet strictly opt‑in and prevents partial configuration when dependencies are missing.
Also applies to: 278-287, 1311-1329
plugins/out_s3/s3.c (3)
97-121: Header handling for compression and Parquet looks correctThe new
zstdContent-Encodingheader and the guards increate_headers()to skipContent-Encodingwhenctx->format == FLB_S3_FORMAT_PARQUETalign with the intended behavior:
- GZIP/ZSTD still get the appropriate header for JSON output.
- Parquet relies on its internal codec only, without an outer content‑encoding header.
Also applies to: 183-239
1516-1661: SDS-based request/flush pipeline is a nice cleanupThe move to
flb_sds_tfor all request buffers (construct_request_buffer(),upload_data(),send_upload_request(),put_all_chunks(),cb_s3_upload(),cb_s3_flush()) significantly simplifies memory management:
- All conversions (JSON / log_key / Parquet) produce SDS and are freed in a single place per call chain.
construct_request_buffer()now cleanly owns and frees the intermediate msgpack buffer and unlocks chunks on all error paths.- Callers always destroy the returned SDS (
buffer) afterupload_data()/s3_put_object()regardless of success or failure, avoiding mixedfree()vsflb_sds_destroy()semantics.This should make future format additions (or tweaks to Parquet encoding) much less error‑prone.
Also applies to: 1943-2015, 3440-3488, 3953-4068
4210-4231: Config map docs for compression/format/schema/upload_timeout/auth look coherentThe updated config map entries:
- Clarify that
compressionsupportsnone|gzip|snappy|zstdand thatarrow|parquetare deprecated legacy values.- Introduce an explicit
formatoption (json/parquet) andschema_strfor Parquet, with docs that match the runtime behavior.- Document
upload_timeoutsemantics more clearly and addauthorization_endpoint_bearer_token.These descriptions match the new logic in
cb_s3_init()andconstruct_request_buffer().Also applies to: 4382-4413
cmake/parquet.cmake (1)
249-258: No action needed—the URLs in the code already match the official Apache Arrow documentation.The URLs currently use
https://packages.apache.org/artifactory/arrow/…, which is the correct format per official Apache Arrow installation guides. The/artifactory/arrow/segment is not obsolete; it is the official repository path for both APT and RPM packages. Removing it would break the URLs.Likely an incorrect or invalid review comment.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (4)
examples/s3_output/README.md (1)
112-114: Consider adding language specifiers to fenced code blocks.Several code blocks in this documentation lack language specifiers, which can affect syntax highlighting. Consider adding appropriate language identifiers (e.g.,
text,ini,json) for better readability.tests/internal/parquet.c (1)
27-31: Consider including the header instead of redefining compression constants.These compression type constants are duplicated from
flb_aws_compress.h. If the values change in the source header, these tests will silently use incorrect values.🔎 Proposed fix
-/* Compression types from flb_aws_compress.h */ -#define FLB_AWS_COMPRESS_NONE 0 -#define FLB_AWS_COMPRESS_GZIP 1 -#define FLB_AWS_COMPRESS_SNAPPY 2 -#define FLB_AWS_COMPRESS_ZSTD 3 +#include <fluent-bit/aws/flb_aws_compress.h>If the header cannot be included due to dependencies, consider adding a comment noting that these values must be kept in sync with the source file.
src/flb_parquet_impl.cpp (2)
178-197: Unnecessary try/catch block around strtol.The
try/catchblock on lines 195-197 is unnecessary sincestrtolis a C function that does not throw exceptions. Theerrnoandendptrchecks already handle all error cases. The same pattern appears in other string parsing functions.🔎 Proposed fix
else if (obj->type == MSGPACK_OBJECT_STR) { - try { std::string str(obj->via.str.ptr, obj->via.str.size); char* endptr; errno = 0; long val = std::strtol(str.c_str(), &endptr, 10); if (endptr == str.c_str() || *endptr != '\0' || errno == ERANGE) { return arrow::Status::Invalid("Cannot parse string to int32"); } if (val > INT32_MAX || val < INT32_MIN) { int32_overflow_by_field[field_name]++; return builder->Append(val > INT32_MAX ? INT32_MAX : INT32_MIN); } string_parsed_to_int_by_field[field_name]++; return builder->Append(static_cast<int32_t>(val)); - } catch (...) { - return arrow::Status::Invalid("Cannot parse string to int32"); - } }Note: The
std::stringconstructor can throwstd::bad_alloc, so if you want to be defensive against memory allocation failures, you could keep a narrower catch or let it propagate to the outer try/catch inflb_msgpack_raw_to_parquet.
619-622: Prefer ValueUnsafe() or check status before ValueOrDie().
ValueOrDie()will terminate the program if the status is not OK, but here the status is already checked on line 618. UsingValueUnsafe()orMoveValueUnsafe()would be more semantically correct since you've already verified success.🔎 Proposed fix
if (!schema_result.ok()) { flb_error("[parquet] Failed to parse schema: %s\n", schema_result.status().ToString().c_str()); return NULL; } - auto schema = schema_result.ValueOrDie(); + auto schema = std::move(schema_result).ValueUnsafe();This pattern should be applied to other
ValueOrDie()calls where the status is already checked (lines 635, 859, 877, 899, 922).
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
CMakeLists.txtcmake/kafka.cmakeexamples/s3_output/README.mdlib/avro/CMakeLists.txtlib/jansson-e23f558/CMakeLists.txtsrc/flb_parquet_impl.cpptests/internal/CMakeLists.txttests/internal/parquet.c
🧰 Additional context used
🧠 Learnings (11)
📓 Common learnings
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:52-56
Timestamp: 2025-08-29T06:24:55.855Z
Learning: ZSTD compression is always available in Fluent Bit and does not require conditional compilation guards. Unlike Arrow/Parquet which use #ifdef FLB_HAVE_ARROW guards, ZSTD is built unconditionally with flb_zstd.c included directly in src/CMakeLists.txt and a bundled ZSTD library at lib/zstd-1.5.7/.
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components such as ARROW/PARQUET (which use `#ifdef FLB_HAVE_ARROW` guards), ZSTD support is always available and doesn't need build-time conditionals. ZSTD headers are included directly without guards across multiple plugins and core components.
📚 Learning: 2025-11-21T06:23:29.770Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 11171
File: include/fluent-bit/flb_lib.h:52-53
Timestamp: 2025-11-21T06:23:29.770Z
Learning: In Fluent Bit core (fluent/fluent-bit repository), function descriptions/documentation are not required for newly added functions in header files.
Applied to files:
examples/s3_output/README.md
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit CMakeLists.txt, the system library preference flags are defined as FLB_PREFER_SYSTEM_LIB_ZSTD and FLB_PREFER_SYSTEM_LIB_KAFKA with the FLB_ prefix.
Applied to files:
CMakeLists.txtcmake/kafka.cmake
📚 Learning: 2025-08-29T06:25:27.250Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components such as ARROW/PARQUET (which use `#ifdef FLB_HAVE_ARROW` guards), ZSTD support is always available and doesn't need build-time conditionals. ZSTD headers are included directly without guards across multiple plugins and core components.
Applied to files:
CMakeLists.txtcmake/kafka.cmakesrc/flb_parquet_impl.cpp
📚 Learning: 2025-08-29T06:24:44.797Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:26-26
Timestamp: 2025-08-29T06:24:44.797Z
Learning: In Fluent Bit, ZSTD support is always available and enabled by default. The build system automatically detects and uses either the system libzstd library or builds the bundled ZSTD version. Unlike other optional dependencies like Arrow which use conditional compilation guards (e.g., FLB_HAVE_ARROW), ZSTD does not require conditional includes or build flags.
Applied to files:
CMakeLists.txt
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit, the correct CMake flag for using system librdkafka is `FLB_PREFER_SYSTEM_LIB_KAFKA=ON`.
Applied to files:
CMakeLists.txtcmake/kafka.cmake
📚 Learning: 2025-08-29T06:24:55.855Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:52-56
Timestamp: 2025-08-29T06:24:55.855Z
Learning: ZSTD compression is always available in Fluent Bit and does not require conditional compilation guards. Unlike Arrow/Parquet which use #ifdef FLB_HAVE_ARROW guards, ZSTD is built unconditionally with flb_zstd.c included directly in src/CMakeLists.txt and a bundled ZSTD library at lib/zstd-1.5.7/.
Applied to files:
CMakeLists.txtcmake/kafka.cmakesrc/flb_parquet_impl.cpp
📚 Learning: 2025-08-29T06:25:27.250Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components, ZSTD support is always available and doesn't need build-time conditionals.
Applied to files:
CMakeLists.txt
📚 Learning: 2025-08-29T06:24:26.170Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:39-42
Timestamp: 2025-08-29T06:24:26.170Z
Learning: In Fluent Bit, ZSTD compression support is enabled by default and does not require conditional compilation guards (like #ifdef FLB_HAVE_ZSTD) around ZSTD-related code declarations and implementations.
Applied to files:
CMakeLists.txt
📚 Learning: 2025-08-29T06:25:02.561Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:7-7
Timestamp: 2025-08-29T06:25:02.561Z
Learning: In Fluent Bit, ZSTD (zstandard) compression library is bundled directly in the source tree at `lib/zstd-1.5.7` and is built unconditionally as a static library. Unlike optional external dependencies, ZSTD does not use conditional compilation guards like `FLB_HAVE_ZSTD` and is always available. Headers like `<fluent-bit/flb_zstd.h>` can be included directly without guards.
Applied to files:
CMakeLists.txt
📚 Learning: 2025-09-08T11:21:33.975Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10851
File: include/fluent-bit/flb_simd.h:60-66
Timestamp: 2025-09-08T11:21:33.975Z
Learning: Fluent Bit currently only supports MSVC compiler on Windows, so additional compiler compatibility guards may be unnecessary for Windows-specific code paths.
Applied to files:
CMakeLists.txt
🧬 Code graph analysis (1)
tests/internal/parquet.c (1)
src/flb_parquet_impl.cpp (2)
flb_msgpack_raw_to_parquet(579-952)flb_msgpack_raw_to_parquet(579-582)
🪛 LanguageTool
examples/s3_output/README.md
[grammar] ~56-~56: Ensure spelling is correct
Context: ..., 201, 404, 500, 401) - Response times (2ms to 5000ms) - Client information (IP, us...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~737-~737: Ensure spelling is correct
Context: ...les uploaded when reaching 50MB (before 60min timeout) Verification Steps: 1. R...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🪛 markdownlint-cli2 (0.18.1)
examples/s3_output/README.md
112-112: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
177-177: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
286-286: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
294-294: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
305-305: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
315-315: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
328-328: Tables should be surrounded by blank lines
(MD058, blanks-around-tables)
344-344: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
362-362: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
436-436: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
447-447: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
542-542: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🔇 Additional comments (10)
lib/avro/CMakeLists.txt (1)
19-19: CMake version bump aligns with project-wide requirements.The minimum CMake version increase to 3.12 is consistent with modern CMake practices and aligns with the Parquet/Arrow integration requirements introduced in this PR.
lib/jansson-e23f558/CMakeLists.txt (1)
567-569: Test target disabled with documented rationale.Commenting out the
checktarget to avoid conflicts with other libraries is acceptable. The comment explains the rationale clearly.cmake/kafka.cmake (1)
35-42: SASL configuration logic is well-structured.The changes correctly enable built-in SASL features (PLAIN/SCRAM/OAUTHBEARER) unconditionally while making Cyrus-SASL optional for GSSAPI/Kerberos. The comments clearly explain the rationale for each setting.
tests/internal/parquet.c (1)
74-103: Comprehensive test coverage with good structure.The test suite is well-organized into logical categories with good coverage of edge cases, type conversions, and error handling. The consistent pattern of setup/execute/verify/cleanup is followed throughout.
CMakeLists.txt (2)
284-287: Good documentation for FLB_PARQUET_ENCODER exclusion from FLB_ALL.The comment explaining that Parquet encoder requires explicit enabling due to external dependencies is helpful for users and maintainers.
1337-1355: Parquet encoder integration is well-structured.The Parquet encoder block properly:
- Enables C++ with C++17 standard (required for Arrow/Parquet)
- Delegates discovery to
cmake/parquet.cmake, which comprehensively handles detection across platforms with three fallback methods (pkg-config, CMake Config, manual search)- Gracefully disables the feature with clear warnings if dependencies are not found
The
cmake/parquet.cmakemodule correctly exports all required variables (ARROW_FOUND,PARQUET_FOUND,ARROW_INCLUDE_DIRS,PARQUET_INCLUDE_DIRS,ARROW_LIBRARIES,PARQUET_LIBRARIES) to the parent scope, with robust platform-specific library search paths, version validation, and helpful installation instructions.src/flb_parquet_impl.cpp (4)
39-42: Forward declaration approach is reasonable.The forward declaration of
flb_msgpack_to_json_strto avoid C++ compilation conflicts with mpack.h is a pragmatic solution. Consider adding a brief comment explaining why the full header cannot be included.
650-727: Record processing logic is well-designed.The on-the-fly processing approach addresses the previous review comment about dangling pointers. The lambda properly handles:
- Non-map records by appending nulls
- Missing fields with nullable vs non-nullable handling
- Conversion failures with appropriate fallbacks
- Statistics tracking for data quality monitoring
939-951: Exception handling properly cleans up resources.The catch blocks correctly free
output_bufferif it was allocated before the exception. However, note thatoutput_bufferis only assigned on line 926, so the null check is appropriate but the variable could be initialized tonullptrfor clarity (which it is on line 586).
863-865: RecordBatch creation uses correct row count.The
record_countvariable correctly tracks the number of processed records and is used when creating the RecordBatch, ensuring consistency between the arrays and the batch metadata.
a409e9f to
37a602a
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
examples/s3_output/README.md (1)
486-486: Typo already flagged in previous review.This issue was previously identified: "2020-203c" should be "2020-2030".
tests/internal/CMakeLists.txt (1)
136-141: Missing Parquet library linkage will cause linker errors.The Parquet test file is correctly added to
UNIT_TESTS_FILES, but theprepare_unit_testsfunction (lines 234-236) only links Avro libraries whenFLB_AVRO_ENCODERis enabled. There's no corresponding linkage forARROW_LIBRARIESandPARQUET_LIBRARIESwhenFLB_PARQUET_ENCODERis enabled.🔎 Proposed fix in prepare_unit_tests function
Add the following after line 236:
if(FLB_AVRO_ENCODER) target_link_libraries(${source_file_we} avro-static jansson) endif() + + if(FLB_PARQUET_ENCODER) + target_link_libraries(${source_file_we} ${ARROW_LIBRARIES} ${PARQUET_LIBRARIES}) + endif() add_test(NAME ${source_file_we}
🧹 Nitpick comments (5)
examples/s3_output/README.md (2)
465-465: Fix spacing in time reference.Line 465 uses "60min" which should be "60-minute" or "60 minutes" for consistency with similar references elsewhere in the document (e.g., "5 minutes" at line 127).
🔎 Proposed fix
- **Fallback**: `upload_timeout=60m` (acts as safety net) + **Fallback**: `upload_timeout=60m` (acts as safety net for 60 minutes)
112-112: Add language identifiers to fenced code blocks.Markdown linting expects language specifications on all code blocks for syntax highlighting and consistency. Add language identifiers to the three blocks without them:
- Line 112: Plain-text block showing a formula/logic
- Line 177: S3 file path tree structure
- Line 270: Log output example
🔎 Proposed fixes
At line 112, specify a language (or use plain):
-``` +```text Upload Trigger = upload_timeout reached OR total_file_size reached -``` +```At line 177, specify a language for the tree structure:
-``` +``` s3://s3-bucket-kafka-sink/fluent-bit-logs/ ├── putobject/app/day=YYYYMMDD/ # Case 1: Small files via PutObject └── multipart/access/day=YYYYMMDD/ # Case 2: Large files via Multipart Upload -``` +```At line 270, specify a language for log output:
-``` +```log [parquet] Data quality summary for 53206 records: [parquet] Missing non-nullable fields (defaults used): [parquet] field='level' count=52506 [parquet] field='timestamp' count=52506 -``` +```Also applies to: 177-177, 270-270
include/fluent-bit/flb_parquet.h (1)
23-25: Consider removing unused includes.The
flb_sds.hinclude appears unnecessary since the API usesvoid*andsize_trather thanflb_sds_t. Themsgpack.hinclude is also not required in the public header since input is typed asconst void*. Onlystdlib.h(forsize_t) is strictly needed.Removing unused includes reduces header dependencies and compile times.
🔎 Proposed fix
-#include <fluent-bit/flb_sds.h> -#include <msgpack.h> #include <stdlib.h>cmake/parquet.cmake (1)
222-228: Unicode symbols may not render on all terminals.The checkmarks (
✓,✗) in status messages are visually helpful but may not display correctly on terminals without UTF-8 support or in CI logs. Consider using ASCII alternatives like[OK]/[FAIL]for broader compatibility.🔎 Alternative using ASCII
if(ARROW_FOUND AND PARQUET_FOUND) - message(STATUS "✓ Arrow found: ${ARROW_LIBRARIES}") + message(STATUS "[OK] Arrow found: ${ARROW_LIBRARIES}") message(STATUS " Include dirs: ${ARROW_INCLUDE_DIRS}") - message(STATUS "✓ Parquet found: ${PARQUET_LIBRARIES}") + message(STATUS "[OK] Parquet found: ${PARQUET_LIBRARIES}") message(STATUS " Include dirs: ${PARQUET_INCLUDE_DIRS}") else() - message(STATUS "✗ Arrow/Parquet not found") + message(STATUS "[NOT FOUND] Arrow/Parquet not found")src/flb_parquet_impl.cpp (1)
51-61: Consider documenting the buffer size parameter.The
256parameter inflb_msgpack_to_json_strappears to be an initial buffer size hint. Adding a brief comment or using a named constant would improve code clarity.🔎 Suggested improvement
std::string msgpack_object_to_json_string(const msgpack_object* obj) { /* Use Fluent Bit's built-in msgpack to JSON converter */ - char *json_str = flb_msgpack_to_json_str(256, obj, FLB_FALSE); + /* Initial buffer size hint for JSON conversion */ + char *json_str = flb_msgpack_to_json_str(256, obj, FLB_FALSE); if (!json_str) { return "{}";
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (25)
.github/workflows/unit-tests.yamlCMakeLists.txtcmake/kafka.cmakecmake/parquet.cmakeexamples/s3_output/README.mdexamples/s3_output/athena-queries.sqlexamples/s3_output/fluent-bit-s3-parquet.confexamples/s3_output/run-s3-parquet-test.shinclude/fluent-bit/aws/flb_aws_compress.hinclude/fluent-bit/flb_parquet.hlib/avro/CMakeLists.txtlib/jansson-e23f558/CMakeLists.txtplugins/out_kinesis_firehose/firehose.cplugins/out_s3/s3.cplugins/out_s3/s3.hsrc/CMakeLists.txtsrc/aws/CMakeLists.txtsrc/aws/compression/CMakeLists.txtsrc/aws/compression/arrow/CMakeLists.txtsrc/aws/compression/arrow/compress.csrc/aws/compression/arrow/compress.hsrc/aws/flb_aws_compress.csrc/flb_parquet_impl.cpptests/internal/CMakeLists.txttests/internal/parquet.c
💤 Files with no reviewable changes (5)
- src/aws/CMakeLists.txt
- src/aws/compression/arrow/CMakeLists.txt
- src/aws/compression/arrow/compress.c
- src/aws/compression/arrow/compress.h
- src/aws/compression/CMakeLists.txt
🚧 Files skipped from review as they are similar to previous changes (6)
- examples/s3_output/fluent-bit-s3-parquet.conf
- .github/workflows/unit-tests.yaml
- examples/s3_output/athena-queries.sql
- plugins/out_kinesis_firehose/firehose.c
- src/CMakeLists.txt
- examples/s3_output/run-s3-parquet-test.sh
🧰 Additional context used
🧠 Learnings (13)
📓 Common learnings
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:52-56
Timestamp: 2025-08-29T06:24:55.855Z
Learning: ZSTD compression is always available in Fluent Bit and does not require conditional compilation guards. Unlike Arrow/Parquet which use #ifdef FLB_HAVE_ARROW guards, ZSTD is built unconditionally with flb_zstd.c included directly in src/CMakeLists.txt and a bundled ZSTD library at lib/zstd-1.5.7/.
📚 Learning: 2025-11-21T06:23:29.770Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 11171
File: include/fluent-bit/flb_lib.h:52-53
Timestamp: 2025-11-21T06:23:29.770Z
Learning: In Fluent Bit core (fluent/fluent-bit repository), function descriptions/documentation are not required for newly added functions in header files.
Applied to files:
include/fluent-bit/flb_parquet.hexamples/s3_output/README.md
📚 Learning: 2025-08-29T06:25:27.250Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components such as ARROW/PARQUET (which use `#ifdef FLB_HAVE_ARROW` guards), ZSTD support is always available and doesn't need build-time conditionals. ZSTD headers are included directly without guards across multiple plugins and core components.
Applied to files:
src/aws/flb_aws_compress.cinclude/fluent-bit/aws/flb_aws_compress.hcmake/kafka.cmakeplugins/out_s3/s3.cCMakeLists.txt
📚 Learning: 2025-08-29T06:24:55.855Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:52-56
Timestamp: 2025-08-29T06:24:55.855Z
Learning: ZSTD compression is always available in Fluent Bit and does not require conditional compilation guards. Unlike Arrow/Parquet which use #ifdef FLB_HAVE_ARROW guards, ZSTD is built unconditionally with flb_zstd.c included directly in src/CMakeLists.txt and a bundled ZSTD library at lib/zstd-1.5.7/.
Applied to files:
src/aws/flb_aws_compress.cinclude/fluent-bit/aws/flb_aws_compress.hsrc/flb_parquet_impl.cppplugins/out_s3/s3.cCMakeLists.txt
📚 Learning: 2025-08-29T06:24:26.170Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:39-42
Timestamp: 2025-08-29T06:24:26.170Z
Learning: In Fluent Bit, ZSTD compression support is enabled by default and does not require conditional compilation guards (like #ifdef FLB_HAVE_ZSTD) around ZSTD-related code declarations and implementations.
Applied to files:
src/aws/flb_aws_compress.cinclude/fluent-bit/aws/flb_aws_compress.hplugins/out_s3/s3.cCMakeLists.txt
📚 Learning: 2025-08-29T06:25:27.250Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components, ZSTD support is always available and doesn't need build-time conditionals.
Applied to files:
src/aws/flb_aws_compress.cinclude/fluent-bit/aws/flb_aws_compress.hplugins/out_s3/s3.cCMakeLists.txt
📚 Learning: 2025-08-29T06:25:02.561Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:7-7
Timestamp: 2025-08-29T06:25:02.561Z
Learning: In Fluent Bit, ZSTD (zstandard) compression library is bundled directly in the source tree at `lib/zstd-1.5.7` and is built unconditionally as a static library. Unlike optional external dependencies, ZSTD does not use conditional compilation guards like `FLB_HAVE_ZSTD` and is always available. Headers like `<fluent-bit/flb_zstd.h>` can be included directly without guards.
Applied to files:
src/aws/flb_aws_compress.cinclude/fluent-bit/aws/flb_aws_compress.hplugins/out_s3/s3.cCMakeLists.txt
📚 Learning: 2025-08-29T06:24:44.797Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: src/aws/flb_aws_compress.c:26-26
Timestamp: 2025-08-29T06:24:44.797Z
Learning: In Fluent Bit, ZSTD support is always available and enabled by default. The build system automatically detects and uses either the system libzstd library or builds the bundled ZSTD version. Unlike other optional dependencies like Arrow which use conditional compilation guards (e.g., FLB_HAVE_ARROW), ZSTD does not require conditional includes or build flags.
Applied to files:
src/aws/flb_aws_compress.cCMakeLists.txt
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit, the correct CMake flag for using system librdkafka is `FLB_PREFER_SYSTEM_LIB_KAFKA=ON`.
Applied to files:
cmake/kafka.cmakeCMakeLists.txtcmake/parquet.cmake
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit CMakeLists.txt, the system library preference flags are defined as FLB_PREFER_SYSTEM_LIB_ZSTD and FLB_PREFER_SYSTEM_LIB_KAFKA with the FLB_ prefix.
Applied to files:
cmake/kafka.cmakeCMakeLists.txtcmake/parquet.cmake
📚 Learning: 2025-09-04T12:35:22.872Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10825
File: plugins/out_s3/s3.c:1339-1344
Timestamp: 2025-09-04T12:35:22.872Z
Learning: In the Fluent Bit S3 plugin, the user prefers to maintain current retry_limit behavior without special handling for FLB_OUT_RETRY_UNLIMITED (-1), as there's no documentation indicating -1 should be used for infinite retries and consistency with current logic is preferred.
Applied to files:
plugins/out_s3/s3.c
📚 Learning: 2025-09-08T11:21:33.975Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10851
File: include/fluent-bit/flb_simd.h:60-66
Timestamp: 2025-09-08T11:21:33.975Z
Learning: Fluent Bit currently only supports MSVC compiler on Windows, so additional compiler compatibility guards may be unnecessary for Windows-specific code paths.
Applied to files:
CMakeLists.txt
📚 Learning: 2025-08-07T10:15:46.187Z
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 10691
File: .github/workflows/unit-tests.yaml:94-101
Timestamp: 2025-08-07T10:15:46.187Z
Learning: Apache Arrow latest packages have changing checksums, making hardcoded checksum verification impractical for CI workflows that download the latest version.
Applied to files:
cmake/parquet.cmake
🧬 Code graph analysis (6)
plugins/out_s3/s3.h (1)
plugins/out_s3/s3.c (1)
s3_put_object(1663-1800)
include/fluent-bit/flb_parquet.h (1)
src/flb_parquet_impl.cpp (2)
flb_msgpack_raw_to_parquet(579-952)flb_msgpack_raw_to_parquet(579-582)
src/aws/flb_aws_compress.c (2)
src/flb_gzip.c (1)
flb_gzip_compress(157-252)src/flb_zstd.c (1)
flb_zstd_compress(33-57)
src/flb_parquet_impl.cpp (1)
include/fluent-bit/flb_mem.h (1)
flb_free(126-128)
plugins/out_s3/s3.c (5)
src/aws/flb_aws_compress.c (2)
flb_aws_compression_get_type(80-97)flb_aws_compression_compress(99-116)src/flb_sds.c (2)
flb_sds_create_len(58-76)flb_sds_destroy(389-399)include/fluent-bit/flb_mem.h (1)
flb_free(126-128)plugins/out_s3/s3_multipart.c (1)
upload_part(685-799)src/flb_pack.c (1)
flb_pack_msgpack_to_json_format(1169-1450)
tests/internal/parquet.c (2)
src/flb_parquet_impl.cpp (2)
flb_msgpack_raw_to_parquet(579-952)flb_msgpack_raw_to_parquet(579-582)include/fluent-bit/flb_mem.h (1)
flb_free(126-128)
🪛 Cppcheck (2.19.0)
plugins/out_s3/s3.c
[error] 791-791: failed to expand 'flb_plg_error', it is invalid to use a preprocessor directive as macro parameter
(syntaxError)
🪛 LanguageTool
examples/s3_output/README.md
[grammar] ~56-~56: Ensure spelling is correct
Context: ..., 201, 404, 500, 401) - Response times (2ms to 5000ms) - Client information (IP, us...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~465-~465: Ensure spelling is correct
Context: ...les uploaded when reaching 50MB (before 60min timeout) Verification Steps: 1. R...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🪛 markdownlint-cli2 (0.18.1)
examples/s3_output/README.md
112-112: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
177-177: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
270-270: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🔇 Additional comments (23)
examples/s3_output/README.md (1)
1-512: Excellent comprehensive documentation for S3 Parquet testing.This new README provides a thorough, well-structured guide covering realistic test scenarios (two input sources, two output configurations), upload trigger mechanics, schema design best practices, data quality considerations, and Athena integration. The pedagogical tone helps users understand why decisions matter (e.g., nullable vs non-nullable fields, PutObject vs Multipart). The inclusion of expected behavior verification steps and troubleshooting guidance adds practical value. This fulfills the
docs-requiredlabel requirement.src/aws/flb_aws_compress.c (3)
31-37: LGTM!The wrapper function correctly adapts the
flb_snappy_compresssignature to match the generic compression interface. The pointer casts are safe and the return value is properly propagated.
66-70: The FLB_AWS_COMPRESS_SNAPPY constant is properly defined.The constant is defined in
include/fluent-bit/aws/flb_aws_compress.h:30with value2, and theflb_snappy_compress_wrapperfunction is correctly defined insrc/aws/flb_aws_compress.c:32. The array entry is correctly integrated.
26-26: No action needed. The include statement is correct as written.Snappy is a core bundled dependency (located at
lib/snappy-fef67ac/) built unconditionally in CMakeLists.txt. Unlike Arrow/Parquet which required#ifdef FLB_HAVE_ARROWguards as optional components, Snappy has noFLB_HAVE_SNAPPYconditional guards anywhere in the codebase and is included unconditionally across 12 existing files (http_server, flb_snappy.c, multiple plugins, etc.). The include at line 26 follows the established pattern.lib/avro/CMakeLists.txt (1)
19-19: CMake version bump looks appropriate.Bumping to CMake 3.12 aligns with the broader build system changes in this PR and enables modern CMake features. This is a reasonable minimum version for 2024+ projects.
lib/jansson-e23f558/CMakeLists.txt (2)
1-1: CMake version bump consistent with the PR.Aligns with the 3.12 minimum version used across other CMake files in this PR.
567-569: Reasonable approach to avoid target name collisions.Disabling the
checktarget prevents conflicts with other libraries (e.g., nghttp2) that may define the same target name. The comment adequately explains the rationale.plugins/out_s3/s3.h (2)
195-203: Clean addition of format support.The new
formatfield andschema_strpointer cleanly separate output format concerns from compression. TheFLB_S3_FORMAT_JSON/FLB_S3_FORMAT_PARQUETconstants follow a clear naming convention consistent with Fluent Bit patterns.
233-234: Function prototype aligns with SDS-based buffer handling.The use of
flb_sds_t bufferis consistent with Fluent Bit's string handling patterns. Note that the parameter names differ slightly from the implementation in s3.c (create_timevsfile_first_log_time,buffervsbody), but this is cosmetic and doesn't affect functionality.include/fluent-bit/flb_parquet.h (1)
27-85: Excellent documentation and clean API design.The header provides comprehensive documentation including:
- Clear parameter descriptions
- Schema format specification with supported types
- Usage examples
- Memory ownership semantics (caller frees with
flb_free)This level of documentation is valuable for a public API.
cmake/parquet.cmake (1)
269-278: Variable exports look correct.All discovered variables are properly exported to the parent scope, including conditional export of
ARROW_VERSION. This ensures the discovery results are available to the main CMakeLists.txt.cmake/kafka.cmake (2)
9-48: Clean simplification of SASL configuration.The refactoring correctly recognizes that librdkafka has built-in support for SASL/PLAIN, SASL/SCRAM, and SASL/OAUTHBEARER without external dependencies. Only SASL/GSSAPI (Kerberos) requires cyrus-sasl.
Key improvements:
FLB_SASL_ENABLEDis always ON (correct for built-in mechanisms)FLB_SASL_OAUTHBEARER_ENABLEDis always ON (built into librdkafka)FLB_KAFKA_MSK_IAM_ENABLEDis always ON (depends on OAuth Bearer)WITH_SASL_CYRUSis conditional on finding cyrus-sasl (for Kerberos only)This removes unnecessary complexity while maintaining full functionality.
77-81: Helpful feature summary for build diagnostics.The summary block clearly shows which Kafka features are enabled, aiding troubleshooting of build configurations.
include/fluent-bit/aws/flb_aws_compress.h (1)
24-46: Well-documented enum reorganization with clear deprecation guidance.The separation of compression algorithms (NONE=0, GZIP=1, SNAPPY=2, ZSTD=3) from format conversions (ARROW=4, PARQUET=5) improves semantic clarity. The deprecation comments provide clear migration paths.
The codebase exclusively uses symbolic constants (
FLB_AWS_COMPRESS_*) rather than hardcoded numeric values, so the enum value changes are transparent to all callers and pose no breaking change risk.Likely an incorrect or invalid review comment.
CMakeLists.txt (2)
1091-1111: Well-designed Homebrew libyaml detection for macOS.The automatic detection of libyaml via Homebrew (
brew --prefix libyaml) is a helpful developer experience improvement for macOS users. The graceful fallback sequence (pkg-config → Homebrew → compile test) ensures broad compatibility.
1337-1355: Parquet encoder integration looks correct.The Parquet encoder setup properly:
- Enables C++ and sets C++17 standard (addressing past review feedback)
- Includes cmake/parquet.cmake for library detection
- Guards all configuration behind
FLB_PARQUET_ENCODER- Provides clear warnings when dependencies are missing
tests/internal/parquet.c (2)
33-66: Excellent test organization and coverage.The test suite provides comprehensive coverage across 38 tests in 9 well-documented categories. The structured organization with clear category headers makes the test suite maintainable and easy to navigate.
74-103: Consistent and correct test pattern.All tests follow a clean pattern: construct msgpack input, invoke the conversion function, assert success, and cleanup resources. The consistent use of
flb_freefor output buffers ensures no memory leaks.src/flb_parquet_impl.cpp (3)
649-727: Immediate record processing correctly avoids dangling pointers.Processing each record immediately during the msgpack unpack loop (rather than storing
msgpack_object*pointers for later) correctly addresses the lifetime issue flagged in previous reviews. Themsgpack_unpackedzone is reused on each iteration, so this approach ensures data integrity.
579-952: Robust implementation with comprehensive error handling.The main conversion function demonstrates solid engineering:
- Input validation and compression mapping
- Proper Arrow/Parquet API usage with status checking
- Immediate record processing to avoid lifetime issues
- Statistics tracking for data quality monitoring
- Clean exception handling with resource cleanup
- Correct memory management between Arrow and Fluent Bit allocators
775-847: Fix missing newlines in debug log statements.All
flb_debugcalls in the statistics output section are missing trailing\ncharacters, which will cause log messages to run together incorrectly. Eachflb_debugcall should end with a newline.🔎 Proposed fix for debug log formatting
if (has_issues) { - flb_debug("[parquet] Data quality summary for %zu records:", record_count); + flb_debug("[parquet] Data quality summary for %zu records:\n", record_count); if (!converter.complex_to_string_by_field.empty()) { - flb_debug("[parquet] Complex types serialized to JSON string:"); + flb_debug("[parquet] Complex types serialized to JSON string:\n"); for (const auto& pair : converter.complex_to_string_by_field) { - flb_debug("[parquet] field='%s' count=%zu", - pair.first.c_str(), pair.second); + flb_debug("[parquet] field='%s' count=%zu\n", + pair.first.c_str(), pair.second); } } if (!converter.string_parsed_to_int_by_field.empty()) { - flb_debug("[parquet] Strings parsed to integers:"); + flb_debug("[parquet] Strings parsed to integers:\n"); for (const auto& pair : converter.string_parsed_to_int_by_field) { - flb_debug("[parquet] field='%s' count=%zu", - pair.first.c_str(), pair.second); + flb_debug("[parquet] field='%s' count=%zu\n", + pair.first.c_str(), pair.second); } } /* Apply similar fixes to remaining flb_debug calls in this section */Apply this pattern to all remaining
flb_debugcalls through line 847.Likely an incorrect or invalid review comment.
plugins/out_s3/s3.c (2)
744-824: Verify legacy Parquet compression behavior.The legacy
compression=arrow/parquetpath setsformat=parquetbut leavesctx->compression = FLB_AWS_COMPRESS_NONE(initialized at line 745). A previous review comment suggested this should default to GZIP for backward compatibility. The comment at line 764 says "Keep default no compression," but if the old Arrow/Parquet path applied GZIP by default, this represents a breaking change.Please verify whether the legacy compression path should default to GZIP when
format=parquetis set via the deprecatedcompression=arrow/parquetvalues.Based on the previous review comment at lines 744-824.
42-44: LGTM: Parquet integration is well-implemented.The Parquet format support is thoughtfully designed:
Clean separation: Parquet uses internal compression (passed to
flb_msgpack_raw_to_parquet), while outer compression is correctly skipped for Parquet format throughout the codebase.Efficient buffering: The architectural change to store raw msgpack data (lines 3953-3962) instead of pre-converting to JSON is a smart optimization that:
- Avoids repeated conversions when buffering
- Enables efficient batching for columnar formats
- Defers format conversion until upload time (line 1598)
Proper memory management: All SDS buffers are consistently managed with
flb_sds_create_lenandflb_sds_destroy. Arrow-allocated Parquet buffers are correctly freed withflb_free(line 1617) after copying to SDS.Good error handling: Schema validation (lines 1588-1596), conversion error checking (lines 1605-1613), and cleanup on all error paths are properly implemented.
Also applies to: 1520-1661, 3953-3962
37a602a to
7f32dd4
Compare
Add comprehensive Parquet and Kafka cmake configurations including: - New cmake/parquet.cmake with Apache Arrow and Parquet library detection - Integration of Parquet support in main CMakeLists.txt with FLB_PARQUET option - AWS compression CMakeLists.txt configurations - Updated GitHub Actions workflow to test Parquet builds - Modified src/CMakeLists.txt to include Parquet writer implementation Guard get_target_property calls with TARGET checks to prevent CMake 3.12+ errors. Signed-off-by: Arbin <arbin.cheng@coins.ph>
Enhanced AWS utilities for S3 Parquet support including: - Extended flb_aws_compress with Snappy compression implementation - Enhanced flb_aws_util with new helper functions for Parquet processing - Added Snappy compression functions for Parquet column compression - Improved error handling and memory management - Added support for AWS-specific data transformations These changes provide the foundation for Parquet format support in AWS outputs. Signed-off-by: Arbin <arbin.cheng@coins.ph>
Add comprehensive Parquet writer implementation with: - Complete C++ implementation using Apache Arrow and Parquet libraries - Support for converting Fluent Bit msgpack records to Parquet format - Schema generation from msgpack data structures - Column writer for all supported data types (int, float, string, binary, etc) - Memory-efficient streaming write support - Error handling and resource management This implementation enables output plugins to write data in Parquet format for efficient storage and analytics in data lakes and warehouses. Signed-off-by: Arbin <arbin.cheng@coins.ph>
Extended blob database with comprehensive S3 integration features: - New file tracking and metadata management APIs - Multipart upload state tracking (upload_id, s3_key, parts) - File lifecycle management (pending, ready, uploading states) - Upload progress tracking and resume capability - Part-level tracking for multipart uploads - Database queries for file and part status - Proper cleanup and error handling Added functions for ready-file completion using correct s3_key from database instead of upload_id, preventing stranded parts when upload check fails. These enhancements enable reliable file-based buffering with upload resume support for S3 and other blob storage backends. Signed-off-by: Arbin <arbin.cheng@coins.ph>
Update AWS Signature Version 4 implementation for S3 Parquet operations: - Improved signature generation for multipart uploads - Enhanced error handling and validation - Better support for S3-specific headers - Code cleanup and formatting improvements These changes ensure proper authentication for S3 multipart uploads used by Parquet file uploads. Signed-off-by: Arbin <arbin.cheng@coins.ph>
Add Parquet format support to Kinesis Firehose output plugin. Minor code cleanup to align with Parquet integration. Signed-off-by: Arbin <arbin.cheng@coins.ph>
Update blob file handling to work with enhanced blob database. Improved error handling and compatibility with new blob_db APIs. Signed-off-by: Arbin <arbin.cheng@coins.ph>
Major refactoring and enhancement of S3 output plugin: New Features: - Comprehensive Parquet format support with Apache Arrow integration - Refactored codebase into modular components (auth, blob, multipart, queue, store, stream) - File-based buffering with multipart upload support - Upload resume capability for interrupted transfers - Enhanced configuration options for Parquet and compression New Modules: - s3_auth: Authentication and credential management - s3_blob: Blob storage integration - s3_multipart: Multipart upload handling - s3_queue: Upload queue management - s3_store: File storage management - s3_stream: Streaming upload support Bug Fixes: - Fix ready-file completion using correct s3_key instead of upload_id - Prevent stranded parts when upload check fails - Handle flb_hash_table_add failure to prevent orphaned files - Improved error handling and resource cleanup These changes enable effic Major refactoring and enhancement of S3 output plugin: New Features: - Comprehensive Parquet forma Si New Features: Signed-off-by: Arbin <arbin.cheng@coins.ph>
Signed-off-by: Arbin <arbin.cheng@coins.ph>
- Cache authorization_endpoint_url length to guard against empty string - Add NULL check before accessing string indices in URL construction - Update error logging to use sized format (%.*s) for non-NUL payloads Signed-off-by: Arbin <arbin.cheng@coins.ph>
Free and reset msgpack_unpacked zone after processing each record to prevent memory accumulation during streaming operations. Signed-off-by: Arbin <arbin.cheng@coins.ph>
- Normalize snappy wrapper -2 error to compression contract's -1 - Add NULL guard for compression_keyword parameter Signed-off-by: Arbin <arbin.cheng@coins.ph>
Include errno.h header to declare errno and ERANGE symbols used in strtol-based TZ offset parsing logic. Signed-off-by: Arbin <arbin.cheng@coins.ph>
- Add stdint.h include for SIZE_MAX availability - Replace function-scoped macro with local const to prevent leakage Signed-off-by: Arbin <arbin.cheng@coins.ph>
Ensure parquet output files are unlinked in error-path tests to prevent file accumulation during test runs. Signed-off-by: Arbin <arbin.cheng@coins.ph>
- Fail fast when temp paths are NULL in out_s3_blob tests - Return error status from seed_recovery_db with proper checks in callers Signed-off-by: Arbin <arbin.cheng@coins.ph>
Add error handling for pthread_mutex_init failure with proper cleanup of allocated resources. Signed-off-by: Arbin <arbin.cheng@coins.ph>
GROUP BY already ensures unique file rows, making DISTINCT unnecessary. Signed-off-by: Arbin <arbin.cheng@coins.ph>
- Add explicit cfl_sds.h include to avoid fragile transitive dependencies - Log warnings when stale multipart uploads cannot be aborted without s3_key Signed-off-by: Arbin <arbin.cheng@coins.ph>
remote_id is the multipart upload_id, not the object key. In check_and_complete_multipart(), prefer s3_key from database file record (authoritative source) over the parameter value from part entries to ensure correct object key is used for CompleteMultipartUpload API calls. Signed-off-by: Arbin <arbin.cheng@coins.ph>
Capture and check the return value of flb_output_set() in flb_test_blob_database_custom_path() to prevent test from passing when output configuration fails. Ensures test properly exercises intended path. Signed-off-by: Arbin <arbin.cheng@coins.ph>
Replace flb_aws_client_mock_clear_generator_instance() with flb_aws_client_mock_destroy_generator() in manual cleanup path to properly free the mock structure. This matches s3_test_cleanup() behavior and prevents memory leak when flb_start() fails. Signed-off-by: Arbin <arbin.cheng@coins.ph>
S3 multipart upload API requires part numbers in range 1-10000 (1-based). The s3_blob_register_parts() function was registering parts starting at 0, causing mismatch between stored part IDs and S3's expected numbering during recovery. Initialize parts counter to 1 instead of 0, pass 1-based part number to flb_blob_db_file_part_insert(), and return actual count (parts - 1) to maintain compatibility with callers expecting part count. Signed-off-by: Arbin <arbin.cheng@coins.ph>
47e7c60 to
0f6da36
Compare
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Fix all issues with AI agents
In `@plugins/out_s3/s3_queue.c`:
- Around line 136-199: s3_queue_add_part currently calls
flb_sds_create()/flb_sds_create_len on possibly NULL inputs (file_path, s3_key,
upload_id, tag) which can crash; before each flb_sds_create/_len call in
s3_queue_add_part validate the corresponding input pointer (file_path, s3_key,
upload_id, tag) is non-NULL (and for tag use tag_len>0 when appropriate), and if
any are invalid perform the same error handling/cleanup sequence (flb_errno(),
destroy any already-created SDS like
entry->stream_path/entry->s3_key/entry->upload_id, flb_free(entry)) and return
-1 so corrupted DB rows or bad callers won’t cause a crash. Ensure all
validation checks reference the existing symbols (entry->stream_path,
entry->s3_key, entry->upload_id, entry->tag, flb_sds_create, flb_sds_create_len)
and keep cleanup order consistent with current returns.
In `@plugins/out_s3/s3.h`:
- Around line 34-47: The Windows branch of the cross-platform macros is missing
the header that declares _stat64/_fstat64 and struct _stat64; add an `#include`
<sys/stat.h> inside the `#ifdef` _WIN32 block so flb_s3_stat_struct,
flb_s3_stat(), and flb_s3_fstat() resolve correctly on MSVC builds; ensure the
include is placed alongside the existing `#include` <io.h> in the same block.
In `@tests/include/s3_test_helpers.h`:
- Around line 6-15: The Windows build is missing prototypes for _mkdir and
_unlink used in this file; update the Windows include block (the `#else` branch
that currently includes <windows.h>) to also `#include` <direct.h> for _mkdir and
<io.h> for _unlink so MSVC has proper declarations for those functions; ensure
the includes are only under the _WIN32 branch and do not change the POSIX
branch.
In `@tests/runtime/out_s3_queue.c`:
- Around line 21-27: The flb_output_set() calls in
flb_test_queue_multiple_workers (and the other queue test functions) currently
ignore their return values; update each call (e.g., in
flb_test_queue_multiple_workers after S3_TEST_RUN_AND_CLEANUP and the calls at
the other listed sites) to check the return value and fail the test on error
(abort/return non-zero or call the test cleanup/failure helper) so a
configuration failure doesn't make the test pass silently; ensure you use the
same test failure pattern used elsewhere in these tests when flb_output_set()
returns an error.
In `@tests/runtime/out_s3_recovery.c`:
- Around line 191-192: The test currently calls flb_output_set(ctx, out_ffd,
"upload_chunk_size", "5M", NULL) without checking its return value; update the
call sites (including the one in out_s3_recovery.c and the other lines noted) to
capture the return into a variable (e.g., int ret = flb_output_set(...)) and add
an assertion that ret == 0 (matching the pattern used elsewhere) so
configuration failures cause the test to fail rather than produce false
positives; ensure you use the same variable name and assertion style used in
surrounding tests to keep consistency.
🧹 Nitpick comments (2)
tests/runtime/out_s3_error_handling.c (1)
97-98: Minor inconsistency: Use== 0for consistency with other tests.Line 202 uses
TEST_CHECK(ret == 0)while this line usesTEST_CHECK(ret >= 0). Sinceflb_output_setreturns 0 on success and -1 on failure,== 0is more precise and consistent.🔧 Suggested fix for consistency
ret = flb_output_set(ctx, out_ffd, "upload_chunk_size", "5M", NULL); - TEST_CHECK(ret >= 0); + TEST_CHECK(ret == 0);tests/runtime/out_s3_edge_cases.c (1)
216-374: Consider reusing the S3_TEST_ helpers to reduce duplication in rapid start/stop.*
This block repeats the same setup/teardown pattern twice; wrapping it with the existing macros (or a small helper) would shrink the test and reduce future drift.
| int s3_queue_add_part(struct flb_s3 *ctx, | ||
| uint64_t file_id, | ||
| uint64_t part_db_id, | ||
| uint64_t part_id, | ||
| const char *file_path, | ||
| off_t offset_start, | ||
| off_t offset_end, | ||
| const char *s3_key, | ||
| const char *upload_id, | ||
| const char *tag, | ||
| int tag_len) | ||
| { | ||
| struct upload_queue *entry; | ||
|
|
||
| entry = flb_calloc(1, sizeof(struct upload_queue)); | ||
| if (!entry) { | ||
| flb_errno(); | ||
| return -1; | ||
| } | ||
|
|
||
| entry->file_id = file_id; | ||
| entry->part_db_id = part_db_id; | ||
| entry->part_id = part_id; | ||
| entry->upload_file = NULL; | ||
| entry->offset_start = offset_start; | ||
| entry->offset_end = offset_end; | ||
| entry->retry_counter = 0; | ||
| entry->upload_time = time(NULL); | ||
| entry->state = S3_STATE_UPLOAD_PART; | ||
|
|
||
| entry->stream_path = flb_sds_create(file_path); | ||
| if (!entry->stream_path) { | ||
| flb_errno(); | ||
| flb_free(entry); | ||
| return -1; | ||
| } | ||
|
|
||
| entry->s3_key = flb_sds_create(s3_key); | ||
| if (!entry->s3_key) { | ||
| flb_errno(); | ||
| flb_sds_destroy(entry->stream_path); | ||
| flb_free(entry); | ||
| return -1; | ||
| } | ||
|
|
||
| entry->upload_id = flb_sds_create(upload_id); | ||
| if (!entry->upload_id) { | ||
| flb_errno(); | ||
| flb_sds_destroy(entry->stream_path); | ||
| flb_sds_destroy(entry->s3_key); | ||
| flb_free(entry); | ||
| return -1; | ||
| } | ||
|
|
||
| entry->tag = flb_sds_create_len(tag, tag_len); | ||
| if (!entry->tag) { | ||
| flb_errno(); | ||
| flb_sds_destroy(entry->stream_path); | ||
| flb_sds_destroy(entry->s3_key); | ||
| flb_sds_destroy(entry->upload_id); | ||
| flb_free(entry); | ||
| return -1; | ||
| } | ||
| entry->tag_len = tag_len; |
There was a problem hiding this comment.
Validate required strings before SDS creation in s3_queue_add_part.
Line 166+ calls flb_sds_create() on inputs that could be NULL if DB rows are corrupt or callers pass unexpected values, which would crash. Add guards before creating SDS strings.
🔧 Suggested fix
entry->state = S3_STATE_UPLOAD_PART;
+ if (!file_path || !s3_key || !upload_id || !tag) {
+ flb_plg_warn(ctx->ins, "Invalid part entry: missing required fields");
+ flb_free(entry);
+ return -1;
+ }
+
entry->stream_path = flb_sds_create(file_path);🤖 Prompt for AI Agents
In `@plugins/out_s3/s3_queue.c` around lines 136 - 199, s3_queue_add_part
currently calls flb_sds_create()/flb_sds_create_len on possibly NULL inputs
(file_path, s3_key, upload_id, tag) which can crash; before each
flb_sds_create/_len call in s3_queue_add_part validate the corresponding input
pointer (file_path, s3_key, upload_id, tag) is non-NULL (and for tag use
tag_len>0 when appropriate), and if any are invalid perform the same error
handling/cleanup sequence (flb_errno(), destroy any already-created SDS like
entry->stream_path/entry->s3_key/entry->upload_id, flb_free(entry)) and return
-1 so corrupted DB rows or bad callers won’t cause a crash. Ensure all
validation checks reference the existing symbols (entry->stream_path,
entry->s3_key, entry->upload_id, entry->tag, flb_sds_create, flb_sds_create_len)
and keep cleanup order consistent with current returns.
| /* Cross-platform file I/O compatibility macros */ | ||
| #ifdef _WIN32 | ||
| #include <windows.h> | ||
| #include <io.h> | ||
| #define flb_s3_unlink(path) _unlink(path) | ||
| #define flb_s3_access(path, mode) _access((path), (mode)) | ||
| #define flb_s3_open(path, flags, ...) _open((path), ((flags) | O_BINARY), ##__VA_ARGS__) | ||
| #define flb_s3_close(fd) _close(fd) | ||
| #define flb_s3_read(fd, buf, count) _read((fd), (buf), (count)) | ||
| #define flb_s3_lseek(fd, offset, origin) _lseeki64((fd), (offset), (origin)) | ||
| #define flb_s3_stat(path, buf) _stat64((path), (buf)) | ||
| #define flb_s3_fstat(fd, buf) _fstat64((fd), (buf)) | ||
| #define flb_s3_stat_struct struct _stat64 | ||
| #else |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, locate the file and examine its contents
cd /tmp && find . -name "s3.h" -path "*/out_s3/*" 2>/dev/null | head -5Repository: fluent/fluent-bit
Length of output: 43
🏁 Script executed:
# List files in the repository to understand structure
git ls-files | grep -E "(s3\.h|out_s3)" | head -20Repository: fluent/fluent-bit
Length of output: 606
🏁 Script executed:
# Examine the specific file
cat -n plugins/out_s3/s3.h | head -60Repository: fluent/fluent-bit
Length of output: 2822
🏁 Script executed:
# Check where _stat64 and _fstat64 are used in the code
rg "flb_s3_stat|flb_s3_fstat" plugins/out_s3/ -A 2 -B 2Repository: fluent/fluent-bit
Length of output: 1251
🏁 Script executed:
# Check if there's any documentation about Windows compilation
find . -name "*.md" -o -name "*.txt" | xargs grep -l "Windows\|MSVC\|WIN32" 2>/dev/null | head -10Repository: fluent/fluent-bit
Length of output: 323
🏁 Script executed:
# Check the actual implementation to see if there are stat-related issues
grep -n "_stat64\|struct _stat64\|flb_s3_stat_struct" plugins/out_s3/*.c plugins/out_s3/*.hRepository: fluent/fluent-bit
Length of output: 329
🏁 Script executed:
# Let's check the Windows documentation and header dependencies
# First, look at the actual usage of struct _stat64 in s3_multipart.c
sed -n '1230,1250p' plugins/out_s3/s3_multipart.cRepository: fluent/fluent-bit
Length of output: 679
🏁 Script executed:
# Check if there are any explicit includes in s3_multipart.c
head -60 plugins/out_s3/s3_multipart.c | grep -E "#include"Repository: fluent/fluent-bit
Length of output: 693
🏁 Script executed:
# Search for how other files handle stat functions on Windows
rg "sys/stat.h" plugins/Repository: fluent/fluent-bit
Length of output: 3678
🏁 Script executed:
# Check Windows-specific includes to see if sys/stat.h is needed
rg "#include.*stat" plugins/out_s3/Repository: fluent/fluent-bit
Length of output: 298
🌐 Web query:
MSVC _stat64 header location where is it declared
💡 Result:
In MSVC, _stat64 (and the associated struct __stat64) is declared by including:
#include <sys/types.h>followed by#include <sys/stat.h>(MS docs list this explicitly for_stat64). [1]
On disk, that header is the MSVC CRT header:
...\VC\Tools\MSVC\<toolset-version>\include\sys\stat.h(under the MSVC include root such as...\VC\Tools\MSVC\<ver>\include). [2]
Sources: [1] [2]
🌐 Web query:
Windows MSVC io.h includes sys/stat.h
💡 Result:
In MSVC, <io.h> does not “pull in” <sys/stat.h> for you in any way you can rely on. Instead:
- The low-level I/O functions like
_openrequire<io.h>, and Microsoft’s docs list<sys/stat.h>(and<sys/types.h>) only as optional headers that you should include when you use the file-permission arguments/macros (for example_S_IREAD,_S_IWRITE). [1] - The
struct _statand related definitions live in<sys/stat.h>, and MS docs note you should include<sys/types.h>before<sys/stat.h>becausestat.huses_dev_tfromtypes.h. [2]
Typical include set for _open with permissions:
`#include` <fcntl.h>
`#include` <sys/types.h>
`#include` <sys/stat.h>
`#include` <io.h>Sources: Microsoft Learn on _open headers [1] and _stat / _stat structure notes [2].
[1] (learn.microsoft.com)
[2] (learn.microsoft.com)
Citations:
- 1: https://learn.microsoft.com/en-us/previous-versions/z0kc8e3z%28v%3Dvs.140%29?utm_source=openai
- 2: https://learn.microsoft.com/en-us/previous-versions/14h5k7ff%28v%3Dvs.140%29?utm_source=openai
Include <sys/stat.h> in the Windows section for _stat64 / _fstat64 declarations.
The struct _stat64 and _stat64 / _fstat64 function declarations are in <sys/stat.h> on MSVC; <io.h> does not transitively include it. Without this include, code that uses the macros flb_s3_stat_struct, flb_s3_stat(), or flb_s3_fstat() via this header can fail on Windows builds.
🔧 Proposed fix
`#ifdef` _WIN32
`#include` <windows.h>
`#include` <io.h>
+#include <sys/stat.h>
`#define` flb_s3_unlink(path) _unlink(path)
`#define` flb_s3_access(path, mode) _access((path), (mode))
`#define` flb_s3_open(path, flags, ...) _open((path), ((flags) | O_BINARY), ##__VA_ARGS__)
`#define` flb_s3_close(fd) _close(fd)
`#define` flb_s3_read(fd, buf, count) _read((fd), (buf), (count))
`#define` flb_s3_lseek(fd, offset, origin) _lseeki64((fd), (offset), (origin))
`#define` flb_s3_stat(path, buf) _stat64((path), (buf))
`#define` flb_s3_fstat(fd, buf) _fstat64((fd), (buf))
`#define` flb_s3_stat_struct struct _stat64🤖 Prompt for AI Agents
In `@plugins/out_s3/s3.h` around lines 34 - 47, The Windows branch of the
cross-platform macros is missing the header that declares _stat64/_fstat64 and
struct _stat64; add an `#include` <sys/stat.h> inside the `#ifdef` _WIN32 block so
flb_s3_stat_struct, flb_s3_stat(), and flb_s3_fstat() resolve correctly on MSVC
builds; ensure the include is placed alongside the existing `#include` <io.h> in
the same block.
| #include <fluent-bit.h> | ||
| #ifndef _WIN32 | ||
| #include <sys/stat.h> | ||
| #include <unistd.h> | ||
| #include <dirent.h> | ||
| #include <time.h> | ||
| #include <stdlib.h> | ||
| #else | ||
| #include <windows.h> | ||
| #endif |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, find and examine the full content of s3_test_helpers.h
find . -name "s3_test_helpers.h" -type fRepository: fluent/fluent-bit
Length of output: 94
🏁 Script executed:
# Search for _mkdir and _unlink usage in the repository
rg "_mkdir|_unlink" --type cpp --type hRepository: fluent/fluent-bit
Length of output: 8820
🏁 Script executed:
# Read the complete s3_test_helpers.h file
cat -n ./tests/include/s3_test_helpers.hRepository: fluent/fluent-bit
Length of output: 19899
Add missing Windows headers for _mkdir and _unlink function prototypes.
The Windows code path calls _mkdir() (line 321) and _unlink() (line 346), but their prototypes are declared in <direct.h> and <io.h> respectively. Without these headers, MSVC will generate implicit function declaration warnings or errors.
🔧 Proposed fix
`#else`
`#include` <windows.h>
+#include <direct.h>
+#include <io.h>
`#endif`🤖 Prompt for AI Agents
In `@tests/include/s3_test_helpers.h` around lines 6 - 15, The Windows build is
missing prototypes for _mkdir and _unlink used in this file; update the Windows
include block (the `#else` branch that currently includes <windows.h>) to also
`#include` <direct.h> for _mkdir and <io.h> for _unlink so MSVC has proper
declarations for those functions; ensure the includes are only under the _WIN32
branch and do not change the POSIX branch.
| void flb_test_queue_multiple_workers(void) | ||
| { | ||
| S3_TEST_DECLARE_VARS(); | ||
|
|
||
| S3_TEST_RUN_AND_CLEANUP("test", "queue_multiple_workers"); | ||
| flb_output_set(ctx, out_ffd, "workers", "5", NULL); | ||
|
|
There was a problem hiding this comment.
Check flb_output_set() results in queue tests.
Line 26, Line 72, Line 93, and Line 139 ignore return values; a failed config can make tests pass without exercising the intended path.
🔧 Suggested fix (apply to each call site)
- flb_output_set(ctx, out_ffd, "workers", "5", NULL);
+ ret = flb_output_set(ctx, out_ffd, "workers", "5", NULL);
+ TEST_CHECK(ret == 0);Also applies to: 68-73, 92-95, 138-141
🤖 Prompt for AI Agents
In `@tests/runtime/out_s3_queue.c` around lines 21 - 27, The flb_output_set()
calls in flb_test_queue_multiple_workers (and the other queue test functions)
currently ignore their return values; update each call (e.g., in
flb_test_queue_multiple_workers after S3_TEST_RUN_AND_CLEANUP and the calls at
the other listed sites) to check the return value and fail the test on error
(abort/return non-zero or call the test cleanup/failure helper) so a
configuration failure doesn't make the test pass silently; ensure you use the
same test failure pattern used elsewhere in these tests when flb_output_set()
returns an error.
| flb_output_set(ctx, out_ffd, "upload_chunk_size", "5M", NULL); | ||
|
|
There was a problem hiding this comment.
Check flb_output_set() return values to avoid false‑positive tests.
A few tests call flb_output_set() without asserting success; if config parsing fails, the test still passes. Capture ret and assert ret == 0 (same pattern used elsewhere).
🔧 Example fix (apply similarly to the other call sites)
- flb_output_set(ctx, out_ffd, "upload_chunk_size", "5M", NULL);
+ ret = flb_output_set(ctx, out_ffd, "upload_chunk_size", "5M", NULL);
+ TEST_CHECK(ret == 0);Also applies to: 234-235, 276-277, 360-361, 425-431
🤖 Prompt for AI Agents
In `@tests/runtime/out_s3_recovery.c` around lines 191 - 192, The test currently
calls flb_output_set(ctx, out_ffd, "upload_chunk_size", "5M", NULL) without
checking its return value; update the call sites (including the one in
out_s3_recovery.c and the other lines noted) to capture the return into a
variable (e.g., int ret = flb_output_set(...)) and add an assertion that ret ==
0 (matching the pattern used elsewhere) so configuration failures cause the test
to fail rather than produce false positives; ensure you use the same variable
name and assertion style used in surrounding tests to keep consistency.
There was a problem hiding this comment.
Generally speaking for vendored libraries we do not change it directly here but submit to upstream and pull the changes then once merged.
|
I tried this out by building from source, and I had two issues with it:
|
Enter
[N/A]in the box, if an item is not applicable to your change.Testing
Before we can approve your change; please submit the following in a comment:
If this is a change to packaging of containers or native binaries then please confirm it works for all targets.
ok-package-testlabel to test for all targets (requires maintainer to do).Documentation
Backporting
Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.
Summary by CodeRabbit
New Features
Improvements
Tests