20233: Refactor example metadata parsing utilities(#20204) by martin-augment · Pull Request #233 · martin-augment/datafusion

martin-augment · 2026-02-09T12:51:15Z

20233: To review by AI

coderabbitai · 2026-02-09T12:51:41Z

Walkthrough

A monolithic documentation generation module at datafusion-examples/src/utils/examples_docs.rs (908 lines) is refactored into a modular structure under datafusion-examples/src/utils/example_metadata/. The new structure splits functionality across seven files: mod.rs for module declarations and re-exports, discover.rs for filesystem scanning, layout.rs for repository structure utilities, model.rs for domain data types, parser.rs for documentation parsing, render.rs for Markdown generation, and test_utils.rs for test helpers. The module path reference in src/bin/examples-docs.rs is updated to use the new example_metadata name, and the corresponding export in src/utils/mod.rs is updated. All previously public APIs are preserved through re-exports in the new module hierarchy.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch pr-20233-2026-02-09-12-51-11

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments

datafusion-examples/src/utils/example_metadata/render.rs (2)

88-95: Consider moving ABBREVIATIONS to model.rs to break the circular dependency.

model.rs imports render::ABBREVIATIONS for format_part, and render.rs imports model::ExampleGroup. This creates a bidirectional dependency between the two sibling modules. Moving ABBREVIATIONS into model.rs (where format_part lives) would keep the dependency graph acyclic: render → model, with no reverse edge.

130-143: Duplicate raw_name extraction logic.

Lines 131–137 duplicate the directory-name extraction already performed inside ExampleGroup::from_dir (model.rs lines 47–53). The duplication exists because Category::for_group needs the raw name before from_dir is called. Consider either having from_dir accept a pre-extracted name, or inlining category resolution inside from_dir to eliminate the redundancy.
datafusion-examples/src/utils/example_metadata/parser.rs (1)
70-95: Use nom::combinator::rest instead of take_while(|_| true).

Line 73 uses take_while(|_| true) to consume the remainder of the input. nom::combinator::rest is the idiomatic nom equivalent and conveys intent more clearly.
♻️ Suggested change
-use nom::{
-    IResult, Parser,
-    bytes::complete::{tag, take_until, take_while},
-    character::complete::multispace0,
-    combinator::all_consuming,
-    sequence::{delimited, preceded},
-};
+use nom::{
+    IResult, Parser,
+    bytes::complete::{tag, take_until},
+    character::complete::multispace0,
+    combinator::{all_consuming, rest},
+    sequence::{delimited, preceded},
+};
     let parser = preceded(
         multispace0,
-        preceded(tag("//!"), preceded(multispace0, take_while(|_| true))),
+        preceded(tag("//!"), preceded(multispace0, rest)),
     );
datafusion-examples/src/utils/example_metadata/model.rs (1)

104-112: Consider whether ExampleEntry fields should be public.

All three fields are pub, allowing direct construction from outside the module (e.g., in tests and parser). This is fine for the current scope, but if you want to enforce construction invariants later, consider providing a constructor and making fields pub(crate).
datafusion-examples/src/utils/example_metadata/test_utils.rs (1)
51-63: Temp directory lifetime is safe here, but worth a brief note.

TempDir is dropped (and the directory deleted) at the end of this function. This works because ExampleGroup::from_dir eagerly reads and parses the file contents before returning. If from_dir were ever changed to lazily read files, this would silently break. A one-line comment noting the intentional eager-read dependency would help future maintainers.
📝 Suggested comment
 pub fn example_group_from_docs(docs: &str) -> Result<ExampleGroup> {
+    // TempDir is dropped at function exit; this is safe because
+    // ExampleGroup::from_dir eagerly reads and parses all file contents.
     let tmp = TempDir::new().map_err(|e| {

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-02-09T12:52:37Z

Summary of Changes

Hello @martin-augment, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the example metadata parsing utilities within the DataFusion examples project. The original monolithic examples_docs.rs file has been split into several smaller, more focused modules to improve code clarity and maintainability. This change does not introduce new functionality but rather reorganizes existing code for better structure.

Highlights

Refactoring: The primary goal is to refactor the example metadata parsing utilities to improve code organization and maintainability.
New Modules: Several new modules have been added under datafusion-examples/src/utils/example_metadata, including discover, layout, model, parser, render, and test_utils to handle different aspects of example metadata processing.
Code Removal: The datafusion-examples/src/utils/examples_docs.rs file has been removed, with its functionality distributed among the new modules.
Dependency Management: The changes focus on internal organization and do not introduce any new external dependencies.

Changelog

datafusion-examples/src/bin/examples-docs.rs
- Modified to use the new example_metadata module instead of the old examples_docs module.
datafusion-examples/src/utils/example_metadata/discover.rs
- Added a new module for discovering example groups in the repository filesystem.
datafusion-examples/src/utils/example_metadata/layout.rs
- Added a new module for repository layout utilities.
datafusion-examples/src/utils/example_metadata/mod.rs
- Added a new module that acts as the main entry point for example metadata functionality, defining the overall structure and usage.
datafusion-examples/src/utils/example_metadata/model.rs
- Added a new module defining the domain model for DataFusion example documentation.
datafusion-examples/src/utils/example_metadata/parser.rs
- Added a new module for parsing example metadata from main.rs documentation comments.
datafusion-examples/src/utils/example_metadata/render.rs
- Added a new module for rendering Markdown documentation for DataFusion examples.
datafusion-examples/src/utils/example_metadata/test_utils.rs
- Added a new module providing test helpers for example metadata parsing and validation.
datafusion-examples/src/utils/examples_docs.rs
- Removed the original monolithic module, as its functionality has been split into smaller, more focused modules.
datafusion-examples/src/utils/mod.rs
- Modified to replace the examples_docs module with the new example_metadata module.

Activity

The pull request primarily involves code refactoring and reorganization.
No specific reviewer activity is mentioned in the provided context.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

claude · 2026-02-09T12:52:45Z

Pull Request Review: Refactor Example Metadata Parsing Utilities

Summary

This PR refactors datafusion-examples/src/utils/examples_docs.rs (908 lines) into a well-organized module structure with 7 separate files (1,217 lines). The refactoring improves code organization, maintainability, and testability by following the single responsibility principle.

Code Quality: ✅ Excellent

Strengths

Clean module separation - Each module has a clear, focused responsibility:
- discover.rs - Filesystem discovery
- layout.rs - Repository structure
- model.rs - Domain models
- parser.rs - Parsing logic
- render.rs - Markdown generation
- test_utils.rs - Shared test helpers
Comprehensive documentation - All modules have excellent doc comments explaining their purpose and design goals
Strong test coverage - Tests moved with their corresponding modules, maintaining excellent coverage:
- Parser tests (15+ test cases)
- Model validation tests
- Layout tests
- Rendering tests
- Edge case handling
Consistent error handling - Proper use of DataFusionError::Execution with meaningful error messages
Well-structured types - Clean separation between data models (ExampleGroup, GroupName, ExampleEntry, Category)

Best Practices: ✅ Good

Positives

✅ Proper use of nom parser combinators for robust parsing
✅ State machine pattern in ParserState for tracking subcommand-metadata relationships
✅ Helper functions (format_part, is_non_blank_doc_line) are well-scoped
✅ Tests use tempfile::TempDir for isolated filesystem operations
✅ ASF license headers on all files
✅ Public API remains unchanged (exports via mod.rs)

Minor Observations

Module visibility - test_utils.rs is marked #[cfg(test)] in mod.rs but re-exported publicly. Consider:
```
#[cfg(test)]
pub mod test_utils;
```
This is minor since it's already feature-gated, but makes intent clearer.
Const vs static - ABBREVIATIONS in render.rs is defined as pub const but referenced from model.rs. This creates a coupling between modules. Consider:
- Moving to a shared location, or
- Making it private to render.rs and passing as parameter

Potential Issues: ⚠️ Minor

1. Cross-module coupling (model.rs ↔ render.rs)

Location: datafusion-examples/src/utils/example_metadata/model.rs:322

use crate::utils::example_metadata::{parse_main_rs_docs, render::ABBREVIATIONS};

The format_part function in model.rs depends on ABBREVIATIONS from render.rs. This creates a dependency cycle (render uses model, model uses render's constant).

Recommendation: Move ABBREVIATIONS and format_part to a shared module or make ABBREVIATIONS part of model.rs since it's used for model formatting.

2. Parser error context

Location: parser.rs:831-835

When parsing fails, the error message includes line numbers but not the actual problematic line content. For developer experience:

return Err(DataFusionError::Execution(format!(
    "Metadata without preceding subcommand at {}:{}\nLine: {}",
    path.display(),
    line_no + 1,
    raw_line  // Show the actual line
)));

3. Test helper location

The test_utils.rs module is conditionally compiled but publicly exported. While this works, it's unconventional. If these utilities are only for internal tests, they should be private.

Performance: ✅ No Concerns

No performance regressions introduced
Filesystem operations are appropriately scoped
Parser uses efficient nom combinators
String allocations are minimal and necessary

Security: ✅ No Concerns

No command injection risks
Filesystem operations are properly scoped
No unsafe code
Input validation via parser is robust
Error messages don't leak sensitive information

Test Coverage: ✅ Comprehensive

All original tests preserved and properly organized:

✅ 15+ parser tests (valid/invalid inputs)
✅ Model validation tests (duplicates, ordering, abbreviations)
✅ Layout detection tests
✅ Rendering tests
✅ Integration test (parse_group_example_works)
✅ Edge cases (blank lines, metadata positioning)

Architecture: ✅ Improved

The refactoring significantly improves code organization:

Before: Single 908-line file mixing concerns
After: 7 focused modules with clear boundaries

This makes the codebase:

Easier to navigate
Easier to test individual components
Easier to modify without affecting unrelated functionality
Better self-documenting through module structure

Recommendations

Required: None

Suggested Improvements:

Address circular dependency between model.rs and render.rs by moving ABBREVIATIONS to model.rs or a separate constants.rs module
Enhance error messages in parser to include actual line content for better debugging
Clarify test_utils visibility - either make it truly internal or document why it's public

Example fix for #1:

// In model.rs
const ABBREVIATIONS: &[(&str, &str)] = &[
    ("dataframe", "DataFrame"),
    ("io", "IO"),
    ("sql", "SQL"),
    ("udf", "UDF"),
];

// Then render.rs can re-export if needed:
pub use crate::utils::example_metadata::model::ABBREVIATIONS;

Conclusion

This is a high-quality refactoring that significantly improves code organization without changing functionality. The code follows Rust best practices, has excellent test coverage, and maintains backward compatibility.

Recommendation: Approve with minor suggestions ✅

The suggested improvements are minor and don't block merging. The PR achieves its goal of making the codebase more maintainable and modular.

🤖 Generated by Claude Code

gemini-code-assist

Code Review

This pull request refactors the example metadata parsing and documentation generation logic into a new, well-structured example_metadata module. The code is now more modular and easier to maintain. I've found a couple of areas for improvement: a circular dependency between modules that should be resolved, and a fragile parsing implementation that could be made more robust. My detailed comments provide suggestions for addressing these points.

gemini-code-assist · 2026-02-09T12:55:00Z

datafusion-examples/src/utils/example_metadata/model.rs

+
+use datafusion::error::{DataFusionError, Result};
+
+use crate::utils::example_metadata::{parse_main_rs_docs, render::ABBREVIATIONS};


This use statement introduces a circular dependency between the model and render modules (model -> render for ABBREVIATIONS, and render -> model for ExampleGroup). This can make the code harder to understand and maintain.

Since ABBREVIATIONS is used for constructing the GroupName title, it's more of a model concern than a rendering one in the current structure. To break the cycle, I suggest moving the ABBREVIATIONS constant from render.rs to this file (model.rs) and updating this use statement accordingly.

value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! The ABBREVIATIONS constant is used only in model.rs, so there is no real need to keep it in render.rs and create a circular dependency between these two modules.

gemini-code-assist · 2026-02-09T12:55:01Z

datafusion-examples/src/utils/example_metadata/parser.rs

+fn parse_metadata_line(input: &str) -> IResult<&str, (&str, &str)> {
+    let parser = preceded(
+        multispace0,
+        preceded(tag("//!"), preceded(multispace0, take_while(|_| true))),
+    );
+    let (rest, payload) = all_consuming(parser).parse(input)?;
+
+    let content = payload
+        .strip_prefix("(")
+        .and_then(|s| s.strip_suffix(")"))
+        .ok_or_else(|| {
+            nom::Err::Error(nom::error::Error::new(payload, nom::error::ErrorKind::Tag))
+        })?;
+
+    let (file, desc) = content
+        .strip_prefix("file:")
+        .ok_or_else(|| {
+            nom::Err::Error(nom::error::Error::new(payload, nom::error::ErrorKind::Tag))
+        })?
+        .split_once(", desc:")
+        .ok_or_else(|| {
+            nom::Err::Error(nom::error::Error::new(payload, nom::error::ErrorKind::Tag))
+        })?;
+
+    Ok((rest, (file.trim(), desc.trim())))
+}


The use of split_once here is a bit fragile. It might incorrectly parse if the file path contains a comma, or if the description contains the string , desc:. A more robust approach would be to use nom combinators to parse the content within the parentheses. This will require updating the nom imports at the top of the file to include rest, separated_pair, and tuple.

fn parse_metadata_line(input: &str) -> IResult<&str, (&str, &str)> { let parser = preceded( multispace0, preceded(tag("//!"), preceded(multispace0, take_while(|_| true))), ); let (rest, payload) = all_consuming(parser).parse(input)?; let content = payload .strip_prefix("(") .and_then(|s| s.strip_suffix(")")) .ok_or_else(|| { nom::Err::Error(nom::error::Error::new(payload, nom::error::ErrorKind::Tag)) })?; let file_parser = preceded(nom::sequence::tuple((tag("file:"), multispace0)), take_until(", desc:")); let desc_parser = preceded(nom::sequence::tuple((tag(", desc:"), multispace0)), nom::combinator::rest); let (file, desc) = match nom::sequence::separated_pair(file_parser, multispace0, desc_parser)(content) { Ok( K("", (file, desc))) => (file, desc), _ => { return Err(nom::Err::Error(nom::error::Error::new( payload, nom::error::ErrorKind::Tag, ))) } }; Ok((rest, (file.trim(), desc.trim()))) }

gemini-code-assist · 2026-02-09T12:55:01Z

datafusion-examples/src/utils/example_metadata/render.rs

+pub const ABBREVIATIONS: &[(&str, &str)] = &[
+    ("dataframe", "DataFrame"),
+    ("io", "IO"),
+    ("sql", "SQL"),
+    ("udf", "UDF"),
+];


To break a circular dependency with the model module, this ABBREVIATIONS constant should be moved to model.rs. The model module currently depends on render for this constant (to construct GroupName instances), while render depends on model for ExampleGroup.

value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! The ABBREVIATIONS constant is used only in model.rs, so there is no real need to keep it in render.rs and create a circular dependency between these two modules.

augmentcode · 2026-02-09T12:56:08Z

🤖 Augment PR Summary

Summary: This PR refactors the example README generation utilities by splitting the former monolithic examples_docs implementation into a new utils/example_metadata module with clearer responsibilities.

Changes:

Updated the `examples-docs` binary to use `datafusion_examples::utils::example_metadata`.
Replaced `utils/examples_docs.rs` with a directory-based module (`discover`, `layout`, `model`, `parser`, `render`).
Introduced `RepoLayout` helpers for locating the repository root and the examples directory.
Added filesystem discovery of example groups based on presence of `main.rs`.
Implemented a strict parser for structured `//!` metadata (subcommand + `(file, desc)` lines) and validation rules.
Added Markdown rendering logic for README sections and tables per example group.
Added focused unit tests per submodule plus shared test helpers.

Technical Notes: The refactor centralizes example-doc generation behind generate_examples_readme, keeping the generator fail-fast on malformed documentation while making the code easier to test and maintain.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 2 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2026-02-09T12:56:09Z

datafusion-examples/src/utils/example_metadata/discover.rs

+        let entry = entry?;
+        let path = entry.path();
+
+        if path.is_dir() && path.join("main.rs").exists() {


path.join("main.rs").exists() will also return true if main.rs exists but isn’t a regular file (e.g., a directory or unexpected symlink). Consider checking is_file() (or metadata) so discovery only treats real main.rs files as valid example groups.

Severity: low

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

value:useful; category:bug; feedback: The Augment AI reviewer is correct! An additional check that main.rs is a file should be added to prevent accepting main.rs folders as a group indicator.

augmentcode · 2026-02-09T12:56:09Z

datafusion-examples/src/utils/example_metadata/parser.rs

+            continue;
+        }
+
+        // If a non-blank doc line interrupts a pending subcommand, reset the state


parse_main_rs_docs only resets ParserState::SeenSubcommand when it encounters an unrelated non-blank doc line, so metadata could still be accepted after intervening non-doc lines (comments/code). If the intent is “metadata must immediately follow the subcommand (allowing only blank doc lines)”, consider also resetting state on other non-empty non-metadata lines.

Severity: medium

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

value:good-to-have; category:bug; feedback: The Augment AI reviewer is correct! The whole state should be reset after seeing new non-empty content after an empty line. This indicates a new subcommand with its own metadata, so the state must be cleared.

martin-augment · 2026-02-09T21:13:27Z

1. Cross-module coupling (model.rs ↔ render.rs)

Location: datafusion-examples/src/utils/example_metadata/model.rs:322
use crate::utils::example_metadata::{parse_main_rs_docs, render::ABBREVIATIONS};
The format_part function in model.rs depends on ABBREVIATIONS from render.rs. This creates a dependency cycle (render uses model, model uses render's constant).

Recommendation: Move ABBREVIATIONS and format_part to a shared module or make ABBREVIATIONS part of model.rs since it's used for model formatting.

value:good-to-have; category:bug; feedback: The Claude AI reviewer is correct! The ABBREVIATIONS constant is used only in model.rs, so there is no real need to keep it in render.rs and create a circular dependency between these two modules.

martin-augment · 2026-02-09T21:15:33Z

2. Parser error context

Location: parser.rs:831-835

When parsing fails, the error message includes line numbers but not the actual problematic line content. For developer experience:
return Err(DataFusionError::Execution(format!(
    "Metadata without preceding subcommand at {}:{}\nLine: {}",
    path.display(),
    line_no + 1,
    raw_line  // Show the actual line
)));

value:valid-but-wont-fix; category:bug; feedback: The Claude AI reviewer is not correct! The file name and the line number is enough information to debug any potential problems. Printing the line itself might be helpful if the text is short but it will add noise for longer lines.

Refactor example metadata parsing utilities(apache#20204)

24a6632

gemini-code-assist bot reviewed Feb 9, 2026

View reviewed changes

augmentcode bot reviewed Feb 9, 2026

View reviewed changes


		use datafusion::error::{DataFusionError, Result};

		use crate::utils::example_metadata::{parse_main_rs_docs, render::ABBREVIATIONS};

Conversation

martin-augment commented Feb 9, 2026

Uh oh!

coderabbitai bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

gemini-code-assist bot commented Feb 9, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

claude bot commented Feb 9, 2026

Pull Request Review: Refactor Example Metadata Parsing Utilities

Summary

Code Quality: ✅ Excellent

Strengths

Best Practices: ✅ Good

Positives

Minor Observations

Potential Issues: ⚠️ Minor

1. Cross-module coupling (model.rs ↔ render.rs)

2. Parser error context

3. Test helper location

Performance: ✅ No Concerns

Security: ✅ No Concerns

Test Coverage: ✅ Comprehensive

Architecture: ✅ Improved

Recommendations

Required: None

Suggested Improvements:

Example fix for #1:

Conclusion

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot commented Feb 9, 2026

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment commented Feb 9, 2026

1. Cross-module coupling (model.rs ↔ render.rs)

Uh oh!

martin-augment commented Feb 9, 2026

2. Parser error context

Uh oh!

Reviewers

Assignees

Labels

Projects

coderabbitai bot commented Feb 9, 2026 •

edited

Loading