Skip to content

20233: Refactor example metadata parsing utilities(#20204)#233

Open
martin-augment wants to merge 1 commit intomainfrom
pr-20233-2026-02-09-12-51-11
Open

20233: Refactor example metadata parsing utilities(#20204)#233
martin-augment wants to merge 1 commit intomainfrom
pr-20233-2026-02-09-12-51-11

Conversation

@martin-augment
Copy link
Owner

20233: To review by AI

@coderabbitai
Copy link

coderabbitai bot commented Feb 9, 2026

Walkthrough

A monolithic documentation generation module at datafusion-examples/src/utils/examples_docs.rs (908 lines) is refactored into a modular structure under datafusion-examples/src/utils/example_metadata/. The new structure splits functionality across seven files: mod.rs for module declarations and re-exports, discover.rs for filesystem scanning, layout.rs for repository structure utilities, model.rs for domain data types, parser.rs for documentation parsing, render.rs for Markdown generation, and test_utils.rs for test helpers. The module path reference in src/bin/examples-docs.rs is updated to use the new example_metadata name, and the corresponding export in src/utils/mod.rs is updated. All previously public APIs are preserved through re-exports in the new module hierarchy.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-20233-2026-02-09-12-51-11

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments
datafusion-examples/src/utils/example_metadata/render.rs (2)

88-95: Consider moving ABBREVIATIONS to model.rs to break the circular dependency.

model.rs imports render::ABBREVIATIONS for format_part, and render.rs imports model::ExampleGroup. This creates a bidirectional dependency between the two sibling modules. Moving ABBREVIATIONS into model.rs (where format_part lives) would keep the dependency graph acyclic: render → model, with no reverse edge.


130-143: Duplicate raw_name extraction logic.

Lines 131–137 duplicate the directory-name extraction already performed inside ExampleGroup::from_dir (model.rs lines 47–53). The duplication exists because Category::for_group needs the raw name before from_dir is called. Consider either having from_dir accept a pre-extracted name, or inlining category resolution inside from_dir to eliminate the redundancy.

datafusion-examples/src/utils/example_metadata/parser.rs (1)

70-95: Use nom::combinator::rest instead of take_while(|_| true).

Line 73 uses take_while(|_| true) to consume the remainder of the input. nom::combinator::rest is the idiomatic nom equivalent and conveys intent more clearly.

♻️ Suggested change
-use nom::{
-    IResult, Parser,
-    bytes::complete::{tag, take_until, take_while},
-    character::complete::multispace0,
-    combinator::all_consuming,
-    sequence::{delimited, preceded},
-};
+use nom::{
+    IResult, Parser,
+    bytes::complete::{tag, take_until},
+    character::complete::multispace0,
+    combinator::{all_consuming, rest},
+    sequence::{delimited, preceded},
+};
     let parser = preceded(
         multispace0,
-        preceded(tag("//!"), preceded(multispace0, take_while(|_| true))),
+        preceded(tag("//!"), preceded(multispace0, rest)),
     );
datafusion-examples/src/utils/example_metadata/model.rs (1)

104-112: Consider whether ExampleEntry fields should be public.

All three fields are pub, allowing direct construction from outside the module (e.g., in tests and parser). This is fine for the current scope, but if you want to enforce construction invariants later, consider providing a constructor and making fields pub(crate).

datafusion-examples/src/utils/example_metadata/test_utils.rs (1)

51-63: Temp directory lifetime is safe here, but worth a brief note.

TempDir is dropped (and the directory deleted) at the end of this function. This works because ExampleGroup::from_dir eagerly reads and parses the file contents before returning. If from_dir were ever changed to lazily read files, this would silently break. A one-line comment noting the intentional eager-read dependency would help future maintainers.

📝 Suggested comment
 pub fn example_group_from_docs(docs: &str) -> Result<ExampleGroup> {
+    // TempDir is dropped at function exit; this is safe because
+    // ExampleGroup::from_dir eagerly reads and parses all file contents.
     let tmp = TempDir::new().map_err(|e| {

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link

Summary of Changes

Hello @martin-augment, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the example metadata parsing utilities within the DataFusion examples project. The original monolithic examples_docs.rs file has been split into several smaller, more focused modules to improve code clarity and maintainability. This change does not introduce new functionality but rather reorganizes existing code for better structure.

Highlights

  • Refactoring: The primary goal is to refactor the example metadata parsing utilities to improve code organization and maintainability.
  • New Modules: Several new modules have been added under datafusion-examples/src/utils/example_metadata, including discover, layout, model, parser, render, and test_utils to handle different aspects of example metadata processing.
  • Code Removal: The datafusion-examples/src/utils/examples_docs.rs file has been removed, with its functionality distributed among the new modules.
  • Dependency Management: The changes focus on internal organization and do not introduce any new external dependencies.
Changelog
  • datafusion-examples/src/bin/examples-docs.rs
    • Modified to use the new example_metadata module instead of the old examples_docs module.
  • datafusion-examples/src/utils/example_metadata/discover.rs
    • Added a new module for discovering example groups in the repository filesystem.
  • datafusion-examples/src/utils/example_metadata/layout.rs
    • Added a new module for repository layout utilities.
  • datafusion-examples/src/utils/example_metadata/mod.rs
    • Added a new module that acts as the main entry point for example metadata functionality, defining the overall structure and usage.
  • datafusion-examples/src/utils/example_metadata/model.rs
    • Added a new module defining the domain model for DataFusion example documentation.
  • datafusion-examples/src/utils/example_metadata/parser.rs
    • Added a new module for parsing example metadata from main.rs documentation comments.
  • datafusion-examples/src/utils/example_metadata/render.rs
    • Added a new module for rendering Markdown documentation for DataFusion examples.
  • datafusion-examples/src/utils/example_metadata/test_utils.rs
    • Added a new module providing test helpers for example metadata parsing and validation.
  • datafusion-examples/src/utils/examples_docs.rs
    • Removed the original monolithic module, as its functionality has been split into smaller, more focused modules.
  • datafusion-examples/src/utils/mod.rs
    • Modified to replace the examples_docs module with the new example_metadata module.
Activity
  • The pull request primarily involves code refactoring and reorganization.
  • No specific reviewer activity is mentioned in the provided context.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@claude
Copy link

claude bot commented Feb 9, 2026

Pull Request Review: Refactor Example Metadata Parsing Utilities

Summary

This PR refactors datafusion-examples/src/utils/examples_docs.rs (908 lines) into a well-organized module structure with 7 separate files (1,217 lines). The refactoring improves code organization, maintainability, and testability by following the single responsibility principle.

Code Quality: ✅ Excellent

Strengths

  1. Clean module separation - Each module has a clear, focused responsibility:

    • discover.rs - Filesystem discovery
    • layout.rs - Repository structure
    • model.rs - Domain models
    • parser.rs - Parsing logic
    • render.rs - Markdown generation
    • test_utils.rs - Shared test helpers
  2. Comprehensive documentation - All modules have excellent doc comments explaining their purpose and design goals

  3. Strong test coverage - Tests moved with their corresponding modules, maintaining excellent coverage:

    • Parser tests (15+ test cases)
    • Model validation tests
    • Layout tests
    • Rendering tests
    • Edge case handling
  4. Consistent error handling - Proper use of DataFusionError::Execution with meaningful error messages

  5. Well-structured types - Clean separation between data models (ExampleGroup, GroupName, ExampleEntry, Category)

Best Practices: ✅ Good

Positives

  • ✅ Proper use of nom parser combinators for robust parsing
  • ✅ State machine pattern in ParserState for tracking subcommand-metadata relationships
  • ✅ Helper functions (format_part, is_non_blank_doc_line) are well-scoped
  • ✅ Tests use tempfile::TempDir for isolated filesystem operations
  • ✅ ASF license headers on all files
  • ✅ Public API remains unchanged (exports via mod.rs)

Minor Observations

  1. Module visibility - test_utils.rs is marked #[cfg(test)] in mod.rs but re-exported publicly. Consider:

    #[cfg(test)]
    pub mod test_utils;

    This is minor since it's already feature-gated, but makes intent clearer.

  2. Const vs static - ABBREVIATIONS in render.rs is defined as pub const but referenced from model.rs. This creates a coupling between modules. Consider:

    • Moving to a shared location, or
    • Making it private to render.rs and passing as parameter

Potential Issues: ⚠️ Minor

1. Cross-module coupling (model.rs ↔ render.rs)

Location: datafusion-examples/src/utils/example_metadata/model.rs:322

use crate::utils::example_metadata::{parse_main_rs_docs, render::ABBREVIATIONS};

The format_part function in model.rs depends on ABBREVIATIONS from render.rs. This creates a dependency cycle (render uses model, model uses render's constant).

Recommendation: Move ABBREVIATIONS and format_part to a shared module or make ABBREVIATIONS part of model.rs since it's used for model formatting.

2. Parser error context

Location: parser.rs:831-835

When parsing fails, the error message includes line numbers but not the actual problematic line content. For developer experience:

return Err(DataFusionError::Execution(format!(
    "Metadata without preceding subcommand at {}:{}\nLine: {}",
    path.display(),
    line_no + 1,
    raw_line  // Show the actual line
)));

3. Test helper location

The test_utils.rs module is conditionally compiled but publicly exported. While this works, it's unconventional. If these utilities are only for internal tests, they should be private.

Performance: ✅ No Concerns

  • No performance regressions introduced
  • Filesystem operations are appropriately scoped
  • Parser uses efficient nom combinators
  • String allocations are minimal and necessary

Security: ✅ No Concerns

  • No command injection risks
  • Filesystem operations are properly scoped
  • No unsafe code
  • Input validation via parser is robust
  • Error messages don't leak sensitive information

Test Coverage: ✅ Comprehensive

All original tests preserved and properly organized:

  • ✅ 15+ parser tests (valid/invalid inputs)
  • ✅ Model validation tests (duplicates, ordering, abbreviations)
  • ✅ Layout detection tests
  • ✅ Rendering tests
  • ✅ Integration test (parse_group_example_works)
  • ✅ Edge cases (blank lines, metadata positioning)

Architecture: ✅ Improved

The refactoring significantly improves code organization:

Before: Single 908-line file mixing concerns
After: 7 focused modules with clear boundaries

This makes the codebase:

  • Easier to navigate
  • Easier to test individual components
  • Easier to modify without affecting unrelated functionality
  • Better self-documenting through module structure

Recommendations

Required: None

Suggested Improvements:

  1. Address circular dependency between model.rs and render.rs by moving ABBREVIATIONS to model.rs or a separate constants.rs module

  2. Enhance error messages in parser to include actual line content for better debugging

  3. Clarify test_utils visibility - either make it truly internal or document why it's public

Example fix for #1:

// In model.rs
const ABBREVIATIONS: &[(&str, &str)] = &[
    ("dataframe", "DataFrame"),
    ("io", "IO"),
    ("sql", "SQL"),
    ("udf", "UDF"),
];

// Then render.rs can re-export if needed:
pub use crate::utils::example_metadata::model::ABBREVIATIONS;

Conclusion

This is a high-quality refactoring that significantly improves code organization without changing functionality. The code follows Rust best practices, has excellent test coverage, and maintains backward compatibility.

Recommendation: Approve with minor suggestions

The suggested improvements are minor and don't block merging. The PR achieves its goal of making the codebase more maintainable and modular.


🤖 Generated by Claude Code

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the example metadata parsing and documentation generation logic into a new, well-structured example_metadata module. The code is now more modular and easier to maintain. I've found a couple of areas for improvement: a circular dependency between modules that should be resolved, and a fragile parsing implementation that could be made more robust. My detailed comments provide suggestions for addressing these points.


use datafusion::error::{DataFusionError, Result};

use crate::utils::example_metadata::{parse_main_rs_docs, render::ABBREVIATIONS};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This use statement introduces a circular dependency between the model and render modules (model -> render for ABBREVIATIONS, and render -> model for ExampleGroup). This can make the code harder to understand and maintain.

Since ABBREVIATIONS is used for constructing the GroupName title, it's more of a model concern than a rendering one in the current structure. To break the cycle, I suggest moving the ABBREVIATIONS constant from render.rs to this file (model.rs) and updating this use statement accordingly.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! The ABBREVIATIONS constant is used only in model.rs, so there is no real need to keep it in render.rs and create a circular dependency between these two modules.

Comment on lines +70 to +95
fn parse_metadata_line(input: &str) -> IResult<&str, (&str, &str)> {
let parser = preceded(
multispace0,
preceded(tag("//!"), preceded(multispace0, take_while(|_| true))),
);
let (rest, payload) = all_consuming(parser).parse(input)?;

let content = payload
.strip_prefix("(")
.and_then(|s| s.strip_suffix(")"))
.ok_or_else(|| {
nom::Err::Error(nom::error::Error::new(payload, nom::error::ErrorKind::Tag))
})?;

let (file, desc) = content
.strip_prefix("file:")
.ok_or_else(|| {
nom::Err::Error(nom::error::Error::new(payload, nom::error::ErrorKind::Tag))
})?
.split_once(", desc:")
.ok_or_else(|| {
nom::Err::Error(nom::error::Error::new(payload, nom::error::ErrorKind::Tag))
})?;

Ok((rest, (file.trim(), desc.trim())))
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of split_once here is a bit fragile. It might incorrectly parse if the file path contains a comma, or if the description contains the string , desc:. A more robust approach would be to use nom combinators to parse the content within the parentheses. This will require updating the nom imports at the top of the file to include rest, separated_pair, and tuple.

fn parse_metadata_line(input: &str) -> IResult<&str, (&str, &str)> {
    let parser = preceded(
        multispace0,
        preceded(tag("//!"), preceded(multispace0, take_while(|_| true))),
    );
    let (rest, payload) = all_consuming(parser).parse(input)?;

    let content = payload
        .strip_prefix("(")
        .and_then(|s| s.strip_suffix(")"))
        .ok_or_else(|| {
            nom::Err::Error(nom::error::Error::new(payload, nom::error::ErrorKind::Tag))
        })?;

    let file_parser = preceded(nom::sequence::tuple((tag("file:"), multispace0)), take_until(", desc:"));
    let desc_parser = preceded(nom::sequence::tuple((tag(", desc:"), multispace0)), nom::combinator::rest);

    let (file, desc) = match nom::sequence::separated_pair(file_parser, multispace0, desc_parser)(content) {
        Ok( K("", (file, desc))) => (file, desc),
        _ => {
            return Err(nom::Err::Error(nom::error::Error::new(
                payload,
                nom::error::ErrorKind::Tag,
            )))
        }
    };

    Ok((rest, (file.trim(), desc.trim())))
}

Comment on lines +90 to +95
pub const ABBREVIATIONS: &[(&str, &str)] = &[
("dataframe", "DataFrame"),
("io", "IO"),
("sql", "SQL"),
("udf", "UDF"),
];

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To break a circular dependency with the model module, this ABBREVIATIONS constant should be moved to model.rs. The model module currently depends on render for this constant (to construct GroupName instances), while render depends on model for ExampleGroup.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! The ABBREVIATIONS constant is used only in model.rs, so there is no real need to keep it in render.rs and create a circular dependency between these two modules.

@augmentcode
Copy link

augmentcode bot commented Feb 9, 2026

🤖 Augment PR Summary

Summary: This PR refactors the example README generation utilities by splitting the former monolithic examples_docs implementation into a new utils/example_metadata module with clearer responsibilities.

Changes:

  • Updated the `examples-docs` binary to use `datafusion_examples::utils::example_metadata`.
  • Replaced `utils/examples_docs.rs` with a directory-based module (`discover`, `layout`, `model`, `parser`, `render`).
  • Introduced `RepoLayout` helpers for locating the repository root and the examples directory.
  • Added filesystem discovery of example groups based on presence of `main.rs`.
  • Implemented a strict parser for structured `//!` metadata (subcommand + `(file, desc)` lines) and validation rules.
  • Added Markdown rendering logic for README sections and tables per example group.
  • Added focused unit tests per submodule plus shared test helpers.

Technical Notes: The refactor centralizes example-doc generation behind generate_examples_readme, keeping the generator fail-fast on malformed documentation while making the code easier to test and maintain.

🤖 Was this summary useful? React with 👍 or 👎

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

let entry = entry?;
let path = entry.path();

if path.is_dir() && path.join("main.rs").exists() {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

path.join("main.rs").exists() will also return true if main.rs exists but isn’t a regular file (e.g., a directory or unexpected symlink). Consider checking is_file() (or metadata) so discovery only treats real main.rs files as valid example groups.

Severity: low

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Augment AI reviewer is correct! An additional check that main.rs is a file should be added to prevent accepting main.rs folders as a group indicator.

continue;
}

// If a non-blank doc line interrupts a pending subcommand, reset the state
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_main_rs_docs only resets ParserState::SeenSubcommand when it encounters an unrelated non-blank doc line, so metadata could still be accepted after intervening non-doc lines (comments/code). If the intent is “metadata must immediately follow the subcommand (allowing only blank doc lines)”, consider also resetting state on other non-empty non-metadata lines.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Augment AI reviewer is correct! The whole state should be reset after seeing new non-empty content after an empty line. This indicates a new subcommand with its own metadata, so the state must be cleared.

@martin-augment
Copy link
Owner Author

1. Cross-module coupling (model.rs ↔ render.rs)

Location: datafusion-examples/src/utils/example_metadata/model.rs:322

use crate::utils::example_metadata::{parse_main_rs_docs, render::ABBREVIATIONS};

The format_part function in model.rs depends on ABBREVIATIONS from render.rs. This creates a dependency cycle (render uses model, model uses render's constant).

Recommendation: Move ABBREVIATIONS and format_part to a shared module or make ABBREVIATIONS part of model.rs since it's used for model formatting.

value:good-to-have; category:bug; feedback: The Claude AI reviewer is correct! The ABBREVIATIONS constant is used only in model.rs, so there is no real need to keep it in render.rs and create a circular dependency between these two modules.

@martin-augment
Copy link
Owner Author

2. Parser error context

Location: parser.rs:831-835

When parsing fails, the error message includes line numbers but not the actual problematic line content. For developer experience:

return Err(DataFusionError::Execution(format!(
    "Metadata without preceding subcommand at {}:{}\nLine: {}",
    path.display(),
    line_no + 1,
    raw_line  // Show the actual line
)));

value:valid-but-wont-fix; category:bug; feedback: The Claude AI reviewer is not correct! The file name and the line number is enough information to debug any potential problems. Printing the line itself might be helpful if the text is short but it will add noise for longer lines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants