Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsers / loaders are a bit limited #72

Closed
redhog opened this issue Oct 5, 2024 · 4 comments · Fixed by #81
Closed

Parsers / loaders are a bit limited #72

redhog opened this issue Oct 5, 2024 · 4 comments · Fixed by #81

Comments

@redhog
Copy link
Collaborator

redhog commented Oct 5, 2024

Limitations:

  • Can not use multiple input fields / the entire item
  • Can not take additional parameters from config
  • Can only output a single field
    • Use-cases for multiple fields:
      • Parse some text out of a document and additionally some metadata (set of name/value pairs)
      • Parse all columns of e.g. a csv into separate fields

How can we redesign this to make it more flexible?

@shreyashankar
Copy link
Collaborator

Agreed. One idea is for the custom parsers to accept a document and return a list of documents/dictionaries, which get merged into the original documents, instead of a list of strings. Then the user doesn't have to specify input_key or output_key.

@redhog
Copy link
Collaborator Author

redhog commented Oct 6, 2024

I think that, coupled with arbitrary arguments from the config yaml to the parser, would be good enough. Parsers that wants to use an input key, would just take that as an argument.

We'd just change the yaml to be:

datasets:
  my_dataset:
    type: file
    source: local
    path: "data.json"
    parsing:
      - function: my_parser
        argument_name_1: value
        argument_name_2: value
        argument_name_3: value

And for the existing parsers, we add the parameters input_key and output_key and it will be compatible with old pipeline yaml files too :)

Signature for a parser function would then be:

def my_parser(
    item: dict[str, Any],
    **kw) -> list[dict[str, Any]]:

@shreyashankar
Copy link
Collaborator

shreyashankar commented Oct 7, 2024

I like this idea! I'm on board.

Feel free to take it if you'd like--I'm trying to get a v0 of the UI out this week :-) But i can also get to it later no worries

@redhog
Copy link
Collaborator Author

redhog commented Oct 7, 2024

I'd be happy to do this one, but I want the rate_limit stuff merged first (with the runner argument), to limit merge conflicts...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants