Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Synthetic examples #82

Draft
wants to merge 24 commits into
base: main
Choose a base branch
from
Draft

[WIP] Synthetic examples #82

wants to merge 24 commits into from

Conversation

jacobmarks
Copy link
Contributor

Start of concept for generating synthetic examples for a dataset based on the specific fields and values. This is meant to partially address the problem that currently, GPT doesn't always know the schema of your dataset - especially when it comes to non-standard fields like filepath and metadata, or non-label fields.

It is implemented via an FieldExampleGenerator class, which randomly generates field-type specific examples from templates. There is flexibility so that this can be used for any field type. The only things that need to be done to add a new field type are:

  1. Fill the self.patterns dictionary. The keys should be the patterns to fill, and the values should be the function objects which generate their replacements. These functions need to be implemented, but they are typically one line of code each.
  2. Set the self.example_templates attribute, which should be a list of dicts, each containing a query and a string-form list of view stages
  3. Change the self.filters attribute if needed - this defines the conditions used when turning these examples into a pandas DataFrame that we can filter later on.

This is what it looks like for string fields:

import fiftyone as fo
import fiftyone.zoo as foz
dataset = foz.load_zoo_dataset("quickstart")
from links.synthetic_example_generator import StringFieldExampleGenerator

## add a string field to the dataset
dataset.add_sample_field('my_field', fo.StringField)
view = dataset.set_field('my_field', F('ground_truth.detections.label')[0])
view.save('my_field')

num_examples = 5  ### num examples to generate
sfeg = StringFieldExampleGenerator(dataset, "my_field")
examples = sfeg.generate_examples(num_examples)

This results in the following:

[{'query': ' images where my_field is skis or traffic light',
  'stages': "[match(F('my_field').is_in(['skis', 'traffic light']))]"},
 {'query': 'Exclude the my_field field from all samples',
  'stages': "[exclude_fields('my_field')]"},
 {'query': 'Exclude the my_field field from all samples',
  'stages': "[exclude_fields('my_field')]"},
 {'query': ' images where my_field is sheep or tie',
  'stages': "[match(F('my_field').is_in(['sheep', 'tie']))]"},
 {'query': 'Only images that have my_field not equal to bed',
  'stages': "[match(F('my_field') != 'bed')]"}]

We will also want to only take the unique examples, so we don't get any duplicates.

To fold this in to the rest of the code, the workflow would look something like this:

  1. Given the dataset, generate field-type specific examples for each field (obv excluding the default ones)
  2. Compute the embeddings for these and store them separately
  3. In our example selection link, instead of selecting the 40 top examples from the generic candidates, get the top, e.g. 30 of those, and the top e.g. 10 of these dataset-specific examples

@jacobmarks jacobmarks requested a review from brimoor June 1, 2023 23:27
@brimoor brimoor changed the title Synthetic examples [WIP] Synthetic examples Jun 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants