[WIP] Synthetic examples #82

jacobmarks · 2023-06-01T23:27:06Z

Start of concept for generating synthetic examples for a dataset based on the specific fields and values. This is meant to partially address the problem that currently, GPT doesn't always know the schema of your dataset - especially when it comes to non-standard fields like filepath and metadata, or non-label fields.

It is implemented via an FieldExampleGenerator class, which randomly generates field-type specific examples from templates. There is flexibility so that this can be used for any field type. The only things that need to be done to add a new field type are:

Fill the self.patterns dictionary. The keys should be the patterns to fill, and the values should be the function objects which generate their replacements. These functions need to be implemented, but they are typically one line of code each.
Set the self.example_templates attribute, which should be a list of dicts, each containing a query and a string-form list of view stages
Change the self.filters attribute if needed - this defines the conditions used when turning these examples into a pandas DataFrame that we can filter later on.

This is what it looks like for string fields:

import fiftyone as fo
import fiftyone.zoo as foz
dataset = foz.load_zoo_dataset("quickstart")
from links.synthetic_example_generator import StringFieldExampleGenerator

## add a string field to the dataset
dataset.add_sample_field('my_field', fo.StringField)
view = dataset.set_field('my_field', F('ground_truth.detections.label')[0])
view.save('my_field')

num_examples = 5  ### num examples to generate
sfeg = StringFieldExampleGenerator(dataset, "my_field")
examples = sfeg.generate_examples(num_examples)

This results in the following:

[{'query': ' images where my_field is skis or traffic light',
  'stages': "[match(F('my_field').is_in(['skis', 'traffic light']))]"},
 {'query': 'Exclude the my_field field from all samples',
  'stages': "[exclude_fields('my_field')]"},
 {'query': 'Exclude the my_field field from all samples',
  'stages': "[exclude_fields('my_field')]"},
 {'query': ' images where my_field is sheep or tie',
  'stages': "[match(F('my_field').is_in(['sheep', 'tie']))]"},
 {'query': 'Only images that have my_field not equal to bed',
  'stages': "[match(F('my_field') != 'bed')]"}]

We will also want to only take the unique examples, so we don't get any duplicates.

To fold this in to the rest of the code, the workflow would look something like this:

Given the dataset, generate field-type specific examples for each field (obv excluding the default ones)
Compute the embeddings for these and store them separately
In our example selection link, instead of selecting the 40 top examples from the generic candidates, get the top, e.g. 30 of those, and the top e.g. 10 of these dataset-specific examples

Correcting typo

jacobmarks and others added 24 commits June 1, 2023 18:17

basic concept

64163b9

StringField Example Generation

391b354

contributor and deployment instructions

02c3396

removing spurious .labels

04c1ac1

correct filter_labels hallucination

76e1ccc

fixing issue with adding new examples

e17be5a

adding negation and patches examples

7c8f158

postprocess bool=False

80fb7a5

more false pattern correction

1c9005d

more length() filtering examples

0d6d2bd

handling 'field=' case

bb9e1ed

reformat text_sim query

e8fe6cd

adding token counting util

e7ef1d7

adding tiktoken to requirements.txt

32e1d6d

split doc embeddings (+ new chunk size)

803f2c9

New docs embeddings gen and token counting

9618726

docs QA rules

ed2438d

remove un-needed import

4666e2f

Markdown is nicer, as per docs-search.

a1d64a1

Refactoring to be cleaner and streamlined

13383d9

New docs splitting embeddings

0fbc34b

0.21 release docs

35a2bcb

Update dataset_view_generator.py

4218635

Correcting typo

Allen non-null conf fix

a9e5514

jacobmarks requested a review from brimoor June 1, 2023 23:27

brimoor changed the title ~~Synthetic examples~~ [WIP] Synthetic examples Jun 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Synthetic examples #82

[WIP] Synthetic examples #82

jacobmarks commented Jun 1, 2023

[WIP] Synthetic examples #82

Are you sure you want to change the base?

[WIP] Synthetic examples #82

Conversation

jacobmarks commented Jun 1, 2023