Skip to content

Commit

Permalink
Update tiny programs example readme explanation
Browse files Browse the repository at this point in the history
  • Loading branch information
ncoop57 committed Sep 28, 2024
1 parent 539c783 commit e9c404a
Showing 1 changed file with 188 additions and 0 deletions.
188 changes: 188 additions & 0 deletions examples/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,191 @@
# TinyPrograms

This experiment was inspired by the paper [TinyStories: How Small Can Language Models Be and Still Speak Coherent English?](https://arxiv.org/abs/2305.07759) wherein the authors used OpenAI GPT models to synthetically generate simple children's books and showed that training on such data allowed extremely tiny models (~10M (yes M as in million) paramters) to outperform models 10x larger. TinyProgram's is the coding equivalent and comprises ~1,000 tiny python programs generated by Antropic's Haiku model. I decided to deviate from the original paper by finetuning a model rather than train from scratch since I had such a small dataset.

I am a software engineer at heart so I decided to do the same thing, but for coding models. So, I put my synthetic data hat on and generated TinyPrograms, ~1,000 tiny python programs generated by Anthropic's Haiku 3 model. To do so, I used the following definition of what a tiny program is when using `fastdata` to generate them:

```python
class TinyProgram():
"""
A program that satisfies the requirements.
"""
def __init__(
requirements: str, # A description of the requirements for the program to help the persona.
code: str # The code that satisfies the requirements. Ensure it is well written and documented.
): store_attr()

__repr__ = basic_repr(['requirements', 'code'])
```

I also followed the advice I outline in the accompanied blog post to this experiment "How To ~~Train~~ Synthesize Your ~~Dragon~~ Data" where I add some examples of what kind of tiny program I want looks like along with using the recently released PersonaHub in order to create a diverse set of tiny programs. Here is what the prompt looks like:

```python
prompt_template = """\
Here are some examples:
{examples}
Create requirements and the python program that satisfies them for the following persona: {persona}
"""
```

Here is one of the programs that Haiku created:

#### Persona
```python
A striker who admires the player's defensive prowess and values their support on the field"
```

#### Requirements
```python
A Python program that simulates a soccer game between a striker and a defensive player. The striker admires the defensive player's prowess and values their support on the field. The program should model the interactions between the striker and the defensive player, demonstrating how the defensive player's support enables the striker to be more effective in scoring goals.
```

#### Program
```python
import random

class SoccerPlayer:
def __init__(self, name, position):
self.name = name
self.position = position
self.skill_level = random.randint(1, 10)

class Striker(SoccerPlayer):
def __init__(self, name):
super().__init__(name, "Striker")
self.scoring_chance = 0.5

def shoot(self, defender):
if defender.skill_level >= self.skill_level:
print(f"{self.name} shoots, but {defender.name} blocks the shot!")
else:
if random.random() < self.scoring_chance:
print(f"{self.name} scores a goal!")
else:
print(f"{self.name} misses the shot.")

class Defender(SoccerPlayer):
def __init__(self, name):
super().__init__(name, "Defender")
self.support_level = 7

def support(self, striker):
striker.scoring_chance += self.support_level / 100
print(f"{self.name} provides support to {striker.name}, increasing their scoring chance.")

def simulate_game():
striker = Striker("Alex")
defender = Defender("Sarah")

print(f"{striker.name} (Striker) vs. {defender.name} (Defender)")

for _ in range(5):
defender.support(striker)
striker.shoot(defender)
print()

simulate_game()
```

### The Experiment

I decided to deviate from the TinyStories paper by finetuning a model rather than train from scratch since I had such a small dataset. I started with Huggingface's sick SmolLM-360M model since it received a respectable score of pass@1 of 11.6% on the popular coding benchmark HumanEval. I then created 5 configurations of datasets to test which improves my model the most:

1. The first one is simply the 992 tiny python programs.
2. The second is 992 python files that have been taken from the popular Stack dataset.
3. The third is a high quality filtered version of the tiny python programs which uses an LLM to score the programs based on a rubric.
4. The fourth is the same as the third, but on the python files taken from the Stack.
5. Finally, the fifth mixes half of the high quality filtered tiny python programs and the high quality filtered python files from the Stack.

## Filtering for Quality

In order to filter the tiny programs, I similarly used `fastdata`

```python
class TinyProgramCritique():
"""
A critique of a tiny program.
"""
def __init__(
self,
critique: str, # A critique of the code.
score: Literal[1, 2, 3, 4, 5], # A score of the code from 1 to 5.
): store_attr()

__repr__ = basic_repr(["critique", "score"])
```

And here is the prompt I used to guide to model to generating a score:

```python
critique_template = """\
Below is a code snippet. Evaluate its educational value for teaching programming to beginners in this language, using the additive 5-point scoring system described below. Points are accumulated based on the satisfaction of each criterion:
- Add 1 point if the code is syntactically correct and runs without errors, providing a basic example of working code in the language.
- Add another point if the code demonstrates fundamental programming concepts (e.g., variables, control structures, functions) in a straightforward manner, even if it's not optimized or doesn't follow all best practices.
- Award a third point if the code is well-commented, explaining key concepts and the purpose of different code sections. It should be readable and illustrate good naming conventions, making it easier for beginners to understand.
- Grant a fourth point if the code showcases language-specific features or common programming patterns in an accessible way. It should provide clear examples of how to apply these concepts practically.
- Bestow a fifth point if the code is an exemplary teaching tool, striking an excellent balance between simplicity and real-world applicability. It should inspire further learning, possibly including deliberate mistakes or opportunities for improvement that a teacher could use as discussion points.
The code snippet:
{code}
After examining the code:
- Briefly justify your total score, up to 100 words, focusing on its effectiveness as a teaching tool for beginners.
- Conclude with the score.
"""
```

This is the distribution of the scores for the 992 tiny python programs:

| Score | Count |
|-------|-------|
| 1 | 25 |
| 2 | 117 |
| 3 | 96 |
| 4 | 256 |
| 5 | 498 |

And here is the same for 10,000 of the python files:

| Score | Count |
|-------|-------|
| 1 | 2239 |
| 2 | 5230 |
| 3 | 1545 |
| 4 | 618 |
| 5 | 236 |

I only kepted a score of 4 and 5 as high quality data for both the tiny python programs and python files from the Stack.

### Results

| Setup | pass@1 |
|---------|--------|
| Baseline | 11.6% |
| TinyPrograms | 09.1% |
| The Stack | 11.0% |
| TinyPrograms Filtered | 12.2% |
| The Stack Filtered | 08.5% |
| Mixed Filtered | 09.8% |

### Key findings from the experiment:

1. Training on synthetic data outperforms training on random GitHub programs, regardless of quality filtering (TinyPrograms vs The Stack and TinyPrograms Filtered vs The Stack Filtered).
2. Only high-quality synthetic data (TinyPrograms Filtered) slightly improves performance over the baseline.
3. All other setups degrade performance, with high-quality python files showing the most significant drop.
4. The unexpected poor performance of high-quality python files warrants further investigation. Possible explanations include:
- The scoring system may not be as effective for GitHub programs as for synthetic ones.
- There might be a lack of diversity in the GitHub programs.

For further exploration, I encourage you to:
1. Replicate this experiment with your own task.
2. Experiment with larger datasets to see how it affects model performance.
3. Share your findings with the community and reach out if you need assistance!

You can follow the rest of this README to see how you can reproduce my results and as a starting point for your own project!

## Install

Make sure you've installed `fastdata` with the following command from the root of the repo:
Expand Down

0 comments on commit e9c404a

Please sign in to comment.