BoLoCo: Boolean Logic Expression Generator

🚀 Version 2.0 - Enhanced with AI/ML Integration

BoLoCo is an enhanced toolkit for generating Boolean logic expression datasets with rich metadata, designed for training and evaluating logical reasoning capabilities in AI models. Version 2.0 introduces comprehensive JSON/JSONL data formats, HuggingFace integration, and enhanced metadata tracking.

✨ What's New in Version 2.0

🎯 JSON/JSONL Formats: Rich structured data with comprehensive metadata
📊 Enhanced Metadata: Automatic complexity scoring, operator analysis, nesting depth
🔄 Multiple Formats: Support for JSON, JSONL, and HuggingFace formats
📝 Auto-Generated Dataset Cards: HuggingFace-compatible documentation
✅ Input Validation: Comprehensive error checking and dataset validation
🎨 Rich CLI Experience: Beautiful output with progress indicators (optional)
🤗 HuggingFace Ready: Direct compatibility with datasets library
🔀 Single CLI: Clean, focused interface

🎯 Use Cases

🧠 AI Research: Training logical reasoning models
📚 Educational: Teaching Boolean logic concepts
🔬 Benchmarking: Evaluating model logical capabilities
🏗️ Synthetic Data: Generating structured logical datasets
🎮 Game AI: Rule-based system training

📦 Installation

Using Poetry (Recommended)

git clone https://github.com/klusai/boloco.git
cd boloco
poetry install  # Basic installation

Enhanced Features

poetry install --extras enhanced  # Adds HuggingFace + Rich CLI

All Features

poetry install --extras full  # Includes transformers for advanced features

Development Setup

poetry install --with dev  # All development tools included
make dev-setup             # Complete development environment

Using pip (Alternative)

pip install boloco                    # From PyPI (when published)
pip install "boloco[enhanced]"        # With enhanced features
pip install "boloco[full]"            # With all features

🚀 Quick Start Examples

Enhanced CLI (Recommended)

# Generate a dataset with rich metadata
python3 -m boloco.cli generate --max-tokens 5 --output-dir ./data

# Generate with specific error ratio and format
python3 -m boloco.cli generate \
  --max-tokens 7 \
  --error-ratio 0.1 \
  --output-dir ./my_dataset \
  --format jsonl

# Generate with all formats
python3 -m boloco.cli generate --max-tokens 5 --output-dir ./data --format all

# Note: After installation with Poetry, you can also use 'poetry run boloco' or just 'boloco' directly

🎯 Enhanced Dataset Format

Example Output (JSONL)

{
  "expression": "( T OR F ) AND NOT F",
  "evaluation": "T", 
  "tokens": ["(", "T", "OR", "F", ")", "AND", "NOT", "F"],
  "metadata": {
    "token_count": 8,
    "operator_count": 3,
    "literal_count": 3,
    "nesting_depth": 1,
    "has_negation": true,
    "is_error": false,
    "complexity_score": 15.0
  },
  "reasoning_steps": [],
  "error_type": null,
  "created_at": "2025-01-15T10:30:00Z"
}

Rich Metadata Features

Complexity Scoring: Automated difficulty assessment based on multiple factors
Operator Analysis: Count and distribution of logical operators (AND, OR, NOT)
Structural Analysis: Nesting depth, parentheses usage, token counting
Error Classification: Systematic categorization of invalid expressions
Provenance Tracking: Complete generation history and configuration

🤗 HuggingFace Integration

Direct Dataset Loading

from datasets import load_dataset

# Load from generated files
dataset = load_dataset("json", data_files={
    "train": "data/dataset_train.jsonl",
    "validation": "data/dataset_validation.jsonl",
    "test": "data/dataset_test.jsonl"
})

# Access examples with rich metadata
for example in dataset["train"]:
    print(f"Expression: {example['expression']}")
    print(f"Result: {example['evaluation']}")
    print(f"Complexity: {example['metadata']['complexity_score']}")
    print(f"Has negation: {example['metadata']['has_negation']}")

Programmatic Generation

from boloco.enhanced import BoLoCoDataset, BoLoCoExample
from boloco.cli import BoLoCoGenerator

# Configure generation
config = {
    "max_tokens": 7,
    "error_ratio": 0.1,
    "train_ratio": 0.7,
    "validate_ratio": 0.15,
    "test_ratio": 0.15,
    "seed": 42
}

# Generate dataset
generator = BoLoCoGenerator(config)
dataset = generator.generate_dataset()

# Export in multiple formats
dataset.save_json("complete_dataset.json")
dataset.save_jsonl("dataset.jsonl") 
dataset.save_legacy_format("./legacy/")
dataset.create_dataset_card("README.md")

# Convert to HuggingFace format (if datasets installed)
hf_dataset = dataset.to_huggingface_dataset()
if hf_dataset:
    hf_dataset.save_to_disk("./hf_dataset")

🔧 Configuration Options

Enhanced CLI Parameters

python3 -m boloco.cli generate \
  --max-tokens 10 \           # Expression complexity (1-50)
  --error-ratio 0.1 \         # Proportion of error examples (0.0-1.0)
  --train-ratio 0.8 \         # Training split ratio
  --validate-ratio 0.1 \      # Validation split ratio
  --test-ratio 0.1 \          # Test split ratio (auto-calculated if not specified)
  --seed 42 \                 # Reproducibility seed
  --output-dir ./data \       # Output directory
  --format all \              # json|jsonl|hf|legacy|all
  --name "my-dataset" \       # Dataset name
  --version "1.0.0"           # Dataset version

Legacy CLI Parameters (Unchanged)

python -m boloco.boloco \
  --mode generate \           # generate|stats
  --max_tokens 5 \           # Maximum tokens per expression
  --error_ratio 0.05 \       # Error proportion
  --dir data \               # Output directory
  --train_ratio 0.7 \        # Training ratio
  --validate_ratio 0.15 \    # Validation ratio
  --test_ratio 0.15 \        # Test ratio
  --seed 42                  # Random seed

📁 Output Structure

Enhanced Format Output

data/
├── dataset.json              # Complete dataset with metadata
├── dataset_train.jsonl       # Training split (JSONL)
├── dataset_validation.jsonl  # Validation split
├── dataset_test.jsonl        # Test split
├── README.md                 # Auto-generated dataset card
└── hf_dataset/              # HuggingFace format (if enabled)
    ├── dataset_info.json
    ├── train/
    ├── validation/
    └── test/

🎓 Advanced Usage Examples

Research Workflow

from boloco.cli import BoLoCoGenerator

# Generate research dataset
config = {
    "max_tokens": 15,
    "error_ratio": 0.2,
    "name": "logical-reasoning-benchmark",
    "version": "1.0.0",
    "description": "Boolean logic benchmark for AI reasoning"
}

generator = BoLoCoGenerator(config)
dataset = generator.generate_dataset()

# Analyze complexity distribution
stats = dataset.metadata["statistics"]
print(f"Average complexity: {stats['train']['avg_complexity']:.2f}")
print(f"Max nesting depth: {stats['train']['max_nesting_depth']}")

# Filter by complexity for progressive training
hf_dataset = dataset.to_huggingface_dataset()
if hf_dataset:
    simple_examples = hf_dataset["train"].filter(
        lambda x: x["metadata"]["complexity_score"] < 10
    )
    complex_examples = hf_dataset["train"].filter(
        lambda x: x["metadata"]["complexity_score"] >= 10
    )

Model Training Pipeline

from datasets import load_dataset
from transformers import AutoTokenizer

# Load dataset
dataset = load_dataset("json", data_files="dataset.json")

# Prepare for transformer training
tokenizer = AutoTokenizer.from_pretrained("gpt2")

def prepare_examples(examples):
    inputs = [f"Evaluate: {expr}" for expr in examples["expression"]]
    targets = examples["evaluation"]
    return tokenizer(inputs, targets, truncation=True, padding=True)

# Tokenize and prepare
tokenized_dataset = dataset.map(prepare_examples, batched=True)

# Filter by complexity for curriculum learning
easy_examples = dataset["train"].filter(
    lambda x: x["metadata"]["complexity_score"] < 8
)
hard_examples = dataset["train"].filter(
    lambda x: x["metadata"]["complexity_score"] >= 8
)

Integration with PyTorch

from torch.utils.data import DataLoader
from datasets import load_dataset
import torch

dataset = load_dataset("json", data_files="dataset.json")
dataloader = DataLoader(dataset["train"], batch_size=32, shuffle=True)

for batch in dataloader:
    expressions = batch["expression"]
    evaluations = batch["evaluation"]
    complexity_scores = batch["metadata"]["complexity_score"]
    
    # Use complexity scores for curriculum learning
    easy_mask = complexity_scores < 8
    hard_mask = complexity_scores >= 8
    
    # Train your model with progressive difficulty
    # model.train_step(expressions[easy_mask], evaluations[easy_mask])

📊 Dataset Statistics & Analysis

The enhanced version automatically computes comprehensive statistics:

Distribution Analysis: True/False/Error ratios across splits
Complexity Metrics: Average complexity scores and distributions
Operator Analysis: AND/OR/NOT usage patterns
Structural Analysis: Nesting depth and parentheses usage
Quality Metrics: Error rates and validation scores

Example output:

Dataset Statistics:
  Train: 90 examples, avg complexity: 8.45
  Validation: 18 examples, avg complexity: 8.72
  Test: 22 examples, avg complexity: 8.23

Operator Distribution:
  Train: AND=45, OR=38, NOT=23
  Validation: AND=9, OR=8, NOT=5
  Test: AND=11, OR=10, NOT=6

🧪 Testing

Run the comprehensive test suite:

# Run all tests
make test                    # or poetry run pytest tests/

# Run tests with verbose output
make test-verbose           # or poetry run pytest tests/ -vv

# Run with coverage
make test-coverage          # Generate coverage reports

# Quick demo
make demo                   # or poetry run boloco generate --max-tokens 3 --format json

Current Test Status: 4/5 tests passing ✅

✅ BoLoCoExample creation
✅ BoLoCoDataset functionality
✅ CLI configuration validation
⚠️ File operations (minor issue with empty statistics display)
⚠️ HuggingFace integration (requires optional dependency)

🚀 Performance & Scalability

Generation Speed

Small datasets (max_tokens=5): ~130 expressions in <0.01s
Medium datasets (max_tokens=10): ~1000+ expressions in <0.1s
Large datasets (max_tokens=15): ~10000+ expressions in <1s

Memory Efficiency

Streaming JSONL: Memory-efficient for large datasets
Lazy Loading: Only load data when needed
Batch Processing: Efficient handling of multiple files

Format Support

Input: Legacy TXT format
Output: JSON, JSONL, HuggingFace, Legacy TXT
Validation: All formats supported
Conversion: Bidirectional between all formats

🤝 Contributing

We welcome contributions! The modern codebase is designed for extensibility:

Fork the repository
Create a feature branch
Add your enhancements
Test with both legacy and modern formats
Submit a pull request

Development Setup

git clone https://github.com/klusai/boloco.git
cd boloco
make dev-setup              # Complete setup with pre-commit hooks

# Run tests
make test                   # or poetry run pytest tests/

# Run quality checks
make quality               # Format, lint, and type-check

# Generate sample data for testing
make demo                  # Quick demo
make run-cli              # Full CLI demo

Architecture Overview

boloco/enhanced.py - Enhanced data structures and I/O
boloco/cli.py - Enhanced CLI interface
tests/ - Comprehensive test suite with pytest
pyproject.toml - Poetry configuration and dependencies
Makefile - Development workflow automation

📈 Streamlined & Focused

✅ Clean Architecture

Single CLI: One focused, enhanced interface
Modern Formats: JSON, JSONL, and HuggingFace support
Rich Metadata: Comprehensive analysis and statistics
Easy Integration: Direct compatibility with ML workflows

🛠️ Development Workflow

BoLoCo uses Poetry for modern Python dependency management and pytest for testing:

Common Commands

make help                   # Show all available commands
make install                # Install basic dependencies
make install-dev            # Install with development tools
make test                   # Run test suite
make lint                   # Check code quality
make format                 # Format code
make build                  # Build distribution packages
make clean                  # Clean build artifacts

Poetry Commands

poetry install              # Install dependencies
poetry add <package>        # Add new dependency
poetry remove <package>     # Remove dependency
poetry update               # Update dependencies
poetry run <command>        # Run command in virtual environment
poetry shell               # Activate virtual environment
poetry build               # Build package
poetry publish             # Publish to PyPI

Quality & Testing

poetry run black .         # Format code
poetry run isort .          # Sort imports
poetry run flake8 .         # Lint code
poetry run mypy boloco      # Type checking
poetry run pytest tests/   # Run tests

📚 Documentation & Resources

Enhanced API: See boloco/enhanced.py for full API
CLI Reference: poetry run boloco --help for all commands
Development: make help for development workflow
Test Examples: tests/ for usage patterns
Generated Cards: Auto-created README.md files for datasets

🔍 Troubleshooting

Common Issues

Q: "python: command not found" A: Use python3 instead of python

Q: "No module named 'datasets'" A: Install with pip install datasets or use pip install -e ".[enhanced]"

Q: "Rich output not showing" A: Install with pip install rich or use pip install -e ".[enhanced]"

Getting Help

Check test suite: make test or poetry run pytest tests/
Quick demo: make demo or poetry run boloco generate --max-tokens 3 --output-dir ./test
Review logs: Enhanced CLI provides detailed error messages
All commands: make help for available development commands

📄 License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
boloco		boloco
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

klusai/boloco

Folders and files

Latest commit

History

Repository files navigation

BoLoCo: Boolean Logic Expression Generator

🚀 Version 2.0 - Enhanced with AI/ML Integration

✨ What's New in Version 2.0

🎯 Use Cases

📦 Installation

Using Poetry (Recommended)

Enhanced Features

All Features

Development Setup

Using pip (Alternative)

🚀 Quick Start Examples

Enhanced CLI (Recommended)

🎯 Enhanced Dataset Format

Example Output (JSONL)

Rich Metadata Features

🤗 HuggingFace Integration

Direct Dataset Loading

Programmatic Generation

🔧 Configuration Options

Enhanced CLI Parameters

Legacy CLI Parameters (Unchanged)

📁 Output Structure

Enhanced Format Output

🎓 Advanced Usage Examples

Research Workflow

Model Training Pipeline

Integration with PyTorch

📊 Dataset Statistics & Analysis

🧪 Testing

🚀 Performance & Scalability

Generation Speed

Memory Efficiency

Format Support

🤝 Contributing

Development Setup

Architecture Overview

📈 Streamlined & Focused

✅ Clean Architecture

🛠️ Development Workflow

Common Commands

Poetry Commands

Quality & Testing

📚 Documentation & Resources

🔍 Troubleshooting

Common Issues

Getting Help

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages