This project implements a focused investigation of concept-aware token modification in transformer architectures. Instead of training a complete language model, it isolates and tests the core mechanism of using concept vectors to modify token representations.
-
Token Embeddings (
src/molm/embeddings/
)- Uses the Nomic-Embed-Text model for generating token embeddings
- Provides a clean interface for embedding individual tokens and sequences
- Integrates with Ollama for efficient local embedding generation
-
Concept Space (
src/molm/concepts/
)- Two implementations for concept management:
ConceptSpace
: Dynamic concept space with real-time relationship computationStaticConceptSpace
: Efficient pre-computed concept vectors and relationships
- Integrates with WordNet for semantic relationships
- Uses spaCy for abstraction level detection
- Supports concept harvesting and pre-computation for better performance
- Two implementations for concept management:
-
Token Modification (
src/molm/modification/
)- Implements two-stage concept-aware token modification:
- First-order concept gathering for individual tokens
- Second-order concept formation across sequences
- Balances token-specific meaning with sequence-level coherence
- Configurable weights for concept influence
- Implements two-stage concept-aware token modification:
- Clone the repository:
git clone https://github.com/yourusername/molm.git
cd molm
- Install dependencies:
pip install -e .
- Install and start Ollama:
- Follow instructions at Ollama's website
- Pull the Nomic-Embed-Text model:
ollama pull nomic-embed-text
- Download required NLTK and spaCy data:
python -m nltk.downloader wordnet omw-1.4
python -m spacy download en_core_web_lg
from molm.embeddings.nomic_embedder import NomicEmbedder
from molm.concepts.static_concept_space import StaticConceptSpace
from molm.modification.token_modifier import TokenModifier
# Initialize components
embedder = NomicEmbedder()
concept_space = StaticConceptSpace(embedder, "data/concepts")
modifier = TokenModifier(embedder, concept_space)
# Get and modify token embeddings
text = "The neural network processes data efficiently"
token_embeddings = modifier.get_token_embeddings(text)
modified_embeddings = modifier.modify_tokens(token_embeddings)
To build the pre-computed concept space:
python scripts/harvest_concepts.py
This will:
- Harvest concepts and relationships from WordNet
- Detect abstraction levels using spaCy
- Compute concept embeddings using Nomic-Embed-Text
- Build and save the concept relationship graph
- Save all data for efficient loading by StaticConceptSpace
Run the demonstration scripts:
# Basic token modification example
python examples/basic_modification.py
# Concept space exploration and analysis
python examples/concept_space_demo.py
# Detailed concept demonstration
python examples/concept_demonstration.py
molm/
├── src/
│ └── molm/
│ ├── embeddings/
│ │ └── nomic_embedder.py
│ ├── concepts/
│ │ ├── concept_space.py
│ │ ├── static_concept_space.py
│ │ └── harvester.py
│ └── modification/
│ └── token_modifier.py
├── examples/
│ ├── basic_modification.py
│ ├── concept_space_demo.py
│ └── concept_demonstration.py
├── scripts/
│ ├── harvest_concepts.py
│ └── verify_embeddings.py
├── tests/
├── docs/
└── README.md
- Modular Design: Each component is independent and easily modifiable
- Type Safety: Comprehensive type hints throughout the codebase
- Efficient Concept Management:
- Pre-computed concept vectors and relationships
- Checkpointing for long-running operations
- Efficient batch processing for large concept spaces
- Rich Concept Analysis:
- Domain and abstraction level detection
- First and second-order concept relationships
- Cross-domain concept mapping
- Configurable: Easy to adjust weights, thresholds, and parameters
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
If you use this code in your research, please cite:
@software{molm2024,
title = {MOLM: Modular Token Modification in Language Models},
author = {Your Name},
year = {2024},
url = {https://github.com/yourusername/molm}
}