-
Notifications
You must be signed in to change notification settings - Fork 368
Add fenic-datasets integration #1936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
be8c987
ddd17c8
24b351d
3fd5be9
ee3ad5e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,237 @@ | ||
# fenic | ||
|
||
[fenic](https://github.com/typedef-ai/fenic) is a PySpark-inspired DataFrame framework designed for building production AI and agentic applications. fenic provides support for reading datasets directly from the Hugging Face Hub. | ||
|
||
<div class="flex justify-center"> | ||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/fenic_hf.png"/> | ||
</div> | ||
|
||
## Getting Started | ||
|
||
To get started, pip install `fenic`: | ||
|
||
```bash | ||
pip install fenic | ||
``` | ||
|
||
### Create a Session | ||
|
||
Instantiate a fenic session with the default configuration (sufficient for reading datasets and other non-semantic operations): | ||
|
||
```python | ||
import fenic as fc | ||
|
||
session = fc.Session.get_or_create(fc.SessionConfig()) | ||
``` | ||
|
||
## Overview | ||
|
||
fenic is an opinionated data processing framework that combines: | ||
- **DataFrame API**: PySpark-inspired operations for familiar data manipulation | ||
- **Semantic Operations**: Built-in AI/LLM operations including semantic functions, embeddings, and clustering | ||
- **Model Integration**: Native support for AI providers (Anthropic, OpenAI, Cohere, Google) | ||
- **Query Optimization**: Automatic optimization through logical plan transformations | ||
|
||
## Read from Hugging Face Hub | ||
|
||
fenic can read datasets directly from the Hugging Face Hub using the `hf://` protocol. This functionality is built into fenic's DataFrameReader interface. | ||
|
||
### Supported Formats | ||
|
||
fenic supports reading the following formats from Hugging Face: | ||
- **Parquet files** (`.parquet`) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this just for native parquet datasets, or do you also support auto-converted parquet branches? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Auto converted parquet branches should also be supported. |
||
- **CSV files** (`.csv`) | ||
|
||
### Reading Datasets | ||
|
||
To read a dataset from the Hugging Face Hub: | ||
|
||
```python | ||
import fenic as fc | ||
|
||
session = fc.Session.get_or_create(fc.SessionConfig()) | ||
|
||
# Read a CSV file from a public dataset | ||
df = session.read.csv("hf://datasets/datasets-examples/doc-formats-csv-1/data.csv") | ||
|
||
# Read Parquet files using glob patterns | ||
df = session.read.parquet("hf://datasets/cais/mmlu/astronomy/*.parquet") | ||
|
||
# Read from a specific dataset revision | ||
df = session.read.parquet("hf://datasets/datasets-examples/doc-formats-csv-1@~parquet/**/*.parquet") | ||
``` | ||
|
||
### Reading with Schema Management | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. might be good to explain (briefly) what schema means in this context? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. good catch! I added a note there with explanation and links to documentation. Auto merging and reconciling schemas is tricky and people should be aware of the limitations. |
||
|
||
```python | ||
# Read multiple CSV files with schema merging | ||
df = session.read.csv("hf://datasets/username/dataset_name/*.csv", merge_schemas=True) | ||
|
||
# Read multiple Parquet files with schema merging | ||
df = session.read.parquet("hf://datasets/username/dataset_name/*.parquet", merge_schemas=True) | ||
``` | ||
|
||
> **Note:** In fenic, a schema is the set of column names and their data types. When you enable `merge_schemas`, fenic tries to reconcile differences across files by filling missing columns with nulls and widening types where it can. Some layouts still cannot be merged—consult the fenic docs for [CSV schema merging limitations](https://docs.fenic.ai/latest/reference/fenic/?h=parquet#fenic.DataFrameReader.csv) and [Parquet schema merging limitations](https://docs.fenic.ai/latest/reference/fenic/?h=parquet#fenic.DataFrameReader.parquet). | ||
|
||
### Authentication | ||
|
||
To read private datasets, you need to set your Hugging Face token as an environment variable: | ||
|
||
```shell | ||
export HF_TOKEN="your_hugging_face_token_here" | ||
``` | ||
|
||
### Path Format | ||
|
||
The Hugging Face path format in fenic follows this structure: | ||
``` | ||
hf://{repo_type}/{repo_id}/{path_to_file} | ||
``` | ||
|
||
You can also specify dataset revisions or versions: | ||
``` | ||
hf://{repo_type}/{repo_id}@{revision}/{path_to_file} | ||
``` | ||
|
||
Features: | ||
- Supports glob patterns (`*`, `**`) | ||
- Dataset revisions/versions using `@` notation: | ||
- Specific commit: `@d50d8923b5934dc8e74b66e6e4b0e2cd85e9142e` | ||
- Branch: `@refs/convert/parquet` | ||
- Branch alias: `@~parquet` | ||
- Requires `HF_TOKEN` environment variable for private datasets | ||
|
||
### Mixing Data Sources | ||
|
||
fenic allows you to combine multiple data sources in a single read operation, including mixing different protocols: | ||
|
||
```python | ||
# Mix HF and local files in one read call | ||
df = session.read.parquet([ | ||
"hf://datasets/cais/mmlu/astronomy/*.parquet", | ||
"file:///local/data/*.parquet", | ||
"./relative/path/data.parquet" | ||
]) | ||
``` | ||
|
||
This flexibility allows you to seamlessly combine data from Hugging Face Hub and local files in your data processing pipeline. | ||
|
||
## Processing Data from Hugging Face | ||
|
||
Once loaded from Hugging Face, you can use fenic's full DataFrame API: | ||
|
||
### Basic DataFrame Operations | ||
|
||
```python | ||
import fenic as fc | ||
|
||
session = fc.Session.get_or_create(fc.SessionConfig()) | ||
|
||
# Load IMDB dataset from Hugging Face | ||
df = session.read.parquet("hf://datasets/imdb/plain_text/train-*.parquet") | ||
|
||
# Filter and select | ||
positive_reviews = df.filter(fc.col("label") == 1).select("text", "label") | ||
|
||
# Group by and aggregate | ||
label_counts = df.group_by("label").agg( | ||
fc.count("*").alias("count") | ||
) | ||
``` | ||
|
||
### AI-Powered Operations | ||
|
||
To use semantic and embedding operations, configure language and embedding models in your SessionConfig. Once configured: | ||
|
||
```python | ||
import fenic as fc | ||
|
||
# Requires OPENAI_API_KEY to be set for language and embedding calls | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 on having inference providers examples here! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you! I'm planning to write a post about the fenic integration with Datasets and there I'll expand even more. |
||
session = fc.Session.get_or_create( | ||
fc.SessionConfig( | ||
semantic=fc.SemanticConfig( | ||
language_models={ | ||
"gpt-4o-mini": fc.OpenAILanguageModel( | ||
model_name="gpt-4o-mini", | ||
rpm=60, | ||
tpm=60000, | ||
) | ||
}, | ||
embedding_models={ | ||
"text-embedding-3-small": fc.OpenAIEmbeddingModel( | ||
model_name="text-embedding-3-small", | ||
rpm=60, | ||
tpm=60000, | ||
) | ||
}, | ||
) | ||
) | ||
) | ||
|
||
# Load a text dataset from Hugging Face | ||
df = session.read.parquet("hf://datasets/imdb/plain_text/train-00000-of-00001.parquet") | ||
|
||
# Add embeddings to text columns | ||
df_with_embeddings = df.select( | ||
"*", | ||
fc.semantic.embed(fc.col("text")).alias("embedding") | ||
) | ||
|
||
# Apply semantic functions for sentiment analysis | ||
df_analyzed = df_with_embeddings.select( | ||
"*", | ||
fc.semantic.analyze_sentiment( | ||
fc.col("text"), | ||
model_alias="gpt-4o-mini", # Optional: specify model | ||
).alias("sentiment") | ||
) | ||
``` | ||
|
||
## Example: Analyzing MMLU Dataset | ||
|
||
```python | ||
import fenic as fc | ||
|
||
# Requires OPENAI_API_KEY to be set for semantic calls | ||
session = fc.Session.get_or_create( | ||
fc.SessionConfig( | ||
semantic=fc.SemanticConfig( | ||
language_models={ | ||
"gpt-4o-mini": fc.OpenAILanguageModel( | ||
model_name="gpt-4o-mini", | ||
rpm=60, | ||
tpm=60000, | ||
) | ||
}, | ||
) | ||
) | ||
) | ||
|
||
# Load MMLU astronomy subset from Hugging Face | ||
df = session.read.parquet("hf://datasets/cais/mmlu/astronomy/*.parquet") | ||
|
||
# Process the data | ||
processed_df = (df | ||
# Filter for specific criteria | ||
.filter(fc.col("subject") == "astronomy") | ||
# Select relevant columns | ||
.select("question", "choices", "answer") | ||
# Add difficulty analysis using semantic.map | ||
.select( | ||
"*", | ||
fc.semantic.map( | ||
"Rate the difficulty of this question from 1-5: {{question}}", | ||
question=fc.col("question"), | ||
model_alias="gpt-4o-mini" # Optional: specify model | ||
).alias("difficulty") | ||
) | ||
) | ||
|
||
# Show results | ||
processed_df.show() | ||
``` | ||
|
||
## Resources | ||
|
||
- [fenic GitHub Repository](https://github.com/typedef-ai/fenic) | ||
- [fenic Documentation](https://docs.fenic.ai/latest/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is "fenic" always styled with lowercase "f"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great question @davanstrien and the answer is yes, but there might be places where capital "F"s have escaped.