Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community: Add Baichuan Text Embedding Model and Baichuan Inc introduction #16568

Merged
merged 1 commit into from
Jan 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions docs/docs/integrations/providers/baichuan.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Baichuan

>[Baichuan Inc.](https://www.baichuan-ai.com/) is a Chinese startup in the era of AGI, dedicated to addressing fundamental human needs: Efficiency, Health, and Happiness.

## Visit Us
Visit us at https://www.baichuan-ai.com/.
Register and get an API key if you are trying out our APIs.

## Baichuan Chat Model
An example is available at [example](/docs/integrations/chat/baichuan).

## Baichuan Text Embedding Model
An example is available at [example] (/docs/integrations/text_embedding/baichuan)
75 changes: 75 additions & 0 deletions docs/docs/integrations/text_embedding/baichuan.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Baichuan Text Embeddings\n",
"\n",
"As of today (Jan 25th, 2024) BaichuanTextEmbeddings ranks #1 in C-MTEB (Chinese Multi-Task Embedding Benchmark) leaderboard.\n",
"\n",
"Leaderboard (Under Overall -> Chinese section): https://huggingface.co/spaces/mteb/leaderboard\n",
"\n",
"Official Website: https://platform.baichuan-ai.com/docs/text-Embedding\n",
"An API-key is required to use this embedding model. You can get one by registering at https://platform.baichuan-ai.com/docs/text-Embedding.\n",
"BaichuanTextEmbeddings support 512 token window and preduces vectors with 1024 dimensions. \n",
"\n",
"Please NOTE that BaichuanTextEmbeddings only supports Chinese text embedding. Multi-language support is coming soon.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "plaintext"
}
},
"outputs": [],
"source": [
"from langchain_community.embeddings import BaichuanTextEmbeddings\n",
"\n",
"# Place your Baichuan API-key here.\n",
"embeddings = BaichuanTextEmbeddings(baichuan_api_key=\"sk-*\")\n",
"\n",
"text_1 = \"今天天气不错\"\n",
"text_2 = \"今天阳光很好\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "plaintext"
}
},
"outputs": [],
"source": [
"query_result = embeddings.embed_query(text_1)\n",
"query_result"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "plaintext"
}
},
"outputs": [],
"source": [
"doc_result = embeddings.embed_documents([text_1, text_2])\n",
"doc_result"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
54 changes: 29 additions & 25 deletions docs/docs/integrations/vectorstores/kdbai.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -167,9 +167,9 @@
],
"source": [
"%%time\n",
"URL = 'https://www.conseil-constitutionnel.fr/node/3850/pdf'\n",
"PDF = 'Déclaration_des_droits_de_l_homme_et_du_citoyen.pdf'\n",
"open(PDF, 'wb').write(requests.get(URL).content)"
"URL = \"https://www.conseil-constitutionnel.fr/node/3850/pdf\"\n",
"PDF = \"Déclaration_des_droits_de_l_homme_et_du_citoyen.pdf\"\n",
"open(PDF, \"wb\").write(requests.get(URL).content)"
]
},
{
Expand Down Expand Up @@ -208,7 +208,7 @@
],
"source": [
"%%time\n",
"print('Read a PDF...')\n",
"print(\"Read a PDF...\")\n",
"loader = PyPDFLoader(PDF)\n",
"pages = loader.load_and_split()\n",
"len(pages)"
Expand Down Expand Up @@ -252,12 +252,14 @@
],
"source": [
"%%time\n",
"print('Create a Vector Database from PDF text...')\n",
"embeddings = OpenAIEmbeddings(model='text-embedding-ada-002')\n",
"print(\"Create a Vector Database from PDF text...\")\n",
"embeddings = OpenAIEmbeddings(model=\"text-embedding-ada-002\")\n",
"texts = [p.page_content for p in pages]\n",
"metadata = pd.DataFrame(index=list(range(len(texts))))\n",
"metadata['tag'] = 'law'\n",
"metadata['title'] = 'Déclaration des Droits de l\\'Homme et du Citoyen de 1789'.encode('utf-8')\n",
"metadata[\"tag\"] = \"law\"\n",
"metadata[\"title\"] = \"Déclaration des Droits de l'Homme et du Citoyen de 1789\".encode(\n",
" \"utf-8\"\n",
")\n",
"vectordb = KDBAI(table, embeddings)\n",
"vectordb.add_texts(texts=texts, metadatas=metadata)"
]
Expand Down Expand Up @@ -288,11 +290,13 @@
],
"source": [
"%%time\n",
"print('Create LangChain Pipeline...')\n",
"qabot = RetrievalQA.from_chain_type(chain_type='stuff',\n",
" llm=ChatOpenAI(model='gpt-3.5-turbo-16k', temperature=TEMP), \n",
" retriever=vectordb.as_retriever(search_kwargs=dict(k=K)),\n",
" return_source_documents=True)"
"print(\"Create LangChain Pipeline...\")\n",
"qabot = RetrievalQA.from_chain_type(\n",
" chain_type=\"stuff\",\n",
" llm=ChatOpenAI(model=\"gpt-3.5-turbo-16k\", temperature=TEMP),\n",
" retriever=vectordb.as_retriever(search_kwargs=dict(k=K)),\n",
" return_source_documents=True,\n",
")"
]
},
{
Expand Down Expand Up @@ -325,9 +329,9 @@
],
"source": [
"%%time\n",
"Q = 'Summarize the document in English:'\n",
"print(f'\\n\\n{Q}\\n')\n",
"print(qabot.invoke(dict(query=Q))['result'])"
"Q = \"Summarize the document in English:\"\n",
"print(f\"\\n\\n{Q}\\n\")\n",
"print(qabot.invoke(dict(query=Q))[\"result\"])"
]
},
{
Expand Down Expand Up @@ -362,9 +366,9 @@
],
"source": [
"%%time\n",
"Q = 'Is it a fair law and why ?'\n",
"print(f'\\n\\n{Q}\\n')\n",
"print(qabot.invoke(dict(query=Q))['result'])"
"Q = \"Is it a fair law and why ?\"\n",
"print(f\"\\n\\n{Q}\\n\")\n",
"print(qabot.invoke(dict(query=Q))[\"result\"])"
]
},
{
Expand Down Expand Up @@ -414,9 +418,9 @@
],
"source": [
"%%time\n",
"Q = 'What are the rights and duties of the man, the citizen and the society ?'\n",
"print(f'\\n\\n{Q}\\n')\n",
"print(qabot.invoke(dict(query=Q))['result'])"
"Q = \"What are the rights and duties of the man, the citizen and the society ?\"\n",
"print(f\"\\n\\n{Q}\\n\")\n",
"print(qabot.invoke(dict(query=Q))[\"result\"])"
]
},
{
Expand All @@ -441,9 +445,9 @@
],
"source": [
"%%time\n",
"Q = 'Is this law practical ?'\n",
"print(f'\\n\\n{Q}\\n')\n",
"print(qabot.invoke(dict(query=Q))['result'])"
"Q = \"Is this law practical ?\"\n",
"print(f\"\\n\\n{Q}\\n\")\n",
"print(qabot.invoke(dict(query=Q))[\"result\"])"
]
},
{
Expand Down
2 changes: 2 additions & 0 deletions libs/community/langchain_community/embeddings/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
)
from langchain_community.embeddings.awa import AwaEmbeddings
from langchain_community.embeddings.azure_openai import AzureOpenAIEmbeddings
from langchain_community.embeddings.baichuan import BaichuanTextEmbeddings
from langchain_community.embeddings.baidu_qianfan_endpoint import (
QianfanEmbeddingsEndpoint,
)
Expand Down Expand Up @@ -91,6 +92,7 @@
__all__ = [
"OpenAIEmbeddings",
"AzureOpenAIEmbeddings",
"BaichuanTextEmbeddings",
"ClarifaiEmbeddings",
"CohereEmbeddings",
"DatabricksEmbeddings",
Expand Down
113 changes: 113 additions & 0 deletions libs/community/langchain_community/embeddings/baichuan.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
from typing import Any, Dict, List, Optional

import requests
from langchain_core.embeddings import Embeddings
from langchain_core.pydantic_v1 import BaseModel, SecretStr, root_validator
from langchain_core.utils import convert_to_secret_str, get_from_dict_or_env

BAICHUAN_API_URL: str = "http://api.baichuan-ai.com/v1/embeddings"

# BaichuanTextEmbeddings is an embedding model provided by Baichuan Inc. (https://www.baichuan-ai.com/home).
# As of today (Jan 25th, 2024) BaichuanTextEmbeddings ranks #1 in C-MTEB
# (Chinese Multi-Task Embedding Benchmark) leaderboard.
# Leaderboard (Under Overall -> Chinese section): https://huggingface.co/spaces/mteb/leaderboard

# Official Website: https://platform.baichuan-ai.com/docs/text-Embedding
# An API-key is required to use this embedding model. You can get one by registering
# at https://platform.baichuan-ai.com/docs/text-Embedding.
# BaichuanTextEmbeddings support 512 token window and preduces vectors with
# 1024 dimensions.


# NOTE!! BaichuanTextEmbeddings only supports Chinese text embedding.
# Multi-language support is coming soon.
class BaichuanTextEmbeddings(BaseModel, Embeddings):
"""Baichuan Text Embedding models."""

session: Any #: :meta private:
model_name: str = "Baichuan-Text-Embedding"
baichuan_api_key: Optional[SecretStr] = None

@root_validator(allow_reuse=True)
def validate_environment(cls, values: Dict) -> Dict:
"""Validate that auth token exists in environment."""
try:
baichuan_api_key = convert_to_secret_str(
get_from_dict_or_env(values, "baichuan_api_key", "BAICHUAN_API_KEY")
)
except ValueError as original_exc:
try:
baichuan_api_key = convert_to_secret_str(
get_from_dict_or_env(
values, "baichuan_auth_token", "BAICHUAN_AUTH_TOKEN"
)
)
except ValueError:
raise original_exc
session = requests.Session()
session.headers.update(
{
"Authorization": f"Bearer {baichuan_api_key.get_secret_value()}",
"Accept-Encoding": "identity",
"Content-type": "application/json",
}
)
values["session"] = session
return values

def _embed(self, texts: List[str]) -> Optional[List[List[float]]]:
"""Internal method to call Baichuan Embedding API and return embeddings.

Args:
texts: A list of texts to embed.

Returns:
A list of list of floats representing the embeddings, or None if an
error occurs.
"""
try:
response = self.session.post(
BAICHUAN_API_URL, json={"input": texts, "model": self.model_name}
)
# Check if the response status code indicates success
if response.status_code == 200:
resp = response.json()
embeddings = resp.get("data", [])
# Sort resulting embeddings by index
sorted_embeddings = sorted(embeddings, key=lambda e: e.get("index", 0))
# Return just the embeddings
return [result.get("embedding", []) for result in sorted_embeddings]
else:
# Log error or handle unsuccessful response appropriately
print(
f"""Error: Received status code {response.status_code} from
embedding API"""
)
return None
except Exception as e:
# Log the exception or handle it as needed
print(f"Exception occurred while trying to get embeddings: {str(e)}")
return None

def embed_documents(self, texts: List[str]) -> Optional[List[List[float]]]:
"""Public method to get embeddings for a list of documents.

Args:
texts: The list of texts to embed.

Returns:
A list of embeddings, one for each text, or None if an error occurs.
"""
return self._embed(texts)

def embed_query(self, text: str) -> Optional[List[float]]:
"""Public method to get embedding for a single query text.

Args:
text: The text to embed.

Returns:
Embeddings for the text, or None if an error occurs.
"""
result = self._embed([text])
return result[0] if result is not None else None
19 changes: 19 additions & 0 deletions libs/community/tests/integration_tests/embeddings/test_baichuan.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
"""Test Baichuan Text Embedding."""
from langchain_community.embeddings.baichuan import BaichuanTextEmbeddings


def test_baichuan_embedding_documents() -> None:
"""Test Baichuan Text Embedding for documents."""
documents = ["今天天气不错", "今天阳光灿烂"]
embedding = BaichuanTextEmbeddings()
output = embedding.embed_documents(documents)
assert len(output) == 2
assert len(output[0]) == 1024


def test_baichuan_embedding_query() -> None:
"""Test Baichuan Text Embedding for query."""
document = "所有的小学生都会学过只因兔同笼问题。"
embedding = BaichuanTextEmbeddings()
output = embedding.embed_query(document)
assert len(output) == 1024
1 change: 1 addition & 0 deletions libs/community/tests/unit_tests/embeddings/test_imports.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
EXPECTED_ALL = [
"OpenAIEmbeddings",
"AzureOpenAIEmbeddings",
"BaichuanTextEmbeddings",
"ClarifaiEmbeddings",
"CohereEmbeddings",
"DatabricksEmbeddings",
Expand Down
Loading