Can tokenizers.fromPretrained cache directory be specified #29

rmrbytes · 2024-11-16T05:33:26Z

Thanks for this convenient library to use HF tokenizers. It seems that the json file downloaded does get cached but could not determine its location. Is there a way to specify it via TokenizerConfig?

Thanks.

daulet · 2024-11-18T21:51:15Z

does get cached but could not determine its location

what do you mean by this? can you paste your repro for the error?

rmrbytes · 2024-11-19T04:14:34Z

There was no error @daulet . I only found that every time I run it was giving the message:

Successfully downloaded /var/folders/xb/50fkm1vj7mj_mb14nvc18h5r0000gn/T/huggingface-tokenizer-432552157/tokenizer.json

And if I put the tokenizer command (tokenizers.FromPretrained("google-bert/bert-base-uncased") inside the loop, that many times it gave the message. Hence I wondered about the caching and whether we are supposed to specify a directory as an option.

Just to let you know I am using this inside a docker environment where I am building a chunking service in go and wanted your function to check for token limit for any specified model (via .env).

Subsequently I create a singleton as follows, now it gives the message as designed only once as below:

package splitters

import (
	"log"
	"sync"

	"github.com/daulet/tokenizers"
)

// Global tokenizer variable
var (
	tokenizerInstance *tokenizers.Tokenizer
	once              sync.Once
)

// initTokenizer initializes the tokenizer only once
func initTokenizer() {
	var err error
	tokenizerInstance, err = tokenizers.FromPretrained("google-bert/bert-base-uncased")
	if err != nil {
		log.Fatalf("Failed to load tokenizer: %v", err)
	}
}

// GetTokenizerInstance provides access to the tokenizer instance
func GetTokenizerInstance() *tokenizers.Tokenizer {
	// Ensure the tokenizer is loaded only once using sync.Once
	once.Do(initTokenizer)
	return tokenizerInstance
}

// getTokenLength calculates the number of tokens in the given text
func getTokenLength(input string) int {
	// Get the tokenizer instance
	tokenizer := GetTokenizerInstance()

	// Encode the input text
	encodings, _ := tokenizer.Encode(input, true)

	// Return the number of tokens
	return len(encodings)
}

Thanks for making this convenient library in Go

rmrbytes · 2024-11-19T13:03:45Z

I am able to get the no of tokens as desired when I run using go run main.go ... but when I try the same via docker file it gives me the following error.

Failed to load tokenizer: failed to download mandatory file tokenizer.json: failed to download from https://huggingface.co/google-bert/bert-base-uncased/resolve/main/tokenizer.json: Get "https://huggingface.co/google-bert/bert-base-uncased/resolve/main/tokenizer.json": tls: failed to verify certificate: x509: certificate signed by unknown authority

Is a HF token required? Tho, I wondered how it was able to download it when run locally.

The following is my dockerfile

# Stage 1: Build Go Application
FROM golang:1.23.2-bullseye AS builder

WORKDIR /app

# Copy Go modules manifests and download dependencies
COPY go.mod go.sum ./
RUN go mod download

# Copy the source code and fetch script
COPY . .
RUN chmod +x fetch_tokenizer_library.sh && ./fetch_tokenizer_library.sh

# Set environment variables for CGO to link to the downloaded library
ENV CGO_ENABLED=1
ENV CGO_LDFLAGS="-L/app/libs/tokenizers -ltokenizers"  
ENV CGO_CXXFLAGS="--std=c++11"

# Build the Go application
RUN go build -o my-app .

# Stage 2: Create Runtime Image
FROM debian:bullseye-slim

# Copy the Go application binary from the builder stage
COPY --from=builder /app/myapp .

# Create data directory directly in the runtime stage
RUN mkdir -p /data

# Expose the port (if needed)
EXPOSE 8080

# Run the application
CMD ["./my-app"]

And following is the fetch.sh

#!/bin/bash

set -e

# Define where the libraries should be placed within your project
LIB_DIR="./libs/tokenizers"
mkdir -p "$LIB_DIR"

# Detect platform
PLATFORM=$(uname -s | tr '[:upper:]' '[:lower:]')
ARCH=$(uname -m)

# Determine which library to download
if [ "$PLATFORM" = "linux" ] && [ "$ARCH" = "x86_64" ]; then
  LIB_URL="https://github.com/daulet/tokenizers/releases/latest/download/libtokenizers.linux-amd64.tar.gz"
elif [ "$PLATFORM" = "linux" ] && [ "$ARCH" = "aarch64" ]; then
  LIB_URL="https://github.com/daulet/tokenizers/releases/latest/download/libtokenizers.linux-arm64.tar.gz"
elif [ "$PLATFORM" = "darwin" ] && [ "$ARCH" = "arm64" ]; then
  LIB_URL="https://github.com/daulet/tokenizers/releases/latest/download/libtokenizers.darwin-arm64.tar.gz"
else
  echo "Unsupported platform: $PLATFORM-$ARCH"
  exit 1
fi

# Download and extract the pre-built library
echo "Downloading tokenizer library for $PLATFORM-$ARCH..."
curl -L -o "$LIB_DIR/libtokenizers.tar.gz" "$LIB_URL"

# Extract the library into the target directory
echo "Extracting the library..."
tar -xzf "$LIB_DIR/libtokenizers.tar.gz" -C "$LIB_DIR"

# Clean up
rm "$LIB_DIR/libtokenizers.tar.gz"

echo "Library downloaded and extracted successfully to $LIB_DIR"

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can tokenizers.fromPretrained cache directory be specified #29

Can tokenizers.fromPretrained cache directory be specified #29

rmrbytes commented Nov 16, 2024

daulet commented Nov 18, 2024

rmrbytes commented Nov 19, 2024

rmrbytes commented Nov 19, 2024

Can tokenizers.fromPretrained cache directory be specified #29

Can tokenizers.fromPretrained cache directory be specified #29

Comments

rmrbytes commented Nov 16, 2024

daulet commented Nov 18, 2024

rmrbytes commented Nov 19, 2024

rmrbytes commented Nov 19, 2024