Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The cache in bpe() may occupy a large amount of memory after long-time running. #35

Open
binjie09 opened this issue Apr 1, 2023 · 1 comment

Comments

@binjie09
Copy link

binjie09 commented Apr 1, 2023

I use a large amount of Chinese in the GPT service, and Chinese phrases here will occupy a significant amount of memory.

After running for one day, it occupies more than 1G of memory, which made me think there was a memory leak in my code for a moment.

GPT-3-Encoder/Encoder.js

Lines 87 to 153 in 9df47fc

function bpe(token) {
if (cache.has(token)) {
return cache.get(token)
}``
let word = token.split('')
let pairs = get_pairs(word)
if (!pairs) {
return token
}
while (true) {
const minPairs = {}
Array.from(pairs).map(pair => {
const rank = bpe_ranks[pair]
minPairs[(isNaN(rank) ? 10e10 : rank)] = pair
})
const bigram = minPairs[Math.min(...Object.keys(minPairs).map(x => {
return parseInt(x)
}
))]
if (!(bigram in bpe_ranks)) {
break
}
const first = bigram[0]
const second = bigram[1]
let new_word = []
let i = 0
while (i < word.length) {
const j = word.indexOf(first, i)
if (j === -1) {
new_word = new_word.concat(word.slice(i))
break
}
new_word = new_word.concat(word.slice(i, j))
i = j
if (word[i] === first && i < word.length - 1 && word[i + 1] === second) {
new_word.push(first + second)
i = i + 2
} else {
new_word.push(word[i])
i = i + 1
}
}
word = new_word
if (word.length === 1) {
break
} else {
pairs = get_pairs(word)
}
}
word = word.join(' ')
cache.set(token, word)
return word
}

niieani referenced this issue in niieani/gpt-tokenizer Apr 16, 2023
requires providing an additional argument with cache if you want to make it shared

fixes #35
@niieani
Copy link

niieani commented Apr 16, 2023

Hi @binjie09. My PR #38 fixes this by requiring the cache to be passed in explicitly. That way you can control it however you like. You could even implement a custom Map that removes old entries once some limit is reached.

If you can't wait for the PR to get merged, I've published my fork as gpt-tokenizer.

niieani referenced this issue in niieani/gpt-tokenizer Apr 16, 2023
requires providing an additional argument with cache if you want to make it shared

fixes #35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants