Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BPE error #1

Closed
ruidongtd opened this issue Sep 7, 2019 · 7 comments
Closed

BPE error #1

ruidongtd opened this issue Sep 7, 2019 · 7 comments

Comments

@ruidongtd
Copy link

I always got "Unable to fix BPE error". Why?

@flarn2006
Copy link

flarn2006 commented Sep 9, 2019

I'm getting this too. And it, or something at least, is also resulting in incorrect output—looking at run_simple.py, it looks like it's supposed to recover the original message, This is a very secret message!, and print that at the end of the output. But it actually returns something...slightly different:

[1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
125
======================================== Encoding ========================================
From this time forward Washington remained with the Army on a number of assignments, serving in various capacities and at various times leading the American forces in the battle of Waterloo. He was head of A Company, which commanded
ppl: 12.60, kl: 0.263, words/bit: 0.30, bits/word: 3.33, entropy: 6.30
Unable to fix BPE error: token received: ĠFrom=3574, text: From this time forward Washington remained with the Army on a number of assignments, serving in various capacities and at various times leading the American forces in the battle of Waterloo. He was head of A Company, which commanded
======================================== Recovered Message ========================================
[0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0]
================================================================================
A Drake State School teacher has been charged with sexually abusing four students, ages 14, 9, and 5 years.

The 5-year-old and the 3-year-old have been charged with aggravated rape and terroristic threats against the same 10-year-old student.

Paula L. Smith, a Drake

🤔

@ruidongtd
Copy link
Author

ruidongtd commented Sep 10, 2019

I'm getting this too. And it, or something at least, is also resulting in incorrect output—looking at run_simple.py, it looks like it's supposed to recover the original message, This is a very secret message!, and print that at the end of the output. But it actually returns something...slightly different:

[1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
125
======================================== Encoding ========================================
From this time forward Washington remained with the Army on a number of assignments, serving in various capacities and at various times leading the American forces in the battle of Waterloo. He was head of A Company, which commanded
ppl: 12.60, kl: 0.263, words/bit: 0.30, bits/word: 3.33, entropy: 6.30
Unable to fix BPE error: token received: ĠFrom=3574, text: From this time forward Washington remained with the Army on a number of assignments, serving in various capacities and at various times leading the American forces in the battle of Waterloo. He was head of A Company, which commanded
======================================== Recovered Message ========================================
[0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0]
================================================================================
A Drake State School teacher has been charged with sexually abusing four students, ages 14, 9, and 5 years.

The 5-year-old and the 3-year-old have been charged with aggravated rape and terroristic threats against the same 10-year-old student.

Paula L. Smith, a Drake

🤔

Hi! It's because of "tokenizer.encode and tokenizer.decode of the GPT-2".

test_str = "hello world."
print("test_str = ", test_str)
out = enc.encode(test_str)
print("out = ", out)
decode_str = enc.decode(out)
print("decode_str = ", decode_str)
print("enc.encode(decode_str) = ", enc.encode(decode_str))

output:

test_str =  hello world.
out =  [23748, 995, 13]
decode_str =   hello world.
enc.encode(decode_str) =  [220, 23748, 995, 13]

An additional "220" is added. So you should add inp = inp[1:] in def decode_xxx().
for example:

def decode_huffman(model, enc, text, context, bits_per_word, device='cuda'):
   # inp is a list of token indices
    # context is a list of token indices
    # print("start decode_huffman")
    inp = enc.encode(text)
    inp = inp[1:]

@flarn2006
Copy link

I think this should be reopened; this might be a good workaround but the bug still exists.

@ruidongtd
Copy link
Author

I think this should be reopened; this might be a good workaround but the bug still exists.

It's because of "tokenizer.encode and tokenizer.decode of the GPT-2".
It has nothing to do with steganography.
you should read code.

@flarn2006
Copy link

flarn2006 commented Sep 10, 2019 via email

@ruidongtd ruidongtd reopened this Sep 15, 2019
@thandongtb
Copy link

@ruidongtd make sure you have correct package version. I have got the correct result when run python run_simple.py. Here is my env

pytorch_transformers==1.1.0
torch==1.0.1
bitarray==1.0.1

And the my result

[0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
123
======================================== Encoding ========================================
In 1784, Washington began his military career in Virginia. He served as the first African American to occupy the State Capitol, where he was elected Governor of Virginia. In 1785, Washington pushed for the establishment of a special relationship with Japan. In
ppl: 7.43, kl: 0.152, words/bit: 0.38, bits/word: 2.66, entropy: 5.58
======================================== Recovered Message ========================================
[0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
================================================================================
This is a very secret message!<eos>

@jitendracheripally2003
Copy link

Hi, I tried to install the requirements but the torch 1.0.1 is unavailable for 3.10 version of python.
Can I know what version of python is required and is there a need of CUDA, if so what is the version?

If 1.0.1 torch is unavailable to install with pip, can I know what are the ways to download it?

Thank you in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants