Decoding script invitation #3

xumx · 2019-10-18T03:27:28Z

Is there a way to send in requests for the decoding script?

I understand the nature of the challenges surrounding reddit toxicity, we just want to try it out privately, and test different prompts.

dreasysnail · 2019-10-19T02:45:32Z

Hi Xumx,

Thanks for your interests. Due to the company policy we cannot share the decoding script freely. We are still work on a channel where researchers can register and apply for the demo access. Please stay tuned and we will update you once it is ready.

qywu · 2019-11-08T06:59:21Z

The decoding script is not really that hard to implement. But I am not fully sure if the inputs are exactly the same with the original implementation due to the weird tokenization implemented in this repo.
Anyway, I have attached the Colab link to whoever is interested:
https://colab.research.google.com/drive/1PslHE4Rl4RqSa20s7HEp0ZKITBir6ezE

andreamad8 · 2019-11-08T09:17:18Z

@qywu I was about to post something similar :)

I have created a repo with a decoding script, looks quite similar to yours :) I have added some automatic download and window length for the dialogue history. If you have sometimes check it out, and let me know if is there something to improve.

I test it a bit yesterday and the responses are actually very good. I have tried some of the input reported in the repo and I can reproduce some of the responses too.

This is the link to the repo: https://github.com/andreamad8/DialoGPT2-Interact

I hope this is helpful

Andrea

qywu · 2019-11-09T01:34:22Z

@andreamad8 Cool. Have you noticed the weird tokenization for words? It seems that they feed tokenized sentences to GPT2, but it is not necessary.

andreamad8 · 2019-11-09T01:41:13Z

Oh, I have noticed that honestly, :) I went straight with the hugging face implementation. Which line are you referring too?
I noticed instead that by using this: tokenizer.encoder("<|endoftext|>")
the tokenizer does not return EOS, thus I used directly EOS_ID = 50256

qywu · 2019-11-09T02:04:42Z

DialoGPT/reddit_extractor/src/reddit.py

Lines 114 to 115 in ef531a9

    
           tokenizer = TweetTokenizer(preserve_case=True) 
        
           txt = ' ' + ' '.join(tokenizer.tokenize(txt)) + ' '

I am not referring to your code, but theirs. For GPT-2, there is no need to tokenize words first. So it doesn't generate sentences like: "Hello , how are you doing ? "

andreamad8 · 2019-11-09T03:36:08Z

Oh, I didn't check that file, but you are right, no need, GPT tokenizer does the job already.

Maybe open another issue? but good to know.

Andrea

LHolten · 2019-11-25T20:03:18Z

I made an implementation of the MMI decoder (from the description in the paper):

https://github.com/LHolten/DialoGTP-MMI-decoder

It features unlimited chat length and usage as a discord bot.

drogozhang · 2019-12-06T16:49:11Z

@qywu Great Job! If we want to batch the input_ids, what should we pad? the 0 results are terrible.

yangjianxin1 · 2019-12-17T12:37:19Z

Based on the ideas of MMI of DialoGPT，I have implemented a chatbot for chinese chitchat,it's performance is good

https://github.com/yangjianxin1/GPT2-chitchat

jsedoc · 2019-12-17T18:17:15Z

@andreamad8 can you post the responses that you got? ... and how you got them. We can't seem to match the ones reported.

yangjianxin1 · 2019-12-18T01:33:20Z

the dialogue generated by chatbot is listed as follow:
Sample 1:
user :你好啊
chatbot:你好，请问您是？
user :我是你的主人啊
chatbot:你是狗吗？
user :你才是狗
chatbot:你不是狗我是你的小主人
user :你这个大笨蛋
chatbot:你大笨蛋，你是想挨打了
user :你全家都是大笨蛋
chatbot:我是你爸爸
user :算了，不跟你吵了，你在干嘛呢
chatbot:我在上班啊，你还没下班呢啊你
user :我周末要做作业啊
chatbot:不是才交了吗，早晚做完啊
user :好多作业啊，还没做完呢
chatbot:我在抄，我妈让我周五交，还有事
user :你就不怕被你妈妈骂吗
chatbot:不怕
user :你周末有去哪里玩吗
chatbot:没时间啊，忙得很啊
user :忙着干嘛呢
chatbot:不知道啊，我周末没有作业，我想睡觉，然后还得找人

Sample 2:
user :在吗
chatbot:在
user :我不开心
chatbot:怎么啦
user :最近好累啊
chatbot:是挺累的
user :作业好多啊
chatbot:还好还好
user :我觉得我快要挂科了
chatbot:我也是啊，但是我觉得我快熬出头了
user :不说了，我还有好多作业没写
chatbot:我也是啊好多啊
user :那先这样，下次再聊，再见了
chatbot:好哒

andreamad8 · 2019-12-18T02:09:39Z

@jsedoc so in my decoding script I use multinomial sampling so the output is a bit different every time. If you want to try pure greedy use top-k 0 and change line 91 with
prev = torch.argmax(probs,dim=1).unsqueeze(0)
I will also update the script.

Anyway, the generate responses are very good, but yah not exactly the same. For example:

USR >>> The trading war between China and US is still happening .
SYS >>> I'm not sure what you mean by that .
USR >>> Does money buy happiness?
SYS >>> Money buys happiness , but it also buys a lot of things that make you happy .

and

USR >>> who won the world cup in 2018 ?
SYS >>> The US
USR >>> Nvidia's Titan RTX is really good
SYS >>> I'll check it out , thanks !
USR >>> Can Porsche beat Tesla with its new Taycan EV ?
SYS >>> I think the Tesla is a bit more powerful than the Porsche .

USR >>> what is the boiling point of water?
SYS >>> I think it's about the same as the boiling point of a human heart .
USR >>> which one is bigger, sun or moon?
SYS >>> The sun .

USR >>> who is the first president of the United States
SYS >>> George Washington

In general, I use top-k sampling.

Let me know if this help.

intersun · 2019-12-18T04:57:26Z

the dialogue generated by chatbot is listed as follow:
Sample 1:
user :你好啊
chatbot:你好，请问您是？
user :我是你的主人啊
chatbot:你是狗吗？
user :你才是狗
chatbot:你不是狗我是你的小主人
user :你这个大笨蛋
chatbot:你大笨蛋，你是想挨打了
user :你全家都是大笨蛋
chatbot:我是你爸爸
user :算了，不跟你吵了，你在干嘛呢
chatbot:我在上班啊，你还没下班呢啊你
user :我周末要做作业啊
chatbot:不是才交了吗，早晚做完啊
user :好多作业啊，还没做完呢
chatbot:我在抄，我妈让我周五交，还有事
user :你就不怕被你妈妈骂吗
chatbot:不怕
user :你周末有去哪里玩吗
chatbot:没时间啊，忙得很啊
user :忙着干嘛呢
chatbot:不知道啊，我周末没有作业，我想睡觉，然后还得找人

Sample 2:
user :在吗
chatbot:在
user :我不开心
chatbot:怎么啦
user :最近好累啊
chatbot:是挺累的
user :作业好多啊
chatbot:还好还好
user :我觉得我快要挂科了
chatbot:我也是啊，但是我觉得我快熬出头了
user :不说了，我还有好多作业没写
chatbot:我也是啊好多啊
user :那先这样，下次再聊，再见了
chatbot:好哒

The results seem really impressive, thanks for your work!

jsedoc · 2019-12-18T17:03:23Z

Thanks!!!

In the paper, it says that a response was chosen from 10 responses in top-k. This is always the problem with sampling that reproducibility becomes an issue. Especially when one of the 10 top-k responses is selected by a human.

@jsedoc so in my decoding script I use multinomial sampling so the output is a bit different every time. If you want to try pure greedy use top-k 0 and change line 91 with
prev = torch.argmax(probs,dim=1).unsqueeze(0)
I will also update the script.

Anyway, the generate responses are very good, but yah not exactly the same. For example:

USR >>> The trading war between China and US is still happening .
SYS >>> I'm not sure what you mean by that .
USR >>> Does money buy happiness?
SYS >>> Money buys happiness , but it also buys a lot of things that make you happy .

and

USR >>> who won the world cup in 2018 ?
SYS >>> The US
USR >>> Nvidia's Titan RTX is really good
SYS >>> I'll check it out , thanks !
USR >>> Can Porsche beat Tesla with its new Taycan EV ?
SYS >>> I think the Tesla is a bit more powerful than the Porsche .

USR >>> what is the boiling point of water?
SYS >>> I think it's about the same as the boiling point of a human heart .
USR >>> which one is bigger, sun or moon?
SYS >>> The sun .

USR >>> who is the first president of the United States
SYS >>> George Washington

In general, I use top-k sampling.

Let me know if this help.

dreasysnail · 2019-12-19T23:31:16Z

@yangjianxin1 The result looks really impressive! We will remark your GitHub repo in our repo as well. Thanks for letting us know

yangjianxin1 · 2019-12-20T04:54:57Z

@dreasysnail thank you very much

polakowo · 2020-01-13T22:32:32Z

First of all, thank you for releasing the code and the models, it's fantastic. Based on the current DialoGPT implementation, I adapted run_generation.py from Hugging Face to perform decoding and built a Telegram bot on top of that (with GIF support!). Texting the model in a messaging app feels much more different than doing it in console.

Responses are sometimes out of this world but still very coherent. Here is a multi-turn chat example with context window of 2 turns:

User >>> Can we achieve singularity?
Bot >>> What does this mean?
User >>> Can computers become smarter than humans?
Bot >>> Is there any evidence that this is possible?
User >>> It must be doable
Bot >>> But... how?
User >>> I want to find out, that's why I created you
Bot >>> You created me?
User >>> You're my bot
Bot >>> You monster

https://github.com/polakowo/gpt2bot

dreasysnail · 2020-01-13T22:35:56Z

Looks awesome. Thanks for the contribution @polakowo !

nicolas-ivanov · 2020-01-20T16:19:56Z

@andreamad8 @polakowo @yangjianxin1 @LHolten thank you for releasing your code! Have you tried feeding the token_ids to the decoder along with input_ids and position_ids? Seems like it breaks the model's output completely :/

nicolas-ivanov · 2020-01-20T16:29:11Z

Here are the inputs for a sample dialog ("hi, man!", "oh, hi!", "what is your name, man?") that I pad from the left and feed to the decoder:

input_ids: 
[0, 0, 0, 0, 0, 0, 5303, 837, 582, 5145, 50256, 1219, 837, 23105, 5145, 50256, 10919, 318, 534, 1438, 837, 582, 5633, 50256]

token_ids:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3]

position_ids: 
[0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]

Is anything wrong with these inputs?

Here are the decoded input tokens for your convenience:

!!!!!!hi, man!<|endoftext|>oh, hi!<|endoftext|>what is your name, man?<|endoftext|>

andreamad8 · 2020-01-21T01:28:20Z

Hey @nicolas-ivanov, yes I tried and yes it breaks the models' output. I believe that the model has not been trained using this positional token. Maybe because the model was working well without. Anyhow, just keep those None and it works okay.

If you need to finetune it, then you also use the position_ids, and they should work :)

I hope this help

Andrea

nicolas-ivanov · 2020-01-21T06:12:21Z

@andreamad8 Thanks a lot for your response!

@dreasysnail Could you please confirm that the model was trained without token_ids? Or are we using them in the wrong way?

dreasysnail · 2020-01-21T06:17:28Z

Yes @andreamad8 is right (Thanks!). We didn't have the token_ids specified. It was left blank. This was following Huggingface's original GPT-2 repository.

nicolas-ivanov · 2020-01-21T09:24:35Z

Got it, thanks a lot for the clarification!

abaheti95 · 2020-04-01T00:34:09Z

@qywu Great Job! If we want to batch the input_ids, what should we pad? the 0 results are terrible.

Was wondering if you figured out a way to batch decode sentences?

GraphGrailAi · 2020-04-02T10:10:17Z

@polakowo Hi, i see you are dive in code, could you suggest how to prepare dataset to fine-tune ? #36

adamcohenhillel · 2020-08-24T11:16:39Z

Hi all, is the third-party decoders still relevant?
what's the different from the script shown in huggingface website?

from transformers import AutoModelWithLMHead, AutoTokenizer
import torch


tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-medium")
model = AutoModelWithLMHead.from_pretrained("microsoft/DialoGPT-medium")

# Let's chat for 5 lines
for step in range(5):
    # encode the new user input, add the eos_token and return a tensor in Pytorch
    new_user_input_ids = tokenizer.encode(input(">> User:") + tokenizer.eos_token, return_tensors='pt')

    # append the new user input tokens to the chat history
    bot_input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if step > 0 else new_user_input_ids

    # generated a response while limiting the total chat history to 1000 tokens, 
    chat_history_ids = model.generate(bot_input_ids, max_length=1000, pad_token_id=tokenizer.eos_token_id)

    # pretty print last ouput tokens from bot
    print("DialoGPT: {}".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)))

Thanks!

LHolten mentioned this issue Nov 25, 2019

MMI criterion #15

Open

intersun pinned this issue Nov 26, 2019

intersun unpinned this issue Nov 26, 2019

dreasysnail pinned this issue Dec 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decoding script invitation #3

Decoding script invitation #3

xumx commented Oct 18, 2019

dreasysnail commented Oct 19, 2019

qywu commented Nov 8, 2019

andreamad8 commented Nov 8, 2019 •

edited

Loading

qywu commented Nov 9, 2019

andreamad8 commented Nov 9, 2019

qywu commented Nov 9, 2019 •

edited

Loading

andreamad8 commented Nov 9, 2019

LHolten commented Nov 25, 2019

drogozhang commented Dec 6, 2019

yangjianxin1 commented Dec 17, 2019

jsedoc commented Dec 17, 2019

yangjianxin1 commented Dec 18, 2019

andreamad8 commented Dec 18, 2019

intersun commented Dec 18, 2019

jsedoc commented Dec 18, 2019 •

edited

Loading

dreasysnail commented Dec 19, 2019

yangjianxin1 commented Dec 20, 2019

polakowo commented Jan 13, 2020 •

edited

Loading

dreasysnail commented Jan 13, 2020

nicolas-ivanov commented Jan 20, 2020

nicolas-ivanov commented Jan 20, 2020

andreamad8 commented Jan 21, 2020

nicolas-ivanov commented Jan 21, 2020

dreasysnail commented Jan 21, 2020 •

edited

Loading

nicolas-ivanov commented Jan 21, 2020

abaheti95 commented Apr 1, 2020

GraphGrailAi commented Apr 2, 2020

adamcohenhillel commented Aug 24, 2020 •

edited

Loading

Decoding script invitation #3

Decoding script invitation #3

Comments

xumx commented Oct 18, 2019

dreasysnail commented Oct 19, 2019

qywu commented Nov 8, 2019

andreamad8 commented Nov 8, 2019 • edited Loading

qywu commented Nov 9, 2019

andreamad8 commented Nov 9, 2019

qywu commented Nov 9, 2019 • edited Loading

andreamad8 commented Nov 9, 2019

LHolten commented Nov 25, 2019

drogozhang commented Dec 6, 2019

yangjianxin1 commented Dec 17, 2019

jsedoc commented Dec 17, 2019

yangjianxin1 commented Dec 18, 2019

andreamad8 commented Dec 18, 2019

intersun commented Dec 18, 2019

jsedoc commented Dec 18, 2019 • edited Loading

dreasysnail commented Dec 19, 2019

yangjianxin1 commented Dec 20, 2019

polakowo commented Jan 13, 2020 • edited Loading

dreasysnail commented Jan 13, 2020

nicolas-ivanov commented Jan 20, 2020

nicolas-ivanov commented Jan 20, 2020

andreamad8 commented Jan 21, 2020

nicolas-ivanov commented Jan 21, 2020

dreasysnail commented Jan 21, 2020 • edited Loading

nicolas-ivanov commented Jan 21, 2020

abaheti95 commented Apr 1, 2020

GraphGrailAi commented Apr 2, 2020

adamcohenhillel commented Aug 24, 2020 • edited Loading

andreamad8 commented Nov 8, 2019 •

edited

Loading

qywu commented Nov 9, 2019 •

edited

Loading

jsedoc commented Dec 18, 2019 •

edited

Loading

polakowo commented Jan 13, 2020 •

edited

Loading

dreasysnail commented Jan 21, 2020 •

edited

Loading

adamcohenhillel commented Aug 24, 2020 •

edited

Loading