You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I found that in HotpotQA, each passage in the original dataset is approximately 100-150 tokens long, but in LongBench, some samples in the hotpotqa.jsonl have a combined passage length of over 10,000 tokens, averaging around 1,000 tokens per passage. I want to ask how such long lengths were processed.
For example, in the second-to-last sample in hotpotqa.jsonl (the question is "Which artist is known for his work on Marvel Team-Up and Batman: Son of the Demon?"), in passage 1, the content is as follows:
Arcade (Marvel Comics)\nArcade is a supervillain appearing in American comic books published by Marvel Comics. He first appeared in 1978's Marvel Team-Up #65, the creation of writer Chris Claremont and writer/artist John Byrne. The character is a combination of an evil genius and a hitman who carries out his assassinations via various elaborate traps, often referred to as Murderworld.\nArcade's first intended victims were Spider-Man and Captain Britain but since Arcade's Murderworld games always leave the outcome up to chance, the duo defeated Arcade and escaped with their lives. Over the years, Arcade has targeted a multitude of Marvel heroes, often focusing on the X-Men and associated members of X-Factor, X-Force, and Excalibur. In what is considered the "game changer" for Arcade, Avengers Arena, he managed to kidnap 16 superpowered teens and forced them to kill each other for survival in his latest version of Murderworld; unlike most Murderworld schemes, this endeavor yielded several casualties.\nArcade has appeared in a number of other Marvel properties outside of comic books, in X-Men: Evolution voiced by Gabe Khouth, and in the Ultimate Spider-Man animated series voiced by Eric Bauza. He has also appeared as one of the main villains in a number of video games......
However, in the original dataset (hotpot_dev_distractor_v1.json, hotpot_dev_fullwiki_v1.json), the content is:
["Arcade (Marvel Comics)",["Arcade is a fictional supervillain appearing in American comic books published by Marvel Comics."," He first appeared in 1978's "Marvel Team-Up" (vol."," 1) #65, the creation of writer Chris Claremont and writer/artist John Byrne."," The character is a combination of evil genius and hitman who carries out his assassinations via various elaborate traps, often referred to as his "Murderworld"."]]
Where does the content after "often referred to as Murderworld." in the example come from?
The text was updated successfully, but these errors were encountered:
I found that in HotpotQA, each passage in the original dataset is approximately 100-150 tokens long, but in LongBench, some samples in the hotpotqa.jsonl have a combined passage length of over 10,000 tokens, averaging around 1,000 tokens per passage. I want to ask how such long lengths were processed.
For example, in the second-to-last sample in hotpotqa.jsonl (the question is "Which artist is known for his work on Marvel Team-Up and Batman: Son of the Demon?"), in passage 1, the content is as follows:
Arcade (Marvel Comics)\nArcade is a supervillain appearing in American comic books published by Marvel Comics. He first appeared in 1978's Marvel Team-Up #65, the creation of writer Chris Claremont and writer/artist John Byrne. The character is a combination of an evil genius and a hitman who carries out his assassinations via various elaborate traps, often referred to as Murderworld.\nArcade's first intended victims were Spider-Man and Captain Britain but since Arcade's Murderworld games always leave the outcome up to chance, the duo defeated Arcade and escaped with their lives. Over the years, Arcade has targeted a multitude of Marvel heroes, often focusing on the X-Men and associated members of X-Factor, X-Force, and Excalibur. In what is considered the "game changer" for Arcade, Avengers Arena, he managed to kidnap 16 superpowered teens and forced them to kill each other for survival in his latest version of Murderworld; unlike most Murderworld schemes, this endeavor yielded several casualties.\nArcade has appeared in a number of other Marvel properties outside of comic books, in X-Men: Evolution voiced by Gabe Khouth, and in the Ultimate Spider-Man animated series voiced by Eric Bauza. He has also appeared as one of the main villains in a number of video games......
However, in the original dataset (hotpot_dev_distractor_v1.json, hotpot_dev_fullwiki_v1.json), the content is:
["Arcade (Marvel Comics)",["Arcade is a fictional supervillain appearing in American comic books published by Marvel Comics."," He first appeared in 1978's "Marvel Team-Up" (vol."," 1) #65, the creation of writer Chris Claremont and writer/artist John Byrne."," The character is a combination of evil genius and hitman who carries out his assassinations via various elaborate traps, often referred to as his "Murderworld"."]]
Where does the content after "often referred to as Murderworld." in the example come from?
The text was updated successfully, but these errors were encountered: