-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update dataset entry to support list of list type #2879
Merged
andreaskoepf
merged 4 commits into
LAION-AI:main
from
CloseChoice:feature/ds-for-webgpt
Apr 27, 2023
Merged
update dataset entry to support list of list type #2879
andreaskoepf
merged 4 commits into
LAION-AI:main
from
CloseChoice:feature/ds-for-webgpt
Apr 27, 2023
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
CloseChoice
requested review from
theblackcat102,
sanagno,
dvruette,
andreaskoepf and
yk
as code owners
April 24, 2023 15:09
…ts script to count language hits, add constant lang parameter to some datasets
andreaskoepf
approved these changes
Apr 27, 2023
grgau
pushed a commit
to grgau/Open-Assistant
that referenced
this pull request
May 8, 2023
In the [PR to introduce RM for the dataset entry class](LAION-AI#2867) I forgot that if we have RM, we'll have multiple answers per question so `[Q1, (A1, A12)]` but I just introduced questions and answers as `list` types and therefore we could not connect a question to an answer accordingly, e.g. `questions=[Q1, Q2]` and `answers=[A11, A12, A21, A22]` there was no way to figure out that `A11, A12` belong to `Q1` and `A21, A22` to `Q2`. So I introduced a `list[list[str]` type for the answers, so that we can connect question and answers by indices: `questions=[Q1, Q2]` and `answers=[[A11, A12], [A21, A22]`, so `answers[0]` belong to `questions[0]`. Note that this is backwards compatible since `answers` is a union type of `list[str] | list[list[str]]`. Also added tests for this Ran `python check_dataset_appearances.py -d webgpt --cache_dir .cache --mode rm` and found one entry with an empty question: ```python DatasetEntry(questions=[''], answers=[['Lebensraum is a German geopolitical concept that means "living space." The term was originally used to support colonialism, and was later adapted by Nazi leader Adolf Hitler to support his quest for German expansion to the east . German geographer and ethnographer Friedrich Ratzel first published an essay called "Der Lebensraum" ("The Living Space") in 1901, in which he posited that all people, animals, and plants need to expand their living space in order to survive . According to Ratzel, species that successfully adapted to one location would spread naturally to others . Hitler believed that Germany required Lebensraum in order to survive, and this conviction that this living space could be gained only in the east and, specifically, from Russia, shaped his policy after his take-over of power in Germany in 1933 . The Nazi Generalplan Ost policy (\'Master Plan for the East\') was based on the tenets of Lebensraum . It stipulated that Germany required a Lebensraum necessary for its survival and that most of the indigenous populations of Central and Eastern Europe would have to be removed permanently (either through mass deportation to Siberia, extermination, or enslavement) .', 'There are several ways to unblock blocked websites. One way is to use a good web-based proxy server . Another way is to type in the URL of the blocked site you want to access in the address bar, and then press Go or Enter . The web content will be sent to the proxy server where it can then be viewed from your device . This may make browsing a bit slower, but you should still be able to access any of your favorite websites . Another way to unblock blocked websites is to use a VPN (Virtual Private Network) . A VPN can be used to access region-restricted websites, shield your web browsing activities on public WiFi networks, and more .']], context=None, lang=None, length=None, quality=None, humor=None, creativity=None ) ``` So this was the result: ```bash 'Found the following occurances in TRAIN webgpt:' {re.compile('^[\\s\\n]*$'): ['']} ```
grgau
pushed a commit
to grgau/Open-Assistant
that referenced
this pull request
May 8, 2023
In the [PR to introduce RM for the dataset entry class](LAION-AI#2867) I forgot that if we have RM, we'll have multiple answers per question so `[Q1, (A1, A12)]` but I just introduced questions and answers as `list` types and therefore we could not connect a question to an answer accordingly, e.g. `questions=[Q1, Q2]` and `answers=[A11, A12, A21, A22]` there was no way to figure out that `A11, A12` belong to `Q1` and `A21, A22` to `Q2`. So I introduced a `list[list[str]` type for the answers, so that we can connect question and answers by indices: `questions=[Q1, Q2]` and `answers=[[A11, A12], [A21, A22]`, so `answers[0]` belong to `questions[0]`. Note that this is backwards compatible since `answers` is a union type of `list[str] | list[list[str]]`. Also added tests for this Ran `python check_dataset_appearances.py -d webgpt --cache_dir .cache --mode rm` and found one entry with an empty question: ```python DatasetEntry(questions=[''], answers=[['Lebensraum is a German geopolitical concept that means "living space." The term was originally used to support colonialism, and was later adapted by Nazi leader Adolf Hitler to support his quest for German expansion to the east . German geographer and ethnographer Friedrich Ratzel first published an essay called "Der Lebensraum" ("The Living Space") in 1901, in which he posited that all people, animals, and plants need to expand their living space in order to survive . According to Ratzel, species that successfully adapted to one location would spread naturally to others . Hitler believed that Germany required Lebensraum in order to survive, and this conviction that this living space could be gained only in the east and, specifically, from Russia, shaped his policy after his take-over of power in Germany in 1933 . The Nazi Generalplan Ost policy (\'Master Plan for the East\') was based on the tenets of Lebensraum . It stipulated that Germany required a Lebensraum necessary for its survival and that most of the indigenous populations of Central and Eastern Europe would have to be removed permanently (either through mass deportation to Siberia, extermination, or enslavement) .', 'There are several ways to unblock blocked websites. One way is to use a good web-based proxy server . Another way is to type in the URL of the blocked site you want to access in the address bar, and then press Go or Enter . The web content will be sent to the proxy server where it can then be viewed from your device . This may make browsing a bit slower, but you should still be able to access any of your favorite websites . Another way to unblock blocked websites is to use a VPN (Virtual Private Network) . A VPN can be used to access region-restricted websites, shield your web browsing activities on public WiFi networks, and more .']], context=None, lang=None, length=None, quality=None, humor=None, creativity=None ) ``` So this was the result: ```bash 'Found the following occurances in TRAIN webgpt:' {re.compile('^[\\s\\n]*$'): ['']} ```
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
#2827
In the PR to introduce RM for the dataset entry class I forgot that if we have RM, we'll have multiple answers per question so
[Q1, (A1, A12)]
but I just introduced questions and answers aslist
types and therefore we could not connect a question to an answer accordingly, e.g.questions=[Q1, Q2]
andanswers=[A11, A12, A21, A22]
there was no way to figure out thatA11, A12
belong toQ1
andA21, A22
toQ2
. So I introduced alist[list[str]
type for the answers, so that we can connect question and answers by indices:questions=[Q1, Q2]
andanswers=[[A11, A12], [A21, A22]
, soanswers[0]
belong toquestions[0]
. Note that this is backwards compatible sinceanswers
is a union type oflist[str] | list[list[str]]
. Also added tests for thisRan
python check_dataset_appearances.py -d webgpt --cache_dir .cache --mode rm
and found one entry with an empty question:
So this was the result: