-
Notifications
You must be signed in to change notification settings - Fork 27.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TAPAS tokenizer & tokenizer tests #8482
Conversation
PreTokenizedInput, | ||
EncodedInput, | ||
], | ||
answer_coordinate: Optional[List[Tuple]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's more logical to call it answer_coordinates
(plural) rather than answer_coordinate
, because a table-question pair typically has an answer spanning multiple coordinates. Here's a screenshot of the SQA TSV format (column is called "answer_coordinates"):
This will also require a change in the encode_plus
and _encode_plus
methods.
@@ -528,7 +1354,7 @@ def _get_cell_token_indexes(self, column_ids, row_ids, column_id, row_id): | |||
if column_ids[index] - 1 == column_id and row_ids[index] - 1 == row_id: | |||
yield index | |||
|
|||
def _add_numeric_column_ranks(self, column_ids, row_ids, table, features): | |||
def _get_numeric_column_ranks(self, column_ids, row_ids, table): | |||
"""Adds column ranks for all numeric columns.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be consistent with _get_numeric_relations
and the other methods below, the docstring can be changed to "Returns numeric column rank embeddings for all numeric columns".
cell_trim_length=cell_trim_length, | ||
max_column_id=max_column_id, | ||
max_row_id=max_row_id, | ||
strip_column_names=strip_column_names, | ||
update_answer_coordinates=update_answer_coordinates, | ||
drop_rows_to_fit=drop_rows_to_fit, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? Didn't know that 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is necessary because all those arguments passed to the super class will then be saved in self.init_kwargs
, which will be used when saving the tokenizer. It is important to have those here, so that when saving/reloading the tokenizer, the exact same config is used!
""" | ||
Adds numeric relation embeddings to 'features' | ||
Return numeric relations embeddings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) ReturnS
segment_ids = self.create_segment_token_type_ids_from_sequences(query_ids, table_data) | ||
column_ids = self.create_column_token_type_ids_from_sequences(query_ids, table_data) | ||
row_ids = self.create_row_token_type_ids_from_sequences(query_ids, table_data) | ||
prev_label_ids = [0] * len(row_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that you set the previous label ids (i.e. which tokens of the table were an answer to the previous question) here. This is OK for a single example, but in case of a batch of examples, then the prev_label_ids
must be set to the label_ids
of the previous table-question pair in the batch (which are calculated based on the get_answer_ids
function). This can be seen here in the original implementation. They use the index of the table-question pair in the batch to determine if it's the first, second, ... question of the batch. However, I'm not entirely sure about whether this is working well, I've submitted an issue to get this resolved.
I see you removed the position_to_label_ids
dictionary that implemented this. It's a dictionary that maps a position to the label ids.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're correct, this is my bad. I will fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, maybe it's better to wait with this until the issue mentioned above is resolved. It seems to me that this can be implemented in a better way than in the original implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: after receiving a response from one of the authors (see issue above), I now understand how these are created in the original implementation (I actually implemented it in the wrong way with that dictionary). The correct way to do this (in case of a batch), is to set the prev_label_ids
equal to get_answer_ids(queries[index - 1])
, with index indicating whether the question is the first, second,... in a batch (note that a batch should always contain questions that refer to the same table). This function in turn calls _get_answer_ids(column_ids, row_ids, question)
, with the column_ids and row_ids of the current table-question pair. So it's important that before calling get_answer_ids
, the column_ids and row_ids are set to those of the current table-question pair.
This will require some changes, also to the signature of several methods (such as get_answer_ids
, which is currently accepting a lot more parameters than just question
), to reflect the original implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
raw_queries: Union[ | ||
TextInput, | ||
PreTokenizedInput, | ||
EncodedInput, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume raw_queries
must be a List
table=table, | ||
query=query, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that I'm using answer_coordinateS here, with an s (see comment with screenshot)
table=table, | |
query=query, | |
table=table, | |
query=query, | |
answer_coordinates=answer_coordinates, | |
answer_text=answer_text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
_, _, num_tokens = self._get_table_boundaries(table_tokens) | ||
|
||
table_data = list(self._get_table_values(table_tokens, num_columns, num_rows, num_tokens)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to a comment above, I suggest the following:
_, _, num_tokens = self._get_table_boundaries(table_tokens) | |
table_data = list(self._get_table_values(table_tokens, num_columns, num_rows, num_tokens)) | |
_, _, max_num_tokens = self._get_table_boundaries(table_tokens) | |
if self.cell_trim_length >= 0 and max_num_tokens > self.cell_trim_length: | |
max_num_tokens = self.cell_trim_length | |
table_data = list(self._get_table_values(table_tokens, num_columns, num_rows, num_tokens)) |
_, _, num_tokens = self._get_table_boundaries(table_tokens) | ||
table_data = list(self._get_table_values(table_tokens, num_columns, num_rows, num_tokens)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, suggestion:
_, _, num_tokens = self._get_table_boundaries(table_tokens) | |
table_data = list(self._get_table_values(table_tokens, num_columns, num_rows, num_tokens)) | |
_, _, max_num_tokens = self._get_table_boundaries(table_tokens) | |
if self.cell_trim_length >= 0 and max_num_tokens > self.cell_trim_length: | |
max_num_tokens = self.cell_trim_length | |
table_data = list(self._get_table_values(table_tokens, num_columns, num_rows, max_num_tokens)) |
@@ -384,7 +1210,7 @@ def _get_token_budget(self, question_tokens): | |||
""" | |||
return self.model_max_length - self._question_encoding_cost(question_tokens) | |||
|
|||
def _get_table_values(self, table, num_columns, num_rows, num_tokens): | |||
def _get_table_values(self, table, num_columns, num_rows, num_tokens) -> Generator[TableValue, None, None]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes it more clear.
def _get_table_values(self, table, num_columns, num_rows, num_tokens) -> Generator[TableValue, None, None]: | |
def _get_table_values(self, table, num_columns, num_rows, max_num_tokens) -> Generator[TableValue, None, None]: |
Thank you! ❗ This is a preliminary review, I'm not finished with it. 2 important things for now:
SQA: https://colab.research.google.com/drive/1BNxrKkrwpWuE2TthZL5qQlERtcK4ZbIt?usp=sharing Normally, the
|
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and | ||
adding special tokens. | ||
|
||
This implementation does not add special tokens and this method should be overridden in a subclass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line can be removed, since it's overridden here.
Great, thanks for your great preliminary review. I've fixed a few of the issues, just pushed a commit. There's a few things you mention that definitely need a deeper look. I can do so in the coming days, but I'll let you finish your review first so that I may batch everything. Thank you! |
List[EncodedInput], | ||
] | ||
] = None, | ||
answer_coordinates: Optional[List[Tuple]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the answer_coordinates
are a List of Lists of Tuples (in case of a batch). Because each table-question pair has a list of tuples as answer coordinates.
In case of a single example, then it's indeed a list of tuples.
I also wonder whether we should include PreTokenizedInput
, since we have a NotImplementedError
further down stating that "Currently TapasTokenizer only supports questions as strings."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done (however, what about PreTokenizedInput
?)
] | ||
] = None, | ||
answer_coordinates: Optional[List[Tuple]] = None, | ||
answer_texts: Optional[List[TextInput]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the answer_texts
is a List of List of TextInputs in case of a batch. A single example can have multiple answer texts, corresponding to the different cells, here's an example row from the SQA dev set:
list two pylons that are at most, 80 m in height. table_csv/203-375.csv ['(-1, -1)', '(-1, -1)'] ['Mittersill goods aerial tramway', 'Singapore cable car'] NONE
Maybe it would also be better to rename this to answer_text
(singular), to have the same column name as the SQA format. This will also make sure we have the same argument names in both the batch and non-batch methods. This will of course require some changes to the other encoding methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
answer_coordinates: Optional[List[Tuple]] = None, | ||
answer_texts: Optional[List[TextInput]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
answer_coordinates: Optional[List[Tuple]] = None, | |
answer_texts: Optional[List[TextInput]] = None, | |
answer_coordinates: Optional[List[List[Tuple]]] = None, | |
answer_texts: Optional[List[List[[TextInput]]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
answer_coordinates: Optional[List[Tuple]] = None, | ||
answer_texts: Optional[List[TextInput]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
answer_coordinates: Optional[List[Tuple]] = None, | |
answer_texts: Optional[List[TextInput]] = None, | |
answer_coordinates: Optional[List[List[[Tuple]]] = None, | |
answer_texts: Optional[List[List[TextInput]]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
for query_ids, raw_query, query_tokens, answer_coords, answer_text in zip( | ||
queries_ids, raw_queries, queries_tokens, answer_coordinates, answer_texts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we change the answer_texts
parameter in all encoding methods to answer_text
, than we have to update this code (and a lot of other places):
for query_ids, raw_query, query_tokens, answer_coords, answer_text in zip( | |
queries_ids, raw_queries, queries_tokens, answer_coordinates, answer_texts | |
for query_ids, raw_query, query_tokens, answer_coords, answer_txt in zip( | |
queries_ids, raw_queries, queries_tokens, answer_coordinates, answer_text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
table_data=table_data, | ||
query_tokens=query_tokens, | ||
answer_coordinates=answer_coords, | ||
answer_text=answer_text, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we change the answer_texts
parameter to answer_text
, then this becomes:
answer_text=answer_text, | |
answer_text=answer_txt, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
answer_coordinates: Optional[List[Tuple]] = None, | ||
answer_texts: Optional[List[TextInput]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
answer_coordinates: Optional[List[Tuple]] = None, | |
answer_texts: Optional[List[TextInput]] = None, | |
answer_coordinates: Optional[List[List[[Tuple]] = None, | |
answer_texts: Optional[List[List[TextInput]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
label_ids = self.get_answer_ids( | ||
column_ids, row_ids, table_data, query_tokens, answer_text, answer_coordinates | ||
) | ||
numeric_values = self._get_numeric_values(raw_table, column_ids, row_ids, columns_to_numeric_values) | ||
numeric_values_scale = self._get_numeric_values_scale(raw_table, column_ids, row_ids) | ||
|
||
encoded_inputs["label_ids"] = label_ids | ||
encoded_inputs["numeric_values"] = numeric_values | ||
encoded_inputs["numeric_values_scale"] = numeric_values_scale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are created, but it seems that the padding/truncation is not working when self.pad()
is applied on them. This currently results in an error (when I run a Colab notebook from this branch):
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
607
--> 608 tensor = as_tensor(value)
609
ValueError: expected sequence of length 41 at dim 1 (got 45)
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
6 frames
/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
623 )
624 raise ValueError(
--> 625 "Unable to create tensor, you should probably activate truncation and/or padding "
626 "with 'padding=True' 'truncation=True' to have batched tensors with the same length."
627 )
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Padding has been fixed
token_ids_0 (:obj:`List[int]`): The first tokenized sequence. | ||
token_ids_1 (:obj:`List[int]`, `optional`): The second tokenized sequence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
token_ids_0 (:obj:`List[int]`): The first tokenized sequence. | |
token_ids_1 (:obj:`List[int]`, `optional`): The second tokenized sequence. | |
token_ids_0 (:obj:`List[int]`): The ids of the question. | |
token_ids_1 (:obj:`List[int]`, `optional`): The ids of the flattened table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and | ||
adding special tokens. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and | |
adding special tokens. | |
Build model inputs from a question and flattened table for sequence classification or question answering tasks by concatenating and | |
adding special tokens. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Args: | ||
token_ids_0 (:obj:`List[int]`): | ||
List of IDs. | ||
token_ids_1 (:obj:`List[int]`, `optional`): | ||
Optional second list of IDs for sequence pairs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Args: | |
token_ids_0 (:obj:`List[int]`): | |
List of IDs. | |
token_ids_1 (:obj:`List[int]`, `optional`): | |
Optional second list of IDs for sequence pairs. | |
Args: | |
token_ids_0 (:obj:`List[int]`): | |
List of question IDs. | |
token_ids_1 (:obj:`List[int]`, `optional`): | |
List of flattened table IDs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
], | ||
} | ||
|
||
new_encoded_inputs = tokenizer.encode_plus(table=table, query=queries[0], padding="max_length") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is to prepare inputs for TAPAS for inference. We also need to include tests to prepare inputs for TAPAS for training (fine-tuning). In that case, also answer_coordinates
and answer_text
need to be provided to the tokenizer, and it should create label_ids
, numeric_values
and numeric_values_scale
.
@LysandreJik I have finished reviewing, I've added more (mostly documentation-related) comments. The most important thing is that when Besides this, the other important things are:
|
This PR aims to implement the tokenizer API for the TAPAS model, as well as the tests. It is based on
tapas-style
which contains all the changes done by black & isort on top of thenielsrogge/tapas_v3
branch in #8113.The API is akin to our other tokenizers': it is based on the
__call__
method which dispatches toencode_plus
orbatch_encode_plus
according to the inputs.These two methods then dispatch to
_encode_plus
and_batch_encode_plus
, which themselves dispatch toprepare_for_model
and_batch_prepare_for_model
.Here are the remaining tasks for the tokenizers, from what I could observe:
pd.DataFrame
s. It should be very simple to switch from these todatasets.Dataset
, which serve the same purpose.Once this PR is merged, I'll open a PR from
tapas-style
tonielsrogge/tapas_v3
as explained in #8113 (comment)