Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TAPAS tokenizer & tokenizer tests #8482

Merged
merged 2 commits into from
Nov 16, 2020
Merged

TAPAS tokenizer & tokenizer tests #8482

merged 2 commits into from
Nov 16, 2020

Conversation

LysandreJik
Copy link
Member

@LysandreJik LysandreJik commented Nov 12, 2020

This PR aims to implement the tokenizer API for the TAPAS model, as well as the tests. It is based on tapas-style which contains all the changes done by black & isort on top of the nielsrogge/tapas_v3 branch in #8113.

The API is akin to our other tokenizers': it is based on the __call__ method which dispatches to encode_plus or batch_encode_plus according to the inputs.

These two methods then dispatch to _encode_plus and _batch_encode_plus, which themselves dispatch to prepare_for_model and _batch_prepare_for_model.

Here are the remaining tasks for the tokenizers, from what I could observe:

  • Two tokenizer tests are failing. This is only due to the fact that there is no checkpoint currently available.
  • The truncation is not the same as it was before these changes. Before these changes, if a row of the dataframe was to be truncated, the whole row was removed. Right now only the overflowing tokens will be removed. This is probably an important change that will need to be reverted (implemented in the new API).
  • The tokenizer is based on pd.DataFrames. It should be very simple to switch from these to datasets.Dataset, which serve the same purpose.

Once this PR is merged, I'll open a PR from tapas-style to nielsrogge/tapas_v3 as explained in #8113 (comment)

PreTokenizedInput,
EncodedInput,
],
answer_coordinate: Optional[List[Tuple]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's more logical to call it answer_coordinates (plural) rather than answer_coordinate, because a table-question pair typically has an answer spanning multiple coordinates. Here's a screenshot of the SQA TSV format (column is called "answer_coordinates"):

sqa_format_bis

This will also require a change in the encode_plus and _encode_plus methods.

@@ -528,7 +1354,7 @@ def _get_cell_token_indexes(self, column_ids, row_ids, column_id, row_id):
if column_ids[index] - 1 == column_id and row_ids[index] - 1 == row_id:
yield index

def _add_numeric_column_ranks(self, column_ids, row_ids, table, features):
def _get_numeric_column_ranks(self, column_ids, row_ids, table):
"""Adds column ranks for all numeric columns."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be consistent with _get_numeric_relations and the other methods below, the docstring can be changed to "Returns numeric column rank embeddings for all numeric columns".

Comment on lines +235 to +240
cell_trim_length=cell_trim_length,
max_column_id=max_column_id,
max_row_id=max_row_id,
strip_column_names=strip_column_names,
update_answer_coordinates=update_answer_coordinates,
drop_rows_to_fit=drop_rows_to_fit,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? Didn't know that 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary because all those arguments passed to the super class will then be saved in self.init_kwargs, which will be used when saving the tokenizer. It is important to have those here, so that when saving/reloading the tokenizer, the exact same config is used!

"""
Adds numeric relation embeddings to 'features'
Return numeric relations embeddings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) ReturnS

segment_ids = self.create_segment_token_type_ids_from_sequences(query_ids, table_data)
column_ids = self.create_column_token_type_ids_from_sequences(query_ids, table_data)
row_ids = self.create_row_token_type_ids_from_sequences(query_ids, table_data)
prev_label_ids = [0] * len(row_ids)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you set the previous label ids (i.e. which tokens of the table were an answer to the previous question) here. This is OK for a single example, but in case of a batch of examples, then the prev_label_ids must be set to the label_ids of the previous table-question pair in the batch (which are calculated based on the get_answer_ids function). This can be seen here in the original implementation. They use the index of the table-question pair in the batch to determine if it's the first, second, ... question of the batch. However, I'm not entirely sure about whether this is working well, I've submitted an issue to get this resolved.

I see you removed the position_to_label_ids dictionary that implemented this. It's a dictionary that maps a position to the label ids.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct, this is my bad. I will fix this.

Copy link
Contributor

@NielsRogge NielsRogge Nov 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, maybe it's better to wait with this until the issue mentioned above is resolved. It seems to me that this can be implemented in a better way than in the original implementation.

Copy link
Contributor

@NielsRogge NielsRogge Nov 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: after receiving a response from one of the authors (see issue above), I now understand how these are created in the original implementation (I actually implemented it in the wrong way with that dictionary). The correct way to do this (in case of a batch), is to set the prev_label_ids equal to get_answer_ids(queries[index - 1]), with index indicating whether the question is the first, second,... in a batch (note that a batch should always contain questions that refer to the same table). This function in turn calls _get_answer_ids(column_ids, row_ids, question), with the column_ids and row_ids of the current table-question pair. So it's important that before calling get_answer_ids, the column_ids and row_ids are set to those of the current table-question pair.

This will require some changes, also to the signature of several methods (such as get_answer_ids, which is currently accepting a lot more parameters than just question), to reflect the original implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 697 to 700
raw_queries: Union[
TextInput,
PreTokenizedInput,
EncodedInput,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume raw_queries must be a List

Comment on lines +889 to +890
table=table,
query=query,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I'm using answer_coordinateS here, with an s (see comment with screenshot)

Suggested change
table=table,
query=query,
table=table,
query=query,
answer_coordinates=answer_coordinates,
answer_text=answer_text

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +949 to +951
_, _, num_tokens = self._get_table_boundaries(table_tokens)

table_data = list(self._get_table_values(table_tokens, num_columns, num_rows, num_tokens))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to a comment above, I suggest the following:

Suggested change
_, _, num_tokens = self._get_table_boundaries(table_tokens)
table_data = list(self._get_table_values(table_tokens, num_columns, num_rows, num_tokens))
_, _, max_num_tokens = self._get_table_boundaries(table_tokens)
if self.cell_trim_length >= 0 and max_num_tokens > self.cell_trim_length:
max_num_tokens = self.cell_trim_length
table_data = list(self._get_table_values(table_tokens, num_columns, num_rows, num_tokens))

Comment on lines +1038 to +1039
_, _, num_tokens = self._get_table_boundaries(table_tokens)
table_data = list(self._get_table_values(table_tokens, num_columns, num_rows, num_tokens))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, suggestion:

Suggested change
_, _, num_tokens = self._get_table_boundaries(table_tokens)
table_data = list(self._get_table_values(table_tokens, num_columns, num_rows, num_tokens))
_, _, max_num_tokens = self._get_table_boundaries(table_tokens)
if self.cell_trim_length >= 0 and max_num_tokens > self.cell_trim_length:
max_num_tokens = self.cell_trim_length
table_data = list(self._get_table_values(table_tokens, num_columns, num_rows, max_num_tokens))

@@ -384,7 +1210,7 @@ def _get_token_budget(self, question_tokens):
"""
return self.model_max_length - self._question_encoding_cost(question_tokens)

def _get_table_values(self, table, num_columns, num_rows, num_tokens):
def _get_table_values(self, table, num_columns, num_rows, num_tokens) -> Generator[TableValue, None, None]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes it more clear.

Suggested change
def _get_table_values(self, table, num_columns, num_rows, num_tokens) -> Generator[TableValue, None, None]:
def _get_table_values(self, table, num_columns, num_rows, max_num_tokens) -> Generator[TableValue, None, None]:

@NielsRogge
Copy link
Contributor

NielsRogge commented Nov 12, 2020

Thank you!

❗ This is a preliminary review, I'm not finished with it. 2 important things for now:

  1. I am also testing the Colab demo's with this branch. Currently I'm getting an error when providing answer_coordinates and answer_texts to the tokenizer:

SQA: https://colab.research.google.com/drive/1BNxrKkrwpWuE2TthZL5qQlERtcK4ZbIt?usp=sharing
WTQ: https://colab.research.google.com/drive/1K8ZeNQyBqo-A03D8RL8_j34n-Ubggb9U?usp=sharing

Normally, the label_ids, numeric_values and numeric_values_scale should also be padded when I set padding='max_length'.

  1. I've got an updated version of the creation of the numeric values (they are currently not performed correctly) in a branch named tapas_v3_up_to_date_with_master. Either you could incorporate these changes in your branch before making a PR, or I make them after the PR is merged (what you like best - the latter is probably easier).

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
adding special tokens.

This implementation does not add special tokens and this method should be overridden in a subclass.
Copy link
Contributor

@NielsRogge NielsRogge Nov 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line can be removed, since it's overridden here.

@LysandreJik
Copy link
Member Author

Great, thanks for your great preliminary review. I've fixed a few of the issues, just pushed a commit. There's a few things you mention that definitely need a deeper look. I can do so in the coming days, but I'll let you finish your review first so that I may batch everything. Thank you!

List[EncodedInput],
]
] = None,
answer_coordinates: Optional[List[Tuple]] = None,
Copy link
Contributor

@NielsRogge NielsRogge Nov 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the answer_coordinates are a List of Lists of Tuples (in case of a batch). Because each table-question pair has a list of tuples as answer coordinates.

In case of a single example, then it's indeed a list of tuples.

I also wonder whether we should include PreTokenizedInput, since we have a NotImplementedError further down stating that "Currently TapasTokenizer only supports questions as strings."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (however, what about PreTokenizedInput?)

]
] = None,
answer_coordinates: Optional[List[Tuple]] = None,
answer_texts: Optional[List[TextInput]] = None,
Copy link
Contributor

@NielsRogge NielsRogge Nov 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the answer_texts is a List of List of TextInputs in case of a batch. A single example can have multiple answer texts, corresponding to the different cells, here's an example row from the SQA dev set:

list two pylons that are at most, 80 m in height. table_csv/203-375.csv ['(-1, -1)', '(-1, -1)'] ['Mittersill goods aerial tramway', 'Singapore cable car'] NONE

Maybe it would also be better to rename this to answer_text (singular), to have the same column name as the SQA format. This will also make sure we have the same argument names in both the batch and non-batch methods. This will of course require some changes to the other encoding methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +529 to +530
answer_coordinates: Optional[List[Tuple]] = None,
answer_texts: Optional[List[TextInput]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
answer_coordinates: Optional[List[Tuple]] = None,
answer_texts: Optional[List[TextInput]] = None,
answer_coordinates: Optional[List[List[Tuple]]] = None,
answer_texts: Optional[List[List[[TextInput]]] = None,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +620 to +621
answer_coordinates: Optional[List[Tuple]] = None,
answer_texts: Optional[List[TextInput]] = None,
Copy link
Contributor

@NielsRogge NielsRogge Nov 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
answer_coordinates: Optional[List[Tuple]] = None,
answer_texts: Optional[List[TextInput]] = None,
answer_coordinates: Optional[List[List[[Tuple]]] = None,
answer_texts: Optional[List[List[TextInput]]] = None,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +738 to +739
for query_ids, raw_query, query_tokens, answer_coords, answer_text in zip(
queries_ids, raw_queries, queries_tokens, answer_coordinates, answer_texts
Copy link
Contributor

@NielsRogge NielsRogge Nov 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change the answer_texts parameter in all encoding methods to answer_text, than we have to update this code (and a lot of other places):

Suggested change
for query_ids, raw_query, query_tokens, answer_coords, answer_text in zip(
queries_ids, raw_queries, queries_tokens, answer_coordinates, answer_texts
for query_ids, raw_query, query_tokens, answer_coords, answer_txt in zip(
queries_ids, raw_queries, queries_tokens, answer_coordinates, answer_text

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

table_data=table_data,
query_tokens=query_tokens,
answer_coordinates=answer_coords,
answer_text=answer_text,
Copy link
Contributor

@NielsRogge NielsRogge Nov 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change the answer_texts parameter to answer_text, then this becomes:

Suggested change
answer_text=answer_text,
answer_text=answer_txt,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +696 to +697
answer_coordinates: Optional[List[Tuple]] = None,
answer_texts: Optional[List[TextInput]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
answer_coordinates: Optional[List[Tuple]] = None,
answer_texts: Optional[List[TextInput]] = None,
answer_coordinates: Optional[List[List[[Tuple]] = None,
answer_texts: Optional[List[List[TextInput]] = None,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +1087 to +1095
label_ids = self.get_answer_ids(
column_ids, row_ids, table_data, query_tokens, answer_text, answer_coordinates
)
numeric_values = self._get_numeric_values(raw_table, column_ids, row_ids, columns_to_numeric_values)
numeric_values_scale = self._get_numeric_values_scale(raw_table, column_ids, row_ids)

encoded_inputs["label_ids"] = label_ids
encoded_inputs["numeric_values"] = numeric_values
encoded_inputs["numeric_values_scale"] = numeric_values_scale
Copy link
Contributor

@NielsRogge NielsRogge Nov 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are created, but it seems that the padding/truncation is not working when self.pad() is applied on them. This currently results in an error (when I run a Colab notebook from this branch):

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
    607 
--> 608                 tensor = as_tensor(value)
    609 

ValueError: expected sequence of length 41 at dim 1 (got 45)

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
6 frames
/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
    623                     )
    624                 raise ValueError(
--> 625                     "Unable to create tensor, you should probably activate truncation and/or padding "
    626                     "with 'padding=True' 'truncation=True' to have batched tensors with the same length."
    627                 )

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Padding has been fixed

Comment on lines +358 to +359
token_ids_0 (:obj:`List[int]`): The first tokenized sequence.
token_ids_1 (:obj:`List[int]`, `optional`): The second tokenized sequence.
Copy link
Contributor

@NielsRogge NielsRogge Nov 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
token_ids_0 (:obj:`List[int]`): The first tokenized sequence.
token_ids_1 (:obj:`List[int]`, `optional`): The second tokenized sequence.
token_ids_0 (:obj:`List[int]`): The ids of the question.
token_ids_1 (:obj:`List[int]`, `optional`): The ids of the flattened table.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +354 to +355
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
adding special tokens.
Copy link
Contributor

@NielsRogge NielsRogge Nov 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
adding special tokens.
Build model inputs from a question and flattened table for sequence classification or question answering tasks by concatenating and
adding special tokens.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +376 to +380
Args:
token_ids_0 (:obj:`List[int]`):
List of IDs.
token_ids_1 (:obj:`List[int]`, `optional`):
Optional second list of IDs for sequence pairs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Args:
token_ids_0 (:obj:`List[int]`):
List of IDs.
token_ids_1 (:obj:`List[int]`, `optional`):
Optional second list of IDs for sequence pairs.
Args:
token_ids_0 (:obj:`List[int]`):
List of question IDs.
token_ids_1 (:obj:`List[int]`, `optional`):
List of flattened table IDs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

],
}

new_encoded_inputs = tokenizer.encode_plus(table=table, query=queries[0], padding="max_length")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is to prepare inputs for TAPAS for inference. We also need to include tests to prepare inputs for TAPAS for training (fine-tuning). In that case, also answer_coordinates and answer_text need to be provided to the tokenizer, and it should create label_ids, numeric_values and numeric_values_scale.

@NielsRogge
Copy link
Contributor

@LysandreJik I have finished reviewing, I've added more (mostly documentation-related) comments.

The most important thing is that when label_ids, answer_coordinates and answer_text are provided to the tokenizer, an error is currently thrown due to the fact that padding is not working.

Besides this, the other important things are:

  • a correct implementation of the creation of the prev_label_ids when a batch of table-question pairs is provided
  • a correct implementation of drop_rows_to_fit and cell_trim_length

@LysandreJik LysandreJik merged commit 81d56b2 into tapas-style Nov 16, 2020
@LysandreJik LysandreJik deleted the tapas-final branch November 16, 2020 16:30
@LysandreJik LysandreJik restored the tapas-final branch November 16, 2020 16:30
@LysandreJik LysandreJik deleted the tapas-final branch November 16, 2020 16:30
@LysandreJik LysandreJik restored the tapas-final branch November 16, 2020 16:30
@LysandreJik LysandreJik deleted the tapas-final branch April 27, 2022 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants