From 738a608e06534e0d1edb6695013aa52433b87c8e Mon Sep 17 00:00:00 2001 From: NielsRogge Date: Thu, 29 Sep 2022 09:17:48 +0000 Subject: [PATCH] Add more tips --- docs/source/en/model_doc/markuplm.mdx | 22 +++++++++++++++++----- 1 file changed, 17 insertions(+), 5 deletions(-) diff --git a/docs/source/en/model_doc/markuplm.mdx b/docs/source/en/model_doc/markuplm.mdx index a7b2a95f6027cc..314ef750ce3ff4 100644 --- a/docs/source/en/model_doc/markuplm.mdx +++ b/docs/source/en/model_doc/markuplm.mdx @@ -16,7 +16,14 @@ specific language governing permissions and limitations under the License. The MarkupLM model was proposed in [MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding](https://arxiv.org/abs/2110.08518) by Junlong Li, Yiheng Xu, Lei Cui, Furu Wei. MarkupLM is BERT, but -applied to HTML pages instead of raw text documents. +applied to HTML pages instead of raw text documents. The model incorporates additional embedding layers to improve +performance, similar to [LayoutLM](layoutlm). + +The model can be used for tasks like question answering on web pages or information extraction from web pages. It obtains +state-of-the-art results on 2 important benchmarks: +- [WebSRC](https://x-lance.github.io/WebSRC/), a dataset for Web-Based Structual Reading Comprehension (a bit like SQuAD but for web pages) +- [SWDE](https://www.researchgate.net/publication/221299838_From_one_tree_to_a_forest_a_unified_solution_for_structured_web_data_extraction), a dataset +for information extraction from web pages (basically named-entity recogntion on web pages) The abstract from the paper is the following: @@ -30,10 +37,15 @@ pre-trained MarkupLM significantly outperforms the existing strong baseline mode tasks. The pre-trained model and code will be publicly available.* Tips: -- One can use ['MarkupLMProcessor`] to prepare all data for the model. This processor internally combines a [`MarkupLMFeatureExtractor`] to first -extract all nodes and xpaths from one or more HTML strings, which are then fed to [`MarkupLMTokenizerFast`], which will turn them into token-level -`input_ids`, `attention_mask`, `token_type_ids` etc. Optionally, one can also provide `node_labels`, which the tokenizer will turn into token-level -`labels`. +- In addition to `input_ids`, [`~MarkupLMModel.forward`] expects 2 additional inputs, namely `xpath_tags_seq` and `xpath_subs_seq`. +These are the XPATH tags and subscripts respectively for each token in the input sequence. +- One can use ['MarkupLMProcessor`] to prepare all data for the model. Refer to the [usage guide](#usage-markuplmprocessor) for more info. +- Demo notebooks can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/MarkupLM). + + + + MarkupLM architecture. Taken from the original paper. This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/microsoft/unilm/tree/master/markuplm).