Skip to content

Commit

Permalink
Add more tips
Browse files Browse the repository at this point in the history
  • Loading branch information
NielsRogge committed Sep 29, 2022
1 parent 9680740 commit 738a608
Showing 1 changed file with 17 additions and 5 deletions.
22 changes: 17 additions & 5 deletions docs/source/en/model_doc/markuplm.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,14 @@ specific language governing permissions and limitations under the License.

The MarkupLM model was proposed in [MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document
Understanding](https://arxiv.org/abs/2110.08518) by Junlong Li, Yiheng Xu, Lei Cui, Furu Wei. MarkupLM is BERT, but
applied to HTML pages instead of raw text documents.
applied to HTML pages instead of raw text documents. The model incorporates additional embedding layers to improve
performance, similar to [LayoutLM](layoutlm).

The model can be used for tasks like question answering on web pages or information extraction from web pages. It obtains
state-of-the-art results on 2 important benchmarks:
- [WebSRC](https://x-lance.github.io/WebSRC/), a dataset for Web-Based Structual Reading Comprehension (a bit like SQuAD but for web pages)
- [SWDE](https://www.researchgate.net/publication/221299838_From_one_tree_to_a_forest_a_unified_solution_for_structured_web_data_extraction), a dataset
for information extraction from web pages (basically named-entity recogntion on web pages)

The abstract from the paper is the following:

Expand All @@ -30,10 +37,15 @@ pre-trained MarkupLM significantly outperforms the existing strong baseline mode
tasks. The pre-trained model and code will be publicly available.*

Tips:
- One can use ['MarkupLMProcessor`] to prepare all data for the model. This processor internally combines a [`MarkupLMFeatureExtractor`] to first
extract all nodes and xpaths from one or more HTML strings, which are then fed to [`MarkupLMTokenizerFast`], which will turn them into token-level
`input_ids`, `attention_mask`, `token_type_ids` etc. Optionally, one can also provide `node_labels`, which the tokenizer will turn into token-level
`labels`.
- In addition to `input_ids`, [`~MarkupLMModel.forward`] expects 2 additional inputs, namely `xpath_tags_seq` and `xpath_subs_seq`.
These are the XPATH tags and subscripts respectively for each token in the input sequence.
- One can use ['MarkupLMProcessor`] to prepare all data for the model. Refer to the [usage guide](#usage-markuplmprocessor) for more info.
- Demo notebooks can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/MarkupLM).

<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/markuplm_architecture.jpg"
alt="drawing" width="600"/>

<small> MarkupLM architecture. Taken from the <a href="https://arxiv.org/abs/2110.08518">original paper.</a> </small>

This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/microsoft/unilm/tree/master/markuplm).

Expand Down

0 comments on commit 738a608

Please sign in to comment.