-
Notifications
You must be signed in to change notification settings - Fork 14
Open
Description
Hi, I'm currently compiling additional, more modern html documents with gold standard content + comments for use in training dragnet models, and I have a few questions:
- Should I consider the text of embedded tweets, posts, quotations, image captions, and other rich media to be content?
- Should I include author byline, pubdate, etc. at the start of an article as content? What about typical addenda at the bottom of the article?
- Should I include non-English language content?
Thanks for your help!
Metadata
Metadata
Assignees
Labels
No labels