Skip to content
This repository has been archived by the owner on May 10, 2024. It is now read-only.

HTML tags should be removed when classifying a page for Brave ads #1918

Closed
tmancey opened this issue Nov 7, 2019 · 4 comments
Closed

HTML tags should be removed when classifying a page for Brave ads #1918

tmancey opened this issue Nov 7, 2019 · 4 comments

Comments

@tmancey
Copy link
Collaborator

tmancey commented Nov 7, 2019

Description:

HTML tags should be removed when classifying a page for Brave ads otherwise the classifier will likely return "Technology & Computing-Software"

Steps to Reproduce

  1. Enable Brave Rewards
  2. Visit www.law.com

Actual result:
Page is classified as "technology & computing-software"

Expected result:
Page should be classified as "law-law"

Reproduces how often: [Easily reproduced, Intermittent Issue]
Easily reproduced

Brave Version:

Device details:

Website problems only:

  • did you check with Brave Shields down?
  • did you check in Safari/Firefox (WkWebView-based browsers)?

Additional Information

Search for "Successfully classified page at" in the logs to see the page classification

@iccub
Copy link
Contributor

iccub commented Nov 7, 2019

Only stripping html tags is enough, or do we need to do more like extracting only <body> contents or removing css/script contents?

@kylehickinson kylehickinson self-assigned this Nov 7, 2019
@kylehickinson kylehickinson added this to the 1.14 milestone Nov 7, 2019
@kylehickinson
Copy link
Collaborator

We're going to just use document.body.innerText as it provides the most accurate and quickest solution

@tmancey
Copy link
Collaborator Author

tmancey commented Nov 7, 2019

I can confirm that tests @kylehickinson showed me using document.body.innerText are as good as the desktop page classification

kylehickinson added a commit that referenced this issue Nov 11, 2019
Instead of sending plain HTML, we now send in the document's `innerText`, which doesn't include any HTML tags, images, etc.
This allows the ads classifier to properly classify pages to their correct type
kylehickinson added a commit that referenced this issue Nov 11, 2019
Instead of sending plain HTML, we now send in the document's `innerText`, which doesn't include any HTML tags, images, etc.
This allows the ads classifier to properly classify pages to their correct type
kylehickinson added a commit that referenced this issue Nov 11, 2019
Instead of sending plain HTML, we now send in the document's `innerText`, which doesn't include any HTML tags, images, etc.
This allows the ads classifier to properly classify pages to their correct type
kylehickinson added a commit that referenced this issue Nov 12, 2019
Instead of sending plain HTML, we now send in the document's `innerText`, which doesn't include any HTML tags, images, etc.
This allows the ads classifier to properly classify pages to their correct type
kylehickinson added a commit that referenced this issue Nov 12, 2019
Instead of sending plain HTML, we now send in the document's `innerText`, which doesn't include any HTML tags, images, etc.
This allows the ads classifier to properly classify pages to their correct type
kylehickinson added a commit that referenced this issue Nov 13, 2019
Instead of sending plain HTML, we now send in the document's `innerText`, which doesn't include any HTML tags, images, etc.
This allows the ads classifier to properly classify pages to their correct type
@srirambv
Copy link
Contributor

Verification passed on iPhone XR with iOS 13.2 running 1.14(19.11.22.15)


Verification passed on iPhone 7+ with iOS 13.2 running 1.14(19.11.22.15)


Verification passed on iPhone 6 with iOS 12.4 running 1.14(19.11.22.15)


Verification passed on iPad Pro with iOS 13.2 running 1.14(19.11.22.15)


Verification passed on iPad Pro with iOS 12.4 running 1.14(19.11.22.15)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants