Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Canonical url intergration batch 3 #905

Merged
merged 36 commits into from
Oct 28, 2023
Merged

Canonical url intergration batch 3 #905

merged 36 commits into from
Oct 28, 2023

Conversation

jknndy
Copy link
Collaborator

@jknndy jknndy commented Oct 15, 2023

Batch 3, scrapers M - R, few reworks but nothing major in this one!

Non-scraper changes

  • Added \u00C2 (Â) to normalize_string util

@jknndy jknndy marked this pull request as ready for review October 16, 2023 02:31
@jayaddison
Copy link
Collaborator

@jknndy ah, I think I see what you mean about the GitHub-suggested recommended commands.

I think what to do instead is to ensure that you have an up-to-date copy of the main branch (git checkout main; git pull origin main) and then merge that into this branch (git checkout canonical_url_intergration_batch_3; git merge main).

Then you'll find the conflicts locally. A shortcut for two of them -- the HTML files, where we don't want to merge two files but instead simply take the one from this branch -- is to reset the file to the version stored in a specific branch: git reset canonical_url_intergration_batch_3 -- tests/test_data/maangchi.testhtml). The conflicts in tests/test_matprat_1.py will require manual resolution, though.

@jknndy
Copy link
Collaborator Author

jknndy commented Oct 26, 2023

Thanks very much! got it & i think we're in good shape for now. Once #915 is merged ill run through this again and then we should be good to go.

edit: not sure whats causing this failure, the files listed in the linter error weren't changed in this PR. & the coverage failures are related to #909

generate.py:211:96: E231 missing whitespace after ','
generate.py:222:85: E231 missing whitespace after ','
recipe_scrapers/goustojson.py:20:20: E231 missing whitespace after ':'
recipe_scrapers/grandfrais.py:57:40: E231 missing whitespace after ':'
recipe_scrapers/kptncook.py:32:26: E231 missing whitespace after ':'
recipe_scrapers/kuchniadomowa.py:15:23: E231 missing whitespace after ':'
recipe_scrapers/mindmegette.py:26:41: E231 missing whitespace after ':'
recipe_scrapers/nihhealthyeating.py:62:37: E231 missing whitespace after ':'
recipe_scrapers/weightwatchers.py:118:55: E702 multiple statements on one line (semicolon)
recipe_scrapers/woolworths.py:14:27: E231 missing whitespace after ':'

@jayaddison
Copy link
Collaborator

That is strange indeed about the lint failures. The list of files there looks similar to the differences between the main and v15 branches - but nothing about this pull request should involve the v15 branch. Odd. I'll try to take more of a look soon.

@jayaddison

This comment was marked as resolved.

@jayaddison

This comment was marked as outdated.

@jayaddison

This comment was marked as outdated.

@jayaddison

This comment was marked as off-topic.

@jayaddison
Copy link
Collaborator

@jknndy it seems that the coverage and lint workflows don't work yet with Python 3.12 - and that became the default for the setup-python action recently. I've pinned those workflows to use lower versions of Python short-term (#917, #918) until we can upgrade.

If you merge the latest changes from main into this branch, tests should pass again 🤞

Copy link
Collaborator

@jayaddison jayaddison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great - thank you for your patience with this one @jknndy! Almost there with the canonical URL batches :)

"1 peça de filé mignon para rosbife (cerca de 750 g)",
"1 colher (chá) de mostarda amarela em pó",
"1 colher (chá) de páprica defumada",
"750 g de filé mignon em peça para rosbife",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arg. I've noticed while doing a final readthrough that there's a problem with the text decoding here. I'm taking a bit of a look at that now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

charchef does a good job of fixing these malencoded text entries, but is fairly slow the first time it runs:

>>> from charchef import aa_convert_utf8_to_ascii_,aa_repair_bad_conversion_to_utf8,aa_replace_non_printable_chars
>>> text = '1 colher (chá) de páprica defumada'
>>> aa_repair_bad_conversion_to_utf8(text)
['1 colher (chá) de páprica defumada']

(that took 10 seconds or so, maybe slightly longer)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LatinFixer by the same author can achieve the same result here, and fast!

>>> from LatinFixer import LatinFix
>>> text = '1 colher (chá) de páprica defumada'
>>> LatinFix(text).apply_wrong_chars()
1 colher (chá) de páprica defumada

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed a possible change to do this; it isn't perfect yet, though. Some ingredient lines begin with fractional amounts -- , for example -- and those aren't being decoded clearly yet.

@jknndy do you remember what browser you downloaded the testhtml files with? There's a possibility that something in the encoding headers/negotiation that it used at the time has resulted in unusual results in the downloaded HTML. When I view the original recipe page live in a web browser, I find that it's more readable than when I view the corresponding .testhtml file locally in the same browser.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've resolved this problem by re-downloading a copy of the HTML; the text encoding received from the site and/or the encoding used to write the file seem to have resolved the problem. So despite that LatinFixer adventure (it's useful to know about that library), it isn't added as a dependency/string handler here at the moment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I grabbed all the updated test html using the header defined in _abstract so it's strange that would have caused an issue. Once I'm down with the last batch I have a few to revisit that I skipped over, I'll add this one to this list & try some other recipes to see what happens.

@jayaddison jayaddison merged commit 935dda2 into hhursev:main Oct 28, 2023
16 checks passed
@jknndy jknndy deleted the canonical_url_intergration_batch_3 branch October 28, 2023 15:48
strangetom pushed a commit to strangetom/recipe-scrapers that referenced this pull request Nov 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants