Scrapes appear frozen time to time in Zimfarm #1756

kelson42 · 2023-01-27T05:13:41Z

This recipe has its very own freeze pattern at the very start, even before scraping looks like!

See https://farm.openzim.org/pipeline/32b22c3a583bd32c94c53d36/debug

uriesk · 2023-01-27T10:07:02Z

works locally

npm start -- --adminEmail=contact@kiwix.org --articleList=http://download.openzim.org/wp1/enwiki/tops/1000000.tsv --customZimDescription="A selection of the best 1 Million Wikipedia articles" --customZimFavicon=https://en.wikipedia.org/static/images/project-logos/enwiki.png --customZimTitle="Wikipedia's 1 Million Top Articles" --filenamePrefix=wikipedia_en_top1m --mwUrl=https://en.wikipedia.org/ --webp

and works with docker

docker run -v /root/test:/output:rw --name mwoffliner_wikipedia_en_top1m ghcr.io/openzim/mwoffliner:TEST-redisupdate mwoffliner --adminEmail=contact@kiwix.org --articleList=http://download.openzim.org/wp1/enwiki/tops/1000000.tsv --customZimDescription="A selection of the best 1 Million Wikipedia articles" --customZimFavicon=https://en.wikipedia.org/static/images/project-logos/enwiki.png --customZimTitle="Wikipedia's 1 Million Top Articles" --filenamePrefix=wikipedia_en_top1m --mwUrl=https://en.wikipedia.org/ --webp

both enter the article-downloading stage fine

Is there anything special on the zimfarm?

kelson42 · 2023-01-27T10:55:58Z

@uriesk not that i know... and we use to download list from that server.... but never of that size

holta · 2023-01-28T21:39:28Z

@uriesk https://kinsta.com/blog/increase-max-upload-size-wordpress/#increase-the-max-upload-file-size-in-nginx suggests NGINX error 413 Request Entity Too Large might be overcome by increasing these default settings in /etc/php/7.4/fpm/php.ini and /etc/php/7.4/cli/php.ini or equivalent?!

upload_max_filesize = 64M
post_max_size = 128M

Others recommend increasing these too:

max_execution_time = 30
max_input_time = 60
memory_limit = 128M
max_input_vars = 1000

Followed by something like:

systemctl restart php7.4-fpm

[ Later Followup: ]

413 Request Entity Too Large With recipe wikipedia_en_top1m zimfarm#738

kelson42 · 2023-01-28T22:00:43Z

@rgaudin Any idea? these erros are in the task-worker log https://farm.openzim.org/pipeline/32b22c3a583bd32c94c53d36/debug

rgaudin · 2023-01-28T22:11:08Z

137 is OOM

rgaudin · 2023-01-28T22:13:43Z

Ah sorry yes I've seen those. We need a ticket on ZF but it doesn't affect the scraping

kelson42 · 2023-01-28T23:03:42Z

@rgaudin OK, have already increased memory available and restarted a scrape.

kelson42 · 2023-01-29T14:19:11Z

@rgaudin Almost all the time (we had one which somehow achieved to go through this and then died later with 137) the scrape is just stuck after Redis server start... so super early. This is the only one with that behaviour and seems the only one with the errors openzim/zimfarm#738. Locally it seems to work fine like reported by @uriesk. I have a kind of strong feeling that the problem might come from the Zimfarm (worker?) itself. Would you be able please to verify and maybe find other clues?

rgaudin · 2023-01-29T14:25:43Z

I'll look into it but the task you referenced here failed after 1day and 50mn… not exactly upon startup

holta · 2023-01-29T14:31:32Z

I'll look into it but the task you referenced here failed after 1day and 50mn… not exactly upon startup

Indeed, this time https://farm.openzim.org/recipes/wikipedia_en_top1m failed after about half a day according to https://farm.openzim.org/pipeline/81507cc3e015578b61a95d36 ("10 hours, 50 minutes" ?)

ASIDE: This ~40 GB ZIM file will be a Lifeline for people who just cannot afford large microSD cards.

In essentially all countries.

So I'd like to help wherever I can!

English Wikipedia with exactly 1 Million "top" articles (< 40GB ZIM file) zim-requests#508

uriesk · 2023-01-29T15:14:54Z

Better not run it on verbose.

The three oldest available builds failed with errorcode 137 (out-of-memory).
All scraps since then are with verbose.
And when you download the 500 MB logfile of the most recent one, you see:

[error] [2023-01-29T11:50:40.491Z] Error downloading article Nottingham_City_Hospital
Failed to run mwoffliner after [49493s]: {
	"message": "Request failed with status code 500",
	"name": "AxiosError",
	"stack": "AxiosError: Request failed with status code 500\n    at settle (file:///tmp/mwoffliner/node_modules/axios/lib/core/settle.js:19:12)\n    at IncomingMessage.handleStreamEnd (file:///tmp/mwoffliner/node_modules/axios/lib/adapters/http.js:512:11)\n    at IncomingMessage.emit (node:events:539:35)\n    at IncomingMessage.emit (node:domain:475:12)\n    at endReadableNT (node:internal/streams/readable:1344:12)\n    at process.processTicksAndRejections (node:internal/process/task_queues:82:21)",

so just one failed article, that we can exclude with articleListToIgnore when 1.12 releases.
Edit: Or just remove from the article list, aham.

I think the zimfarm just got some issues with massive verbose outputs of large scraps.

rgaudin · 2023-01-30T14:00:49Z

wikipedia_en_top1m never worked

When (end)	Duration	Reason
2 months ago	1 day, 8 hours, 50 minutes	OOM (10GB)
1 month ago	1 day, 20 hours, 30 minutes	OOM (10GB)
29 days ago	1 day, 10 hours, 30 minutes	OOM (15GB)
28 days ago	1 day, 3 hours, 30 minutes	Canceled
23 days ago	10 hours, 30 minutes	Canceled
23 days ago	1 hour, 2 minutes	Canceled
10 days ago	20 hours, 3 minutes	Canceled
2 days ago	1 day, 0 hours, 50 minutes	OOM (15GB)
1 day ago	10 hours, 50 minutes	Exit-code 2 (Request failed with status code 500)
21 minutes ago	20 hours, 60 minutes	Canceled

This doesn't match the ticket description at all. This looks like a memory hungry task that didn't fit within 15GB.

I see that another one has been launched following openzim/zimfarm#738 fix. I doubt this would have much impact as this was just preventing the task worker from uploading the log. Even if that log was kept in memory, we're talking about 500MB… and the task worker is not resources limited. So for it to have an impact, it would require the worker to be completely maxed out on RAM and hope that the kernel decides to kill the scraper…

I am ruling this out of zimfarm at the moment ; please let me know if your findings lead back to zimfarm.

I would suggest you test locally by specifying memory resources limit on your docker command.

uriesk · 2023-01-30T16:18:09Z

@rgaudin
then i suggest to look at the actual issue and not only trying to work around it (like i wanted with that "lets not do verbose").

The 413 Request Entity Too Large regularly happened with verbose outputs while none-verbose was ok.
Here you got a different one: https://farm.openzim.org/pipeline/4d2a6c335d2fa49beb7afb36/debug or here https://farm.openzim.org/pipeline/ba86dc6329a4e5323ec6eb36/debug

The zimfarm will not be the reason why the scraps fail. But it can be the reason why they appear as frozen (even if they might not be). All those Canceled ones had a frozen output. They should have either given us a OOM error, or a running output, or whatever error actually appeared (like the one a day ago).
But if the zimfarm is just preventing the task worker from uploading the log we don't know what is going on and when you hit cancel, you don't get the logfile either.
The output of the last wikipedia_ceb_all appeared frozen for two days or more and then magically updated and we got a legit error.
A frozen output is an issue.
mwoffliner:TEST-redisupdate was supposed to solve freezes that happen within mwoffliner with timers that monitor the crucial parts and cancel execution with a legitimate error when somethings freezes. It did do that successfully once, so i guess that is working.
But if the whole node process freezes, a timer within that frozen process can't help.
Neither is it helping when it actually isn't frozen and we just don't see the output.

So i ask you to stick around, watch those builds, and if something appears to be frozen... check what is going on in the container.
Even if you think that it is not a zimfarm issue anymore since client_max_body_size changes, lets watch if that is the case, and if freezes still appear and are not zimfarm related, you can give the hint that leads the way.
Maybe shell into it and check the node process if its still running, is it taking CPU, do redis-cli and check if the redis connections still exist and if there are still commands going in.

I can not locally test a full scrap of a 6 million article wikipedia like ceb or en or even just 1 million one. I can just check if it reaches one of the earlier stages without freezing.
And i never saw a freeze with the ones that are small enough that i can try.
We can only debug this on the zimfarm.

We can rename this issue to some scraps appear frozen sometimes if that is helping.

rgaudin · 2023-01-30T17:01:52Z

That makes more sense. I understand that indeed if you rely on timestamps in the live-updated stdout to tell a running task from a stuck one, then ZF issue would have made you think those tasks were stuck.

That said:

I don't understand the references to the start/beginning
Given most runs failed/got stuck after 20h+, I don't understand how it was considered to work locally but not in the farm.

I guess none of this matter now that we have eliminated the main culprit. Hopefully, the current run will enlighten us.

As for monitoring, ping me here or on slack with a task when you want me to connect and find information for you ; I'd be happy to help.

uriesk · 2023-01-30T17:33:22Z

Because they appeared stuck right after starting redis, before any scrapping started
https://farm.openzim.org/pipeline/4f192c6329a4e5327c141b36/debug

Thanks, lets hope it works out 👍

holta · 2023-02-01T13:50:10Z

FWIW yesterday's run contained:

lots of "413 Request Entity Too Large" warnings
one "couldn't patch task status=scraper_running HTTP 502: ResponseError (not JSON)" "502 Bad Gateway" warning

Does Scraper stderr /usr/local/sbin/mwoffliner: line 2: 13 Killed mean the job was manually killed?

Thank you to @kelson42 who launched another ZF attempt 3.5 hours ago:
https://farm.openzim.org/pipeline/28c070f7906bf9674d93ad36

rgaudin · 2023-02-01T13:51:53Z

FWIW yesterday's run contained:

* lots of "413 Request Entity Too Large" warnings
* one "couldn't patch task status=scraper_running HTTP 502: ResponseError (not JSON)" "502 Bad Gateway" warning

Both were from before the fix. Sorry about the timing conflict

Does Scraper stderr /usr/local/sbin/mwoffliner: line 2: 13 Killed mean the job was manually killed?

No, it was killed by docker due to lack of RAM

rgaudin · 2023-02-01T13:53:48Z

I'd like to mention that yesterday's run had 20GB of RAM yet OOM'd. Keep in mind that redis is completely in RAM as well.

The new one running is bound to 25GB.

holta · 2023-02-01T13:59:06Z

I'd like to mention that yesterday's run had 20GB of RAM yet OOM'd. Keep in mind that redis is completely in RAM as well.

The new one running is bound to 25GB.

I should have seen & realized 137 OOM yesterday, Right!

(5GB extra RAM during each attempt can't hurt, if indeed it's that simple!)

🙏

uriesk · 2023-02-01T15:15:27Z

wikipedia_en_all_maxi made it through the downloading-articles stage with 20GB RAM and is fetching 6.6 million articles. Something fishy is going on.

TEST-redisupdate is merged now, we can do the next try with mwoffliner:dev.
And i would appreciate it without verbose, i don't enjoy downloading those GB logfiles.

holta · 2023-02-01T19:59:21Z

And i would appreciate it without verbose, i don't enjoy downloading those GB logfiles.

Great Question @uriesk, given the importance of rapid/continuous improvement:

Looking to the future with mwoffliner 1.13.0, what would be the most pragmatic way to post less-verbose-but-still-"very decently informative" diagnostic output — e.g. extracted from (and alongside) the full/verbose log file — perhaps using grep -v or similar, to eliminate common patterns of log file bloat? As man grep shows here:
```
       -v, --invert-match
            Invert the sense of matching, to select non-matching lines.
```
Certainly yesterday's was 1.4GB! Which can quickly deter participation ☹️
https://s3.us-west-1.wasabisys.com/org-kiwix-zimfarm-logs/f7b5acc98307812c7e5c7d36_mwoffliner.log

ASIDE, "Scrapper Log" should really be "Scraper Log" on every job's "Debug" tab (Debug page) like:
https://farm.openzim.org/pipeline/f7b5acc98307812c7e5c7d36/debug

rgaudin · 2023-02-02T08:27:03Z

ASIDE, "Scrapper Log" should really be "Scraper Log" on every job's "Debug" tab (Debug page) like: https://farm.openzim.org/pipeline/f7b5acc98307812c7e5c7d36/debug

Fixed the typo (openzim/zimfarm@f83b6ee). Please refresh. Note that I accidentally understood what you meant while I was following this ticket but I don't contribute to mwoffliner so zimfarm bugs have limited chances to get fixed from here 😉

holta · 2023-02-02T19:40:33Z

So far so good — the latest scrape of wikipedia_en_top1m is still running after almost 34h:

https://farm.openzim.org/pipeline/28c070f7906bf9674d93ad36

uriesk · 2023-02-03T09:03:59Z

it's done, with an epic 200 GB file 🤔

And one 413 Request Entity Too Large at the very end

kelson42 · 2023-02-03T10:13:03Z

Yes, I get the novid format should be configured on this recipe ;)

holta · 2023-02-03T13:35:49Z

Yes, I get the novid format should be configured on this recipe ;)

Very Awesome it appears so close; Thanks to everyone !

Can someone restart the job with --format="novid:maxi" ?

ASIDE: Why does wikipedia_en_all_maxi use --addNamespaces="100" ?

kelson42 · 2023-02-03T13:47:28Z

@holta yes, will do

uriesk · 2023-02-03T15:13:14Z

@holta @kelson42 that build gave very good insights in storage requirements of media, i did analyze it #1767

holta · 2023-02-03T15:30:11Z

@holta @kelson42 that [build] gave very good insights in storage requirements of media, i did analyze it #1767

Heroic. Millions of people should provide their thanks to you and everyone helping here. When we're all done — reliably + compactly delivering the "world's knowledge" (i.e. a very meaningful snapshot subset of Wikipedia, in English for starters) every single month:

This is an incredibly interesting idea. As a 1st order heuristic for kids and schools (in essentially all countries) who are in fact constantly asking Internet-in-a-Box for more dynamic + more interesting + more compelling learning via audio/video — similar to the more thoughtful sides of YouTube:
- Filter media by file size #1767
I now finally see the verbose flag set to true within https://farm.openzim.org/recipes/wikipedia_en_top1m/config > Config, that you (@uriesk) mentioned earlier!
- --verbose should be able to get a log threshold #674

holta · 2023-02-11T17:42:28Z

@uriesk thanks for your help completing this major accomplishment — that has the very real potential to help millions of kids — getting legit-not-stale encyclopedic knowledge every single month: 🎆

(i.e. if the wikipedia_en_top1m scraping recipe hopefully proves much more reliable/ruggedized than https://farm.openzim.org/recipes/wikipedia_en_all_maxi ❓❗)

kelson42 added the bug label Jan 27, 2023

kelson42 added this to the 1.12.0 milestone Jan 27, 2023

kelson42 changed the title ~~WPEN top1m recipe stuck at very start~~ scrapes appear frozen time to time in Zimfarm Jan 30, 2023

kelson42 changed the title ~~scrapes appear frozen time to time in Zimfarm~~ Scrapes appear frozen time to time in Zimfarm Jan 31, 2023

kelson42 modified the milestones: 1.12.0, 1.13.0 Feb 1, 2023

kelson42 closed this as completed Feb 2, 2023

holta mentioned this issue Feb 17, 2023

Is --addNamespaces="100" no longer working? (to add Wikipedia's portal pages to ZIM files for schools!) #1784

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scrapes appear frozen time to time in Zimfarm #1756

Scrapes appear frozen time to time in Zimfarm #1756

kelson42 commented Jan 27, 2023

uriesk commented Jan 27, 2023

kelson42 commented Jan 27, 2023

holta commented Jan 28, 2023 •

edited

Loading

kelson42 commented Jan 28, 2023

rgaudin commented Jan 28, 2023

rgaudin commented Jan 28, 2023

kelson42 commented Jan 28, 2023

kelson42 commented Jan 29, 2023

rgaudin commented Jan 29, 2023

holta commented Jan 29, 2023 •

edited

Loading

uriesk commented Jan 29, 2023 •

edited

Loading

rgaudin commented Jan 30, 2023

uriesk commented Jan 30, 2023

rgaudin commented Jan 30, 2023

uriesk commented Jan 30, 2023

holta commented Feb 1, 2023

rgaudin commented Feb 1, 2023

rgaudin commented Feb 1, 2023

holta commented Feb 1, 2023

uriesk commented Feb 1, 2023

holta commented Feb 1, 2023

rgaudin commented Feb 2, 2023

holta commented Feb 2, 2023

uriesk commented Feb 3, 2023 •

edited

Loading

kelson42 commented Feb 3, 2023

holta commented Feb 3, 2023

kelson42 commented Feb 3, 2023

uriesk commented Feb 3, 2023 •

edited

Loading

holta commented Feb 3, 2023 •

edited

Loading

holta commented Feb 11, 2023

Scrapes appear frozen time to time in Zimfarm #1756

Scrapes appear frozen time to time in Zimfarm #1756

Comments

kelson42 commented Jan 27, 2023

uriesk commented Jan 27, 2023

kelson42 commented Jan 27, 2023

holta commented Jan 28, 2023 • edited Loading

kelson42 commented Jan 28, 2023

rgaudin commented Jan 28, 2023

rgaudin commented Jan 28, 2023

kelson42 commented Jan 28, 2023

kelson42 commented Jan 29, 2023

rgaudin commented Jan 29, 2023

holta commented Jan 29, 2023 • edited Loading

uriesk commented Jan 29, 2023 • edited Loading

rgaudin commented Jan 30, 2023

uriesk commented Jan 30, 2023

rgaudin commented Jan 30, 2023

uriesk commented Jan 30, 2023

holta commented Feb 1, 2023

rgaudin commented Feb 1, 2023

rgaudin commented Feb 1, 2023

holta commented Feb 1, 2023

uriesk commented Feb 1, 2023

holta commented Feb 1, 2023

rgaudin commented Feb 2, 2023

holta commented Feb 2, 2023

uriesk commented Feb 3, 2023 • edited Loading

kelson42 commented Feb 3, 2023

holta commented Feb 3, 2023

kelson42 commented Feb 3, 2023

uriesk commented Feb 3, 2023 • edited Loading

holta commented Feb 3, 2023 • edited Loading

holta commented Feb 11, 2023

holta commented Jan 28, 2023 •

edited

Loading

holta commented Jan 29, 2023 •

edited

Loading

uriesk commented Jan 29, 2023 •

edited

Loading

uriesk commented Feb 3, 2023 •

edited

Loading

uriesk commented Feb 3, 2023 •

edited

Loading

holta commented Feb 3, 2023 •

edited

Loading