Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU Usage and MySQL Server Inaccessibility during catalog_product_flat reindexing #3309

Closed
Winfle opened this issue Jun 7, 2023 · 13 comments
Labels

Comments

@Winfle
Copy link

Winfle commented Jun 7, 2023

Preconditions (*)

Open Mage 19.5.0-rc3
Mysql server: AWS Aurora MySQL (3.03.0 - 8.0.26 compatiable)
Mysql CPU 100% during reindex process

The issue

Hello, guys
I'm encountering a critical issue during the reindexing process of the catalog_product_flat index in my OpenMage project. The problem leads to a spike in CPU usage, causing the MySQL server to become inaccessible. I have also noticed significant waits in SYNCH and IO Mysql processes, which may be related to the issue.

Here are some statistics regarding the problem.
These are waits generated by Performance Insights of Aurora SQL:

wait/synch/sxlock/innodb/trx_purge_latch: 55.54 AAS
wait/io/table/sql/handler: 21.32 AAS
wait/synch/mutex/innodb/fil_system_mutex: 14.02 AAS
These statistics indicate abnormal behavior, as the normal total AAS should be under 16.

Waits graph:
image

CPU graph:
image

From the official documentation: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/ams-waits.sx-lock-hash-table-locks.html
I can see, that wait/synch/sxlock/innodb/trx_purge_latch wait event occurs when the workload must access the data in files on disk or when blocks are freed from or added to the buffer pool's LRU list.
So, it could be related, to that fact, that indexes are used in non-optimal way.

The only fix for now is to restart mysql server and disable reindex process at all.
Maybe somebody faced with this issue before?
Any help will be appritiated.

Thanks

@pquerner
Copy link
Contributor

pquerner commented Jun 8, 2023

My knowledge about the magento (CE) indexer is limited and possibly outdated (or forgotten), but I have had my fun time with it aswell.

How many products and categories are we talking about in your instance? Is it the stock-indexer or do you use some external module on top of it?

In the past we had to resort to a custom module (I think it was AsyncIndex by Magento-Hackathon, which indexed the products for us in badges (since we had many product changes, which also happened quite often, on a very large catalog).
(I think we had something else on top of it, but I am not sure anymore... I will edit it once I remember)

Does the same thing happen on a developer machine, so you can pin-point the issue better perhaps?

@luigifab
Copy link
Contributor

luigifab commented Jun 8, 2023

Sorry for my stupid question, is it new with rc3? or do you have the same problem with previous version?

@Winfle
Copy link
Author

Winfle commented Jun 8, 2023

@pquerner Hey, thank you for the feedback!
We have around 368 000 products and around 1180 categories with 4 stores.
We have a plugin for another index type - https://www.dnd.fr/magento-patch-how-to-optimize-re-index-processing-time-for-url-rewrite/ (URL Rewrites reindex)

As far I know, the issue appeared around 10 days ago, but we didn't make any change.
So it could be related to some internal limit, and then it started to crash..

So, my goal for now is to investigate what is the factor affects MySQL state.
Thanks

@Winfle
Copy link
Author

Winfle commented Jun 8, 2023

Hello, @luigifab. Thank you for your input!
It was stable for 3 months, but 10 days before this index started to crash.
We migrated to OpenMage 3 month ago, but we haven't noticed this issue until now.

@addison74
Copy link
Contributor

If the solution worked until 10 days ago, it means that the problem is elsewhere than in OpenMage, provided that the source code has not been modified. I would check the updates of the packages on the server from the last month, especially the version of MySQL.

In such situations, it must be established from the beginning if the problem is from the source code of the framework or from the applications that run it. I would move the whole OpenMage directory and database to a local test environment.

@Winfle
Copy link
Author

Winfle commented Jun 8, 2023

@addison74 Hello.
I think, that is complex issue in terms, that during particular index workload, it can lock some specific resources, that are used by current frontend load.
Of cource, it could not be a code issue, but understanding of processes, that happens inside is important, since we have a correlation between particular index and mysql server outrage.
I just want to investigate all factors, that can lead to this problem, including codebase.
Maybe solution will be just delete not needed products, but I want to know the exact problem.
Also, I would like to add, that certain queries that stuck are frontend ones, but I think, they are just "victims" of corrupted / locked state of reindex.

So, this is probably more about specific case issue, but not OpenMage in general. However, it can be useful for other people, because soltion solving generally lays not only in code part area.
Appritiate your input!

@addison74
Copy link
Contributor

In this case we will keep the ticket open for a period of 2 weeks, then if there are no updates we will move it to the Discussions > Q&A section.

@Winfle
Copy link
Author

Winfle commented Jun 8, 2023

@addison74 thank you for the help!

@elidrissidev elidrissidev added the performance Performance related label Jun 8, 2023
@rvelhote
Copy link
Contributor

rvelhote commented Jun 15, 2023

@Winfle Just some brainstorming of things I thought of when reading this:

@tmotyl
Copy link
Contributor

tmotyl commented Jun 15, 2023

FYI, This PR will help make the issue smaller #3267 by limiting the number of entries to index

@Winfle
Copy link
Author

Winfle commented Jul 3, 2023

@addison74 I will post a solition here tomorrow.
I had to use another strategy of reindexing

@pquerner
Copy link
Contributor

pquerner commented Jul 4, 2023

The oldest lie on the internet "will post solution tomorrow" :D

@Winfle
Copy link
Author

Winfle commented Jul 4, 2023

Attribute Optimization: We sorted the attributes by usage and disabled those that were associated with less than 10,000 products. By reducing the number of attributes in the flat table, we minimized the workload during reindexing. (we had 128 attributes and 400k of products)

Temporary Flat Table: (actual fix) Instead of altering the current catalog flat table, we created a temporary flat table to perform the necessary updates. This approach eliminated the CPU lock that occurred during the alteration process. After completing reindex, we renamed the temporary table to replace the current flat table.
I think, main problem, is that it uses foreign key from flat to catalog_product_entity table.
So, when it performs update, a row lock is occured.
As reindex process completes reindex horisontally (one attribute for all products), it makes significant impact.

Disable foreign key checks: Before starting the reindexing process, we temporarily disabled foreign key checks in MySQL. This can help alleviate the row locking issue. However, it's important to note that this approach should be used with caution and only in a controlled environment.

By implementing these steps, we successfully resolved the CPU usage spike and improved the efficiency of the reindexing process for the catalog_product_flat index in our OpenMage project.

Additional tools: we used OtterTune service to monitor performance and provide recommended settings for MySQL server.
I think, it was really helpful, despite it didn't make an inpact on the issue. Just a general optiomization for webserwer.

If somebody is interested, I can post this solution here.
It's not perfect, but did the trick.

Thanks for the support all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants