Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Object Hydration performance on large result sets #10035

Closed
wants to merge 5 commits into from

Conversation

tucksaun
Copy link

@tucksaun tucksaun commented Sep 9, 2022

Hi folks!

While assessing a client's project performance I found some optimizations regarding the ORM hydration process that improves speed when working with large result sets and collections.

Expand to get some context on how we reached this. The context is about online carts on a B2B e-commerce platform with a high number of items and complex personalization logic. We reached the point in the optimization process where most of the time was spent in the ORM (after serialization) when fetching a cart.

This makes sense because of the volume and the logic at stake.
Denormalization or caching does not make sense for us because of the low read/write ratio and the business logic involved so we first went with the 2-step hydration technic.

While analyzing profiling results we more or less came to the same conclusion as #8390: the hydration process converts data over and over again even if the entity is already hydrated>

But after trying to work around performance for specific DBAL types that were costly to convert (namely Symfony's UUID and Dunglas's JSON document) and digging deeper into profiles I concluded something could be done directly in the Hydrator for the benefit of everyone instead of writing our specific optimized hydrator:
Screenshot 2022-09-08 at 17 51 38

This PR acts according to one consideration: most parts of the hydration process are not slow per se but are repeated for every row and column. By applying some small optimizations, moving some code before the first iteration, or some code at the very last moment, we can improve the overall process.
The more rows and columns, the more the process is accelerated.

For examples:

  • type conversion can be costly but only identifiers are required to determine if the entity is already hydrated or not: the actual data can be converted later on in the process only if needed using only the columns required for the current entity (not the data for other entities in relations);
  • we can prepare the mapping information in a way that prevents repeating some part of the process thousands of times;
  • and we can avoid using the reflection to retrieve relations properties if they are available in another mapping.

With this patch, I manage to reach the following results:

Metric Before After Notes
->getQuery()->getResult() 726.39 ms 358.67 ms -51%, best of 20 iterations, query result cache enabled, 11 826 entities created
Doctrine\ORM\PersistentCollection::* 164 634 77 345
TypedNoDefaultReflectionProperty::getValue 162 770 125 669
Doctrine\ORM\Internal\Hydration\AbstractHydrator::resultSetMapping 441 742 19 745
Doctrine\ORM\Internal\Hydration\AbstractHydrator::hydrateColumnInfo 946 970 337
Doctrine\DBAL\Types\IntegerType::convertToPHPValue 406 982 279 366
Doctrine\DBAL\Types\StringType::convertToPHPValue 181 207 63 473
Doctrine\DBAL\Types\BooleanType::convertToPHPValue 75 163 7 865
Doctrine\DBAL\Types\DateTimeImmutableType::convertToPHPValue 25 118 789
App\DBAL\Types\UuidType::convertToPHPValue 13 056 413

I hope we can manage to get this work so that everyone enjoys better performance 🤗
If you are willing to move in this direction, we can probably adapt ArrayHydrator the same way.

@beberlei
Copy link
Member

Hey @tucksaun, this looks like fantastic work! Thank you! It will take some time to review the PR through as its touching essentially everything about hydrators, so I am not sure when we will get to it at this time.

@sips-richard
Copy link

I found this PR in relation to an issue we've been having when hydrating a graph with a couple left joins. Specifically that Doctrine calls convertToPHPValue() on all the NULL values for each column in any empty left join records, prior to hydrating the entities and then discarding those empty records. We noticed this because we have a custom data type that throws an exception if it encounters a NULL (the fix for now was to eliminate the check).

But I was wondering whether it is necessary for the calls to convertToPHPValue() up front when ultimately the record ID might be NULL and the empty data set gets thrown in the trash, or could the code could be reworked to only convert the values for columns belonging to entities that will actually be hydrated?

@tucksaun
Copy link
Author

Hey @tucksaun, this looks like fantastic work! Thank you! It will take some time to review the PR through as its touching essentially everything about hydrators, so I am not sure when we will get to it at this time.

@beberlei sure, I understand as this is quite a change.
If it can help you: I tried to keep a "logical" order in commits so that maybe it is easier to follow.
Also, feel free to ping me if you want to chat about it if you want.

I will try to maintain this PR as up-to-date as possible in the meantime.

@tucksaun
Copy link
Author

I found this PR in relation to an issue we've been having when hydrating a graph with a couple left joins. Specifically that Doctrine calls convertToPHPValue() on all the NULL values for each column in any empty left join records, prior to hydrating the entities and then discarding those empty records. We noticed this because we have a custom data type that throws an exception if it encounters a NULL (the fix for now was to eliminate the check).

But I was wondering whether it is necessary for the calls to convertToPHPValue() up front when ultimately the record ID might be NULL and the empty data set gets thrown in the trash, or could the code could be reworked to only convert the values for columns belonging to entities that will actually be hydrated?

@sips-richard I would say that the hydrator has to maintain the call to convertToPHPValue because the logic of what is done with a NULL value in the database might be reflected differently in PHP depending on the needs and the type implementation.
But if I got your problem correctly, the number of calls might be lower with this patch.

@tucksaun tucksaun force-pushed the perf/optim-hydration-2.13 branch 3 times, most recently from bd6be8e to 898cf1e Compare September 28, 2022 14:18
@tucksaun tucksaun force-pushed the perf/optim-hydration-2.13 branch from 898cf1e to bd8be73 Compare November 8, 2022 18:50
@greg0ire
Copy link
Member

greg0ire commented Nov 8, 2022

Hi! git rebase --exec vendor/bin/phpcbf should fix some (but not all) cs issues :)

@tucksaun tucksaun force-pushed the perf/optim-hydration-2.13 branch 2 times, most recently from 3ee9ab1 to 423b645 Compare November 8, 2022 20:36
@tucksaun
Copy link
Author

tucksaun commented Nov 8, 2022

@greg0ire I though I already fixed them all and only casually rebased this PR today so thank you for the heads up about he CS issues.
They should be fixed now.

@tucksaun tucksaun force-pushed the perf/optim-hydration-2.13 branch from 423b645 to 95213ea Compare November 8, 2022 20:57
@derrabus derrabus changed the base branch from 2.13.x to 2.14.x November 8, 2022 22:18
@derrabus derrabus requested a review from beberlei November 8, 2022 22:19
$newObjectsData[$objIndex]['args'][$argIndex] = $value;
}

/** @psalm-suppress LessSpecificReturnStatement Psalm does not detect that args keys is always defined */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't suppress Psalm errors via comments. We usually configure this in psalm.xml.dist.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@derrabus I don't manage to suppress it for this specific method inside psalm.xml.
And to suppress it for the whole file looks a bit too wide to me.

Do you have any recommendations?
Or should I add this to the baseline?
Or does it mean this is a legitimate use case to you?

Comment on lines +561 to 563
$resultKey = isset($resultSetMapping->indexByMap['scalars'])
? $row[$resultSetMapping->indexByMap['scalars']]
: $this->resultCounter - 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
$resultKey = isset($resultSetMapping->indexByMap['scalars'])
? $row[$resultSetMapping->indexByMap['scalars']]
: $this->resultCounter - 1;
$resultKey = $resultSetMapping->indexByMap['scalars'] ?? $this->resultCounter - 1;

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hum does not look correct to me as we are accessing $row, not only $resultSetMapping->indexByMap ? 🤔

tests/Doctrine/Tests/ORM/Functional/Ticket/DDC6303Test.php Outdated Show resolved Hide resolved
lib/Doctrine/ORM/Internal/Hydration/AbstractHydrator.php Outdated Show resolved Hide resolved
lib/Doctrine/ORM/Internal/Hydration/AbstractHydrator.php Outdated Show resolved Hide resolved
*
* @psalm-return array<array-key, array<array-key, mixed>> An array with all the ids (name => value) of the row.
*/
protected function gatherRowIDs(array $data, array &$id, array &$nonemptyComponents)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
protected function gatherRowIDs(array $data, array &$id, array &$nonemptyComponents)
private function gatherRowIDs(array $data, array &$id, array &$nonemptyComponents): array

lib/Doctrine/ORM/Internal/Hydration/AbstractHydrator.php Outdated Show resolved Hide resolved
*
* @psalm-return array<array-key, mixed>
*/
protected function gatherRowDataFromScalarsMapping(array $data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
protected function gatherRowDataFromScalarsMapping(array $data)
private function gatherRowDataFromScalarsMapping(array $data): array

*
* @psalm-suppress MoreSpecificReturnType Psalm does not detect that args keys is always defined
*/
protected function gatherRowDataFromNewObjectsMapping(array $data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
protected function gatherRowDataFromNewObjectsMapping(array $data)
private function gatherRowDataFromNewObjectsMapping(array $data): array

lib/Doctrine/ORM/Internal/Hydration/AbstractHydrator.php Outdated Show resolved Hide resolved
@sips-richard
Copy link

sips-richard commented Nov 9, 2022

I found this PR in relation to an issue we've been having when hydrating a graph with a couple left joins. Specifically that Doctrine calls convertToPHPValue() on all the NULL values for each column in any empty left join records, prior to hydrating the entities and then discarding those empty records. We noticed this because we have a custom data type that throws an exception if it encounters a NULL (the fix for now was to eliminate the check).
But I was wondering whether it is necessary for the calls to convertToPHPValue() up front when ultimately the record ID might be NULL and the empty data set gets thrown in the trash, or could the code could be reworked to only convert the values for columns belonging to entities that will actually be hydrated?

@sips-richard I would say that the hydrator has to maintain the call to convertToPHPValue because the logic of what is done with a NULL value in the database might be reflected differently in PHP depending on the needs and the type implementation. But if I got your problem correctly, the number of calls might be lower with this patch.

@tucksaun Absolutely this will speed up our code, but I will try to clarify on our problem scenario because I think a change in the internals might be beneficial here.

Our ORM query included several left joins. Internally Doctrine was passing all the null values for any non-existent left join entities to the convertToPHPValue() method and only determining afterward that there was no relationship and then throwing these values away.

The convertToPHPValue method on the type for one of our columns did not permit null values and would throw an exception (because this would never happen in normal operation), but because Doctrine is hydrating null values for the empty left joins it was.

My question therefore is: can the logic be reworked in such a way that Doctrine could determine whether or not the ID columns represent a joined entity or an empty left join up front, prior to any hydration.

It would save a lot of function calls where there are empty left joins for a start.

@tucksaun tucksaun closed this Nov 15, 2022
@tucksaun tucksaun deleted the perf/optim-hydration-2.13 branch November 15, 2022 11:17
@tucksaun tucksaun restored the perf/optim-hydration-2.13 branch November 15, 2022 11:18
@tucksaun tucksaun reopened this Nov 15, 2022
@tucksaun tucksaun force-pushed the perf/optim-hydration-2.13 branch from 95213ea to 890d7d7 Compare November 15, 2022 11:18
Do not store $resultPointers in a property as this is slower to access than a local variable
Cache the result of $this->resultSetMapping() to avoid repeating null checks as well as method calls over and over again
Rework how the hydrator works for better performance by moving most of the column information gathering before the result iteration
so that the gathering is not done X times.
Additionally, this allows to group column info by dql alias which in turns allows to get easily the column list associated with a given dql alias.
This allows to conver the column data lazily, only when needed.
When iterating over a result set representing a high cardinality relation, this avoid using and converting over and over again each column for a given entity.
The only data we systematically need is the identifiers columns in order to determine if the entity is already present in the identity map.

Lowering the impact of hydration for such high cardinality relation lower the thresold where a 2 step hydration process is required.
PHP Reflection API is known to not be the fastest.
This is OK when only a couple of callls are made but here it is used in a tight loop whereas references to values fetched via Reflection are already available or not even used.

Moving the use as late as possible allows to optimize performance a bit.
The current implementation uses collections object being hydrated as array, this introduces several calls to ArrayAccess implementation
causing a noticeable overhead especially on big collections.
While those checks would seem necessary in a regular use of collections they seem a bit too much in the hydration process where we can check differently the integrity of the collection.
@tucksaun tucksaun force-pushed the perf/optim-hydration-2.13 branch from 890d7d7 to b7aaf10 Compare November 15, 2022 11:50
@NoiseByNorthwest
Copy link

It looks very close to the issue I've described here #8390
And I've shared a custom hydrator which fixes the useless "repeated rows hydration" with a simple caching logic.

Does my workaround make sense ? Should it be implemented into ORM ?

Copy link
Contributor

There hasn't been any activity on this pull request in the past 90 days, so it has been marked as stale and it will be closed automatically if no further activity occurs in the next 7 days.
If you want to continue working on it, please leave a comment.

@github-actions github-actions bot added the Stale label Nov 11, 2024
Copy link
Contributor

This pull request was closed due to inactivity.

@github-actions github-actions bot closed this Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants