Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The hydration cost can easily & significantly be decreased in case of repeated row data + costly type conversion #8390

Open
NoiseByNorthwest opened this issue Dec 13, 2020 · 2 comments

Comments

@NoiseByNorthwest
Copy link

The well known performance issue with fetch joins on collection-like relations (*-to-many) is that the result set will contain repeated data. The simple case being an entity with one of its collection relations fetched which will then cause the (root) entity data to be repeated as many times as there are items in the collection.

It get obviously exponentially worse when nested or sibling collections are also fetched. This issue could BTW be avoided with a feature like this one (not merged yet) #1569 .

That being said, I've discovered another performance issue last week, which does not need thousands of rows to be noticeable. It occurs when the repeated data are mapped to a type with a costly conversion. In my case this is a json_document column, this type is provided by https://github.com/dunglas/doctrine-json-odm .

More concretly my query targets the table A, and fetch a collection of table B records. A contains a json_document column and for each A record there are ~66 B records.

The result set then contains 526 rows representing 526 distinct B records and only 8 distinct A records.

The query was to slow to meet my requirements and after having profiled that (in tracing mode to get actual call counts & sampling mode to get accurate timings) I've discovered that 70% of the hydration time is taken by JsonDocumentType::convertToPHPValue() which is called 526 times (instead of 8 as logically required).

And this slowness has obviously more to do with the fact that JsonDocumentType::convertToPHPValue() is called 526 times where it could be called only 8 times than with the own slowness of JsonDocumentType::convertToPHPValue() which is a bit expected and with probably not so much room for optimizations.

The problem lies in the fact that AbstractHydrator::gatherRowData() which is, among other things, responsible to call the proper type conversion function for each column of the given row, does not do any caching logic to avoid doing this process multiple times for the same logical set of column values (representing the root or a fetched relation AKA DQL alias).

So I've written my own hydrator, which extends ObjectHydrator and after having copy/pasted the gatherRowData() function I've added this caching logic.

When used, I've observed a 78% decrease of the hydrateAllData() cost. Here is a table below with more details:

Metric Original hydrator Custom hydrator Decrease (%)
hydrateAllData() wall time (ms) 232 52 -78%
gatherRowData() call count 526 526 -
gatherRowData() wall time (ms) 200 26 -87%
JsonDocumentType::convertToPHPValue() call count 526 8 -
JsonDocumentType::convertToPHPValue() wall time (ms) 162 5 -97%

My questions:

  • am I doing something wrong here ?
  • if not, shouldn't this caching logic be added to the original gatherRowData() implementation (considering its simplicity and the fact that this use case may not be so uncommon) ?

The custom hydrator (original gatherRowData() implementation):

<?php

namespace App\Framework\DoctrineExtensions\ORM;

use Doctrine\ORM\Internal\Hydration\ObjectHydrator;

class OptimizedObjectHydrator extends ObjectHydrator
{
    protected function gatherRowData(array $data, array &$id, array &$nonemptyComponents)
    {
        $rowData = ['data' => []];

        // this first $data traversal is required in order to have the ids resolved ahead of the
        // rest of the processing since we need them in order to compute the cache keys
        foreach ($data as $key => $value) {
            if (($cacheKeyInfo = $this->hydrateColumnInfo($key)) === null) {
                continue;
            }

            switch (true) {
                case isset($cacheKeyInfo['isNewObjectParameter']):
                case isset($cacheKeyInfo['isScalar']):
                    break;

                default:
                    $dqlAlias = $cacheKeyInfo['dqlAlias'];

                    if ($cacheKeyInfo['isIdentifier'] && $value !== null) {
                        $id[$dqlAlias] .= '|' . $value;
                        $nonemptyComponents[$dqlAlias] = true;
                    }

                    break;
            }
        }

        // this flag serves as a caching toggle
        $cachingEnabled = true;
        if ($cachingEnabled) {
            // building the cache keys
            $processedRowCacheKeys = [];
            foreach ($id as $dqlAlias => $entityId) {
                $processedRowCacheKeys[$dqlAlias] = \sprintf(
                    '%s::%s::%s::%s',
                    self::class,
                    'processedRows',
                    $dqlAlias,
                    $entityId
                );
            }

            // pre-fill $rowData['data'] from cache if HIT for each DQL alias
            foreach ($id as $dqlAlias => $entityId) {
                if (isset($this->_cache[$processedRowCacheKeys[$dqlAlias]])) {
                    $rowData['data'][$dqlAlias] = $this->_cache[$processedRowCacheKeys[$dqlAlias]];
                }
            }
        }

        foreach ($data as $key => $value) {
            if (($cacheKeyInfo = $this->hydrateColumnInfo($key)) === null) {
                continue;
            }

            $fieldName = $cacheKeyInfo['fieldName'];

            switch (true) {
                case isset($cacheKeyInfo['isNewObjectParameter']):
                    $argIndex = $cacheKeyInfo['argIndex'];
                    $objIndex = $cacheKeyInfo['objIndex'];
                    $type = $cacheKeyInfo['type'];
                    $value = $type->convertToPHPValue($value, $this->_platform);

                    $rowData['newObjects'][$objIndex]['class'] = $cacheKeyInfo['class'];
                    $rowData['newObjects'][$objIndex]['args'][$argIndex] = $value;

                    break;

                case isset($cacheKeyInfo['isScalar']):
                    $type = $cacheKeyInfo['type'];
                    $value = $type->convertToPHPValue($value, $this->_platform);

                    $rowData['scalars'][$fieldName] = $value;

                    break;

                //case (isset($cacheKeyInfo['isMetaColumn'])):
                default:
                    $dqlAlias = $cacheKeyInfo['dqlAlias'];
                    $type = $cacheKeyInfo['type'];

                    // If there are field name collisions in the child class, then we need
                    // to only hydrate if we are looking at the correct discriminator value
                    if (isset($cacheKeyInfo['discriminatorColumn'], $data[$cacheKeyInfo['discriminatorColumn']])
                        && !\in_array((string) $data[$cacheKeyInfo['discriminatorColumn']], $cacheKeyInfo['discriminatorValues'], true)
                    ) {
                        break;
                    }

                    // in an inheritance hierarchy the same field could be defined several times.
                    // We overwrite this value so long we don't have a non-null value, that value we keep.
                    // Per definition it cannot be that a field is defined several times and has several values.
                    // /!\ This test is also required to take advantage of the processed row cache.
                    if (isset($rowData['data'][$dqlAlias][$fieldName])) {
                        break;
                    }

                    $rowData['data'][$dqlAlias][$fieldName] = $type
                        ? $type->convertToPHPValue($value, $this->_platform)
                        : $value;

                    break;
            }
        }

        if ($cachingEnabled) {
            // save $rowData['data'] to cache if MISS for each DQL alias
            foreach ($id as $dqlAlias => $entityId) {
                if (!isset($this->_cache[$processedRowCacheKeys[$dqlAlias]])) {
                    $this->_cache[$processedRowCacheKeys[$dqlAlias]] = $rowData['data'][$dqlAlias];
                }
            }
        }

        return $rowData;
    }
}

The Doctrine bundle configuration:

doctrine:
    orm:
        hydrators:
            OptimizedObjectHydrator: App\Framework\DoctrineExtensions\ORM\OptimizedObjectHydrator

When querying:

$entities = $query
    ->setHint(Query::HINT_INCLUDE_META_COLUMNS, true)
    ->getResult('OptimizedObjectHydrator')
;
@holtkamp
Copy link
Contributor

Not totally sure it is related, but concerning the costly conversion it be nice to have a look at creof/doctrine2-spatial#121 and #1241

@NoiseByNorthwest
Copy link
Author

Yes in both cases there is a costly conversion involved, but in my case the conversion cost on its own is not noticeable. It is actually noticeable because it is unnecessarily repeated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants