-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MC-21651: Proposal to refactor customer import #323
Open
larsroettig
wants to merge
1
commit into
magento:master
Choose a base branch
from
larsroettig:refactor-customer-import
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,269 @@ | ||
# MC-21651 Proposal to refactor the customer and address import depending on AUTO_INCREMENT and showTableStatus | ||
|
||
## Current usage in customer import (Open Source Edition) | ||
|
||
* `\Magento\CustomerImportExport\Model\Import\Address::_getNextEntityId()` **@api** | ||
* Only called in case no address id is specified for the address that is being imported. The result is then used to | ||
predict the id of the address that is being imported. The `AUTO_INCREMENT` is only fetched one, in case it has | ||
not been fetched before. Every further call returns the cached value and increments it (post increment) for the | ||
next call. This could lead to off-by-N issues. | ||
* `\Magento\CustomerImportExport\Model\Import\Customer::_getNextEntityId()` **@api** | ||
* Only called in case no customer id is specified for the customer that is being imported. The result is then used | ||
to predict the id of the customer that is being imported. | ||
The `AUTO_INCREMENT` is only fetched one, in case it has not been fetched before. Every further call returns the | ||
cached value and increments it (post increment) for the next call. This could lead to off-by-N issues. | ||
* `\Magento\CustomerImportExport\Model\Import\CustomerComposite::__construct()` **@api** | ||
* The result is used to predict the next available customer id in case it is not already specified in the | ||
optional constructor parameter `$data`. | ||
|
||
## Ideas of Refactoring | ||
|
||
### 1. Use `Sequence` implementation from Framework | ||
|
||
Every `Sequence` follows the api `Magento\Framework\DB\Sequence\SequenceInterface` | ||
|
||
### The benchmark consists of importing 1000 customers an median of 3 measurements. | ||
|
||
The benchmarking process is as follows: | ||
1. Delete all customers | ||
1. In the admin section navigate to _System > Import_ | ||
1. In the field _Entity Type_ select "Customers Main File" | ||
1. In the field _Import Behavior_ select "Add/Update Complex Data" | ||
1. In the field _Select File to Import_ upload the file containing customer data | ||
1. Hit button _Check Data_ | ||
1. Start measuring time | ||
1. Hit button _Import_ | ||
1. Stop measuring time | ||
1. Start over | ||
|
||
**System setup:** | ||
|
||
* NginX 1.14.0 | ||
* PHP-FPM 7.2.9 with Zend OPcache v7.3.9 | ||
* Magento in Developer Mode | ||
* MySQL 5.6 | ||
|
||
#### Unmodified | ||
|
||
* Time: 3,86 sec | ||
* IO: 245ms | ||
* CPU: 3.62 sec | ||
* Network: 1.05 mb | ||
* Sequel Queries: 200ms 87rq | ||
| ||
|
||
Concept 1 - Use `class CustomerSequence implements SequenceInterface` implementation | ||
---------------------- | ||
|
||
```php | ||
protected function _getNextEntityId() | ||
{ | ||
return $this->sequence->getNextValue(); | ||
} | ||
``` | ||
* Time: 13.3 sec | ||
* IO: 4.64 sec | ||
* CPU: 8.66sec | ||
* Network: 1.08 mb | ||
* Sequel Queries: 4.78sec 1,085rq | ||
|
||
**Advantage:** | ||
|
||
- For import we only need to add optional parameter to `\Magento\CustomerImportExport\Model\Import\Customer::__construct` | ||
- More flexibility | ||
|
||
**Disadvantage:** | ||
|
||
- Slower because it triggers much more queries, this will be worse if we are using *AWS* database cluster for import | ||
- We need to maintain the implementation. Maybe it is better to use MetadataPool implementation from the core | ||
|
||
Concept 2 - `Magento\Framework\EntityManager\MetadataPool` implementation | ||
---------------------- | ||
|
||
```php | ||
protected function _getNextEntityId() | ||
{ | ||
return $this->metadataPool->getMetadata(CustomerInterface::class)->generateIdentifier(); | ||
} | ||
``` | ||
|
||
**di.xml:** | ||
|
||
```xml | ||
<type name="Magento\Framework\EntityManager\MetadataPool"> | ||
<arguments> | ||
<argument name="metadata" xsi:type="array"> | ||
<item name="Magento\Customer\Api\Data\CustomerInterface" xsi:type="array"> | ||
<item name="sequenceTable" xsi:type="string">sequence_customer</item> | ||
</item> | ||
</argument> | ||
</arguments> | ||
</type> | ||
``` | ||
* Time: 10.6 sec | ||
* IO: 4.3 sec | ||
* CPU: 6.3sec | ||
* Network: 1.08 mb | ||
* Sequel Queries: 4.4sec 1,085rq | ||
|
||
**Advantage:** | ||
|
||
- For import we only need to add optional parameter to `\Magento\CustomerImportExport\Model\Import\Customer::__construct` | ||
|
||
**Disadvantage:** | ||
|
||
- Slower because it triggers much more queries, this will be worse if we are using *AWS* database cluster for import | ||
|
||
### Summary Concept 1 and Concept 2 | ||
|
||
Both concepts add two queries per new customer. | ||
|
||
If we now think about to import 200.000 new customers on **AWS**, it is much slower than before. For those reasons, we | ||
need a concept that helps us to optimize the queries that we send to MySQL. | ||
|
||
|
||
Concept 3 - optimize queries | ||
---------- | ||
|
||
We need to split the import bunch by if already know the customer, this we can reach with query like. | ||
|
||
To implement this we need to move **magento2ce/app/code/Magento/CustomerImportExport/Model/Import/Customer.php**:537- 555 in a two service classes. | ||
|
||
```mysql | ||
SELECT entity_id, email | ||
FROM customer_entity | ||
WHERE CONCAT(email,website_id) IN ( | ||
'malcolm85@gmail.com1', | ||
'khuel@yahoo.com1', | ||
'newcustomer@yahoo.com1' | ||
); | ||
``` | ||
|
||
**Result:** | ||
|
||
| entity\_id | email | | ||
| :--------- | :------------------ | | ||
| 4290 | khuel@yahoo.com | | ||
| 4288 | malcolm85@gmail.com | | ||
|
||
This helps to find all customers that we already have created in Magento. With this Information we can refactor the `protected function _importData()` | ||
|
||
```php | ||
while ($bunch = $this->_dataSourceModel->getNextBunch()) { | ||
$this->prepareCustomerData($bunch); | ||
$entitiesToCreate = []; | ||
$entitiesToUpdate = []; | ||
$entitiesToDelete = []; | ||
$attributesToSave = []; | ||
|
||
$bunchDataByMail = []; | ||
$customerAddresses = []; | ||
foreach ($bunch as $rowNumber => $rowData) { | ||
if (!$this->validateRow($rowData, $rowNumber)) { | ||
continue; | ||
} | ||
if ($this->getErrorAggregator()->hasToBeTerminated()) { | ||
$this->getErrorAggregator()->addRowToSkip($rowNumber); | ||
continue; | ||
} | ||
$email = $rowData[self::COLUMN_EMAIL]; | ||
$customerAddresses[] = $email.$rowData['website_id']; | ||
$bunchDataByMail[$email] = $rowData; | ||
} | ||
|
||
$query = $this->_connection->select()->from( | ||
'customer_entity', | ||
['entity_id', 'email'] | ||
)->where('CONCAT(email,website_id) in (?)', $customerAddresses); | ||
|
||
if ($this->getBehavior($rowData) == Import::BEHAVIOR_DELETE) { | ||
$entitiesToDelete = $this->_connection->fetchCol($query, 'entity_id'); | ||
} elseif ($this->getBehavior($rowData) == Import::BEHAVIOR_ADD_UPDATE) { | ||
$entitiesToUpdate = $this->_connection->fetchAll($query, 'email'); | ||
|
||
/* should filter $validBunchData[$email] by $entitiesToUpdate and split them in two arrays $entitiesToUpdate and $entitiesToCreate*/ | ||
} | ||
``` | ||
|
||
With these two arrays we can use the row by email to create the data to import. | ||
|
||
To generate new ids I recommend to add functions: | ||
|
||
`Sequence` | ||
|
||
```php | ||
public function getNextValueBunch(int $count): array | ||
{ | ||
$currentValue = $this->getCurrentValue(); | ||
$nextvalue = $currentValue + $count; | ||
$this->resource->getConnection($this->connectionName) | ||
->insert($this->resource->getTableName($this->sequenceTable), ['sequence_value' => $nextValue]); | ||
return range($currentValue, $nextValue); | ||
} | ||
``` | ||
|
||
`EntityMetadata` | ||
|
||
```php | ||
public function generateIdentifierBunch(int $count) : array | ||
{ | ||
$nextIdentifiers = []; | ||
if ($this->sequence) { | ||
$nextIdentifiers = $this->sequence->getNextValueBunch($count); | ||
} | ||
return $nextIdentifiers; | ||
} | ||
``` | ||
|
||
**Advantage:** | ||
|
||
* We can apply this approach to Address and Customer and import should improve performance and stability | ||
|
||
|
||
# Concept 4 - Stored Function to generate Sequences | ||
|
||
I use as base the following technical Article https://www.percona.com/blog/2008/04/02/stored-function-to-generate-sequences/ | ||
|
||
Query that I run before I started performance testing: | ||
|
||
```mysql | ||
CREATE TABLE `sequence` ( | ||
`type` varchar(20) NOT NULL, | ||
`value` int(10) unsigned NOT NULL, | ||
PRIMARY KEY (`type`) | ||
) ENGINE=MyISAM DEFAULT CHARSET=latin1; | ||
|
||
INSERT INTO sequence VALUES ('customer', 1); | ||
|
||
DELIMITER // | ||
CREATE FUNCTION sequence(seq_type char (20)) RETURNS int | ||
BEGIN | ||
UPDATE sequence SET value=last_insert_id(value+1) WHERE type=seq_type; | ||
RETURN last_insert_id(); | ||
END | ||
// | ||
DELIMITER ; | ||
``` | ||
|
||
Implementation of ` _getNextEntityId()` | ||
|
||
```php | ||
protected function _getNextEntityId() | ||
{ | ||
return $this->_connection->query("select sequence('customer')")->fetchColumn(); | ||
} | ||
``` | ||
|
||
* Time: 3.92 sec | ||
* IO: 511ms | ||
* CPU: 3.4sec | ||
* Network: 1.15 mb | ||
* Sequel Queries: 538.5ms 1,084rq | ||
|
||
# Summary | ||
|
||
For all 4 concepts we need to have a migration from `customer_entity` to `sequence` table idea like | ||
`migrateSequneceColumnData(customer_entity,entity_id)`. I think we will get the most benefit with the implementation of | ||
Concept 3 because we are sending less queries to the database. | ||
All concepts can be implemented in a backward compatible way because we only touch constructors and protected functions | ||
for import. |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally yes, but in reality many extensions/customizations depend on DB stricture. As a result https://devdocs.magento.com/guides/v2.3/extension-dev-guide/versioning/codebase-changes.html includes DB schema changes as MINOR/MAJOR change. This should be taken into account.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For distributed systems, it will not be possible to scale this kind of solution, where for generating ids we are relying on the database itself; we need to control at the program level how we want to generate the IDs. Twitter was using similar sequence for MySQL using AUTO_INCREMENT; but they realized it was not possible to scale this approach.
I would suggest looking into Twitter Snowflake
https://github.com/twitter-archive/snowflake/tree/snowflake-2010
Another article around this subject is this
https://www.callicoder.com/distributed-unique-id-sequence-number-generator/
Although currently many of Magento customers running on single-node database, but in the cloud we would need to support large scale bulk operations in distributed style, in which case we would be able to allow multiple threads pushing the data; so an optimized service that can handle this scale would be better approach; which doesn't rely on hitting database for Ids.
Snowflake uses 8 byte unsigned integers, currently Magento using 4 byte unsigned integer, not-sure whether it will be BIC, it seems like a PATCH? according to https://devdocs.magento.com/guides/v2.3/extension-dev-guide/versioning/codebase-changes.html
But the idea is to centralize the ID generation and take it outside of database itself; the service should be similarly scalable as other services to support the load.
This service can provide Bulk IDs as well, and many implementations I have seen, uses Redis or similar Cache to store that Bunch of IDs, in order to avoid unnecessary calls to ID Generation Service as well. We may still be able to use Magento SequenceInterface, while underlying calling this new Distributed ID Service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@larsroettig please review the comment above from my side, and let me know your thoughts?