Duplicate detection for CRM integrations #1347

jamesmacwhite · 2023-03-10T19:30:08Z

jamesmacwhite
Mar 10, 2023

CRM systems have to deal with the data problem of duplicate records. Key entities like contacts, organisations which are intended to be singular can often be duplicated due to multiple data capture streams, particularly for multiple campaigns or offering different products/services.

It could be advantageous for Formie to have duplicate detection when mapped against types like contacts, accounts/organisations to prevent duplicate records being created.

The actual way to check for duplicates would differ for each CRM integration and the API methods available, but potentially extending the existing payload/send payload logic to incorporate duplicate checking is possibly a way to achieve it. The debate would be that the data deduplication process should happen at the CRM side, but you could counter that preventing duplicates being created to begin with helps the overall process and creates less work at the CRM side.

The Dynamics 365 CRM example for checking for the potential duplicates:

Filter query with parameters from the payload i.e. firstname, lastname, emailaddress1
Using the RetrieveDuplicates endpoint which leverages the set duplicate record rules defined in the CRM environment.

Querying the result of either method to determine if there was a match returned. Further checking may need to be done to reduce a false positive result. If deemed a duplicate, use the returned record and obtain it's unique ID/GUID, this should then be used against further relational data, rather than creating a new record. This would only apply for certain entities, others are intended to be duplicated e.g. leads.

Further thoughts

Being able to detect duplicates based on the payload data could be tricky given any form could have a infinite amount of fields that may or may not be present for use in a duplicate check.
Duplicate detection should be an opt-in setting if it existed i.e. an additional setting on the mapping for relevant entities, to be able to turn such behaviour on or off and default off?
If a duplicate is found, but a payload contains data which isn't currently present in the CRM, it shouldn't be discarded and instead updated on the duplicate record.
Duplication detection has a risk of false positive, rules would need to be pretty robust in order to do this programatically. Potential risks of linking to the wrong contact is a possibility.

engram-design · 2023-03-10T20:36:56Z

engram-design
Mar 10, 2023
Maintainer

The main issue I see here is that the burden is on the user to define a bunch of things. Determining what is a duplicate would need to be defined per-integration and per-object. How also do we determine what values to use to check against existing data? Is it email, first name + last name, etc? If so, those fields will be required to be mapped and also required in the submission.

We could maybe define these ourselves (contact requires email is easy enough), but things like leads get tricky to pick a unique value. I don't want to have to get users to define what fields are used to determine duplicate checks - this just complicates the mapping process for novice users.

Like your thoughts point out, there's a bunch of things to consider, and I can already see the support requests coming through with people complaining their things don't appear in the destination platform. Most integration platforms are smart enough to throw an error if a duplicate is made (and not allowed). This raises another point that sometimes duplicates are desired, which as you say should be opt-in.

It's something I'll consider, but it'll be a bit of work to implement for every integration + every data object.

0 replies

jamesmacwhite · 2023-03-10T22:03:46Z

jamesmacwhite
Mar 10, 2023
Author

It is certainly an advanced function and totally get your view point on additional settings and configuration, I'd be more inclined to suggest this an advanced function and make it somewhat less visible by default. Duplicates for me are more for singular entities so contact and organisations, something which should be singular. Leads will often be duplicated in many cases therefore I don't think it's something it should be done on. Much like a relational database, one contact can have many leads, therefore have multiple leads is absolutely right to me. Duplicates being required which is why this should be an opt-in feature.

Scenario for creating many duplicates: You offer a range of courses and on each course is a register your interest form which generates a lead. Each course page passes the element ID of itself to the form so the course entry is captured against the form submission (as a hidden field), this is sent to a CRM with the course details captured in the hidden field. Someone could be interested in several courses, so they submitted 3 different forms on three different courses, you've now got three contact records with the same data, with each lead linked to each one. You could argue have a single form with a dropdown of course options instead, but the idea being you aren't actually presenting the course as a selectable field, you are automatically populating it, because you know they are on that course page at that time.

To have that single contact and all leads under the parent, these now need to be merged. With a duplicate check, the first form submission would have created a record given a duplicate check would have returned no match (assuming this person has never been seen before), the other two submissions would go through the same duplicate check and would have mostly likely been able to prevent the creation and instead obtain the unique ID/GUID of the first contact record to link to instead, thus the lead data goes to the original contact record, not a duplicate one.

Duplicate rules need to be very robust for sure. I have looked at our own rules active in our CRM environment and I don't trust them enough, so even the official RetrieveDuplicates endpoint is out. Therefore if we were to do it, I'd build our own lookup query and edge on the side of caution by needing a minimum of 3 personal identifiable markers e.g. First Name, Last Name, Email or First Name, Last Name, Date of birth. It would be likely too risky to use just email alone, what if someone shared the same email, and First Name and Last Name was different. Doing this makes the duplicate check more robust at the cost of potentially not catching more simple duplicates, but the objective would be to try and reduce duplicates, not remove them entirely given there will always be a data management requirement for any CRM.

Dynamics 365 CRM can outright stop duplicates with a simple header in the API request on a create, however it's off by default and likely for a good reason, in most cases you don't want to stop the data from being created, but you might want to check for an existing record first. Basically implementing some form of search/lookup on the entity payload data first, before just sending a POST to create, of course easy said than done with the various area.

It's possible this is borderline of bespoke business requirements and therefore extending the CRM integration class and building it into a custom integration is possibly more likely, but I thought I'd create a discussion, just to see thoughts. I'm primarily looking at it from a Dynamics 365 CRM angle, I'm not sure how many Formie users are extending the Dynamics 365 CRM integration outside of the out of the box version, likewise other CRM products like Salesforce, Pardot etc may do a better job of the problem itself, but I'll leave it here for anyone else to comment.

1 reply

engram-design Mar 10, 2023
Maintainer

I'll say from experience that Salesforce doesn't like duplicate leads, so we actually create a note if there encounters a duplicate. That gets tricky, because should we actually update an already existing lead? How do we define what the match on - assume there's an email or similar?

Good to open up the discussion for sure.

jamesmacwhite · 2023-03-11T10:34:39Z

jamesmacwhite
Mar 11, 2023
Author

I've tested a working concept as an initial proof of concept/proof of functionality, but very opinionated to Dynamics 365, given it's implemented directly, but perhaps is a starting point for a future wider CRM feature or at least provides some theory or implementation ideas if it was something to be added.

The sendPayload function in the CRM integration class can be modified to perform the duplicate check, given this is the point where a record is created and is likely the most suitable place to implement a check. An additional service is needed to perform a duplicate check so it can be re-used across multiple entities. I'm calling this from a module. The benefit of doing it in the sendPayload is you have easy access to the payload values which you are going to need for both using as a duplicate match check and possibly sending an update on an existing record, rather than create. The other side to this, is making sure the form data is not discarded if it contains information that wasn't already there.

Example of modifying the contact map logic to include a duplicate check, while maintaining the existing create new record logic.

if ($this->mapToContact) {

        $contactPayload = $contactValues;

        if ($this->duplicateContactCheck && (false !== ($duplicateRecord = $myService->findExistingRecord('contacts', $contactPayload)))) {
	        $response = $this->deliverPayload($submission, "contacts($duplicateRecord)?\$select=contactid", $contactPayload, 'PATCH');
        }
        else {
	        $response = $this->deliverPayload($submission, 'contacts?$select=contactid', $contactPayload);
        }
        
        if ($response === false) {
	        return true;
        }
        
        $contactId = $response['contactid'] ?? '';
        
        if (!$contactId) {
	        Integration::error($this, Craft::t('nottingham-college-module', 'Missing return contactid {response}. Sent payload {payload}', [
		        'response' => Json::encode($response),
		        'payload' => Json::encode($contactPayload),
	        ]), true);
        
	        return false;
        }
}

Adding the extra properties of duplicateContactCheck and duplicateAccountCheck allows this behaviour to be limited to an entity and be on a per form basis.

One benefit of the Dynamics 365 API is you can send a PATCH request when targeting a specific record, therefore only the values present in the payload will be modified, not the entire record, which is much safer. Given a match was found, certain amounts of data are already going to be present on the contact record, therefore you can update these without risk, with possibly any other data passed over which wasn't already there. As far as I can tell any null values in the mapping are stripped at this stage, so there is less risk of removing data accidentally.

There's also an extra defence you can prevent an upsert type request from creating records if they didn't exist already.

The findExistingRecord() function, then uses one or more match rules on properties that would potentially be present, however it is not guaranteed that these would always be present. So a check has to be done to determine if the properties are all present to make a query for them, if not, return false, given we can't reliably match the data without all being present. For safety, we must have a minimum of reasonable data present to make an accurate judgement on a duplicate. Dynamics 365 duplicate rules can be less wide, because the process generally is run a duplicate query, get results back, make a judgement based on the possible duplicates reported. This however is done by a human to review the suggested records marked as a duplicate but some might not be. Given we are trusting a lookup result, we have to be more carefun.

In the findExistingRecord function, we can define match rules and check the payload to make sure the properties are present, otherwise we'll need to skip over the duplicate check if not all data is available, given we'll get too wide results.

public function findExistingRecord(string $entity, array $payload): bool|string
{
        // Properties to use for checking entities for duplicates
        $propertiesForMatching = [
	        'contacts' => [
		        ['firstname', 'lastname', 'birthdate'],
		        ['firstname', 'lastname', 'emailaddress1']
	        ],
	        'accounts' => [
		        ['name', 'address1_postalcode']
	        ]
        ];
        
        $matchRules = $propertiesForMatching[$entity] ?? null;
        
        if (!$matchRules) {
	        return false;
        }
        
        foreach ($matchRules as $fields) {
	        
	        $fieldsInPayload = count(array_intersect_key(array_flip($fields), $payload));
        
	        // If we don't have all the properties for matching, skip the rule
	        if ($fieldsInPayload !== count($fields)) {
		        continue;
	        }
        
	        // Make sure we always query active records only
	        $filter = ['statecode eq 0'];
        
	        foreach ($fields as $field) {
        
		        // Date values use a different comparison method
		        if ($field === 'birthdate') {
			        $filter[] = "Microsoft.Dynamics.CRM.On(PropertyName='$field',PropertyValue='$payload[$field]')";
		        }
		        else {
			        $filter[] = "$field eq '$payload[$field]'";
		        }
	        }
        
	        $response = $this->getCrmIntegration()?->request('GET', $entity, [
		        'query' => [
			        '$select' => implode(',', $fields),
			        '$filter' => implode(' and ', $filter)
		        ]
	        ]);
        
	        $existingRecord = $response['value'][0] ?? null;
        
	        if ($existingRecord) {
		        break;
	        }
        }
        
        // GUID exists under key accountid, contactid etc...
        $entityId = rtrim($entity, 's') . 'id';
        
        return $existingRecord[$entityId] ?? false;
}

Using the handy Formie::$plugin->getIntegrations()->getIntegrationByHandle('crmIntegration'); we can obtain an authenticated client from anywhere required to perform custom Dynamics 365 API requests in service class.

Further thoughts

What happens when a duplicate query results more than one result. Technically all records returned would have met the criteria, even with an extra check of active only. You could always link to the first result and the requirement to remove the duplicate records in addition to the one selected is part of the usual data deduplication process.
Extending the existing integration settings to toggle duplicate check behaviour on or off per entity is possibly advisable, if there are edge cases or problems, having it able to be turned off without modifying the integration class to do it, seems safer.
First Name, Last Name and Birthday, technically has a very small chance of being two different people.

Just some thoughts and testing.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate detection for CRM integrations #1347

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Duplicate detection for CRM integrations #1347

jamesmacwhite Mar 10, 2023

Further thoughts

Replies: 3 comments · 1 reply

engram-design Mar 10, 2023 Maintainer

jamesmacwhite Mar 10, 2023 Author

engram-design Mar 10, 2023 Maintainer

jamesmacwhite Mar 11, 2023 Author

Further thoughts

jamesmacwhite
Mar 10, 2023

Replies: 3 comments 1 reply

engram-design
Mar 10, 2023
Maintainer

jamesmacwhite
Mar 10, 2023
Author

engram-design Mar 10, 2023
Maintainer

jamesmacwhite
Mar 11, 2023
Author