docs(frontend): adding a use-case for fuzzy encrypted name comparison

zama-ai · Aug 21, 2024 · 956a3df · 956a3df
1 parent 971be1f
commit 956a3df
Show file tree

Hide file tree

Showing 3 changed files with 476 additions and 43 deletions.
diff --git a/frontends/concrete-python/examples/levenshtein_distance/IBAN_name_check.ipynb b/frontends/concrete-python/examples/levenshtein_distance/IBAN_name_check.ipynb
@@ -0,0 +1,356 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "bf2a3e8d-bf57-4841-a863-1eab96c91373",
+   "metadata": {},
+   "source": [
+    "# Comparing encrypted IBAN names\n",
+    "\n",
+    "When doing a transfer between Bank A and Bank B, Bank B has the obligation to check that the IBAN and the name of the recipient match. This is essential to combat frauds (fraudster impersonating someone else) and to avoid misdirected payments. Bank B would usually not reject a transfer if the name is close enough but doesn’t match exactly the recipient’s actual name. This is essential to make room for small spelling mistakes considering the impact of a rejected transfer (days / weeks of delays that can harm a business or a buyer, extra costs to handle the error, …). It is therefore important for Bank A to pre-check the name and inform the sender that the name is likely not matching, before initiating the transfer. For privacy reason however, it's better to do this pre-check over encrypted names.\n",
+    "\n",
+    "In this small tutorial, we show how to use our TFHE Levenshtein distance computations to perform such a privacy-preserving check, very simply and directly in Python. This tutorial can be easily configured, to change for example the way strings are normalized before encryption and comparison. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dc96a80f-0b14-4e64-a33f-31a60351453d",
+   "metadata": {},
+   "source": [
+    "## Importing our FHE Levenshtein computations\n",
+    "\n",
+    "One can have a look to this file to see how the FHE computations are handled."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "56ba9e20-ca46-4aa6-a0f7-86ca13480a52",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from levenshtein_distance import *\n",
+    "from time import time"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "494cd58c-ea28-4547-92c5-80ed4ba83964",
+   "metadata": {},
+   "source": [
+    "## Define the comparison functions\n",
+    "\n",
+    "FHE computation will happen in `calculate_and_return`, if `fhe_or_simulate` is set to `fhe`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "2410b6b3-0c21-4178-b8dc-f734ec0afd40",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def normalized_string(st):\n",
+    "    \"\"\"Normalize a string, to later make that the distance between non-normalized\n",
+    "    string 'John Doe' and 'doe john' is small. This function can be configured depending\n",
+    "    on the needs.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    # Force lower case\n",
+    "    st = st.lower()\n",
+    "\n",
+    "    # Replace - and . by spaces\n",
+    "    st = st.replace('-', ' ')\n",
+    "    st = st.replace('.', ' ')\n",
+    "    \n",
+    "    # Sort the words and join\n",
+    "    words = st.split()\n",
+    "    st = \"\".join(sorted(words))\n",
+    "\n",
+    "    return st\n",
+    "\n",
+    "def compare_IBAN_names(string0: str, string1: str, fhe_or_simulate: str):\n",
+    "    \"\"\"Compare two IBAN names: first, normalize the strings, then compute in FHE (look in \n",
+    "    calculate_and_return for FHE details).\"\"\"\n",
+    "    # Normalize strings\n",
+    "    string0 = normalized_string(string0)\n",
+    "    string1 = normalized_string(string1)\n",
+    "    max_string_length = max(len(string0), len(string1))\n",
+    "\n",
+    "    alphabet = Alphabet.init_by_name(\"name\")\n",
+    "    levenshtein_distance = LevenshteinDistance(alphabet, max_string_length, show_mlir = False, show_optimizer = False)\n",
+    "    time_begin = time()\n",
+    "    distance = levenshtein_distance.calculate_and_return(string0, string1, mode=fhe_or_simulate)    \n",
+    "    time_end = time()\n",
+    "    \n",
+    "    max_len = max(len(string0), len(string1))\n",
+    "    similarity = (max_len - distance) / max_len\n",
+    "\n",
+    "    print(f\"Similarity between the two strings is {similarity:.4f}, computed in {time_end - time_begin: .2f} seconds\")\n",
+    "    return similarity"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f9006416-8240-4d8b-be8d-9011547f4719",
+   "metadata": {},
+   "source": [
+    "This is the option to set to \"fhe\" to run computations in FHE. If you set it to \"simulate\", only simulation will be done, which is sufficient to debug what happens, but should not be used in production settings. Remark that computations in FHE can be long, especially if the strings are long. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "3b79df6b-8aff-4bfe-b119-e4aec78e04af",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fhe_or_simulate = \"fhe\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4062d22f-ae05-4493-a1ab-6a6a16bbc1f3",
+   "metadata": {},
+   "source": [
+    "## Make a few comparisons in a private setting\n",
+    "\n",
+    "First, with equal strings, the match is perfect."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "60ccaded-7579-4bd4-a972-7eea98d5d585",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Similarity between the two strings is 1.0000, computed in  149.59 seconds\n"
+     ]
+    }
+   ],
+   "source": [
+    "string0 = \"John Doe\"\n",
+    "string1 = \"John Doe\"\n",
+    "\n",
+    "assert compare_IBAN_names(string0, string1, fhe_or_simulate = fhe_or_simulate) == 1.0"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f92bc91b-cd26-4ced-af4e-49811bea2353",
+   "metadata": {},
+   "source": [
+    "With reversed names, the match is also perfect, thanks to our definition of `normalized_string`. If it is a non-desired property, we can change it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "c9658e10-94dd-4e6a-8352-639493ac36f7",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Similarity between the two strings is 1.0000, computed in  154.02 seconds\n"
+     ]
+    }
+   ],
+   "source": [
+    "string0 = \"John Doe\"\n",
+    "string1 = \"Doe John\"\n",
+    "\n",
+    "assert compare_IBAN_names(string0, string1, fhe_or_simulate = fhe_or_simulate) == 1.0"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c871b320-f93c-4fdb-9a70-5d423811961e",
+   "metadata": {},
+   "source": [
+    "With a typo, similarity is smaller, but still quite high."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "a822a188-a7ae-466f-8caa-15d91131fc5c",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Similarity between the two strings is 0.8571, computed in  133.71 seconds\n"
+     ]
+    }
+   ],
+   "source": [
+    "string0 = \"John Doe\"\n",
+    "string1 = \"John Do\"\n",
+    "\n",
+    "assert round(compare_IBAN_names(string0, string1, fhe_or_simulate = fhe_or_simulate), 2) == 0.86"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a7c26654-08da-4755-8eba-25aef6d49e2a",
+   "metadata": {},
+   "source": [
+    "With an added letter, it is also high."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "fba38c06-d26a-4dc8-9442-d1f128068d1b",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Similarity between the two strings is 0.8750, computed in  166.83 seconds\n"
+     ]
+    }
+   ],
+   "source": [
+    "string0 = \"John Doe\"\n",
+    "string1 = \"John W Doe\"\n",
+    "\n",
+    "assert round(compare_IBAN_names(string0, string1, fhe_or_simulate = fhe_or_simulate), 2) == 0.88"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fab3e31c-5533-4983-a854-bfb9bb360611",
+   "metadata": {},
+   "source": [
+    "With the way we have normalized strings, we consider '-' and ' ' as equal."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "fc8a70c6-65ee-40c5-98a4-bbc681f2b873",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Similarity between the two strings is 1.0000, computed in  150.00 seconds\n"
+     ]
+    }
+   ],
+   "source": [
+    "string0 = \"John Doe\"\n",
+    "string1 = \"John-Doe\"\n",
+    "\n",
+    "assert round(compare_IBAN_names(string0, string1, fhe_or_simulate = fhe_or_simulate), 2) == 1.0"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7e2db4f0-e1ef-4726-b9d4-d9aedf20ad43",
+   "metadata": {},
+   "source": [
+    "Finally, with totally different names, we can see a very low similarity."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "330de097-fc30-4d46-b2bb-459ab8e00a27",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Similarity between the two strings is 0.1429, computed in  148.66 seconds\n"
+     ]
+    }
+   ],
+   "source": [
+    "string0 = \"John Doe\"\n",
+    "string1 = \"Gill Cot\"\n",
+    "\n",
+    "assert round(compare_IBAN_names(string0, string1, fhe_or_simulate = fhe_or_simulate), 2) == 0.14"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "001c7c1e-37db-4488-925f-2c46a902d962",
+   "metadata": {},
+   "source": [
+    "Remark that, as we sort words in `normalized_string`, typos in the first letter can have bad impacts. It's not obvious to find a function which accepts word reordering but at the same time is not too impacted by mistakes on the first word letters. Choices can be done depending by the banks to fit their preference."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "c5600fde-3c42-4f52-ad0c-fa0ebda9b0cf",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Similarity between the two strings is 0.1429, computed in  155.03 seconds\n",
+      "Similarity between the two strings is 0.8571, computed in  148.72 seconds\n"
+     ]
+    }
+   ],
+   "source": [
+    "# One typo in the first letter\n",
+    "string0 = \"John Doe\"\n",
+    "string1 = \"John Poe\"\n",
+    "\n",
+    "assert round(compare_IBAN_names(string0, string1, fhe_or_simulate = fhe_or_simulate), 2) == 0.14\n",
+    "\n",
+    "# One typo in the last letter\n",
+    "string0 = \"John Doe\"\n",
+    "string1 = \"John Doy\"\n",
+    "\n",
+    "assert round(compare_IBAN_names(string0, string1, fhe_or_simulate = fhe_or_simulate), 2) == 0.86"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8a61bc76-7c38-4251-bafd-51d7843dd3c7",
+   "metadata": {},
+   "source": [
+    "## Conclusion\n",
+    "\n",
+    "We have shown how to use Levenshtein distances in FHE, to perform IBAN checks in a private way. And since the code is open-source and in Python, it's pretty easy for developers to modify it, to fine-tune it to their specific needs, eg in terms of string normalization."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}