diff --git a/services/data-hub/README.md b/services/data-hub/README.md index 3dd233963..3d8112134 100644 --- a/services/data-hub/README.md +++ b/services/data-hub/README.md @@ -42,6 +42,10 @@ VERSION=latest npm run build:docker Kubernetes descriptors can be found in the [k8s](./k8s) directory. +## Import & Merge + +Find more details [here](./src/minhash-poc/README.md) + ## Environment variables ### General diff --git a/services/data-hub/src/minhash-poc/README.md b/services/data-hub/src/minhash-poc/README.md index 142ecda8e..0b0e65f76 100644 --- a/services/data-hub/src/minhash-poc/README.md +++ b/services/data-hub/src/minhash-poc/README.md @@ -33,6 +33,6 @@ Then different data sets were checked for the existence of a common identity. Th The "minhashing" of small data sets basically works here and is suitable for establishing the same identities in different data sets. -In contrast to the conventional application of these algorithms, the OIH's context tends to handle tiny-sized data sets. Each set usually contains a small amount of characters and therefore results more often in a meager set of shingles. In the end this can lead to accumulated false positives. +In contrast to the conventional application of these algorithms, the OIH's context tends to handle tiny-sized data sets. Each set usually contains a small number of characters and therefore results more often in a meager set of shingles. In the end, this can lead to a higher amount of false positives. One possible conclusion is that automatic detection using the Minhash method should not be accepted without a further verification process. This should be done by algorithmic post-processing or plain human interaction on a suited merge interface. \ No newline at end of file