Skip to content

Commit

Permalink
Update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
heggert committed Feb 10, 2021
1 parent 78c30dc commit 3c202c7
Show file tree
Hide file tree
Showing 2 changed files with 5 additions and 1 deletion.
4 changes: 4 additions & 0 deletions services/data-hub/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,10 @@ VERSION=latest npm run build:docker

Kubernetes descriptors can be found in the [k8s](./k8s) directory.

## Import & Merge

Find more details [here](./src/minhash-poc/README.md)

## Environment variables

### General
Expand Down
2 changes: 1 addition & 1 deletion services/data-hub/src/minhash-poc/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,6 @@ Then different data sets were checked for the existence of a common identity. Th

The "minhashing" of small data sets basically works here and is suitable for establishing the same identities in different data sets.

In contrast to the conventional application of these algorithms, the OIH's context tends to handle tiny-sized data sets. Each set usually contains a small amount of characters and therefore results more often in a meager set of shingles. In the end this can lead to accumulated false positives.
In contrast to the conventional application of these algorithms, the OIH's context tends to handle tiny-sized data sets. Each set usually contains a small number of characters and therefore results more often in a meager set of shingles. In the end, this can lead to a higher amount of false positives.

One possible conclusion is that automatic detection using the Minhash method should not be accepted without a further verification process. This should be done by algorithmic post-processing or plain human interaction on a suited merge interface.

0 comments on commit 3c202c7

Please sign in to comment.