Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[4 curation] query script needed to detect and handle duplicated entries #142

Open
czhci opened this issue Mar 3, 2020 · 2 comments
Open
Assignees
Labels
curation enhancement New feature or request help wanted Extra attention is needed

Comments

@czhci
Copy link
Member

czhci commented Mar 3, 2020

examples:
http://biii.eu/active-cells-3d
http://biii.eu/active-cells-3d-0

seems if there are duplicated entries for the same software/training/dataset/etc, likely they will be given the same title, such as "Active Cells 3D", and thus likely the generated nodes with appended hyphen with numbers.

We should have a query script to return a list of such suspicious candidates for curators to verify and correct it.

@czhci czhci added curation enhancement New feature or request labels Mar 3, 2020
@albangaignard
Copy link
Member

Here is a short python snippet too compute a syntaxic distance between names.

import jellyfish

distances = [] 
for n1 in names:
    for n2 in names:
        if (n1 != n2):
            #distance = {'distance': jellyfish.levenshtein_distance(n1,n2), 'n1': n1, 'n2':n2}
            distance = {'distance': jellyfish.jaro_winkler(n1,n2), 'n1': n1, 'n2':n2}
            distances.append(distance)

sorted_distances = sorted(distances, key = lambda x : x['distance'], reverse=True)

Here is the output it can produce :

[{'distance': 0.9925925925925926,
  'n1': 'Neuron Tracing Vaa3D (MOST)',
  'n2': 'Neuron Tracing Vaa3D (MST)'},
 {'distance': 0.9925925925925926,
  'n1': 'Neuron Tracing Vaa3D (MST)',
  'n2': 'Neuron Tracing Vaa3D (MOST)'},
 {'distance': 0.9882352941176471,
  'n1': 'DeconvolutionLab2',
  'n2': 'DeconvolutionLab'},
 {'distance': 0.9882352941176471,
  'n1': 'DeconvolutionLab',
  'n2': 'DeconvolutionLab2'},
 {'distance': 0.9857142857142858,
  'n1': 'ROI image process tutorial 2',
  'n2': 'ROI image process tutorial 1'},
 {'distance': 0.9857142857142858,
  'n1': 'ROI image process tutorial 1',
  'n2': 'ROI image process tutorial 2'},
 {'distance': 0.9826086956521739,
  'n1': 'Math operations (Icy)',
  'n2': 'Math operations++ (Icy)'},
 {'distance': 0.9826086956521739,
  'n1': 'Math operations++ (Icy)',
  'n2': 'Math operations (Icy)'},
 {'distance': 0.9818181818181818,
  'n1': 'Icy Overlay tutorial 1',
  'n2': 'Icy Overlay tutorial 2'},
 {'distance': 0.9818181818181818, 'n1': 'ImageJFIJI', 'n2': 'ImageJ/FIJI'},
 {'distance': 0.9818181818181818, 'n1': 'ImageJ/FIJI', 'n2': 'ImageJFIJI'},
 {'distance': 0.9818181818181818,
  'n1': 'Icy Overlay tutorial 2',
  'n2': 'Icy Overlay tutorial 1'},
 {'distance': 0.9806451612903226,
  'n1': 'Microscope Live 3D (deprecated)',
  'n2': 'Microscope Live (deprecated)'},
 {'distance': 0.9806451612903226,
  'n1': 'Microscope Live (deprecated)',
  'n2': 'Microscope Live 3D (deprecated)'},
 {'distance': 0.98, 'n1': 'CellTrack', 'n2': 'CellTrack '},
 {'distance': 0.98, 'n1': 'CellTrack ', 'n2': 'CellTrack'},
 {'distance': 0.9798757763975156,
  'n1': 'CSBDeep, a toolbox for Content-aware Image Restoration (CARE) in Fiji',
  'n2': 'CSBDeep, a toolbox for Content-aware Image Restoration (CARE) in Knime'},
 {'distance': 0.9798757763975156,
  'n1': 'CSBDeep, a toolbox for Content-aware Image Restoration (CARE) in Knime',
  'n2': 'CSBDeep, a toolbox for Content-aware Image Restoration (CARE) in Fiji'},
 {'distance': 0.9764705882352941,
  'n1': 'EBImage Transpose',
  'n2': 'EBImage transpose'},
 {'distance': 0.9764705882352941,
  'n1': 'EBImage transpose',
  'n2': 'EBImage Transpose'},
...

@PerrineGilloteaux PerrineGilloteaux added the help wanted Extra attention is needed label May 15, 2023
@czhci
Copy link
Member Author

czhci commented May 17, 2023

Output entries by @albangaignard are curated now.

The script by @albangaignard should be scheduled to run regularly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
curation enhancement New feature or request help wanted Extra attention is needed
Projects
Status: Todo
Development

No branches or pull requests

3 participants