Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ETL crashes, claiming duplicate key 'null' #6179

Closed
4 of 16 tasks
rspeer opened this issue May 21, 2016 · 2 comments
Closed
4 of 16 tasks

ETL crashes, claiming duplicate key 'null' #6179

rspeer opened this issue May 21, 2016 · 2 comments
Assignees
Labels
Milestone

Comments

@rspeer
Copy link

rspeer commented May 21, 2016

Expected behavior and actual behavior

I'm trying to import ConceptNet 5 into OrientDB despite my historical lack of luck with graph databases. I've distilled it down to what I think is its simplest form: a tab-separated CSV file of (predicate, start node, end node) triples. There's more I'd consider including, but making this work would be the first step.

The file (simple.csv) looks like this:

/r/Antonym      /c/en/a_little/r        /c/en/not_a_little
/r/Antonym      /c/en/abash/v   /c/en/reassure
/r/Antonym      /c/en/abash/v   /c/en/unabash
/r/Antonym      /c/en/abbreviate/v/wikt/en_1    /c/en/lengthen
/r/Antonym      /c/en/able/a    /c/en/unable

and so on.

Here's my ETL file (conceptnet-import.json), adapted from this StackOverflow question because none of the examples in your documentation take in simple lists of edges between a single type of node:

{
  "source": { "file": { "path": "/home/rspeer/conceptnet5/data/assertions/simple.csv" } },
  "extractor": {
      "csv": {
          "separator": "\t",
          "columnsOnFirstLine": false,
          "columns": ["rel:string", "start:string", "end:string"]
      }
  },
  "transformers": [
    {
        "merge": {
            "unresolvedLinkAction": "CREATE",
            "joinFieldName": "start",
            "lookup": "Term.uri"
        }
    },
    {
        "vertex": {
            "class": "Term"
        }
    },
    {
        "edge": {
            "unresolvedLinkAction": "CREATE",
            "class": "Assertion",
            "joinFieldName": "end",
            "lookup": "Term.uri"
        }
    },
  ],
  "loader": {
    "orientdb": {
       "dbURL": "plocal:/home/rspeer/conceptnet5/data/tmp/orient-conceptnet",
       "dbType": "graph",
       "wal": false,
       "batchCommit": 1000,
       "tx": false,
       "txUseLog": false,
       "classes": [
         {"name": "Term", "extends": "V"},
         {"name": "Assertion", "extends": "E"}
       ], "indexes": [
         {"class":"Term", "fields":["uri:string"], "type":"UNIQUE" }
       ]
    }
  }
}

ConceptNet has no information in its nodes (terms), only in its edges (assertions).

I would expect that running ETL on this file would get me a simple graph of ConceptNet. Instead, it crashes:

$ ./oetl.sh conceptnet-import.json
OrientDB etl v.2.2.0 (build develop@r79d281140b01c0bc3b566a46a64f1573cb359783; 2016-05-18 14:14:32+0000) www.orientdb.com
[csv] INFO column types: {rel=STRING, start=STRING, end=STRING}
BEGIN ETL PROCESSOR
[file] INFO Reading from file /home/rspeer/conceptnet5/data/assertions/simple.csv with encoding UTF-8
Started execution with 1 worker threads
Error in Pipeline execution: com.orientechnologies.orient.core.storage.ORecordDuplicatedException: Cannot index record Term{rel:/r/Antonym,start:/c/en/abash/v,end:/c/en/reassure}: found duplicated key 'null' in index 'Term.uri' previously assigned to the record #17:0
        Storage URL="plocal:/home/rspeer/conceptnet5/data/tmp/orient-conceptnet"INDEX=Term.uri RID=#17:0
+ extracted 503 rows (0 rows/sec) - 503 rows -> loaded 1 vertices (0 vertices/sec) Total time: 999ms [0 warnings, 1 errors]
+ extracted 503 rows (0 rows/sec) - 503 rows -> loaded 1 vertices (0 vertices/sec) Total time: 1999ms [0 warnings, 1 errors]
+ extracted 503 rows (0 rows/sec) - 503 rows -> loaded 1 vertices (0 vertices/sec) Total time: 2999ms [0 warnings, 1 errors]
+ extracted 503 rows (0 rows/sec) - 503 rows -> loaded 1 vertices (0 vertices/sec) Total time: 4s [0 warnings, 1 errors]
+ extracted 503 rows (0 rows/sec) - 503 rows -> loaded 1 vertices (0 vertices/sec) Total time: 5s [0 warnings, 1 errors]

Steps to reproduce the problem

Put the given tab-separated data in /home/rspeer/conceptnet5/data/assertions/simple.csv.

Save the above ETL file as conceptnet-import.json in the orientdb/bin directory (it can't find it if it's not in the same directory, it seems) and run:

./oetl.sh conceptnet-import.json

Important Questions

Runninng Mode

  • Embedded, using PLOCAL access mode
  • Embedded, using MEMORY access mode
  • Remote

Misc

  • I have a distributed setup with multiple servers. How many?
  • I'm using the Enterprise Edition

OrientDB Version

  • v2.0.x - Please specify last number:
  • v2.1.x - Please specify last number:
  • v2.2.x - Please specify last number: 0

Operating System

  • Linux
  • MacOSX
  • Windows
  • Other Unix
  • Other, name?

Java Version

  • 6
  • 7
  • 8
@robfrank robfrank self-assigned this May 21, 2016
@robfrank robfrank added this to the 2.2.x (next hotfix) milestone May 21, 2016
@lvca lvca closed this as completed May 21, 2016
@lvca
Copy link
Member

lvca commented May 21, 2016

The message is clear: "found duplicated key 'null' in index 'Term.uri' previously assigned to the record #17:0". You aren't setting the uri field, so it's null and you cannot have multiple null because it's a UNIQUE index.

Am I missing anything? If you don't want to consider null keys as unique, you should create the index with the {ignoreNulls:true}. Docs: http://orientdb.com/docs/last/Indexes.html#indexes-and-null-values

@lvca lvca assigned lvca and unassigned robfrank May 21, 2016
@lvca lvca added the question label May 21, 2016
@rspeer
Copy link
Author

rspeer commented May 23, 2016

I intend to be setting the URI field. Of course I don't want it to be null. How should I set it?

If I've missed some documentation -- it seems strange that I would have to rely on StackOverflow answers for the case of loading a graph from a list of edges -- please point me to it.

@robfrank robfrank modified the milestones: 2.2.x (next hotfix), 2.2.1 Jun 9, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

3 participants