-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ETL: loading of records is somewhat slower than should be #8732
Comments
After adding index |
I will close this issue now. |
After initial spurt looked very encouraging for approximately 1 hour ( 1/2 million records loaded ) things went back to much slower pace. In the past hour only about 200000 records loaded, 2.5 times slower, than "fast" interval. So, it looks like as load advances forward into the CSV file something impacts lookup time. |
Oh... I am removing CC_FIPS field in the ETL script from City vertex. So, no index can be applied. Looks like after ETL completes, perhaps on Suday, I will need to do some debugging round and report back with some analysis, potentially resolution. |
Interesting. I had a throery, and dropped / created same index on Country.
|
The load completed about in approximately 8 hours, loading 2410476 rows to vertex City and edge CountryCity, both, after index introduction, prior to that in approximately 9 hours ETL process loaded 1/2 million rows, in all average speed without index 15 vertex + edge / second with index roughly 100 / second.
|
OrientDB Version: 3.0.12
Java Version:
bin>java -version
java version "9.0.4"
Java(TM) SE Runtime Environment (build 9.0.4+11)
Java HotSpot(TM) 64-Bit Server VM (build 9.0.4+11, mixed mode)
OS: Microsoft Windows 10
Expected behavior
Loading at least 100 records / second
Actual behavior
loading approximately ( very rough ) 15 records / second
Steps to reproduce
Create database GEO as following:
CREATE DATABASE remote:localhost/GEO root password PLOCAL GRAPH
Add vertex Country, City and edge CountryCity as following:
CREATE CLASS EarthPoliticalMap EXTENDS V ABSTRACT;
CREATE CLASS Country EXTENDS EarthPoliticalMap;
CREATE PROPERTY Country.ID STRING;
CREATE PROPERTY Country.CC_FIPS STRING;
CREATE PROPERTY Country.CC_ISO STRING;
CREATE PROPERTY Country.TLD STRING;
CREATE PROPERTY Country.NAME STRING;
CREATE CLASS City EXTENDS EarthPoliticalMap;
CREATE PROPERTY City.ID STRING;
CREATE PROPERTY City.NAME STRING;
CREATE EDGE CountryCity EXTENDS E;
Exceute ETL, ETL configuration JSON file and file with data attached.
GEO-Extract-City.zip
GEODATASOURCE-CITIES-FREE.zip
Invoke ETL as following:
oetl.bat C:\dev\app\orientdb-3.0.12\code\DATABASE_GEO\GEO-Extract-City.json -sourceDirPath=C:\dev\app\orientdb-3.0.12\code<TOP_FOLDER_GEO_DATA>
****** Note
Vertex "Country" loaded in under a second. A different ETL, where data was sourced from same database GEO and for transform was used "command" to create edge(s) from raw data using "CREATE EDGE" and $input context variable, worked in some nanoseconds for 3 records. It looked so encouraging that I was thinking of some thousands records loaded per second, in other words, my initial expectation was something like a 15 min - 1 hour ( worst case ) load time.
Looking at how slow ETL is progressing, I started load around 9:30AM and it is nearly 7:00PM now on the West Coast of USA, and 500,000 recordss loaded, it might take 3 calendar days. Not something an production instance can tolerate, unless it really "has to" :),
So, it makes sense at least to assess the differences between this, apparently, slow execution and the fast one.
#1 of course volume of recods. I may need to warm up caches and tune WAL, which I think is good idea to use, for robustness sakes.
#2 between the fast and slow load the difference in terms of configuration is:
a. when creating edtes extract from OrientDB vs. CSV file ( attached )
b. when loading Country vertex no edges were loaded, and the process went into milliseconds for some 266 rows.
#3 somehow country CSV loaded without any issues, as far as file text goes, while City file requires to enforce the following configuration setting: ignoreEmptyLines
#4 features that were absent in fast ETL(s) are
a. field transform with action "remove"
b. edge transfrom with lookup
c. config setting "parallel": true ( was false )
d. the context switching between loading both, edge and vertex in a single step
One of the things I have explored when loading edge is creating edge off of outer matching set ( the large one ) vs. the innter set ( the small one ). And it did look, like even on 3 record strong load the difference was like 40 nanoseconds for row by row approach, iterating over outer set, vs 18 nanoseconds, iterating over small - inner set. So, I may revisit the strategy and load City as vertex first and than pair the values in 3rd table and than load edges.
Oh yes, before I forget, the load of edges where source was OrientDB database GEO was performed over set of vertex classes linked together using links, so, the edge creation strategy was cached in the table(s). Perhaps it's the lookup via link vs explicit lookup via un - indexed field CC_FIPS, that causes the trouble. Let me try to add index, as ETL is running, and see if it speeds things along.
The text was updated successfully, but these errors were encountered: