Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ETL: loading of records is somewhat slower than should be #8732

Closed
MironAtHome opened this issue Jan 19, 2019 · 6 comments
Closed

ETL: loading of records is somewhat slower than should be #8732

MironAtHome opened this issue Jan 19, 2019 · 6 comments

Comments

@MironAtHome
Copy link

OrientDB Version: 3.0.12

Java Version:

bin>java -version
java version "9.0.4"
Java(TM) SE Runtime Environment (build 9.0.4+11)
Java HotSpot(TM) 64-Bit Server VM (build 9.0.4+11, mixed mode)

OS: Microsoft Windows 10

Expected behavior

Loading at least 100 records / second

Actual behavior

loading approximately ( very rough ) 15 records / second

Steps to reproduce

Create database GEO as following:

CREATE DATABASE remote:localhost/GEO root password PLOCAL GRAPH

Add vertex Country, City and edge CountryCity as following:
CREATE CLASS EarthPoliticalMap EXTENDS V ABSTRACT;

CREATE CLASS Country EXTENDS EarthPoliticalMap;
CREATE PROPERTY Country.ID STRING;
CREATE PROPERTY Country.CC_FIPS STRING;
CREATE PROPERTY Country.CC_ISO STRING;
CREATE PROPERTY Country.TLD STRING;
CREATE PROPERTY Country.NAME STRING;

CREATE CLASS City EXTENDS EarthPoliticalMap;
CREATE PROPERTY City.ID STRING;
CREATE PROPERTY City.NAME STRING;

CREATE EDGE CountryCity EXTENDS E;

Exceute ETL, ETL configuration JSON file and file with data attached.
GEO-Extract-City.zip
GEODATASOURCE-CITIES-FREE.zip
Invoke ETL as following:
oetl.bat C:\dev\app\orientdb-3.0.12\code\DATABASE_GEO\GEO-Extract-City.json -sourceDirPath=C:\dev\app\orientdb-3.0.12\code<TOP_FOLDER_GEO_DATA>


****** Note


Vertex "Country" loaded in under a second. A different ETL, where data was sourced from same database GEO and for transform was used "command" to create edge(s) from raw data using "CREATE EDGE" and $input context variable, worked in some nanoseconds for 3 records. It looked so encouraging that I was thinking of some thousands records loaded per second, in other words, my initial expectation was something like a 15 min - 1 hour ( worst case ) load time.

Looking at how slow ETL is progressing, I started load around 9:30AM and it is nearly 7:00PM now on the West Coast of USA, and 500,000 recordss loaded, it might take 3 calendar days. Not something an production instance can tolerate, unless it really "has to" :),
So, it makes sense at least to assess the differences between this, apparently, slow execution and the fast one.
#1 of course volume of recods. I may need to warm up caches and tune WAL, which I think is good idea to use, for robustness sakes.
#2 between the fast and slow load the difference in terms of configuration is:
a. when creating edtes extract from OrientDB vs. CSV file ( attached )
b. when loading Country vertex no edges were loaded, and the process went into milliseconds for some 266 rows.
#3 somehow country CSV loaded without any issues, as far as file text goes, while City file requires to enforce the following configuration setting: ignoreEmptyLines
#4 features that were absent in fast ETL(s) are
a. field transform with action "remove"
b. edge transfrom with lookup
c. config setting "parallel": true ( was false )
d. the context switching between loading both, edge and vertex in a single step


One of the things I have explored when loading edge is creating edge off of outer matching set ( the large one ) vs. the innter set ( the small one ). And it did look, like even on 3 record strong load the difference was like 40 nanoseconds for row by row approach, iterating over outer set, vs 18 nanoseconds, iterating over small - inner set. So, I may revisit the strategy and load City as vertex first and than pair the values in 3rd table and than load edges.
Oh yes, before I forget, the load of edges where source was OrientDB database GEO was performed over set of vertex classes linked together using links, so, the edge creation strategy was cached in the table(s). Perhaps it's the lookup via link vs explicit lookup via un - indexed field CC_FIPS, that causes the trouble. Let me try to add index, as ETL is running, and see if it speeds things along.

@MironAtHome
Copy link
Author

After adding index
CREATE INDEX IX_Country_CC_FIPS ON Country (CC_FIPS) NOTUNIQUE_HASH_INDEX;
the load speed changed drammatically.
If previous 1/2 million records loaded in 9 hours, the next 1/2 million records loaded in 1/2 hour.

@MironAtHome
Copy link
Author

I will close this issue now.

@MironAtHome
Copy link
Author

After initial spurt looked very encouraging for approximately 1 hour ( 1/2 million records loaded ) things went back to much slower pace. In the past hour only about 200000 records loaded, 2.5 times slower, than "fast" interval. So, it looks like as load advances forward into the CSV file something impacts lookup time.
Perhaps an index on the FK used for lookup field during edge transform creation can help. Will try it now and report back if any speed improvements noted.

@MironAtHome
Copy link
Author

Oh... I am removing CC_FIPS field in the ETL script from City vertex. So, no index can be applied. Looks like after ETL completes, perhaps on Suday, I will need to do some debugging round and report back with some analysis, potentially resolution.

@MironAtHome
Copy link
Author

Interesting. I had a throery, and dropped / created same index on Country.
I can't say with certainty this made the difference, but in the past 3 hours the following changes can be found:

  1. a million of records was loaded ( good )
  2. 1/2 a Gigabyte memory RAM was taken by java.exe process, that was prior at 2GB.
    In other words, each of the spurts of OrientDB record ingestion can be mapped to resource utilization increase. And I could be wrong, but from my side it's not a linear or "as you go" kind of increase. It's more like speed increase = more resource grabbed.
    Which might point to garbage collection or some other resource reuse / release or might it be, leak of sorts, which in java.exe ( and other machines architected similarly to Java as an OS architecture within OS ) is good at sandboxing its own globally owned resources, but inside of machine the resources get pinned, lost, latched and locked in a way that requires, essentially, stopping activity ( threads ) and explicit resource release.
    At any rate, this resource usage was welcome and its translting to faster execution was precisely what I was looking for. The only question is do I really have to drop / create index again now. Will monitor and confirm.

@MironAtHome
Copy link
Author

The load completed about in approximately 8 hours, loading 2410476 rows to vertex City and edge CountryCity, both, after index introduction, prior to that in approximately 9 hours ETL process loaded 1/2 million rows, in all average speed without index 15 vertex + edge / second with index roughly 100 / second.
I'd say this is slower than it should be. Perhaps loading flat file into vertex first and then building edges by materializing links, faster approach. Will report back, in case of significant speed up.
In addition, the following distribution of load quality observed:
SELECT COUNT() FROM City WHERE out_CountryCity.size() = 1
2706590
SELECT COUNT(
) FROM City WHERE out_CountryCity.size() = 0
6612
SELECT COUNT() FROM City WHERE out_CountryCity.size() = 2
195295
SELECT COUNT(
) FROM City WHERE out_CountryCity.size() = 3
3325
Expected value for edge count for each City is 1. Wonder what is going on.
So, to sum it up, the following directions need to be verified

  1. how to load large files faster
  2. how to ensure correctness of loaded data
  3. from previous issue logged here, extend UI to handle "remote" engine connection for loader
    Will be updating with progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants