Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of OFFSET very inefficient with large Postgres DB #18

Open
jeroenvandijk opened this issue Jul 25, 2012 · 1 comment
Open

Use of OFFSET very inefficient with large Postgres DB #18

jeroenvandijk opened this issue Jul 25, 2012 · 1 comment

Comments

@jeroenvandijk
Copy link

When using maple to import a 40GB+ Postgres database I noticed that queries became too slow and the complete hadoop job failed because of the use of OFFSET:

After changing this line to this:

            // HARDCODING PRIMARY KEY.....
            query.append(" WHERE id >= ").append(split.getStart());
            query.append(" LIMIT ").append(split.getLength());

The query time doesn't grow exponentially anymore and stays the same. The above is not a generic solution (e.g. your index might not be id). Do you have suggestions to handle this situation? I'm also not sure how other JDBC databases handle OFFSET.

Has this library been used on large Postgres DB's before? I would like to gain some insights into best practices. Even with the above optimization my import time is around 3 hours.

Thanks for you work on maple.

Cheers,
Jeroen

@azymnis
Copy link
Contributor

azymnis commented Jul 25, 2012

Hi Jeroen,

We have only used this Tap on relatively small MySQL tables so no, this has not been tested on a large dataset coming from a DB.

Keep in mind that there is a version of JDBCScheme on which the primary key is defined. You can use that instead and not rely on hardcoding the primary key.

Argyris

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants