-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use of sqlite database on Lustre/NFS filesystems #39
Comments
No ideas yet, but will look into this. In principle database access is abstracted. csv2db uploads data for mysql, postgres and sqlite, while CGATReport uses sqlalchemy to connect to mysql, sqlite and postgres. However, there are many instances of within pipeline database access that are sqlite specific and will need refactoring. Can you try setting: [general] I expect that the problem is due to concurrent write/write or read/write access. Fortunately, our upload tasks are usually distinct from tasks that read the database. So let us try in that order:
|
Hi Andreas, So setting the jobs_limit doesn't help, at least on the Lustre system. Although I find it difficult to confirm that the jobs_limit is doing what it says it is, running a csv2db.py command on the commandline gives the same error. However, the command will run without error on an NFS location. I read somewhere that sqlite confirms its writes by querying something about the low-level disk state, which can't be done on distributed file systems, I don't know how IFS gets around this. One solution might be to store the database file on a seperate NFS filespace. This would still require restricting db jobs to 1 as concurrent NFS access attempts are unsafe in sqlite, but it might at least run. However, Pipeline.load at the moment assumes that you want to write to ./csvdb I'm going to try to modify this and give it a go. Not a longer solution at the moment because I only get 50GB NFS storage (unlimited lustre), but the alternative solution would not only require re-factoring the pipelines, but also buying/renting a machine to run as a db server. |
Thanks, I see. Are there any databases installed on your systems? What do people use for data storage other than the file system? |
Is there any additional error/output log file from sqlite that you can look at? (apart from to Python's traceback) That would help to narrow down the problem. Other interesting reading: NB: We use ifs through NFSv3. |
It seems to be something that ruffus struggles with as well: https://code.google.com/p/ruffus/issues/detail?id=59 From reading around, it seems that in principle in should not be a problem in lustre. The locking mechanisms that sqlite requires are there, but might be not behaving in the way that sqlite expects. |
There is this thread:
Not sure if that is an option. |
I can ask the HPC people here... |
Running the databases in a seperate localtion seems to be working so far. Don't know how well it will work if the databases start getting too large to live on my 100Gb NFS share... |
I asked about mounting with -o flock: apparently if slows the entire system down by a large amount. |
Thanks. I have implemented the fundamentals for other database access - mysql should work, but the code will still need to develop. |
Hi Andreas, At the moment things seem to be doing okay with sperating the database off in a serpate location. Not ideal, but working for the moment. May move to mysql in the future, but currently I don't have any machines I can run one on . |
Ok, thanks - I will close this issue for now to be reopened. |
I recently started trying to use the pipelines in earnest on our HPC here, and have immediately run into a pretty serious problem: I would appear that one cannot use sqlite databases on a lustre filesystem (which is the file system we use here for our high performance storage). I've not yet fully determined whether this problem is caused by any access or only attempts at concurrent access, but all my pipelines invariably fail with the following error:
My reading around the net suggests that sqlite won't really work with NFS filesystems either; although I've definately managed to get pipeline_annotations to run on an NFS location at least once (although it didn't neccessarily run all the way through without error). This seems to rule out the file systems that most people would want to use for this sort of thing.
Are there alternatives to to sqlite? I saw talk of making things work with MySQL, but I don't know how far along that is? (Plus I'd have to find a machine that could run a MySql server).
Ideas anyone?
The text was updated successfully, but these errors were encountered: