Refactor data import #144

zeyus · 2023-03-28T12:16:48Z

The new implementation reads the schema and imports the data straight after upload.

It will need to be benchmarked but it's likely that even though reading from the filesystem is slow, it may just be quicker to read the schema by iterating over all the rows, and then only import the selected fields, because right now importing 400k tweets takes about 40 minutes and then the subsequent delete query (pre index, indexed version is being tested now) takes an additional 20 minutes if done in a single query, and > 60 minutes if done in individual queries.

While we're at it, consider using MongoDB for document storage, and join with a unique key (or the document ID)

See also: https://github.com/NLP4ALL/nlp4all/wiki/Performance

If we go this route (probably more performant) that will require hooks on the init-db and drop-db as well as when deleting and adding data sources.

Update
Version with gin index on the document column actually takes longer both for import and for property deletion. This makes sense as it actually has to update more information at each step, and probably the indexing doesn't extend to such deep nesting (it could, if the structure was consistent). Seems like MongoDB may be the way to go.

UPDATE 2

MongoDB has now been implemented, which is now a 3 minute import. Key deletion still takes around 8 minutes, but that just leaves one remaining task. Process the schema BEFORE import, and only import the required keys. the whole process will be much quicker and probably total around the same (3 min)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor data import #144

Refactor data import #144

zeyus commented Mar 28, 2023 •

edited

Loading

Refactor data import #144

Refactor data import #144

Comments

zeyus commented Mar 28, 2023 • edited Loading

zeyus commented Mar 28, 2023 •

edited

Loading