Skip to content
This repository has been archived by the owner on Sep 27, 2019. It is now read-only.

Support COPY for CSV files #1371

Merged
merged 42 commits into from
Jun 8, 2018
Merged

Support COPY for CSV files #1371

merged 42 commits into from
Jun 8, 2018

Conversation

pmenon
Copy link
Member

@pmenon pmenon commented May 17, 2018

Summary

This PR adds support for psql's COPY command for bulk loading CSV files into the database. Using COPY, one can quickly load millions of rows into a table in a few seconds. For example, I was able to load 20M rows into a table with four integer columns in under three seconds; I also loaded an SF-1 lineitem table from TPC-H in about 20 seconds (which isn't great, but faster than through oltpbench). I find it very convenient to use this to quickly load a crap-tonne of data into the database fast to do benchmarks.

Right now, we only support CSV files, but the quoting, escaping, and delimiter characters can be configured. I tried to make the parser fairly robust to erroneous files, but we're not as generous as Postgres (which goes to great lengths to try to understand your CSV).

Modifications

  1. Changes in the parser to look for format, delimiter, escape, and quote characters.
  2. Added logical and physical ExternalFileScan and ExportExternalFile operators to optimizer for copy-from and copy-to.
  3. Added CSVScanPlan and ExportExternalFilePlan to planner.
  4. Added CSVScanTranslator to codegen. Also added runtime helper CSVScanner that accepts a callback function to invoke per-row in the CSV.
  5. Added integer, decimal, string, and date parsing logic into the functions namespace.
  6. Lots of tests.

Reviewers

  • @chenboy Can you take a look at the changes to the optimizer? I didn't add any costs (though one probably could use the schema and file size to estimate the number of rows).
  • @tcm-marcel Want to take a stab at this?

@coveralls
Copy link

coveralls commented May 17, 2018

Coverage Status

Coverage decreased (-0.3%) to 77.066% when pulling ec280f9 on pmenon:csv into 881a8e6 on cmu-db:master.

@pmenon pmenon force-pushed the csv branch 2 times, most recently from 29f1d1a to 15b6f8d Compare May 17, 2018 16:30
@pmenon pmenon requested review from chenboy and tcm-marcel May 17, 2018 22:52
@pmenon pmenon force-pushed the csv branch 8 times, most recently from 379f3c7 to 2c5b5dc Compare May 22, 2018 20:59
@pmenon pmenon force-pushed the csv branch 2 times, most recently from 103c2d3 to 63bcbe4 Compare May 26, 2018 12:39
apavlo
apavlo previously requested changes May 29, 2018
Copy link
Member

@apavlo apavlo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot here to review. We still need @tcm-marcel to take a look.

std::string ret;
for (uint32_t i = 0; i < tuple_.size(); i++) {
if (i != 0) ret.append(",");
ret.append(tuple_[i].ToString());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with this is that we are not escaping commas. Is there a CSV library that we could use?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a utility function to pretty print tuples for debugging purposes. It isn't used in execution.

auto *printf_fn = LookupBuiltin("printf");
if (printf_fn == nullptr) {
#if GCC_AT_LEAST_6
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment to explain what this does?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a short comment explaining this.

TL;DR; GCC 6+ complains when it doesn't see you using all function attributes (i.e., nothrow, throw, notnull etc.). We use decltype to get the signature for some functions, so we don't need the attributes. This fails compilation. We ignore this warning for this section of code.

static constexpr char kMemcmpFnName[] = "memcmp";
auto *memcmp_fn = LookupBuiltin(kMemcmpFnName);
if (memcmp_fn == nullptr) {
#if GCC_AT_LEAST_6
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, please explain why we need this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a short comment explaining this.

TL;DR; GCC 6+ complains when it doesn't see you using all function attributes (i.e., nothrow, throw, notnull etc.). We use decltype to get the signature for some functions, so we don't need the attributes. This fails compilation. We ignore this warning for this section of code.

codegen.Const8(scan.GetEscapeChar())});
}

namespace {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just an anonymous namespace. I use it so I can create classes in .cpp files without worrying about name collisions.


Type::Type(const SqlType &sql_type, bool _nullable)
: Type(sql_type.TypeId(), _nullable) {}

bool Type::operator==(const Type &other) const {
// TODO(pmenon): This isn't correct; we need to check all other fields ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be an easy fix?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this, but it's a little more involved because of some assumptions we make in the code. We can probably address this in a separate PR.

Copy link
Contributor

@tcm-marcel tcm-marcel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first part of my review. I am not done yet.

@@ -22,6 +22,7 @@ namespace test {
class RpcQueryPlanTests : public PelotonTest {};

TEST_F(RpcQueryPlanTests, BasicTest) {
#if 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this code not needed anymore?

};

/**
* Common base class for all codegen tests. This class four test tables that all
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"This class four test" verb missing


TEST_F(DecimalFunctionsTests, SqrtTest) {
TEST_F(NumericFunctionsTest, SqrtTest) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our naming conventions for test suite classes actually say that they should end with the suffix Tests. [1] Catching this up, because #1257 will create a static check for this.

[1] https://github.com/cmu-db/peloton/blob/master/test/README.md

llvm::Constant *ConstString(const std::string &s) const;
llvm::Value *ConstString(const std::string &str_val,
const std::string &name) const;
llvm::Value *ConstType(const type::Type &type);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment that describes the functionality of ConstType? It is not obvious to me.

codegen.Const8(scan.GetEscapeChar())});
}

namespace {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats is that for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just an anonymous namespace. I use it so I can create classes in .cpp files without worrying about name collisions.

pmenon added 2 commits June 6, 2018 16:14
… during CheckConstraints(). We were spending 50% of our time here during bulk insertions into wide tables due to unnecessary copying!
Copy link
Contributor

@tcm-marcel tcm-marcel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I saw most of the code. Some part in the optimizer is missing, but I don't understand that part. The code works great, I used it several times already!

Out of curiosity: What is the reason/Is there a reaon you decided to read chunks into a buffer instead of using mmap?

column_accessors.emplace_back(output_attributes_[i], cols,
scan.GetNullString(), null_str);
}
for (uint32_t i = 0; i < output_attributes_.size(); i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this loop be merged with the one before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs to be two loops because we build up a vector in the first loop and insert pointers to vector elements in the second loop. If it's in the same loop, the pointers may be invalid due to resizing.

We could reserve space beforehand and use one loop. We use this pattern in other places.

char delimiter, char quote, char escape) {
// Forward to constructor
new (&scanner)
CSVScanner(*executor_context.GetPool(), file_path, col_types, num_cols,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarification: Is the CSVScanner initialized at runtime and not during code generation to make the plan reusable for different paths etc.?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, the CSVScanner is created and initialized at runtime per-invocation.

In fact, all the classes in the util directory are runtime classes. We should rename it to peloton::codegen::runtime to make this clear.

last_was_escape = false;
}

// Process the new-line character. If we a new-line and we're not currently
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, will fix.

}
}

// Flush remaining valid bytes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get this part, can you explain briefly? (When does this occur?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially, all bytes between buffer_pos_ and curr_buffer_pos are valid bytes that should be part of the current line. It is possible that we exit the loop not having written all valid bytes to the line buffer. This section ensures we flush these valid bytes before retuning.

break;
}

*out++ = c;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand correctly, that this line basically writes the current character to the same position in the same buffer - except for the offset that was created by quotes and escapes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, we're writing non-quote characters (unless they're escaped).

// finish
if (c == delimiter || c == '\0') {
col_end = out;
iter--;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is iter decremented here? The next column loop iteration will read the delimiter or null-byte again, will it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The invariant we want to hold is that when this loop exists, iter either points to a newline character or a delimiter. We decrement here because we incremented earlier on line 300. Happy to take suggestions on how to make this clearer. CSV parsing is tricky 😓

llvm::Value *cols = codegen->CreateLoad(codegen->CreateConstInBoundsGEP2_32(
CSVScannerProxy::GetType(codegen), LoadStatePtr(scanner_id_), 0, 1));

llvm::Value *null_str = codegen.ConstString(scan.GetNullString(), "null");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a chance that not all types have "null" as null-string?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second null argument is the name of the constant variable, not the value we're comparing with. A null string exists in the plan node.

}

// Eat the rest
while (consumed_ptr < end && *consumed_ptr == ' ') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean, we basically only allow spaces after the actual parsed decimal?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We allow spaces both before and after actually. std::strtod handles spaces beforehand. We handle trailing spaces.

@pmenon
Copy link
Member Author

pmenon commented Jun 7, 2018

@tcm-marcel I didn't use mmap mostly because it wouldn't help performance. This is a giant sequential read so IO prefetching will kick in. If not, we can fadvise a hint. We're doing read() which bypasses the page cache, avoiding duplicated data.

There is an argument in software engineering simplicity using mmap, but the interaction would be similar - instead of copying from a temporary buffer managed by us to a line buffer, we'd be copying from a temporary kernel mmap buffer into our line buffer.

@tcm-marcel
Copy link
Contributor

@pmenon Thank you for the explanation! I didn't know read() bypasses the page cache.

@pmenon
Copy link
Member Author

pmenon commented Jun 7, 2018

@tcm-marcel I misspoke. read() will use the page cache unless we use direct IO - we don't do this. I was thinking of fread() that does library-level buffering that we avoid.

@pmenon pmenon removed the request for review from chenboy June 7, 2018 16:49
@apavlo apavlo merged commit 078a438 into cmu-db:master Jun 8, 2018
mbutrovich pushed a commit to mbutrovich/peloton that referenced this pull request Jun 11, 2018
* Add more information to CopyStatement. Cleaned up includes.

* Add CSVScan node to ToString and FromString. Removed BINARY external format for now.

* Move COPY from DDL to DML processing. COPY now goes through planner/optimization.

* Propagate external file information

* Removed unused serialization stuff from plan nodes

* Codegen can now have constant generic/opaque bytes in module

* When no columns specified during copy, all columns are inserted

* Added function to throw expception with ill-formatted input string when converting to number

* Removed serialization

* Added input functions in prepartion to read table data from files

* All SQL types must now provide an input function to convert a string into a SQL type

* Added test for value integrity

* First take at CSV Scan translator

* Fix after rebase

* file api

* CSV scanner reads lines

* Process CSV line in scanner

* Free memory when re-allocating line buffer

* Added memcmp to codegen interface. Renamed CallPrintf() to Printf().

* Cleaned up CSV scan translator. Added null checking.

* Moved TupleRuntime::CreateVarlen() into ValuesRuntime::WriteVarlen(). Better code organization and clearer name.

* Added error handling for long columns. Added null-terminator byte for when read-buffers are copied to line-buffers.

* Added inputs for decimal types

* Moved type-specific functions into function namespace

* REALLY simple Date support

* Compile fixes for GCC 6+

* Get string inputs working

* Beefed up tests

* Simple CSV scan test

* Updated optimize to continue support for old/weird/strange AF copy executor

* Extracted implementation into CPP file for plan node

* * Propagatge file options through optimization.
* Added codegen.cpp to source validator whitelist, since we have the ability to call printf() from codegen for debug.
* Beefed up overflow checks in NumericRuntime.
* Fixed tests.

* Fixes after rebase

* Simple function to convert tuple to string CSV

* Fix void* -> i8* conversion

* More tests

* Address reviews

* Revert "Removed serialization"

This reverts commit d055ff9.

* Revert "Removed unused serialization stuff from plan nodes"

This reverts commit 74427c7.

* Beefed up tests, which caught more bugs

* Fix tests

* Reducing copying overhead for columns, constraints and loop variables during CheckConstraints(). We were spending 50% of our time here during bulk insertions into wide tables due to unnecessary copying!
mbutrovich pushed a commit to mbutrovich/peloton that referenced this pull request Jun 11, 2018
* Add more information to CopyStatement. Cleaned up includes.

* Add CSVScan node to ToString and FromString. Removed BINARY external format for now.

* Move COPY from DDL to DML processing. COPY now goes through planner/optimization.

* Propagate external file information

* Removed unused serialization stuff from plan nodes

* Codegen can now have constant generic/opaque bytes in module

* When no columns specified during copy, all columns are inserted

* Added function to throw expception with ill-formatted input string when converting to number

* Removed serialization

* Added input functions in prepartion to read table data from files

* All SQL types must now provide an input function to convert a string into a SQL type

* Added test for value integrity

* First take at CSV Scan translator

* Fix after rebase

* file api

* CSV scanner reads lines

* Process CSV line in scanner

* Free memory when re-allocating line buffer

* Added memcmp to codegen interface. Renamed CallPrintf() to Printf().

* Cleaned up CSV scan translator. Added null checking.

* Moved TupleRuntime::CreateVarlen() into ValuesRuntime::WriteVarlen(). Better code organization and clearer name.

* Added error handling for long columns. Added null-terminator byte for when read-buffers are copied to line-buffers.

* Added inputs for decimal types

* Moved type-specific functions into function namespace

* REALLY simple Date support

* Compile fixes for GCC 6+

* Get string inputs working

* Beefed up tests

* Simple CSV scan test

* Updated optimize to continue support for old/weird/strange AF copy executor

* Extracted implementation into CPP file for plan node

* * Propagatge file options through optimization.
* Added codegen.cpp to source validator whitelist, since we have the ability to call printf() from codegen for debug.
* Beefed up overflow checks in NumericRuntime.
* Fixed tests.

* Fixes after rebase

* Simple function to convert tuple to string CSV

* Fix void* -> i8* conversion

* More tests

* Address reviews

* Revert "Removed serialization"

This reverts commit d055ff9.

* Revert "Removed unused serialization stuff from plan nodes"

This reverts commit 74427c7.

* Beefed up tests, which caught more bugs

* Fix tests

* Reducing copying overhead for columns, constraints and loop variables during CheckConstraints(). We were spending 50% of our time here during bulk insertions into wide tables due to unnecessary copying!
mtunique pushed a commit to mtunique/peloton that referenced this pull request Apr 16, 2019
* Add more information to CopyStatement. Cleaned up includes.

* Add CSVScan node to ToString and FromString. Removed BINARY external format for now.

* Move COPY from DDL to DML processing. COPY now goes through planner/optimization.

* Propagate external file information

* Removed unused serialization stuff from plan nodes

* Codegen can now have constant generic/opaque bytes in module

* When no columns specified during copy, all columns are inserted

* Added function to throw expception with ill-formatted input string when converting to number

* Removed serialization

* Added input functions in prepartion to read table data from files

* All SQL types must now provide an input function to convert a string into a SQL type

* Added test for value integrity

* First take at CSV Scan translator

* Fix after rebase

* file api

* CSV scanner reads lines

* Process CSV line in scanner

* Free memory when re-allocating line buffer

* Added memcmp to codegen interface. Renamed CallPrintf() to Printf().

* Cleaned up CSV scan translator. Added null checking.

* Moved TupleRuntime::CreateVarlen() into ValuesRuntime::WriteVarlen(). Better code organization and clearer name.

* Added error handling for long columns. Added null-terminator byte for when read-buffers are copied to line-buffers.

* Added inputs for decimal types

* Moved type-specific functions into function namespace

* REALLY simple Date support

* Compile fixes for GCC 6+

* Get string inputs working

* Beefed up tests

* Simple CSV scan test

* Updated optimize to continue support for old/weird/strange AF copy executor

* Extracted implementation into CPP file for plan node

* * Propagatge file options through optimization.
* Added codegen.cpp to source validator whitelist, since we have the ability to call printf() from codegen for debug.
* Beefed up overflow checks in NumericRuntime.
* Fixed tests.

* Fixes after rebase

* Simple function to convert tuple to string CSV

* Fix void* -> i8* conversion

* More tests

* Address reviews

* Revert "Removed serialization"

This reverts commit d055ff9.

* Revert "Removed unused serialization stuff from plan nodes"

This reverts commit 74427c7.

* Beefed up tests, which caught more bugs

* Fix tests

* Reducing copying overhead for columns, constraints and loop variables during CheckConstraints(). We were spending 50% of our time here during bulk insertions into wide tables due to unnecessary copying!
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants