-
Notifications
You must be signed in to change notification settings - Fork 623
Conversation
29f1d1a
to
15b6f8d
Compare
379f3c7
to
2c5b5dc
Compare
103c2d3
to
63bcbe4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lot here to review. We still need @tcm-marcel to take a look.
std::string ret; | ||
for (uint32_t i = 0; i < tuple_.size(); i++) { | ||
if (i != 0) ret.append(","); | ||
ret.append(tuple_[i].ToString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with this is that we are not escaping commas. Is there a CSV library that we could use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a utility function to pretty print tuples for debugging purposes. It isn't used in execution.
auto *printf_fn = LookupBuiltin("printf"); | ||
if (printf_fn == nullptr) { | ||
#if GCC_AT_LEAST_6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment to explain what this does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a short comment explaining this.
TL;DR; GCC 6+ complains when it doesn't see you using all function attributes (i.e., nothrow, throw, notnull etc.). We use decltype
to get the signature for some functions, so we don't need the attributes. This fails compilation. We ignore this warning for this section of code.
static constexpr char kMemcmpFnName[] = "memcmp"; | ||
auto *memcmp_fn = LookupBuiltin(kMemcmpFnName); | ||
if (memcmp_fn == nullptr) { | ||
#if GCC_AT_LEAST_6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, please explain why we need this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a short comment explaining this.
TL;DR; GCC 6+ complains when it doesn't see you using all function attributes (i.e., nothrow, throw, notnull etc.). We use decltype
to get the signature for some functions, so we don't need the attributes. This fails compilation. We ignore this warning for this section of code.
codegen.Const8(scan.GetEscapeChar())}); | ||
} | ||
|
||
namespace { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just an anonymous namespace. I use it so I can create classes in .cpp
files without worrying about name collisions.
|
||
Type::Type(const SqlType &sql_type, bool _nullable) | ||
: Type(sql_type.TypeId(), _nullable) {} | ||
|
||
bool Type::operator==(const Type &other) const { | ||
// TODO(pmenon): This isn't correct; we need to check all other fields ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be an easy fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this, but it's a little more involved because of some assumptions we make in the code. We can probably address this in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the first part of my review. I am not done yet.
test/network/rpc_queryplan_test.cpp
Outdated
@@ -22,6 +22,7 @@ namespace test { | |||
class RpcQueryPlanTests : public PelotonTest {}; | |||
|
|||
TEST_F(RpcQueryPlanTests, BasicTest) { | |||
#if 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this code not needed anymore?
}; | ||
|
||
/** | ||
* Common base class for all codegen tests. This class four test tables that all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"This class four test" verb missing
|
||
TEST_F(DecimalFunctionsTests, SqrtTest) { | ||
TEST_F(NumericFunctionsTest, SqrtTest) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our naming conventions for test suite classes actually say that they should end with the suffix Tests
. [1] Catching this up, because #1257 will create a static check for this.
[1] https://github.com/cmu-db/peloton/blob/master/test/README.md
src/include/codegen/codegen.h
Outdated
llvm::Constant *ConstString(const std::string &s) const; | ||
llvm::Value *ConstString(const std::string &str_val, | ||
const std::string &name) const; | ||
llvm::Value *ConstType(const type::Type &type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment that describes the functionality of ConstType
? It is not obvious to me.
codegen.Const8(scan.GetEscapeChar())}); | ||
} | ||
|
||
namespace { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whats is that for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just an anonymous namespace. I use it so I can create classes in .cpp
files without worrying about name collisions.
…en converting to number
… during CheckConstraints(). We were spending 50% of our time here during bulk insertions into wide tables due to unnecessary copying!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I saw most of the code. Some part in the optimizer is missing, but I don't understand that part. The code works great, I used it several times already!
Out of curiosity: What is the reason/Is there a reaon you decided to read chunks into a buffer instead of using mmap
?
column_accessors.emplace_back(output_attributes_[i], cols, | ||
scan.GetNullString(), null_str); | ||
} | ||
for (uint32_t i = 0; i < output_attributes_.size(); i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this loop be merged with the one before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It needs to be two loops because we build up a vector in the first loop and insert pointers to vector elements in the second loop. If it's in the same loop, the pointers may be invalid due to resizing.
We could reserve space beforehand and use one loop. We use this pattern in other places.
char delimiter, char quote, char escape) { | ||
// Forward to constructor | ||
new (&scanner) | ||
CSVScanner(*executor_context.GetPool(), file_path, col_types, num_cols, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarification: Is the CSVScanner initialized at runtime and not during code generation to make the plan reusable for different paths etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, the CSVScanner
is created and initialized at runtime per-invocation.
In fact, all the classes in the util
directory are runtime classes. We should rename it to peloton::codegen::runtime
to make this clear.
last_was_escape = false; | ||
} | ||
|
||
// Process the new-line character. If we a new-line and we're not currently |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, will fix.
} | ||
} | ||
|
||
// Flush remaining valid bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get this part, can you explain briefly? (When does this occur?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially, all bytes between buffer_pos_
and curr_buffer_pos
are valid bytes that should be part of the current line. It is possible that we exit the loop not having written all valid bytes to the line buffer. This section ensures we flush these valid bytes before retuning.
break; | ||
} | ||
|
||
*out++ = c; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I understand correctly, that this line basically writes the current character to the same position in the same buffer - except for the offset that was created by quotes and escapes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, we're writing non-quote characters (unless they're escaped).
// finish | ||
if (c == delimiter || c == '\0') { | ||
col_end = out; | ||
iter--; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is iter
decremented here? The next column loop iteration will read the delimiter or null-byte again, will it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The invariant we want to hold is that when this loop exists, iter
either points to a newline character or a delimiter. We decrement here because we incremented earlier on line 300. Happy to take suggestions on how to make this clearer. CSV parsing is tricky 😓
llvm::Value *cols = codegen->CreateLoad(codegen->CreateConstInBoundsGEP2_32( | ||
CSVScannerProxy::GetType(codegen), LoadStatePtr(scanner_id_), 0, 1)); | ||
|
||
llvm::Value *null_str = codegen.ConstString(scan.GetNullString(), "null"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a chance that not all types have "null" as null-string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second null
argument is the name of the constant variable, not the value we're comparing with. A null string exists in the plan node.
} | ||
|
||
// Eat the rest | ||
while (consumed_ptr < end && *consumed_ptr == ' ') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean, we basically only allow spaces after the actual parsed decimal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We allow spaces both before and after actually. std::strtod
handles spaces beforehand. We handle trailing spaces.
@tcm-marcel I didn't use There is an argument in software engineering simplicity using |
@pmenon Thank you for the explanation! I didn't know |
@tcm-marcel I misspoke. |
* Add more information to CopyStatement. Cleaned up includes. * Add CSVScan node to ToString and FromString. Removed BINARY external format for now. * Move COPY from DDL to DML processing. COPY now goes through planner/optimization. * Propagate external file information * Removed unused serialization stuff from plan nodes * Codegen can now have constant generic/opaque bytes in module * When no columns specified during copy, all columns are inserted * Added function to throw expception with ill-formatted input string when converting to number * Removed serialization * Added input functions in prepartion to read table data from files * All SQL types must now provide an input function to convert a string into a SQL type * Added test for value integrity * First take at CSV Scan translator * Fix after rebase * file api * CSV scanner reads lines * Process CSV line in scanner * Free memory when re-allocating line buffer * Added memcmp to codegen interface. Renamed CallPrintf() to Printf(). * Cleaned up CSV scan translator. Added null checking. * Moved TupleRuntime::CreateVarlen() into ValuesRuntime::WriteVarlen(). Better code organization and clearer name. * Added error handling for long columns. Added null-terminator byte for when read-buffers are copied to line-buffers. * Added inputs for decimal types * Moved type-specific functions into function namespace * REALLY simple Date support * Compile fixes for GCC 6+ * Get string inputs working * Beefed up tests * Simple CSV scan test * Updated optimize to continue support for old/weird/strange AF copy executor * Extracted implementation into CPP file for plan node * * Propagatge file options through optimization. * Added codegen.cpp to source validator whitelist, since we have the ability to call printf() from codegen for debug. * Beefed up overflow checks in NumericRuntime. * Fixed tests. * Fixes after rebase * Simple function to convert tuple to string CSV * Fix void* -> i8* conversion * More tests * Address reviews * Revert "Removed serialization" This reverts commit d055ff9. * Revert "Removed unused serialization stuff from plan nodes" This reverts commit 74427c7. * Beefed up tests, which caught more bugs * Fix tests * Reducing copying overhead for columns, constraints and loop variables during CheckConstraints(). We were spending 50% of our time here during bulk insertions into wide tables due to unnecessary copying!
* Add more information to CopyStatement. Cleaned up includes. * Add CSVScan node to ToString and FromString. Removed BINARY external format for now. * Move COPY from DDL to DML processing. COPY now goes through planner/optimization. * Propagate external file information * Removed unused serialization stuff from plan nodes * Codegen can now have constant generic/opaque bytes in module * When no columns specified during copy, all columns are inserted * Added function to throw expception with ill-formatted input string when converting to number * Removed serialization * Added input functions in prepartion to read table data from files * All SQL types must now provide an input function to convert a string into a SQL type * Added test for value integrity * First take at CSV Scan translator * Fix after rebase * file api * CSV scanner reads lines * Process CSV line in scanner * Free memory when re-allocating line buffer * Added memcmp to codegen interface. Renamed CallPrintf() to Printf(). * Cleaned up CSV scan translator. Added null checking. * Moved TupleRuntime::CreateVarlen() into ValuesRuntime::WriteVarlen(). Better code organization and clearer name. * Added error handling for long columns. Added null-terminator byte for when read-buffers are copied to line-buffers. * Added inputs for decimal types * Moved type-specific functions into function namespace * REALLY simple Date support * Compile fixes for GCC 6+ * Get string inputs working * Beefed up tests * Simple CSV scan test * Updated optimize to continue support for old/weird/strange AF copy executor * Extracted implementation into CPP file for plan node * * Propagatge file options through optimization. * Added codegen.cpp to source validator whitelist, since we have the ability to call printf() from codegen for debug. * Beefed up overflow checks in NumericRuntime. * Fixed tests. * Fixes after rebase * Simple function to convert tuple to string CSV * Fix void* -> i8* conversion * More tests * Address reviews * Revert "Removed serialization" This reverts commit d055ff9. * Revert "Removed unused serialization stuff from plan nodes" This reverts commit 74427c7. * Beefed up tests, which caught more bugs * Fix tests * Reducing copying overhead for columns, constraints and loop variables during CheckConstraints(). We were spending 50% of our time here during bulk insertions into wide tables due to unnecessary copying!
* Add more information to CopyStatement. Cleaned up includes. * Add CSVScan node to ToString and FromString. Removed BINARY external format for now. * Move COPY from DDL to DML processing. COPY now goes through planner/optimization. * Propagate external file information * Removed unused serialization stuff from plan nodes * Codegen can now have constant generic/opaque bytes in module * When no columns specified during copy, all columns are inserted * Added function to throw expception with ill-formatted input string when converting to number * Removed serialization * Added input functions in prepartion to read table data from files * All SQL types must now provide an input function to convert a string into a SQL type * Added test for value integrity * First take at CSV Scan translator * Fix after rebase * file api * CSV scanner reads lines * Process CSV line in scanner * Free memory when re-allocating line buffer * Added memcmp to codegen interface. Renamed CallPrintf() to Printf(). * Cleaned up CSV scan translator. Added null checking. * Moved TupleRuntime::CreateVarlen() into ValuesRuntime::WriteVarlen(). Better code organization and clearer name. * Added error handling for long columns. Added null-terminator byte for when read-buffers are copied to line-buffers. * Added inputs for decimal types * Moved type-specific functions into function namespace * REALLY simple Date support * Compile fixes for GCC 6+ * Get string inputs working * Beefed up tests * Simple CSV scan test * Updated optimize to continue support for old/weird/strange AF copy executor * Extracted implementation into CPP file for plan node * * Propagatge file options through optimization. * Added codegen.cpp to source validator whitelist, since we have the ability to call printf() from codegen for debug. * Beefed up overflow checks in NumericRuntime. * Fixed tests. * Fixes after rebase * Simple function to convert tuple to string CSV * Fix void* -> i8* conversion * More tests * Address reviews * Revert "Removed serialization" This reverts commit d055ff9. * Revert "Removed unused serialization stuff from plan nodes" This reverts commit 74427c7. * Beefed up tests, which caught more bugs * Fix tests * Reducing copying overhead for columns, constraints and loop variables during CheckConstraints(). We were spending 50% of our time here during bulk insertions into wide tables due to unnecessary copying!
Summary
This PR adds support for psql's
COPY
command for bulk loading CSV files into the database. UsingCOPY
, one can quickly load millions of rows into a table in a few seconds. For example, I was able to load 20M rows into a table with four integer columns in under three seconds; I also loaded an SF-1lineitem
table from TPC-H in about 20 seconds (which isn't great, but faster than through oltpbench). I find it very convenient to use this to quickly load a crap-tonne of data into the database fast to do benchmarks.Right now, we only support CSV files, but the quoting, escaping, and delimiter characters can be configured. I tried to make the parser fairly robust to erroneous files, but we're not as generous as Postgres (which goes to great lengths to try to understand your CSV).
Modifications
format
,delimiter
,escape
, andquote
characters.ExternalFileScan
andExportExternalFile
operators to optimizer for copy-from and copy-to.CSVScanPlan
andExportExternalFilePlan
to planner.CSVScanTranslator
to codegen. Also added runtime helperCSVScanner
that accepts a callback function to invoke per-row in the CSV.functions
namespace.Reviewers