Support COPY for CSV files #1371

pmenon · 2018-05-17T04:16:07Z

Summary

This PR adds support for psql's COPY command for bulk loading CSV files into the database. Using COPY, one can quickly load millions of rows into a table in a few seconds. For example, I was able to load 20M rows into a table with four integer columns in under three seconds; I also loaded an SF-1 lineitem table from TPC-H in about 20 seconds (which isn't great, but faster than through oltpbench). I find it very convenient to use this to quickly load a crap-tonne of data into the database fast to do benchmarks.

Right now, we only support CSV files, but the quoting, escaping, and delimiter characters can be configured. I tried to make the parser fairly robust to erroneous files, but we're not as generous as Postgres (which goes to great lengths to try to understand your CSV).

Modifications

Changes in the parser to look for format, delimiter, escape, and quote characters.
Added logical and physical ExternalFileScan and ExportExternalFile operators to optimizer for copy-from and copy-to.
Added CSVScanPlan and ExportExternalFilePlan to planner.
Added CSVScanTranslator to codegen. Also added runtime helper CSVScanner that accepts a callback function to invoke per-row in the CSV.
Added integer, decimal, string, and date parsing logic into the functions namespace.
Lots of tests.

Reviewers

@chenboy Can you take a look at the changes to the optimizer? I didn't add any costs (though one probably could use the schema and file size to estimate the number of rows).
@tcm-marcel Want to take a stab at this?

coveralls · 2018-05-17T05:23:19Z

Coverage decreased (-0.3%) to 77.066% when pulling ec280f9 on pmenon:csv into 881a8e6 on cmu-db:master.

apavlo

There is a lot here to review. We still need @tcm-marcel to take a look.

apavlo · 2018-05-29T15:12:58Z

src/codegen/buffering_consumer.cpp

+  std::string ret;
+  for (uint32_t i = 0; i < tuple_.size(); i++) {
+    if (i != 0) ret.append(",");
+    ret.append(tuple_[i].ToString());


The problem with this is that we are not escaping commas. Is there a CSV library that we could use?

This is just a utility function to pretty print tuples for debugging purposes. It isn't used in execution.

apavlo · 2018-05-29T15:13:44Z

src/codegen/codegen.cpp

  auto *printf_fn = LookupBuiltin("printf");
  if (printf_fn == nullptr) {
+#if GCC_AT_LEAST_6


Can you add a comment to explain what this does?

I've added a short comment explaining this.

TL;DR; GCC 6+ complains when it doesn't see you using all function attributes (i.e., nothrow, throw, notnull etc.). We use decltype to get the signature for some functions, so we don't need the attributes. This fails compilation. We ignore this warning for this section of code.

apavlo · 2018-05-29T15:14:23Z

src/codegen/codegen.cpp

+  static constexpr char kMemcmpFnName[] = "memcmp";
+  auto *memcmp_fn = LookupBuiltin(kMemcmpFnName);
+  if (memcmp_fn == nullptr) {
+#if GCC_AT_LEAST_6


Again, please explain why we need this.

I've added a short comment explaining this.

TL;DR; GCC 6+ complains when it doesn't see you using all function attributes (i.e., nothrow, throw, notnull etc.). We use decltype to get the signature for some functions, so we don't need the attributes. This fails compilation. We ignore this warning for this section of code.

apavlo · 2018-05-29T15:14:59Z

src/codegen/operator/csv_scan_translator.cpp

+                codegen.Const8(scan.GetEscapeChar())});
+}
+
+namespace {


This is just an anonymous namespace. I use it so I can create classes in .cpp files without worrying about name collisions.

apavlo · 2018-05-29T15:19:07Z

src/codegen/type/type.cpp


 Type::Type(const SqlType &sql_type, bool _nullable)
    : Type(sql_type.TypeId(), _nullable) {}

 bool Type::operator==(const Type &other) const {
+  // TODO(pmenon): This isn't correct; we need to check all other fields ...


Shouldn't this be an easy fix?

I tried this, but it's a little more involved because of some assumptions we make in the code. We can probably address this in a separate PR.

tcm-marcel

This is the first part of my review. I am not done yet.

tcm-marcel · 2018-05-25T19:23:27Z

test/network/rpc_queryplan_test.cpp

@@ -22,6 +22,7 @@ namespace test {
 class RpcQueryPlanTests : public PelotonTest {};

 TEST_F(RpcQueryPlanTests, BasicTest) {
+#if 0


Is this code not needed anymore?

tcm-marcel · 2018-05-25T19:24:22Z

test/include/codegen/testing_codegen_util.h

+};
+
+/**
+ * Common base class for all codegen tests. This class four test tables that all


"This class four test" verb missing

tcm-marcel · 2018-05-25T19:50:41Z

test/function/numeric_functions_test.cpp


-TEST_F(DecimalFunctionsTests, SqrtTest) {
+TEST_F(NumericFunctionsTest, SqrtTest) {


Our naming conventions for test suite classes actually say that they should end with the suffix Tests. [1] Catching this up, because #1257 will create a static check for this.

[1] https://github.com/cmu-db/peloton/blob/master/test/README.md

tcm-marcel · 2018-05-25T20:05:47Z

src/include/codegen/codegen.h

-  llvm::Constant *ConstString(const std::string &s) const;
+  llvm::Value *ConstString(const std::string &str_val,
+                           const std::string &name) const;
+  llvm::Value *ConstType(const type::Type &type);


Can you add a comment that describes the functionality of ConstType? It is not obvious to me.

tcm-marcel · 2018-05-25T21:40:30Z

src/codegen/operator/csv_scan_translator.cpp

+                codegen.Const8(scan.GetEscapeChar())});
+}
+
+namespace {


Whats is that for?

This is just an anonymous namespace. I use it so I can create classes in .cpp files without worrying about name collisions.

…format for now.

…ptimization.

…en converting to number

…ecutor

* Added codegen.cpp to source validator whitelist, since we have the ability to call printf() from codegen for debug. * Beefed up overflow checks in NumericRuntime. * Fixed tests.

This reverts commit d055ff9.

This reverts commit 74427c7.

… during CheckConstraints(). We were spending 50% of our time here during bulk insertions into wide tables due to unnecessary copying!

tcm-marcel

I think I saw most of the code. Some part in the optimizer is missing, but I don't understand that part. The code works great, I used it several times already!

Out of curiosity: What is the reason/Is there a reaon you decided to read chunks into a buffer instead of using mmap?

tcm-marcel · 2018-06-06T22:32:00Z

src/codegen/operator/csv_scan_translator.cpp

+      column_accessors.emplace_back(output_attributes_[i], cols,
+                                    scan.GetNullString(), null_str);
+    }
+    for (uint32_t i = 0; i < output_attributes_.size(); i++) {


Can this loop be merged with the one before?

It needs to be two loops because we build up a vector in the first loop and insert pointers to vector elements in the second loop. If it's in the same loop, the pointers may be invalid due to resizing.

We could reserve space beforehand and use one loop. We use this pattern in other places.

tcm-marcel · 2018-06-06T22:40:26Z