Switching to using an instance of the SmarterCSV::Reader class (#279)

* move all module method to instance methods * update writer * updating docs * pre-release 1.12.0.pre1
tilo · Jul 8, 2024 · a89ca11 · a89ca11
1 parent 7274d9a
commit a89ca11
Show file tree

Hide file tree

Showing 45 changed files with 1,391 additions and 877 deletions.
diff --git a/.rubocop.yml b/.rubocop.yml
@@ -25,7 +25,7 @@ Metrics/BlockNesting:
 Metrics/ClassLength:
   Enabled: false
 
-Metrics/CyclomaticComplexity: # BS rule
+Metrics/CyclomaticComplexity:
   Enabled: false
 
 Metrics/MethodLength:
@@ -34,7 +34,7 @@ Metrics/MethodLength:
 Metrics/ModuleLength:
   Enabled: false
 
-Metrics/PerceivedComplexity: # BS rule
+Metrics/PerceivedComplexity:
   Enabled: false
 
 Naming/PredicateName:
@@ -46,6 +46,9 @@ Naming/VariableName:
 Naming/VariableNumber:
   Enabled: false
 
+Style/AccessorGrouping: # not needed
+  Enabled: false
+
 Style/ClassEqualityComparison:
   Enabled: false
 
@@ -88,6 +91,9 @@ Style/IfInsideElse:
 Style/IfUnlessModifier:
   Enabled: false
 
+Style/InverseMethods:
+  Enabled: false
+
 Style/NestedTernaryOperator:
   Enabled: false
 

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,6 +1,34 @@
 
 # SmarterCSV 1.x Change Log
 
+## 1.12.0 (2024-07-08)
+  * added SmarterCSV::Reader to process CSV files ([issue #277](https://github.com/tilo/smarter_csv/pull/277))
+  * SmarterCSV::Writer changed default row separator to the system's row separator (`\n` on Linux, `\r\n` on Windows)
+  * added a lot of docs 
+
+  * POTENTIAL ISSUE:
+
+    Version 1.12.x has a change of the underlying implementation of `SmarterCSV.process(file_or_input, options, &block)`. 
+    Underneath it now uses this interface:
+      ```
+        reader = SmarterCSV::Reader.new(file_or_input, options)
+
+        # either simple one-liner:
+        data = reader.process
+
+        # or block format:
+        data = reader.process do 
+           # do something here
+        end
+      ```
+    It still supports calling `SmarterCSV.process` for backwards-compatibility, but it no longer provides access to the internal state, e.g. raw_headers.
+
+      `SmarterCSV.raw_headers` -> `reader.raw_headers`
+      `SmarterCSV.headers` -> `reader.headers`
+
+    If you need these features, please update your code to create an instance of `SmarterCSV::Reader` as shown above.
+
+
 ## 1.11.2 (2024-07-06)
   * fixing missing errors definition
 

diff --git a/README.md b/README.md
diff --git a/docs/_introduction.md b/docs/_introduction.md
@@ -0,0 +1,40 @@
+
+# SmarterCSV Introduction
+
+`smarter_csv` is a Ruby Gem for smarter importing of CSV Files as Array(s) of Hashes, suitable for direct processing with ActiveRecord, parallel processing, kicking-off batch jobs with Sidekiq, or oploading data to S3.
+
+
+## Why another CSV library?
+
+Ruby's original 'csv' library's API is pretty old, and its processing of CSV-files returning an array-of-array format feels unnecessarily 'close to the metal'. Its output is not easy to use - especially not if you need a data hash to create database records, or JSON from it, or pass it to Sidekiq or S3. Another shortcoming is that Ruby's 'csv' library does not have good support for huge CSV-files, e.g. there is no support for batching and/or parallel processing of the CSV-content (e.g. with Sidekiq jobs).
+
+When SmarterCSV was envisioned, I needed to do nightly imports of very large data sets that came in CSV format, that needed to be upserted into a database, and because of the sheer volume of data needed to be processed in parallel.
+The CSV processing also needed to be robust against variations in the input data.
+
+## Benefits of using SmarterCSV
+
+* Improved Robustness: 
+  Typically you have little control over the data quality of CSV files that need to be imported. Because SmarterCSV has intelligent defaults and auto-detection of typical formats, this improves the robustness of your CSV imports without having to manually tweak options.
+
+* Easy-to-use Format:
+  By using a Ruby hash to represent a CSV row, SmarterCSV allows you to directly use this data and insert it into a database, or use it with Sidekiq, S3, message queues, etc
+
+* Normalized Headers:
+  SmarterCSV automatically transforms CSV headers to Ruby symbols, stripping leading or trailing whitespace.
+  There are many ways to customize the header transformation to your liking. You can re-map CSV headers to hash keys, and you can ignore CSV columns.
+
+* Normalized Data:
+  SmarterCSV transforms the data in each CSV row automatically, stripping whitespace, converting numerical data into numbers, ignoring nil or empty fields, and more. There are many ways to customize this. You can even add your own value converters.
+
+* Batch Processing of large CSV files:
+  Processing large CSV files in chunks, reduces the memory impact and allows for faster / parallel processing.
+  By adding the option `chunk_size: numeric_value`, you can switch to batch processing. SmarterCSV will then return arrays-of-hashes. This makes parallel processing easy: you can pass whole chunks of data to Sidekiq, bulk-insert into a DB, or pass it to other data sinks.
+
+## Additional Features
+
+* Header Validation:
+  You can validate that a set of hash keys is present in each record after header transformations are applied.
+  This can help ensure importing data with consistent quality.
+
+* Data Validations
+  (planned feature)
diff --git a/docs/basic_api.md b/docs/basic_api.md
@@ -0,0 +1,140 @@
+
+# SmarterCSV API
+
+Let's explore the basic APIs for reading and writing CSV files. There is a simplified API (backwards conpatible with previous SmarterCSV versions) and the full API, which allows you to access the internal state of the reader or writer instance after processing.
+
+## Reading CSV
+
+SmarterCSV has convenient defaults for automatically detecting row and column separators based on the given data. This provides more robust parsing of input files when you have no control over the data, e.g. when users upload CSV files.
+Learn more about this [in this section](docs/examples/row_col_sep.md).
+
+### Simplified Interface
+
+The simplified call to read CSV files is:
+
+      ```
+         array_of_hashes = SmarterCSV.process(file_or_input, options, &block)
+
+      ```
+It can also be used with a block:
+
+      ```
+         SmarterCSV.process(file_or_input, options, &block) do |hash|
+            # process one row of CSV
+         end
+      ```
+
+It can also be used for processing batches of rows:
+
+      ```
+         SmarterCSV.process(file_or_input, {chunk_size: 100}, &block) do |array_of_hashes|
+            # process one chunk of up to 100 rows of CSV data
+         end
+      ```
+
+### Full Interface
+
+The simplified API works in most cases, but if you need access to the internal state and detailed results of the CSV-parsing, you should use this form:
+
+      ```
+        reader = SmarterCSV::Reader.new(file_or_input, options)
+        data = reader.process
+
+        puts reader.raw_headers
+      ```
+It cal also be used with a block:
+
+      ```      
+        reader = SmarterCSV::Reader.new(file_or_input, options)
+        data = reader.process do 
+           # do something here
+        end
+
+        puts reader.raw_headers
+      ```
+
+This allows you access to the internal state of the `reader` instance after processing.
+
+
+## Interface for Writing CSV
+
+To generate a CSV file, we use the `<<` operator to append new data to the file.
+
+The input operator for adding data to a CSV file `<<` can handle single hashes, array-of-hashes, or array-of-arrays-of-hashes, and can be called one or multiple times for each file.
+
+One smart feature of writing CSV data is the discovery of headers. 
+
+If you have hashes of data, where each hash can have different keys, the `SmarterCSV::Reader` automatically discovers the superset of keys as the headers of the CSV file. This can be disabled by either providing one of the options `headers`, `map_headers`, or `discover_headers: false`.
+
+
+### Simplified Interface
+
+The simplified interface takes a block:
+
+      ```
+        SmarterCSV.generate(filename, options) do |csv_writer|
+
+         MyModel.find_in_batches(batch_size: 100) do |batch|
+           batch.pluck(:name, :description, :instructor).each do |record|
+             csv_writer << record
+           end
+         end
+
+       end
+     ```
+
+### Full Interface
+
+      ```
+        writer = SmarterCSV::Writer.new(file_path, options)
+
+        MyModel.find_in_batches(batch_size: 100) do |batch|
+          batch.pluck(:name, :description, :instructor).each do |record|
+            csv_writer << record
+          end
+
+        writer.finalize
+      ```
+
+## Rescue from Exceptions
+
+While SmarterCSV uses sensible defaults to process the most common CSV files, it will raise exceptions if it can not auto-detect `col_sep`, `row_sep`, or if it encounters other problems. Therefore please rescue from `SmarterCSV::Error`, and handle outliers according to your requirements.
+
+If you encounter unusual CSV files, please follow the tips in the Troubleshooting section below. You can use the options below to accomodate for unusual formats.
+
+## Troubleshooting
+
+In case your CSV file is not being parsed correctly, try to examine it in a text editor. For closer inspection  a tool like `hexdump` can help find otherwise hidden control character or byte sequences like [BOMs](https://en.wikipedia.org/wiki/Byte_order_mark).
+
+```
+$ hexdump -C spec/fixtures/bom_test_feff.csv
+00000000  fe ff 73 6f 6d 65 5f 69  64 2c 74 79 70 65 2c 66  |..some_id,type,f|
+00000010  75 7a 7a 62 6f 78 65 73  0d 0a 34 32 37 36 36 38  |uzzboxes..427668|
+00000020  30 35 2c 7a 69 7a 7a 6c  65 73 2c 31 32 33 34 0d  |05,zizzles,1234.|
+00000030  0a 33 38 37 35 39 31 35  30 2c 71 75 69 7a 7a 65  |.38759150,quizze|
+00000040  73 2c 35 36 37 38 0d 0a                           |s,5678..|
+```
+
+## Assumptions / Limitations
+
+* the escape character is `\`, as on UNIX and Windows systems.
+* quote charcters around fields are balanced, e.g. valid: `"field"`, invalid: `"field\"`
+  e.g. an escaped `quote_char` does not denote the end of a field.
+
+
+## NOTES about File Encodings:
+ * if you have a CSV file which contains unicode characters, you can process it as follows:
+
+```ruby
+       File.open(filename, "r:bom|utf-8") do |f|
+         data = SmarterCSV.process(f);
+       end
+```
+* if the CSV file with unicode characters is in a remote location, similarly you need to give the encoding as an option to the `open` call:
+```ruby
+       require 'open-uri'
+       file_location = 'http://your.remote.org/sample.csv'
+       open(file_location, 'r:utf-8') do |f|   # don't forget to specify the UTF-8 encoding!!
+         data = SmarterCSV.process(f)
+       end
+```
diff --git a/docs/batch_processing.md b/docs/batch_processing.md
@@ -0,0 +1,53 @@
+
+# Batch Processing
+
+Processing CSV data in batches (chunks), allows you to parallelize the workload of importing data.
+This can come in handy when you don't want to slow-down the CSV import of large files.
+
+Setting the option `chunk_size` sets the max batch size.
+
+
+## Example 1: How SmarterCSV processes CSV-files as chunks, returning arrays of hashes:
+Please note how the returned array contains two sub-arrays containing the chunks which were read, each chunk containing 2 hashes.
+In case the number of rows is not cleanly divisible by `:chunk_size`, the last chunk contains fewer hashes.
+
+```ruby
+     > pets_by_owner = SmarterCSV.process('/tmp/pets.csv', {:chunk_size => 2, :key_mapping => {:first_name => :first, :last_name => :last}})
+       => [ [ {:first=>"Dan", :last=>"McAllister", :dogs=>"2"}, {:first=>"Lucy", :last=>"Laweless", :cats=>"5"} ],
+            [ {:first=>"Miles", :last=>"O'Brian", :fish=>"21"}, {:first=>"Nancy", :last=>"Homes", :dogs=>"2", :birds=>"1"} ]
+          ]
+```
+
+## Example 2: How SmarterCSV processes CSV-files as chunks, and passes arrays of hashes to a given block:
+Please note how the given block is passed the data for each chunk as the parameter (array of hashes),
+and how the `process` method returns the number of chunks when called with a block
+
+```ruby
+     > total_chunks = SmarterCSV.process('/tmp/pets.csv', {:chunk_size => 2, :key_mapping => {:first_name => :first, :last_name => :last}}) do |chunk|
+         chunk.each do |h|   # you can post-process the data from each row to your heart's content, and also create virtual attributes:
+           h[:full_name] = [h[:first],h[:last]].join(' ')  # create a virtual attribute
+           h.delete(:first) ; h.delete(:last)              # remove two keys
+         end
+         puts chunk.inspect   # we could at this point pass the chunk to a Resque worker..
+       end
+
+       [{:dogs=>"2", :full_name=>"Dan McAllister"}, {:cats=>"5", :full_name=>"Lucy Laweless"}]
+       [{:fish=>"21", :full_name=>"Miles O'Brian"}, {:dogs=>"2", :birds=>"1", :full_name=>"Nancy Homes"}]
+        => 2
+```
+
+## Example 3: Populate a MongoDB Database in Chunks of 100 records with SmarterCSV:
+```ruby
+    # using chunks:
+    filename = '/tmp/some.csv'
+    options = {:chunk_size => 100, :key_mapping => {:unwanted_row => nil, :old_row_name => :new_name}}
+    n = SmarterCSV.process(filename, options) do |chunk|
+          # we're passing a block in, to process each resulting hash / row (block takes array of hashes)
+          # when chunking is enabled, there are up to :chunk_size hashes in each chunk
+          MyModel.collection.insert( chunk )   # insert up to 100 records at a time
+    end
+
+     => returns number of chunks we processed
+```
+
+
diff --git a/docs/data_transformations.md b/docs/data_transformations.md
@@ -0,0 +1,32 @@
+# Data Transformations
+
+SmarterCSV automatically transforms the values in each colum in order to normalize the data.
+This behavior can be customized or disabled.
+
+## Remove Empty Values
+`remove_empty_values` is enabled by default
+It removes any values which are `nil` or would be empty strings.
+
+## Convert Values to Numeric
+`convert_values_to_numeric` is enabled by default. 
+SmarterCSV will convert strings containing Integers or Floats to the appropriate class.
+
+## Remove Zero Values
+`remove_zero_values` is disabled by default.
+When enabled, it removes key/value pairs which have a numeric value equal to zero.
+
+## Remove Values Matching
+`remove_values_matching` is disabled by default. 
+When enabled, this can help removing key/value pairs from result hashes which would cause problems. 
+
+e.g.
+ * `remove_values_matching: /^\$0\.0+$/` would remove $0.00 
+ * `remove_values_matching: /^#VALUE!$/` would remove errors from Excel spreadsheets 
+
+## Empty Hashes
+
+It can happen that after all transformations, a row of the CSV file would produce a completely empty hash.
+
+By default SmarterCSV uses `remove_empty_hashes: true` to remove these empty hashes from the result.
+
+This can be set to `true`, to keep these empty hashes in the results.