Skip to content

Commit

Permalink
Switching to using an instance of the SmarterCSV::Reader class (#279)
Browse files Browse the repository at this point in the history
* move all module method to instance methods

* update writer

* updating docs

* pre-release 1.12.0.pre1
  • Loading branch information
tilo authored Jul 8, 2024
1 parent 7274d9a commit a89ca11
Show file tree
Hide file tree
Showing 45 changed files with 1,391 additions and 877 deletions.
10 changes: 8 additions & 2 deletions .rubocop.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ Metrics/BlockNesting:
Metrics/ClassLength:
Enabled: false

Metrics/CyclomaticComplexity: # BS rule
Metrics/CyclomaticComplexity:
Enabled: false

Metrics/MethodLength:
Expand All @@ -34,7 +34,7 @@ Metrics/MethodLength:
Metrics/ModuleLength:
Enabled: false

Metrics/PerceivedComplexity: # BS rule
Metrics/PerceivedComplexity:
Enabled: false

Naming/PredicateName:
Expand All @@ -46,6 +46,9 @@ Naming/VariableName:
Naming/VariableNumber:
Enabled: false

Style/AccessorGrouping: # not needed
Enabled: false

Style/ClassEqualityComparison:
Enabled: false

Expand Down Expand Up @@ -88,6 +91,9 @@ Style/IfInsideElse:
Style/IfUnlessModifier:
Enabled: false

Style/InverseMethods:
Enabled: false

Style/NestedTernaryOperator:
Enabled: false

Expand Down
28 changes: 28 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,34 @@

# SmarterCSV 1.x Change Log

## 1.12.0 (2024-07-08)
* added SmarterCSV::Reader to process CSV files ([issue #277](https://github.com/tilo/smarter_csv/pull/277))
* SmarterCSV::Writer changed default row separator to the system's row separator (`\n` on Linux, `\r\n` on Windows)
* added a lot of docs

* POTENTIAL ISSUE:

Version 1.12.x has a change of the underlying implementation of `SmarterCSV.process(file_or_input, options, &block)`.
Underneath it now uses this interface:
```
reader = SmarterCSV::Reader.new(file_or_input, options)
# either simple one-liner:
data = reader.process
# or block format:
data = reader.process do
# do something here
end
```
It still supports calling `SmarterCSV.process` for backwards-compatibility, but it no longer provides access to the internal state, e.g. raw_headers.

`SmarterCSV.raw_headers` -> `reader.raw_headers`
`SmarterCSV.headers` -> `reader.headers`

If you need these features, please update your code to create an instance of `SmarterCSV::Reader` as shown above.


## 1.11.2 (2024-07-06)
* fixing missing errors definition

Expand Down
427 changes: 31 additions & 396 deletions README.md

Large diffs are not rendered by default.

40 changes: 40 additions & 0 deletions docs/_introduction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@

# SmarterCSV Introduction

`smarter_csv` is a Ruby Gem for smarter importing of CSV Files as Array(s) of Hashes, suitable for direct processing with ActiveRecord, parallel processing, kicking-off batch jobs with Sidekiq, or oploading data to S3.


## Why another CSV library?

Ruby's original 'csv' library's API is pretty old, and its processing of CSV-files returning an array-of-array format feels unnecessarily 'close to the metal'. Its output is not easy to use - especially not if you need a data hash to create database records, or JSON from it, or pass it to Sidekiq or S3. Another shortcoming is that Ruby's 'csv' library does not have good support for huge CSV-files, e.g. there is no support for batching and/or parallel processing of the CSV-content (e.g. with Sidekiq jobs).

When SmarterCSV was envisioned, I needed to do nightly imports of very large data sets that came in CSV format, that needed to be upserted into a database, and because of the sheer volume of data needed to be processed in parallel.
The CSV processing also needed to be robust against variations in the input data.

## Benefits of using SmarterCSV

* Improved Robustness:
Typically you have little control over the data quality of CSV files that need to be imported. Because SmarterCSV has intelligent defaults and auto-detection of typical formats, this improves the robustness of your CSV imports without having to manually tweak options.

* Easy-to-use Format:
By using a Ruby hash to represent a CSV row, SmarterCSV allows you to directly use this data and insert it into a database, or use it with Sidekiq, S3, message queues, etc

* Normalized Headers:
SmarterCSV automatically transforms CSV headers to Ruby symbols, stripping leading or trailing whitespace.
There are many ways to customize the header transformation to your liking. You can re-map CSV headers to hash keys, and you can ignore CSV columns.

* Normalized Data:
SmarterCSV transforms the data in each CSV row automatically, stripping whitespace, converting numerical data into numbers, ignoring nil or empty fields, and more. There are many ways to customize this. You can even add your own value converters.

* Batch Processing of large CSV files:
Processing large CSV files in chunks, reduces the memory impact and allows for faster / parallel processing.
By adding the option `chunk_size: numeric_value`, you can switch to batch processing. SmarterCSV will then return arrays-of-hashes. This makes parallel processing easy: you can pass whole chunks of data to Sidekiq, bulk-insert into a DB, or pass it to other data sinks.

## Additional Features

* Header Validation:
You can validate that a set of hash keys is present in each record after header transformations are applied.
This can help ensure importing data with consistent quality.

* Data Validations
(planned feature)
140 changes: 140 additions & 0 deletions docs/basic_api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@

# SmarterCSV API

Let's explore the basic APIs for reading and writing CSV files. There is a simplified API (backwards conpatible with previous SmarterCSV versions) and the full API, which allows you to access the internal state of the reader or writer instance after processing.

## Reading CSV

SmarterCSV has convenient defaults for automatically detecting row and column separators based on the given data. This provides more robust parsing of input files when you have no control over the data, e.g. when users upload CSV files.
Learn more about this [in this section](docs/examples/row_col_sep.md).

### Simplified Interface

The simplified call to read CSV files is:

```
array_of_hashes = SmarterCSV.process(file_or_input, options, &block)

```
It can also be used with a block:

```
SmarterCSV.process(file_or_input, options, &block) do |hash|
# process one row of CSV
end
```

It can also be used for processing batches of rows:

```
SmarterCSV.process(file_or_input, {chunk_size: 100}, &block) do |array_of_hashes|
# process one chunk of up to 100 rows of CSV data
end
```

### Full Interface

The simplified API works in most cases, but if you need access to the internal state and detailed results of the CSV-parsing, you should use this form:

```
reader = SmarterCSV::Reader.new(file_or_input, options)
data = reader.process

puts reader.raw_headers
```
It cal also be used with a block:

```
reader = SmarterCSV::Reader.new(file_or_input, options)
data = reader.process do
# do something here
end

puts reader.raw_headers
```

This allows you access to the internal state of the `reader` instance after processing.


## Interface for Writing CSV

To generate a CSV file, we use the `<<` operator to append new data to the file.

The input operator for adding data to a CSV file `<<` can handle single hashes, array-of-hashes, or array-of-arrays-of-hashes, and can be called one or multiple times for each file.

One smart feature of writing CSV data is the discovery of headers.

If you have hashes of data, where each hash can have different keys, the `SmarterCSV::Reader` automatically discovers the superset of keys as the headers of the CSV file. This can be disabled by either providing one of the options `headers`, `map_headers`, or `discover_headers: false`.


### Simplified Interface

The simplified interface takes a block:

```
SmarterCSV.generate(filename, options) do |csv_writer|

MyModel.find_in_batches(batch_size: 100) do |batch|
batch.pluck(:name, :description, :instructor).each do |record|
csv_writer << record
end
end

end
```

### Full Interface

```
writer = SmarterCSV::Writer.new(file_path, options)

MyModel.find_in_batches(batch_size: 100) do |batch|
batch.pluck(:name, :description, :instructor).each do |record|
csv_writer << record
end

writer.finalize
```

## Rescue from Exceptions

While SmarterCSV uses sensible defaults to process the most common CSV files, it will raise exceptions if it can not auto-detect `col_sep`, `row_sep`, or if it encounters other problems. Therefore please rescue from `SmarterCSV::Error`, and handle outliers according to your requirements.

If you encounter unusual CSV files, please follow the tips in the Troubleshooting section below. You can use the options below to accomodate for unusual formats.

## Troubleshooting

In case your CSV file is not being parsed correctly, try to examine it in a text editor. For closer inspection a tool like `hexdump` can help find otherwise hidden control character or byte sequences like [BOMs](https://en.wikipedia.org/wiki/Byte_order_mark).

```
$ hexdump -C spec/fixtures/bom_test_feff.csv
00000000 fe ff 73 6f 6d 65 5f 69 64 2c 74 79 70 65 2c 66 |..some_id,type,f|
00000010 75 7a 7a 62 6f 78 65 73 0d 0a 34 32 37 36 36 38 |uzzboxes..427668|
00000020 30 35 2c 7a 69 7a 7a 6c 65 73 2c 31 32 33 34 0d |05,zizzles,1234.|
00000030 0a 33 38 37 35 39 31 35 30 2c 71 75 69 7a 7a 65 |.38759150,quizze|
00000040 73 2c 35 36 37 38 0d 0a |s,5678..|
```

## Assumptions / Limitations

* the escape character is `\`, as on UNIX and Windows systems.
* quote charcters around fields are balanced, e.g. valid: `"field"`, invalid: `"field\"`
e.g. an escaped `quote_char` does not denote the end of a field.


## NOTES about File Encodings:
* if you have a CSV file which contains unicode characters, you can process it as follows:

```ruby
File.open(filename, "r:bom|utf-8") do |f|
data = SmarterCSV.process(f);
end
```
* if the CSV file with unicode characters is in a remote location, similarly you need to give the encoding as an option to the `open` call:
```ruby
require 'open-uri'
file_location = 'http://your.remote.org/sample.csv'
open(file_location, 'r:utf-8') do |f| # don't forget to specify the UTF-8 encoding!!
data = SmarterCSV.process(f)
end
```
53 changes: 53 additions & 0 deletions docs/batch_processing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@

# Batch Processing

Processing CSV data in batches (chunks), allows you to parallelize the workload of importing data.
This can come in handy when you don't want to slow-down the CSV import of large files.

Setting the option `chunk_size` sets the max batch size.


## Example 1: How SmarterCSV processes CSV-files as chunks, returning arrays of hashes:
Please note how the returned array contains two sub-arrays containing the chunks which were read, each chunk containing 2 hashes.
In case the number of rows is not cleanly divisible by `:chunk_size`, the last chunk contains fewer hashes.

```ruby
> pets_by_owner = SmarterCSV.process('/tmp/pets.csv', {:chunk_size => 2, :key_mapping => {:first_name => :first, :last_name => :last}})
=> [ [ {:first=>"Dan", :last=>"McAllister", :dogs=>"2"}, {:first=>"Lucy", :last=>"Laweless", :cats=>"5"} ],
[ {:first=>"Miles", :last=>"O'Brian", :fish=>"21"}, {:first=>"Nancy", :last=>"Homes", :dogs=>"2", :birds=>"1"} ]
]
```

## Example 2: How SmarterCSV processes CSV-files as chunks, and passes arrays of hashes to a given block:
Please note how the given block is passed the data for each chunk as the parameter (array of hashes),
and how the `process` method returns the number of chunks when called with a block

```ruby
> total_chunks = SmarterCSV.process('/tmp/pets.csv', {:chunk_size => 2, :key_mapping => {:first_name => :first, :last_name => :last}}) do |chunk|
chunk.each do |h| # you can post-process the data from each row to your heart's content, and also create virtual attributes:
h[:full_name] = [h[:first],h[:last]].join(' ') # create a virtual attribute
h.delete(:first) ; h.delete(:last) # remove two keys
end
puts chunk.inspect # we could at this point pass the chunk to a Resque worker..
end

[{:dogs=>"2", :full_name=>"Dan McAllister"}, {:cats=>"5", :full_name=>"Lucy Laweless"}]
[{:fish=>"21", :full_name=>"Miles O'Brian"}, {:dogs=>"2", :birds=>"1", :full_name=>"Nancy Homes"}]
=> 2
```

## Example 3: Populate a MongoDB Database in Chunks of 100 records with SmarterCSV:
```ruby
# using chunks:
filename = '/tmp/some.csv'
options = {:chunk_size => 100, :key_mapping => {:unwanted_row => nil, :old_row_name => :new_name}}
n = SmarterCSV.process(filename, options) do |chunk|
# we're passing a block in, to process each resulting hash / row (block takes array of hashes)
# when chunking is enabled, there are up to :chunk_size hashes in each chunk
MyModel.collection.insert( chunk ) # insert up to 100 records at a time
end

=> returns number of chunks we processed
```


32 changes: 32 additions & 0 deletions docs/data_transformations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Data Transformations

SmarterCSV automatically transforms the values in each colum in order to normalize the data.
This behavior can be customized or disabled.

## Remove Empty Values
`remove_empty_values` is enabled by default
It removes any values which are `nil` or would be empty strings.

## Convert Values to Numeric
`convert_values_to_numeric` is enabled by default.
SmarterCSV will convert strings containing Integers or Floats to the appropriate class.

## Remove Zero Values
`remove_zero_values` is disabled by default.
When enabled, it removes key/value pairs which have a numeric value equal to zero.

## Remove Values Matching
`remove_values_matching` is disabled by default.
When enabled, this can help removing key/value pairs from result hashes which would cause problems.

e.g.
* `remove_values_matching: /^\$0\.0+$/` would remove $0.00
* `remove_values_matching: /^#VALUE!$/` would remove errors from Excel spreadsheets

## Empty Hashes

It can happen that after all transformations, a row of the CSV file would produce a completely empty hash.

By default SmarterCSV uses `remove_empty_hashes: true` to remove these empty hashes from the result.

This can be set to `true`, to keep these empty hashes in the results.
Loading

0 comments on commit a89ca11

Please sign in to comment.