Skip to content

Commit

Permalink
WP-12916 handle encoding detector returning confidence of None (#40)
Browse files Browse the repository at this point in the history
* fix universal detector returning confidence of None

* remove duplicate lines for encoding detection
  • Loading branch information
ChrisLing1 authored Nov 14, 2022
1 parent fbf3b5f commit d14345b
Show file tree
Hide file tree
Showing 2 changed files with 9 additions and 4 deletions.
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "tap-s3-csv"
version = "1.4.6"
version = "1.4.7"
description = "Singer.io tap for extracting CSV files from S3"
authors = ["Stitch"]
classifiers = ["Programming Language :: Python :: 3 :: Only"]
Expand Down
11 changes: 8 additions & 3 deletions tap_s3_csv/dialect.py
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,9 @@ def detect_dialect(config, s3_file, table):
interesting.append(i)
interesting.sort()

# get rid of repeating lines that may skew detector
interesting = list(set(interesting))

# feed selected lines to universal detector
detector = chardet.UniversalDetector()
detector.MINIMUM_THRESHOLD = 0.70
Expand All @@ -126,9 +129,11 @@ def detect_dialect(config, s3_file, table):
encoding = detector_results.get('encoding', 'utf-8')
confidence = detector_results.get('confidence', 1.0)

# 1. ignore detector if confidence was low
# 2. utf-8 is backwards compatible with ascii and supports more characters
if confidence < .70 or encoding == 'ascii':
# 1. cchardet confidence can sometimes have a value of None (WP-12916 not sure exact cause)
# 2. ignore detector if confidence was low
# 3. just in case if encoding is None, we default to utf-8
# 4. utf-8 is backwards compatible with ascii and supports more characters
if confidence is None or confidence < .70 or encoding is None or encoding == 'ascii':
encoding = 'utf-8'

table['encoding'] = encoding
Expand Down

0 comments on commit d14345b

Please sign in to comment.