Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV sniffing falsely detects space as a delimiter #88843

Open
ptokarski mannequin opened this issue Jul 19, 2021 · 4 comments
Open

CSV sniffing falsely detects space as a delimiter #88843

ptokarski mannequin opened this issue Jul 19, 2021 · 4 comments
Assignees
Labels
3.8 (EOL) end of life stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@ptokarski
Copy link
Mannequin

ptokarski mannequin commented Jul 19, 2021

BPO 44677
Nosy @rhettinger, @ptokarski
PRs
  • Do not allow to parse line breaks between quotes in sniffer regexes. #27256
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2021-07-19.17:40:33.684>
    labels = ['3.8', 'type-bug', 'library']
    title = 'CSV sniffing falsely detects space as a delimiter'
    updated_at = <Date 2021-07-22.02:15:15.339>
    user = 'https://github.com/ptokarski'

    bugs.python.org fields:

    activity = <Date 2021-07-22.02:15:15.339>
    actor = 'rhettinger'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2021-07-19.17:40:33.684>
    creator = 'pt12lol'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 44677
    keywords = ['patch']
    message_count = 4.0
    messages = ['397821', '397859', '397860', '397974']
    nosy_count = 3.0
    nosy_names = ['rhettinger', 'python-dev', 'pt12lol']
    pr_nums = ['27256']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue44677'
    versions = ['Python 3.8']

    @ptokarski
    Copy link
    Mannequin Author

    ptokarski mannequin commented Jul 19, 2021

    Let's consider the following CSV content: "a|b\nc| 'd\ne|' f". The real delimiter in this case is '|' character while ' ' is sniffed. Find verbose example attached.

    Problem lays in csv.py file in the following code:

            matches = []
            for restr in (r'(?P<delim>[^\w\n"\'])(?P<space> ?)(?P<quote>["\']).*?(?P=quote)(?P=delim)', # ,".*?",
                          r'(?:^|\n)(?P<quote>["\']).*?(?P=quote)(?P<delim>[^\w\n"\'])(?P<space> ?)',   #  ".*?",
                          r'(?P<delim>[^\w\n"\'])(?P<space> ?)(?P<quote>["\']).*?(?P=quote)(?:$|\n)',   # ,".*?"
                          r'(?:^|\n)(?P<quote>["\']).*?(?P=quote)(?:$|\n)'):                            #  ".*?" (no delim, no space)
                regexp = re.compile(restr, re.DOTALL | re.MULTILINE)
                matches = regexp.findall(data)
                if matches:
                    break
    

    What makes matches non-empty and farther processing happens with delimiter falsely set to ' '.

    @ptokarski ptokarski mannequin added 3.8 (EOL) end of life stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Jul 19, 2021
    @ptokarski
    Copy link
    Mannequin Author

    ptokarski mannequin commented Jul 20, 2021

    Test sample:

    import csv
    from io import StringIO
    
    
    def csv_text():
        return StringIO("a|b\nc| 'd\ne|' f")
    
    
    with csv_text() as input_file:
        print('The following text is going to be parsed:')
        print(input_file.read())
        print()
    
    
    with csv_text() as input_file:
        dialect_params = [
            'delimiter',
            'quotechar',
            'escapechar',
            'lineterminator',
            'quoting',
            'doublequote',
            'skipinitialspace'
        ]
        dialect = csv.Sniffer().sniff(input_file.read())
        print('The following dialect has been detected:')
        for dialect_param in dialect_params:
            print(f'- {dialect_param}: {repr(getattr(dialect, dialect_param))}')
        print()
    
    
    with csv_text() as input_file:
        print('Parsed csv text:')
        for entry in csv.reader(input_file, dialect=dialect):
            print(f'- {entry}')
        print()
    

    Actual output:

    The following text is going to be parsed:
    a|b
    c| 'd
    e|' f
    
    The following dialect has been detected:
    - delimiter: ' '
    - quotechar: "'"
    - escapechar: None
    - lineterminator: '\r\n'
    - quoting: 0
    - doublequote: False
    - skipinitialspace: False
    
    Parsed csv text:
    - ['a|b']
    - ['c|', 'd\ne|', 'f']
    
    

    Expected output:

    The following text is going to be parsed:
    a|b
    c| 'd
    e|' f
    
    The following dialect has been detected:
    - delimiter: '|'
    - quotechar: '"'
    - escapechar: None
    - lineterminator: '\r\n'
    - quoting: 0
    - doublequote: False
    - skipinitialspace: False
    
    Parsed csv text:
    - ['a', 'b']
    - ['c', " 'd"]
    - ['e', "' f"]
    
    

    @ptokarski
    Copy link
    Mannequin Author

    ptokarski mannequin commented Jul 20, 2021

    I think changing (?P<quote>["\']).*?(?P=quote) to (?P<quote>["\'])[^\n]*?(?P=quote) in all regexes does the trick, doesn't it?

    @rhettinger
    Copy link
    Contributor

    Changing sniffer logic is risky because it risks breaking existing code that relies on the current predictions.

    FWIW, in your example, the sniffer gets the desired result if given a delimiter hint:

    >>> s = "a|b\nc| 'd\ne|' f"
    >>> pprint.pp(dict(vars(Sniffer().sniff(s, '|'))))
    {'__module__': 'csv',
     '_name': 'sniffed',
     'lineterminator': '\r\n',
     'quoting': 0,
     '__doc__': None,
     'doublequote': False,
     'delimiter': '|',
     'quotechar': "'",
     'skipinitialspace': False}

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @serhiy-storchaka serhiy-storchaka self-assigned this Jan 9, 2024
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 (EOL) end of life stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    Status: No status
    Development

    No branches or pull requests

    2 participants