Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Add "global" option to regexp_replace #4331

Open
prrao87 opened this issue Oct 1, 2024 · 1 comment · May be fixed by #4375
Open

Feature: Add "global" option to regexp_replace #4331

prrao87 opened this issue Oct 1, 2024 · 1 comment · May be fixed by #4375
Assignees
Labels
feature New features or missing components of existing features

Comments

@prrao87
Copy link
Member

prrao87 commented Oct 1, 2024

API

Other

Description

We have a function called regexp_replace that replaces only the first match of a string based on a regex.

However, we should also provide an option to replace all occurrences of a match, via a function regexp_replace_all. Example below.

I have this JSON file persons.json:

[
  {
    "name": "Xavier Williams",
    "date_of_birth": "1989-06-28",
    "residence": {
      "address": "901 Oak Lane",
      "city": "San Francisco",
      "state": "CA",
      "zip_code": "94122"
    }
  }
]

I want to generate a unique string primary key that's a concatenation of all the address fields, that I can then use for relationship creation downstream. During this process, I want to remove all whitespaces.

I'm doing this right now:

LOAD FROM 'persons.json'
WITH DISTINCT residence AS r
RETURN
    concat(
        regexp_replace(lower(r.address), ' ', ''),
        regexp_replace(lower(r.city), ' ', ''),
        regexp_replace(lower(r.state), ' ', ''),
        r.zip_code
    ) AS id,
    r.address,
    r.city,
    r.state,
    r.zip_code

Because only the first occurrence is matched, I get this result for the id field:

901oak lanesanfranciscoca94122

There's an extra space in the middle that wasn't removed.

If we had a regexp_replace_all function, this would be resolved.

@prrao87 prrao87 added the feature New features or missing components of existing features label Oct 1, 2024
@prrao87
Copy link
Member Author

prrao87 commented Oct 16, 2024

@acquamarin DuckDB has a "global" option available via the 'g' keyword as per their regexp_replace function. Implementing this in our case will solve the issue, and this also fixes #4332 simultaneously.

This is how it would be used:

D select regexp_replace('901  Oak    Lane', '\s', '', 'g');
┌───────────────────────────────────────────────────┐
│ regexp_replace('901  Oak    Lane', '\s', '', 'g') │
│                      varchar                      │
├───────────────────────────────────────────────────┤
│ 901OakLane                                        │
└───────────────────────────────────────────────────┘

Once this is implemented, I can add some docs describing this.

@prrao87 prrao87 changed the title Feature: Add regexp_replace_all function Feature: Add "global" option to regexp_replace Oct 16, 2024
@acquamarin acquamarin linked a pull request Oct 16, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New features or missing components of existing features
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants