Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add regex replace custom function #270

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

JamezQ
Copy link

@JamezQ JamezQ commented Aug 12, 2021

This feature allows q to quickly get a new column based on a regex match of another column. And even execute a GROUP BY using this new column.

Example: Find the disk usage by extension for the current directory:

du -sk *|grep "\."|q -b -t 'SELECT resub(c2,".*\.","") as ex,SUM(c1) FROM - GROUP BY ex ORDER BY SUM(c1)'

This is quite nice compared to other answers in: https://unix.stackexchange.com/questions/308846/how-to-find-total-filesize-grouped-by-extension (Slightly modified for only looking in the current dir)

find . -type f -depth 1 | egrep -o "\.[a-zA-Z0-9]+$" | sort -u|xargs -I '%' find . -type f -name "*%" -depth 1 -exec du -ch {} + -exec echo % \;|egrep "^\.[a-zA-Z0-9]+$|total$"|uniq|paste - -

find . -name '?*.*' -type f -depth 1 -print0 |                                                                                                                                                  
  perl -0ne '
    if (@s = stat$_){
      ($ext = $_) =~ s/.*\.//s;
      $s{$ext} += $s[12];
      $n{$ext}++;
    }
    END {
      for (sort{$s{$a} <=> $s{$b}} keys %s) {
        printf "%15d %4d %s\n",  $s{$_}<<9, $n{$_}, $_;
      }
    }'

Of course, it is possible to use q without this feature quite succinctly as well:

du -sk *|grep "\."|sed 's/\t.*\.\(.*\)$/ \1/g'|q -b 'SELECT c2,SUM(c1) FROM - GROUP BY c2 ORDER BY SUM(c1)'

However, as the capture groups become more complex, getting multiple captures from the same column, etc, I believe resub becomes much more useful.

As another example, I have already used this modified q to analyze a log file containing user agents, and creating a summary of the most used OS/DEVICE combination. (Simply group by both columns and count) This involved two extractions from the same field to acquire the OS and device. (Using capture groups feature of re.sub)

Finally, resub gets to utilize the existing row and column parsing q has, this can be complicated if using sed (commas in quoted fields).

@JamezQ
Copy link
Author

JamezQ commented Aug 12, 2021

This passes make test.

@harelba
Copy link
Owner

harelba commented Oct 22, 2021

Hi,

this is a good addition, although I'm aiming to add a full set of regexp functions, including optimizing the pattern compilation. Using re.sub or any other re method directly compiles the pattern on every call to the function, which is extremely slow.

Regarding filename handling, I believe that it's important enough to have its own functions. regexp parsing of filenames can be error prone, and lead to inconsistencies, such as what happens if there is no extension for a filename.

Now that q 3.1.0-beta is merged into master, it's possible to create new PRs. I've created a PR (#282) for adding multiple requested user functions, and added several filename-parsing functions. Achieving the per-ext total sizes would then be something like that:

$ du -ks * | q -b -t "select file_ext(c2),sum(c1) from - group by 1 order by 2"
.bzl     	4
.ini     	4
.markdown	8
.py      	8
.sh      	12
.txt     	12
.egg-info	24
.md      	28
.alal    	29948
.qsql    	44068
         	288636

I'll add the set of regexp functions on that PR as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants