-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sub and gsub when the second argument is a stream #89
Comments
Did a bit of trawling through the git histories because I wanted to see if
The specific lines: def sub(re; s):
. as $in
| [match(re)]
| .[0]
| . as $r
# create the \"capture\" object:
| reduce ( $r | .captures | .[] | select(.name != null) | { (.name) : .string } ) as $pair
({}; . + $pair)
| if . == {} then $in | .[0:$r.offset]+s+.[$r.offset+$r.length:]
else (. | s)
end ; with the description:
Other (minor) updateslength check bug: jqlang/jq#586 diff --git a/builtin.c b/builtin.c
index 6f66894..7ca0f81 100644
--- a/builtin.c
+++ b/builtin.c
@@ -1051,17 +1051,19 @@ static const char* const jq_builtins[] = {
"def sub($re; s):"
" . as $in"
" | [match($re)]"
- " | .[0]"
- " | . as $r"
- // # create the \"capture\" object:
- " | reduce ( $r | .captures | .[] | select(.name != null) | { (.name) : .string } ) as $pair"
- " ({}; . + $pair)"
- " | if . == {} then $in | .[0:$r.offset]+s+.[$r.offset+$r.length:]"
- " else (. | s)"
+ " | if length == 0 then $in"
+ " else .[0]"
+ " | . as $r"
+ // # create the \"capture\" object:
+ " | reduce ( $r | .captures | .[] | select(.name != null) | { (.name) : .string } ) as $pair"
+ " ({}; . + $pair)"
+ " | if . == {} then $in | .[0:$r.offset]+s+.[$r.offset+$r.length:]"
+ " else (. | s)"
+ " end"
" end ;", Substitution bug: jqlang/jq#600 diff --git a/builtin.c b/builtin.c
index ddd207b..0629ca9 100644
--- a/builtin.c
+++ b/builtin.c
@@ -1057,9 +1057,7 @@ static const char* const jq_builtins[] = {
// # create the \"capture\" object:
" | reduce ( $r | .captures | .[] | select(.name != null) | { (.name) : .string } ) as $pair"
" ({}; . + $pair)"
- " | if . == {} then $in | .[0:$r.offset]+s+.[$r.offset+$r.length:]"
- " else (. | s)"
- " end"
+ " | $in[0:$r.offset] + s + $in[$r.offset+$r.length:]"
" end ;",
//
// repeated substitution of re (which may contain named captures) Meanwhile sub was originally implemented in jaq in 491b332 def sub(re; f; flags): . as $s |
reduce captures(re; flags)[], null as $m
( { i: 0, s: "" };
if $m == null
then .s += $s[.i:]
else .s += $s[.i:$m[0].offset] + ($m | capture_of_match | f) |
.i = ($m[0] | .offset + .length)
end
) | .s; Split into separate functionsmake_builtin!("split_matches", 2, |r, f| Self::Matches(r, f, (true, true))),
...
Self::Matches(re, flags, sm) => {
Box::new(Self::cartesian(flags, re, (cv.0, cv.1.clone())).map(
move |(flags, re)| Ok(Val::Arr(cv.1.captures(&re?, &flags?, *sm)?.into())),
))
}
...
pub fn captures(&self, re: &Self, flags: &Self, sm: (bool, bool)) -> Result<Vec<Val>, Error> {
let s = self.as_str()?;
let flags = Flags::new(flags.as_str()?).map_err(Error::RegexFlag)?;
let re = flags.regex(re.as_str()?).unwrap();
let (split, matches) = sm;
let mut last_byte = 0;
let mut bc = ByteChar::new(s);
let captures = re.captures_iter(s);
let len = re.captures_len();
let cap = if split { len + 1 } else { 0 } + if matches { len } else { 0 };
let mut out = Vec::with_capacity(cap);
for c in captures {
let whole = c.get(0).unwrap();
if whole.start() >= s.len() || (flags.n && whole.as_str().is_empty()) {
continue;
}
let vs = c
.iter()
.zip(re.capture_names())
.filter_map(|(match_, name)| Some(capture(&mut bc, match_?, name)));
if split {
out.push(Val::Str(s[last_byte..whole.start()].to_string().into()));
last_byte = whole.end();
}
if matches {
out.push(Val::Arr(vs.collect::<Vec<_>>().into()));
}
if !flags.g {
break;
}
}
if split {
out.push(Val::Str(s[last_byte..].to_string().into()));
}
Ok(out)
} def sub(re; f; flags): reduce split_matches(re; flags)[] as $x
(""; . + if $x | isstring then $x else $x | capture_of_match | f end); |
Ah, I totally missed your recent addition to jq: jqlang/jq#2641 That (and the linked issues/comments) def helps clarify things |
@baud-rate - Currently (July 7), both jq and jqMaster give the same results as jaq. |
Yes, it just wasn't clear to me the significance of how "regular" the function was, until I read the discussion in this issue: jqlang/jq#513 Although I think I'm still missing a few logical steps from reaching the conclusion that the regularity you described is "clearly intended". These are the cases described in the documentation: "abc" | sub("b"; "x")
"axc"
"abc" | sub("(?<x>b)c"; "\(.x)Y")
"abY" Naively, there is no situation I can think of where it's useful for |
@baod-rate - On the question of clarity: In the case of a match, the def of sub (whether in jq or gojq) makes it clear that (barring errors), sub(_; $x1, $x2, ..., $xn) should emit n values. gojq's author emphasized that to me too. What's not clear (at least to me) is whether there is any reason why On the question of utility - I cannot really say much except that the documentation should be clear :-)
|
I would say that this issue scratches the top of the iceberg, because the behaviour of multiple values for regex filters is quite strange even within jq itself. Observe:
Here, when we pass multiple values as regex and flags, we get the same as binding (1) regex and (2) flags to variables, then evaluate the filter. Fair enough. Let's try
What? Doing the same thing as above, we get different results. Why? This is of course utterly confusing --- however, this probably got unnoticed, because I suppose that few people actually tried to venture into the land of horror where regex filters and multiple output values are joined in unholy union. I would propose to fix an evaluation order for all regex filters ( It would be best if jq/gojq could be made to do the same. What do you think? |
When I'm already at it: The next source of confusion is implicit global search.
This is, as far as I can see, also not documented. jaq currently implements most filters without If I could break backwards compatibility in jq, then I would disable |
Oh, I just realised by @pkoppstein's comment (#75 (comment)) that the So let me ask you: What should be the output of:
I opted for 1) because it is very simple to implement and does not incur overhead if |
|
Yes, thanks for spotting this!
That's also what I thought first, but then your example (#75 (comment)) does not work anymore: Consider the version before and after the transformation that you propose:
|
My apologies. My attempt to formalize the main idea was wrong. But there is not supposed to be anything magical about gsub. Let me try again. Given gsub($regex; s1, s2, …), the result should be the sequence gsub($regex; si) as i ranges from 1 on. |
This might work if There's nothing magical about |
Consider the discrepancy between the number of results produced by these two invocations of jaq:
jaq -n '"x"|sub("x"; "a", "b")'
"a"
"b"
whereas:
jaq -n '"x"|sub("y"; "a", "b")'
"x"
In the language of "regularity" (*), we can say that the sub and gsub
filters are regular in the second argument whenever a match is made.
Given that this is clearly intended, it is apparent that the behavior for
non-matches is at best anomalous, and at worst a bug.
Unfortunately, the waters are muddied here somewhat by the fact
that currently this anomaly can be found in jq and gojq as well.
What should the resolution of this anomaly be?
(*) If
E
is an expression,f
a jq filter of arity 1,A
a non-emptyarray of JSON values, then f is said to be regular in the first
argument if, for all E and A:
[E | f(A[])] == [E | A[] as $a | f($a)]
The concept generalizes to filters of any arity greater than 0, and to an position.
The text was updated successfully, but these errors were encountered: