Skip to content

Chunking

Yasuhiro Yamada edited this page Feb 27, 2023 · 2 revisions

What is "Chunk" ?

"Chunk" is a name of a group of standard input. teip divides standard input into multiple chunks. One chunk is outside the hole and one chunk is inside the hole of masking tape. Only the chunks in the hole are bypassed.

How chunking works?

A chunk that does not match the pattern will be displayed on the standard output as it is. On the other hand, the matched chunk is passed to the standard input of a targeted command. After that, the matched chunk is replaced with the result of the targeted command.

In the next example, the standard input is divided into four chunks as follows.

echo ABC100EFG200 | teip -og '\d+' -- sed 's/.*/@@@/g'
ABC => Chunk(1)
100 => Chunk(2) -- Matched
EFG => Chunk(3)
200 => Chunk(4) -- Matched

By default, the matched chunks are combined by line breaks and used as the new standard input for the targeted command. Imagine that teip executes the following command in its process.

$ printf "100\n200\n" | sed 's/.*/@@@/g'
@@@ # => Result of Chunk(2)
@@@ # => Result of Chunk(4)

(It is not technically accurate but you can now see why $1 is used not $3 in one of the examples in "Getting Started")

After that, matched chunks are replaced with each line of result.

ABC => Chunk(1)
@@@ => Chunk(2) -- Replaced
EFG => Chunk(3)
@@@ => Chunk(4) -- Replaced

Finally, all the chunks are concatenated and the following result is printed.

ABC@@@EFG@@@

Practically, the above process is performed asynchronously. Chunks being printed sequentially as they become available.

Let me show you another example. The following is an example of selecting a number and replacing it with the string "@@@". However, there appear to be too many numbers. The 100 is replaced with "@@@@@@@@@".

$ echo ABC100EFG200 | teip -og '\d' -- sed 's/.*/@@@/g'
ABC@@@@@@@@@EFG@@@@@@@@@

The reason why a lot of @ are printed in the example below is that the input is broken up into many chunks.

$ echo ABC100EFG200 | teip -og '\d'
ABC[1][0][0]EFG[2][0][0]

teip recognizes input matched with the entire regular expression as a single chunk. \d matches a single digit, and it results in many chunks.

ABC => Chunk(1)
1   => Chunk(2) -- Matched
0   => Chunk(3) -- Matched
0   => Chunk(4) -- Matched
EFG => Chunk(5)
2   => Chunk(6) -- Matched
0   => Chunk(7) -- Matched
0   => Chunk(8) -- Matched

Therefore, sed loads many newline characters.

$ printf "1\n0\n0\n2\n0\n0\n" | sed 's/.*/@@@/g'
@@@ # => Result of Chunk(2)
@@@ # => Result of Chunk(3)
@@@ # => Result of Chunk(4)
@@@ # => Result of Chunk(6)
@@@ # => Result of Chunk(7)
@@@ # => Result of Chunk(8)

The chunks of the final form are like the following.

ABC => Chunk(1)
@@@ => Chunk(2) -- Replaced
@@@ => Chunk(3) -- Replaced
@@@ => Chunk(4) -- Replaced
EFG => Chunk(5)
@@@ => Chunk(6) -- Replaced
@@@ => Chunk(7) -- Replaced
@@@ => Chunk(8) -- Replaced

And, here is the final result.

ABC@@@@@@@@@EFG@@@@@@@@@

The concept of chunking is also used for other options. For example, if you use -f to specify a range of A-B, each field will be a separate chunk. Also, the field delimiter is always an unmatched chunk.

$ echo "AA,BB,CC" | teip -f 2-3 -d,
AA,[BB],[CC]

With the -c option, adjacent characters are treated as the same chunk even if they are separated by ,.

$ echo "ABCDEFGHI" | teip -c1,2,3,7-9
[ABC]DEF[GHI]
Clone this wiki locally