-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
/
strings.md
1200 lines (914 loc) · 41.5 KB
/
strings.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# [Strings](@id man-strings)
Strings are finite sequences of characters. Of course, the real trouble comes when one asks what
a character is. The characters that English speakers are familiar with are the letters `A`, `B`,
`C`, etc., together with numerals and common punctuation symbols. These characters are standardized
together with a mapping to integer values between 0 and 127 by the [ASCII](https://en.wikipedia.org/wiki/ASCII)
standard. There are, of course, many other characters used in non-English languages, including
variants of the ASCII characters with accents and other modifications, related scripts such as
Cyrillic and Greek, and scripts completely unrelated to ASCII and English, including Arabic, Chinese,
Hebrew, Hindi, Japanese, and Korean. The [Unicode](https://en.wikipedia.org/wiki/Unicode) standard
tackles the complexities of what exactly a character is, and is generally accepted as the definitive
standard addressing this problem. Depending on your needs, you can either ignore these complexities
entirely and just pretend that only ASCII characters exist, or you can write code that can handle
any of the characters or encodings that one may encounter when handling non-ASCII text. Julia
makes dealing with plain ASCII text simple and efficient, and handling Unicode is as simple and
efficient as possible. In particular, you can write C-style string code to process ASCII strings,
and they will work as expected, both in terms of performance and semantics. If such code encounters
non-ASCII text, it will gracefully fail with a clear error message, rather than silently introducing
corrupt results. When this happens, modifying the code to handle non-ASCII data is straightforward.
There are a few noteworthy high-level features about Julia's strings:
* The built-in concrete type used for strings (and string literals) in Julia is [`String`](@ref).
This supports the full range of [Unicode](https://en.wikipedia.org/wiki/Unicode) characters via
the [UTF-8](https://en.wikipedia.org/wiki/UTF-8) encoding. (A [`transcode`](@ref) function is
provided to convert to/from other Unicode encodings.)
* All string types are subtypes of the abstract type `AbstractString`, and external packages define
additional `AbstractString` subtypes (e.g. for other encodings). If you define a function expecting
a string argument, you should declare the type as `AbstractString` in order to accept any string
type.
* Like C and Java, but unlike most dynamic languages, Julia has a first-class type for representing
a single character, called [`AbstractChar`](@ref). The built-in [`Char`](@ref) subtype of `AbstractChar`
is a 32-bit primitive type that can represent any Unicode character (and which is based
on the UTF-8 encoding).
* As in Java, strings are immutable: the value of an `AbstractString` object cannot be changed.
To construct a different string value, you construct a new string from parts of other strings.
* Conceptually, a string is a *partial function* from indices to characters: for some index values,
no character value is returned, and instead an exception is thrown. This allows for efficient
indexing into strings by the byte index of an encoded representation rather than by a character
index, which cannot be implemented both efficiently and simply for variable-width encodings of
Unicode strings.
## [Characters](@id man-characters)
A `Char` value represents a single character: it is just a 32-bit primitive type with a special literal
representation and appropriate arithmetic behaviors, and which can be converted
to a numeric value representing a
[Unicode code point](https://en.wikipedia.org/wiki/Code_point). (Julia packages may define
other subtypes of `AbstractChar`, e.g. to optimize operations for other
[text encodings](https://en.wikipedia.org/wiki/Character_encoding).) Here is how `Char` values are
input and shown:
```jldoctest
julia> c = 'x'
'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)
julia> typeof(c)
Char
```
You can easily convert a `Char` to its integer value, i.e. code point:
```jldoctest
julia> c = Int('x')
120
julia> typeof(c)
Int64
```
On 32-bit architectures, [`typeof(c)`](@ref) will be [`Int32`](@ref). You can convert an
integer value back to a `Char` just as easily:
```jldoctest
julia> Char(120)
'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)
```
Not all integer values are valid Unicode code points, but for performance, the `Char` conversion
does not check that every character value is valid. If you want to check that each converted value
is a valid code point, use the [`isvalid`](@ref) function:
```jldoctest
julia> Char(0x110000)
'\U110000': Unicode U+110000 (category In: Invalid, too high)
julia> isvalid(Char, 0x110000)
false
```
As of this writing, the valid Unicode code points are `U+0000` through `U+D7FF` and `U+E000` through
`U+10FFFF`. These have not all been assigned intelligible meanings yet, nor are they necessarily
interpretable by applications, but all of these values are considered to be valid Unicode characters.
You can input any Unicode character in single quotes using `\u` followed by up to four hexadecimal
digits or `\U` followed by up to eight hexadecimal digits (the longest valid value only requires
six):
```jldoctest
julia> '\u0'
'\0': ASCII/Unicode U+0000 (category Cc: Other, control)
julia> '\u78'
'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)
julia> '\u2200'
'∀': Unicode U+2200 (category Sm: Symbol, math)
julia> '\U10ffff'
'\U10ffff': Unicode U+10FFFF (category Cn: Other, not assigned)
```
Julia uses your system's locale and language settings to determine which characters can be printed
as-is and which must be output using the generic, escaped `\u` or `\U` input forms. In addition
to these Unicode escape forms, all of [C's traditional escaped input forms](https://en.wikipedia.org/wiki/C_syntax#Backslash_escapes)
can also be used:
```jldoctest
julia> Int('\0')
0
julia> Int('\t')
9
julia> Int('\n')
10
julia> Int('\e')
27
julia> Int('\x7f')
127
julia> Int('\177')
127
```
You can do comparisons and a limited amount of arithmetic with `Char` values:
```jldoctest
julia> 'A' < 'a'
true
julia> 'A' <= 'a' <= 'Z'
false
julia> 'A' <= 'X' <= 'Z'
true
julia> 'x' - 'a'
23
julia> 'A' + 1
'B': ASCII/Unicode U+0042 (category Lu: Letter, uppercase)
```
## String Basics
String literals are delimited by double quotes or triple double quotes:
```jldoctest helloworldstring
julia> str = "Hello, world.\n"
"Hello, world.\n"
julia> """Contains "quote" characters"""
"Contains \"quote\" characters"
```
Long lines in strings can be broken up by preceding the newline with a backslash (`\`):
```jldoctest
julia> "This is a long \
line"
"This is a long line"
```
If you want to extract a character from a string, you index into it:
```jldoctest helloworldstring
julia> str[begin]
'H': ASCII/Unicode U+0048 (category Lu: Letter, uppercase)
julia> str[1]
'H': ASCII/Unicode U+0048 (category Lu: Letter, uppercase)
julia> str[6]
',': ASCII/Unicode U+002C (category Po: Punctuation, other)
julia> str[end]
'\n': ASCII/Unicode U+000A (category Cc: Other, control)
```
Many Julia objects, including strings, can be indexed with integers. The index of the first
element (the first character of a string) is returned by [`firstindex(str)`](@ref), and the index of the last element (character)
with [`lastindex(str)`](@ref). The keywords `begin` and `end` can be used inside an indexing
operation as shorthand for the first and last indices, respectively, along the given dimension.
String indexing, like most indexing in Julia, is 1-based: `firstindex` always returns `1` for any `AbstractString`.
As we will see below, however, `lastindex(str)` is *not* in general the same as `length(str)` for a string,
because some Unicode characters can occupy multiple "code units".
You can perform arithmetic and other operations with [`end`](@ref), just like
a normal value:
```jldoctest helloworldstring
julia> str[end-1]
'.': ASCII/Unicode U+002E (category Po: Punctuation, other)
julia> str[end÷2]
' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
```
Using an index less than `begin` (`1`) or greater than `end` raises an error:
```jldoctest helloworldstring
julia> str[begin-1]
ERROR: BoundsError: attempt to access 14-codeunit String at index [0]
[...]
julia> str[end+1]
ERROR: BoundsError: attempt to access 14-codeunit String at index [15]
[...]
```
You can also extract a substring using range indexing:
```jldoctest helloworldstring
julia> str[4:9]
"lo, wo"
```
Notice that the expressions `str[k]` and `str[k:k]` do not give the same result:
```jldoctest helloworldstring
julia> str[6]
',': ASCII/Unicode U+002C (category Po: Punctuation, other)
julia> str[6:6]
","
```
The former is a single character value of type `Char`, while the latter is a string value that
happens to contain only a single character. In Julia these are very different things.
Range indexing makes a copy of the selected part of the original string.
Alternatively, it is possible to create a view into a string using the type [`SubString`](@ref).
More simply, using the [`@views`](@ref) macro on a block of code converts all string slices
into substrings. For example:
```jldoctest
julia> str = "long string"
"long string"
julia> substr = SubString(str, 1, 4)
"long"
julia> typeof(substr)
SubString{String}
julia> @views typeof(str[1:4]) # @views converts slices to SubStrings
SubString{String}
```
Several standard functions like [`chop`](@ref), [`chomp`](@ref) or [`strip`](@ref)
return a [`SubString`](@ref).
## Unicode and UTF-8
Julia fully supports Unicode characters and strings. As [discussed above](@ref man-characters), in character
literals, Unicode code points can be represented using Unicode `\u` and `\U` escape sequences,
as well as all the standard C escape sequences. These can likewise be used to write string literals:
```jldoctest unicodestring
julia> s = "\u2200 x \u2203 y"
"∀ x ∃ y"
```
Whether these Unicode characters are displayed as escapes or shown as special characters depends
on your terminal's locale settings and its support for Unicode. String literals are encoded using
the UTF-8 encoding. UTF-8 is a variable-width encoding, meaning that not all characters are encoded
in the same number of bytes ("code units"). In UTF-8, ASCII characters — i.e. those with code points less than
0x80 (128) -- are encoded as they are in ASCII, using a single byte, while code points 0x80 and
above are encoded using multiple bytes — up to four per character.
String indices in Julia refer to code units (= bytes for UTF-8), the fixed-width building blocks that
are used to encode arbitrary characters (code points). This means that not every
index into a `String` is necessarily a valid index for a character. If you index into
a string at such an invalid byte index, an error is thrown:
```jldoctest unicodestring
julia> s[1]
'∀': Unicode U+2200 (category Sm: Symbol, math)
julia> s[2]
ERROR: StringIndexError: invalid index [2], valid nearby indices [1]=>'∀', [4]=>' '
Stacktrace:
[...]
julia> s[3]
ERROR: StringIndexError: invalid index [3], valid nearby indices [1]=>'∀', [4]=>' '
Stacktrace:
[...]
julia> s[4]
' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
```
In this case, the character `∀` is a three-byte character, so the indices 2 and 3 are invalid
and the next character's index is 4; this next valid index can be computed by [`nextind(s,1)`](@ref),
and the next index after that by `nextind(s,4)` and so on.
Since `end` is always the last valid index into a collection, `end-1` references an invalid
byte index if the second-to-last character is multibyte.
```jldoctest unicodestring
julia> s[end-1]
' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
julia> s[end-2]
ERROR: StringIndexError: invalid index [9], valid nearby indices [7]=>'∃', [10]=>' '
Stacktrace:
[...]
julia> s[prevind(s, end, 2)]
'∃': Unicode U+2203 (category Sm: Symbol, math)
```
The first case works, because the last character `y` and the space are one-byte characters,
whereas `end-2` indexes into the middle of the `∃` multibyte representation. The correct
way for this case is using `prevind(s, lastindex(s), 2)` or, if you're using that value to index
into `s` you can write `s[prevind(s, end, 2)]` and `end` expands to `lastindex(s)`.
Extraction of a substring using range indexing also expects valid byte indices or an error is thrown:
```jldoctest unicodestring
julia> s[1:1]
"∀"
julia> s[1:2]
ERROR: StringIndexError: invalid index [2], valid nearby indices [1]=>'∀', [4]=>' '
Stacktrace:
[...]
julia> s[1:4]
"∀ "
```
Because of variable-length encodings, the number of characters in a string (given by [`length(s)`](@ref))
is not always the same as the last index. If you iterate through the indices 1 through [`lastindex(s)`](@ref)
and index into `s`, the sequence of characters returned when errors aren't thrown is the sequence
of characters comprising the string `s`. Thus `length(s) <= lastindex(s)`,
since each character in a string must have its own index. The following is an inefficient and
verbose way to iterate through the characters of `s`:
```jldoctest unicodestring
julia> for i = firstindex(s):lastindex(s)
try
println(s[i])
catch
# ignore the index error
end
end
∀
x
∃
y
```
The blank lines actually have spaces on them. Fortunately, the above awkward idiom is unnecessary
for iterating through the characters in a string, since you can just use the string as an iterable
object, no exception handling required:
```jldoctest unicodestring
julia> for c in s
println(c)
end
∀
x
∃
y
```
If you need to obtain valid indices for a string, you can use the [`nextind`](@ref) and
[`prevind`](@ref) functions to increment/decrement to the next/previous valid index, as mentioned above.
You can also use the [`eachindex`](@ref) function to iterate over the valid character indices:
```jldoctest unicodestring
julia> collect(eachindex(s))
7-element Vector{Int64}:
1
4
5
6
7
10
11
```
To access the raw code units (bytes for UTF-8) of the encoding, you can use the [`codeunit(s,i)`](@ref)
function, where the index `i` runs consecutively from `1` to [`ncodeunits(s)`](@ref). The [`codeunits(s)`](@ref)
function returns an `AbstractVector{UInt8}` wrapper that lets you access these raw codeunits (bytes) as an array.
Strings in Julia can contain invalid UTF-8 code unit sequences. This convention allows to
treat any byte sequence as a `String`. In such situations a rule is that when parsing
a sequence of code units from left to right characters are formed by the longest sequence of
8-bit code units that matches the start of one of the following bit patterns
(each `x` can be `0` or `1`):
* `0xxxxxxx`;
* `110xxxxx` `10xxxxxx`;
* `1110xxxx` `10xxxxxx` `10xxxxxx`;
* `11110xxx` `10xxxxxx` `10xxxxxx` `10xxxxxx`;
* `10xxxxxx`;
* `11111xxx`.
In particular this means that overlong and too-high code unit sequences and prefixes thereof are treated
as a single invalid character rather than multiple invalid characters.
This rule may be best explained with an example:
```julia-repl
julia> s = "\xc0\xa0\xe2\x88\xe2|"
"\xc0\xa0\xe2\x88\xe2|"
julia> foreach(display, s)
'\xc0\xa0': [overlong] ASCII/Unicode U+0020 (category Zs: Separator, space)
'\xe2\x88': Malformed UTF-8 (category Ma: Malformed, bad data)
'\xe2': Malformed UTF-8 (category Ma: Malformed, bad data)
'|': ASCII/Unicode U+007C (category Sm: Symbol, math)
julia> isvalid.(collect(s))
4-element BitArray{1}:
0
0
0
1
julia> s2 = "\xf7\xbf\xbf\xbf"
"\U1fffff"
julia> foreach(display, s2)
'\U1fffff': Unicode U+1FFFFF (category In: Invalid, too high)
```
We can see that the first two code units in the string `s` form an overlong encoding of
space character. It is invalid, but is accepted in a string as a single character.
The next two code units form a valid start of a three-byte UTF-8 sequence. However, the fifth
code unit `\xe2` is not its valid continuation. Therefore code units 3 and 4 are also
interpreted as malformed characters in this string. Similarly code unit 5 forms a malformed
character because `|` is not a valid continuation to it. Finally the string `s2` contains
one too high code point.
Julia uses the UTF-8 encoding by default, and support for new encodings can be added by packages.
For example, the [LegacyStrings.jl](https://github.com/JuliaStrings/LegacyStrings.jl) package
implements `UTF16String` and `UTF32String` types. Additional discussion of other encodings and
how to implement support for them is beyond the scope of this document for the time being. For
further discussion of UTF-8 encoding issues, see the section below on [byte array literals](@ref man-byte-array-literals).
The [`transcode`](@ref) function is provided to convert data between the various UTF-xx encodings,
primarily for working with external data and libraries.
## [Concatenation](@id man-concatenation)
One of the most common and useful string operations is concatenation:
```jldoctest stringconcat
julia> greet = "Hello"
"Hello"
julia> whom = "world"
"world"
julia> string(greet, ", ", whom, ".\n")
"Hello, world.\n"
```
It's important to be aware of potentially dangerous situations such as concatenation of invalid UTF-8 strings.
The resulting string may contain different characters than the input strings,
and its number of characters may be lower than sum of numbers of characters
of the concatenated strings, e.g.:
```julia-repl
julia> a, b = "\xe2\x88", "\x80"
("\xe2\x88", "\x80")
julia> c = string(a, b)
"∀"
julia> collect.([a, b, c])
3-element Vector{Vector{Char}}:
['\xe2\x88']
['\x80']
['∀']
julia> length.([a, b, c])
3-element Vector{Int64}:
1
1
1
```
This situation can happen only for invalid UTF-8 strings. For valid UTF-8 strings
concatenation preserves all characters in strings and additivity of string lengths.
Julia also provides [`*`](@ref) for string concatenation:
```jldoctest stringconcat
julia> greet * ", " * whom * ".\n"
"Hello, world.\n"
```
While `*` may seem like a surprising choice to users of languages that provide `+` for string
concatenation, this use of `*` has precedent in mathematics, particularly in abstract algebra.
In mathematics, `+` usually denotes a *commutative* operation, where the order of the operands does
not matter. An example of this is matrix addition, where `A + B == B + A` for any matrices `A` and `B`
that have the same shape. In contrast, `*` typically denotes a *noncommutative* operation, where the
order of the operands *does* matter. An example of this is matrix multiplication, where in general
`A * B != B * A`. As with matrix multiplication, string concatenation is noncommutative:
`greet * whom != whom * greet`. As such, `*` is a more natural choice for an infix string concatenation
operator, consistent with common mathematical use.
More precisely, the set of all finite-length strings *S* together with the string concatenation operator
`*` forms a [free monoid](https://en.wikipedia.org/wiki/Free_monoid) (*S*, `*`). The identity element
of this set is the empty string, `""`. Whenever a free monoid is not commutative, the operation is
typically represented as `\cdot`, `*`, or a similar symbol, rather than `+`, which as stated usually
implies commutativity.
## [Interpolation](@id string-interpolation)
Constructing strings using concatenation can become a bit cumbersome, however. To reduce the need for these
verbose calls to [`string`](@ref) or repeated multiplications, Julia allows interpolation into string literals
using `$`, as in Perl:
```jldoctest stringconcat
julia> "$greet, $whom.\n"
"Hello, world.\n"
```
This is more readable and convenient and equivalent to the above string concatenation -- the system
rewrites this apparent single string literal into the call `string(greet, ", ", whom, ".\n")`.
The shortest complete expression after the `$` is taken as the expression whose value is to be
interpolated into the string. Thus, you can interpolate any expression into a string using parentheses:
```jldoctest
julia> "1 + 2 = $(1 + 2)"
"1 + 2 = 3"
```
Both concatenation and string interpolation call [`string`](@ref) to convert objects into string
form. However, `string` actually just returns the output of [`print`](@ref), so new types
should add methods to [`print`](@ref) or [`show`](@ref) instead of `string`.
Most non-`AbstractString` objects are converted to strings closely corresponding to how
they are entered as literal expressions:
```jldoctest
julia> v = [1,2,3]
3-element Vector{Int64}:
1
2
3
julia> "v: $v"
"v: [1, 2, 3]"
```
[`string`](@ref) is the identity for `AbstractString` and `AbstractChar` values, so these are interpolated
into strings as themselves, unquoted and unescaped:
```jldoctest
julia> c = 'x'
'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)
julia> "hi, $c"
"hi, x"
```
To include a literal `$` in a string literal, escape it with a backslash:
```jldoctest
julia> print("I have \$100 in my account.\n")
I have $100 in my account.
```
## Triple-Quoted String Literals
When strings are created using triple-quotes (`"""..."""`) they have some special behavior that
can be useful for creating longer blocks of text.
First, triple-quoted strings are also dedented to the level of the least-indented line.
This is useful for defining strings within code that is indented. For example:
```jldoctest
julia> str = """
Hello,
world.
"""
" Hello,\n world.\n"
```
In this case the final (empty) line before the closing `"""` sets the indentation level.
The dedentation level is determined as the longest common starting sequence of spaces or
tabs in all lines, excluding the line following the opening `"""` and lines containing
only spaces or tabs (the line containing the closing `"""` is always included).
Then for all lines, excluding the text following the opening `"""`, the common starting
sequence is removed (including lines containing only spaces and tabs if they start with
this sequence), e.g.:
```jldoctest
julia> """ This
is
a test"""
" This\nis\n a test"
```
Next, if the opening `"""` is followed by a newline,
the newline is stripped from the resulting string.
```julia
"""hello"""
```
is equivalent to
```julia
"""
hello"""
```
but
```julia
"""
hello"""
```
will contain a literal newline at the beginning.
Stripping of the newline is performed after the dedentation. For example:
```jldoctest
julia> """
Hello,
world."""
"Hello,\nworld."
```
If the newline is removed using a backslash, dedentation will be respected as well:
```jldoctest
julia> """
Averylong\
word"""
"Averylongword"
```
Trailing whitespace is left unaltered.
Triple-quoted string literals can contain `"` characters without escaping.
Note that line breaks in literal strings, whether single- or triple-quoted, result in a newline
(LF) character `\n` in the string, even if your editor uses a carriage return `\r` (CR) or CRLF
combination to end lines. To include a CR in a string, use an explicit escape `\r`; for example,
you can enter the literal string `"a CRLF line ending\r\n"`.
## Common Operations
You can lexicographically compare strings using the standard comparison operators:
```jldoctest
julia> "abracadabra" < "xylophone"
true
julia> "abracadabra" == "xylophone"
false
julia> "Hello, world." != "Goodbye, world."
true
julia> "1 + 2 = 3" == "1 + 2 = $(1 + 2)"
true
```
You can search for the index of a particular character using the
[`findfirst`](@ref) and [`findlast`](@ref) functions:
```jldoctest
julia> findfirst('o', "xylophone")
4
julia> findlast('o', "xylophone")
7
julia> findfirst('z', "xylophone")
```
You can start the search for a character at a given offset by using
the functions [`findnext`](@ref) and [`findprev`](@ref):
```jldoctest
julia> findnext('o', "xylophone", 1)
4
julia> findnext('o', "xylophone", 5)
7
julia> findprev('o', "xylophone", 5)
4
julia> findnext('o', "xylophone", 8)
```
You can use the [`occursin`](@ref) function to check if a substring is found within a string:
```jldoctest
julia> occursin("world", "Hello, world.")
true
julia> occursin("o", "Xylophon")
true
julia> occursin("a", "Xylophon")
false
julia> occursin('o', "Xylophon")
true
```
The last example shows that [`occursin`](@ref) can also look for a character literal.
Two other handy string functions are [`repeat`](@ref) and [`join`](@ref):
```jldoctest
julia> repeat(".:Z:.", 10)
".:Z:..:Z:..:Z:..:Z:..:Z:..:Z:..:Z:..:Z:..:Z:..:Z:."
julia> join(["apples", "bananas", "pineapples"], ", ", " and ")
"apples, bananas and pineapples"
```
Some other useful functions include:
* [`firstindex(str)`](@ref) gives the minimal (byte) index that can be used to index into `str` (always 1 for strings, not necessarily true for other containers).
* [`lastindex(str)`](@ref) gives the maximal (byte) index that can be used to index into `str`.
* [`length(str)`](@ref) the number of characters in `str`.
* [`length(str, i, j)`](@ref) the number of valid character indices in `str` from `i` to `j`.
* [`ncodeunits(str)`](@ref) number of [code units](https://en.wikipedia.org/wiki/Character_encoding#Terminology) in a string.
* [`codeunit(str, i)`](@ref) gives the code unit value in the string `str` at index `i`.
* [`thisind(str, i)`](@ref) given an arbitrary index into a string find the first index of the character into which the index points.
* [`nextind(str, i, n=1)`](@ref) find the start of the `n`th character starting after index `i`.
* [`prevind(str, i, n=1)`](@ref) find the start of the `n`th character starting before index `i`.
## [Non-Standard String Literals](@id non-standard-string-literals)
There are situations when you want to construct a string or use string semantics, but the behavior
of the standard string construct is not quite what is needed. For these kinds of situations, Julia
provides non-standard string literals. A non-standard string literal looks like a regular
double-quoted string literal,
but is immediately prefixed by an identifier, and may behave differently from a normal string literal.
[Regular expressions](@ref man-regex-literals), [byte array literals](@ref man-byte-array-literals),
and [version number literals](@ref man-version-number-literals), as described below,
are some examples of non-standard string literals. Users and packages may also define new non-standard string literals.
Further documentation is given in the [Metaprogramming](@ref meta-non-standard-string-literals) section.
## [Regular Expressions](@id man-regex-literals)
Sometimes you are not looking for an exact string, but a particular *pattern*. For example, suppose you are trying to extract a single date from a large text file. You don’t know what that date is (that’s why you are searching for it), but you do know it will look something like `YYYY-MM-DD`. Regular expressions allow you to specify these patterns and search for them.
Julia uses version 2 of Perl-compatible regular expressions (regexes), as provided by the [PCRE](https://www.pcre.org/)
library (see the [PCRE2 syntax description](https://www.pcre.org/current/doc/html/pcre2syntax.html) for more details). Regular expressions are related to strings in two ways: the obvious connection is that
regular expressions are used to find regular patterns in strings; the other connection is that
regular expressions are themselves input as strings, which are parsed into a state machine that
can be used to efficiently search for patterns in strings. In Julia, regular expressions are input
using non-standard string literals prefixed with various identifiers beginning with `r`. The most
basic regular expression literal without any options turned on just uses `r"..."`:
```jldoctest
julia> re = r"^\s*(?:#|$)"
r"^\s*(?:#|$)"
julia> typeof(re)
Regex
```
To check if a regex matches a string, use [`occursin`](@ref):
```jldoctest
julia> occursin(r"^\s*(?:#|$)", "not a comment")
false
julia> occursin(r"^\s*(?:#|$)", "# a comment")
true
```
As one can see here, [`occursin`](@ref) simply returns true or false, indicating whether a
match for the given regex occurs in the string. Commonly, however, one wants to know not
just whether a string matched, but also *how* it matched. To capture this information about
a match, use the [`match`](@ref) function instead:
```jldoctest
julia> match(r"^\s*(?:#|$)", "not a comment")
julia> match(r"^\s*(?:#|$)", "# a comment")
RegexMatch("#")
```
If the regular expression does not match the given string, [`match`](@ref) returns [`nothing`](@ref)
-- a special value that does not print anything at the interactive prompt. Other than not printing,
it is a completely normal value and you can test for it programmatically:
```julia
m = match(r"^\s*(?:#|$)", line)
if m === nothing
println("not a comment")
else
println("blank or comment")
end
```
If a regular expression does match, the value returned by [`match`](@ref) is a [`RegexMatch`](@ref)
object. These objects record how the expression matches, including the substring that the pattern
matches and any captured substrings, if there are any. This example only captures the portion
of the substring that matches, but perhaps we want to capture any non-blank text after the comment
character. We could do the following:
```jldoctest
julia> m = match(r"^\s*(?:#\s*(.*?)\s*$|$)", "# a comment ")
RegexMatch("# a comment ", 1="a comment")
```
When calling [`match`](@ref), you have the option to specify an index at which to start the
search. For example:
```jldoctest
julia> m = match(r"[0-9]","aaaa1aaaa2aaaa3",1)
RegexMatch("1")
julia> m = match(r"[0-9]","aaaa1aaaa2aaaa3",6)
RegexMatch("2")
julia> m = match(r"[0-9]","aaaa1aaaa2aaaa3",11)
RegexMatch("3")
```
You can extract the following info from a `RegexMatch` object:
* the entire substring matched: `m.match`
* the captured substrings as an array of strings: `m.captures`
* the offset at which the whole match begins: `m.offset`
* the offsets of the captured substrings as a vector: `m.offsets`
For when a capture doesn't match, instead of a substring, `m.captures` contains `nothing` in that
position, and `m.offsets` has a zero offset (recall that indices in Julia are 1-based, so a zero
offset into a string is invalid). Here is a pair of somewhat contrived examples:
```jldoctest acdmatch
julia> m = match(r"(a|b)(c)?(d)", "acd")
RegexMatch("acd", 1="a", 2="c", 3="d")
julia> m.match
"acd"
julia> m.captures
3-element Vector{Union{Nothing, SubString{String}}}:
"a"
"c"
"d"
julia> m.offset
1
julia> m.offsets
3-element Vector{Int64}:
1
2
3
julia> m = match(r"(a|b)(c)?(d)", "ad")
RegexMatch("ad", 1="a", 2=nothing, 3="d")
julia> m.match
"ad"
julia> m.captures
3-element Vector{Union{Nothing, SubString{String}}}:
"a"
nothing
"d"
julia> m.offset
1
julia> m.offsets
3-element Vector{Int64}:
1
0
2
```
It is convenient to have captures returned as an array so that one can use destructuring syntax
to bind them to local variables. As a convenience, the `RegexMatch` object implements iterator methods that pass through to the `captures` field, so you can destructure the match object directly:
```jldoctest acdmatch
julia> first, second, third = m; first
"a"
```
Captures can also be accessed by indexing the `RegexMatch` object with the number or name of the
capture group:
```jldoctest
julia> m=match(r"(?<hour>\d+):(?<minute>\d+)","12:45")
RegexMatch("12:45", hour="12", minute="45")
julia> m[:minute]
"45"
julia> m[2]
"45"
```
Captures can be referenced in a substitution string when using [`replace`](@ref) by using `\n`
to refer to the nth capture group and prefixing the substitution string with `s`. Capture group
0 refers to the entire match object. Named capture groups can be referenced in the substitution
with `\g<groupname>`. For example:
```jldoctest
julia> replace("first second", r"(\w+) (?<agroup>\w+)" => s"\g<agroup> \1")
"second first"
```
Numbered capture groups can also be referenced as `\g<n>` for disambiguation, as in:
```jldoctest
julia> replace("a", r"." => s"\g<0>1")
"a1"
```
You can modify the behavior of regular expressions by some combination of the flags `i`, `m`,
`s`, and `x` after the closing double quote mark. These flags have the same meaning as they do
in Perl, as explained in this excerpt from the [perlre manpage](https://perldoc.perl.org/perlre#Modifiers):
```
i Do case-insensitive pattern matching.
If locale matching rules are in effect, the case map is taken
from the current locale for code points less than 255, and
from Unicode rules for larger code points. However, matches
that would cross the Unicode rules/non-Unicode rules boundary
(ords 255/256) will not succeed.
m Treat string as multiple lines. That is, change "^" and "$"
from matching the start or end of the string to matching the
start or end of any line anywhere within the string.
s Treat string as single line. That is, change "." to match any
character whatsoever, even a newline, which normally it would
not match.
Used together, as r""ms, they let the "." match any character
whatsoever, while still allowing "^" and "$" to match,
respectively, just after and just before newlines within the
string.
x Tells the regular expression parser to ignore most whitespace
that is neither backslashed nor within a character class. You
can use this to break up your regular expression into
(slightly) more readable parts. The '#' character is also
treated as a metacharacter introducing a comment, just as in
ordinary code.
```
For example, the following regex has all three flags turned on:
```jldoctest
julia> r"a+.*b+.*?d$"ism
r"a+.*b+.*?d$"ims
julia> match(r"a+.*b+.*?d$"ism, "Goodbye,\nOh, angry,\nBad world\n")
RegexMatch("angry,\nBad world")
```
The `r"..."` literal is constructed without interpolation and unescaping (except for
quotation mark `"` which still has to be escaped). Here is an example
showing the difference from standard string literals:
```julia-repl
julia> x = 10
10
julia> r"$x"
r"$x"
julia> "$x"