1 | =head1 NAME |
---|
2 | |
---|
3 | perlunicode - Unicode support in Perl |
---|
4 | |
---|
5 | =head1 DESCRIPTION |
---|
6 | |
---|
7 | =head2 Important Caveats |
---|
8 | |
---|
9 | Unicode support is an extensive requirement. While Perl does not |
---|
10 | implement the Unicode standard or the accompanying technical reports |
---|
11 | from cover to cover, Perl does support many Unicode features. |
---|
12 | |
---|
13 | =over 4 |
---|
14 | |
---|
15 | =item Input and Output Layers |
---|
16 | |
---|
17 | Perl knows when a filehandle uses Perl's internal Unicode encodings |
---|
18 | (UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with |
---|
19 | the ":utf8" layer. Other encodings can be converted to Perl's |
---|
20 | encoding on input or from Perl's encoding on output by use of the |
---|
21 | ":encoding(...)" layer. See L<open>. |
---|
22 | |
---|
23 | To indicate that Perl source itself is using a particular encoding, |
---|
24 | see L<encoding>. |
---|
25 | |
---|
26 | =item Regular Expressions |
---|
27 | |
---|
28 | The regular expression compiler produces polymorphic opcodes. That is, |
---|
29 | the pattern adapts to the data and automatically switches to the Unicode |
---|
30 | character scheme when presented with Unicode data--or instead uses |
---|
31 | a traditional byte scheme when presented with byte data. |
---|
32 | |
---|
33 | =item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts |
---|
34 | |
---|
35 | As a compatibility measure, the C<use utf8> pragma must be explicitly |
---|
36 | included to enable recognition of UTF-8 in the Perl scripts themselves |
---|
37 | (in string or regular expression literals, or in identifier names) on |
---|
38 | ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based |
---|
39 | machines. B<These are the only times when an explicit C<use utf8> |
---|
40 | is needed.> See L<utf8>. |
---|
41 | |
---|
42 | You can also use the C<encoding> pragma to change the default encoding |
---|
43 | of the data in your script; see L<encoding>. |
---|
44 | |
---|
45 | =item C<use encoding> needed to upgrade non-Latin-1 byte strings |
---|
46 | |
---|
47 | By default, there is a fundamental asymmetry in Perl's unicode model: |
---|
48 | implicit upgrading from byte strings to Unicode strings assumes that |
---|
49 | they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are |
---|
50 | downgraded with UTF-8 encoding. This happens because the first 256 |
---|
51 | codepoints in Unicode happens to agree with Latin-1. |
---|
52 | |
---|
53 | If you wish to interpret byte strings as UTF-8 instead, use the |
---|
54 | C<encoding> pragma: |
---|
55 | |
---|
56 | use encoding 'utf8'; |
---|
57 | |
---|
58 | See L</"Byte and Character Semantics"> for more details. |
---|
59 | |
---|
60 | =back |
---|
61 | |
---|
62 | =head2 Byte and Character Semantics |
---|
63 | |
---|
64 | Beginning with version 5.6, Perl uses logically-wide characters to |
---|
65 | represent strings internally. |
---|
66 | |
---|
67 | In future, Perl-level operations will be expected to work with |
---|
68 | characters rather than bytes. |
---|
69 | |
---|
70 | However, as an interim compatibility measure, Perl aims to |
---|
71 | provide a safe migration path from byte semantics to character |
---|
72 | semantics for programs. For operations where Perl can unambiguously |
---|
73 | decide that the input data are characters, Perl switches to |
---|
74 | character semantics. For operations where this determination cannot |
---|
75 | be made without additional information from the user, Perl decides in |
---|
76 | favor of compatibility and chooses to use byte semantics. |
---|
77 | |
---|
78 | This behavior preserves compatibility with earlier versions of Perl, |
---|
79 | which allowed byte semantics in Perl operations only if |
---|
80 | none of the program's inputs were marked as being as source of Unicode |
---|
81 | character data. Such data may come from filehandles, from calls to |
---|
82 | external programs, from information provided by the system (such as %ENV), |
---|
83 | or from literals and constants in the source text. |
---|
84 | |
---|
85 | The C<bytes> pragma will always, regardless of platform, force byte |
---|
86 | semantics in a particular lexical scope. See L<bytes>. |
---|
87 | |
---|
88 | The C<utf8> pragma is primarily a compatibility device that enables |
---|
89 | recognition of UTF-(8|EBCDIC) in literals encountered by the parser. |
---|
90 | Note that this pragma is only required while Perl defaults to byte |
---|
91 | semantics; when character semantics become the default, this pragma |
---|
92 | may become a no-op. See L<utf8>. |
---|
93 | |
---|
94 | Unless explicitly stated, Perl operators use character semantics |
---|
95 | for Unicode data and byte semantics for non-Unicode data. |
---|
96 | The decision to use character semantics is made transparently. If |
---|
97 | input data comes from a Unicode source--for example, if a character |
---|
98 | encoding layer is added to a filehandle or a literal Unicode |
---|
99 | string constant appears in a program--character semantics apply. |
---|
100 | Otherwise, byte semantics are in effect. The C<bytes> pragma should |
---|
101 | be used to force byte semantics on Unicode data. |
---|
102 | |
---|
103 | If strings operating under byte semantics and strings with Unicode |
---|
104 | character data are concatenated, the new string will be created by |
---|
105 | decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the |
---|
106 | old Unicode string used EBCDIC. This translation is done without |
---|
107 | regard to the system's native 8-bit encoding. To change this for |
---|
108 | systems with non-Latin-1 and non-EBCDIC native encodings, use the |
---|
109 | C<encoding> pragma. See L<encoding>. |
---|
110 | |
---|
111 | Under character semantics, many operations that formerly operated on |
---|
112 | bytes now operate on characters. A character in Perl is |
---|
113 | logically just a number ranging from 0 to 2**31 or so. Larger |
---|
114 | characters may encode into longer sequences of bytes internally, but |
---|
115 | this internal detail is mostly hidden for Perl code. |
---|
116 | See L<perluniintro> for more. |
---|
117 | |
---|
118 | =head2 Effects of Character Semantics |
---|
119 | |
---|
120 | Character semantics have the following effects: |
---|
121 | |
---|
122 | =over 4 |
---|
123 | |
---|
124 | =item * |
---|
125 | |
---|
126 | Strings--including hash keys--and regular expression patterns may |
---|
127 | contain characters that have an ordinal value larger than 255. |
---|
128 | |
---|
129 | If you use a Unicode editor to edit your program, Unicode characters |
---|
130 | may occur directly within the literal strings in one of the various |
---|
131 | Unicode encodings (UTF-8, UTF-EBCDIC, UCS-2, etc.), but will be recognized |
---|
132 | as such and converted to Perl's internal representation only if the |
---|
133 | appropriate L<encoding> is specified. |
---|
134 | |
---|
135 | Unicode characters can also be added to a string by using the |
---|
136 | C<\x{...}> notation. The Unicode code for the desired character, in |
---|
137 | hexadecimal, should be placed in the braces. For instance, a smiley |
---|
138 | face is C<\x{263A}>. This encoding scheme only works for characters |
---|
139 | with a code of 0x100 or above. |
---|
140 | |
---|
141 | Additionally, if you |
---|
142 | |
---|
143 | use charnames ':full'; |
---|
144 | |
---|
145 | you can use the C<\N{...}> notation and put the official Unicode |
---|
146 | character name within the braces, such as C<\N{WHITE SMILING FACE}>. |
---|
147 | |
---|
148 | |
---|
149 | =item * |
---|
150 | |
---|
151 | If an appropriate L<encoding> is specified, identifiers within the |
---|
152 | Perl script may contain Unicode alphanumeric characters, including |
---|
153 | ideographs. Perl does not currently attempt to canonicalize variable |
---|
154 | names. |
---|
155 | |
---|
156 | =item * |
---|
157 | |
---|
158 | Regular expressions match characters instead of bytes. "." matches |
---|
159 | a character instead of a byte. The C<\C> pattern is provided to force |
---|
160 | a match a single byte--a C<char> in C, hence C<\C>. |
---|
161 | |
---|
162 | =item * |
---|
163 | |
---|
164 | Character classes in regular expressions match characters instead of |
---|
165 | bytes and match against the character properties specified in the |
---|
166 | Unicode properties database. C<\w> can be used to match a Japanese |
---|
167 | ideograph, for instance. |
---|
168 | |
---|
169 | (However, and as a limitation of the current implementation, using |
---|
170 | C<\w> or C<\W> I<inside> a C<[...]> character class will still match |
---|
171 | with byte semantics.) |
---|
172 | |
---|
173 | =item * |
---|
174 | |
---|
175 | Named Unicode properties, scripts, and block ranges may be used like |
---|
176 | character classes via the C<\p{}> "matches property" construct and |
---|
177 | the C<\P{}> negation, "doesn't match property". |
---|
178 | |
---|
179 | For instance, C<\p{Lu}> matches any character with the Unicode "Lu" |
---|
180 | (Letter, uppercase) property, while C<\p{M}> matches any character |
---|
181 | with an "M" (mark--accents and such) property. Brackets are not |
---|
182 | required for single letter properties, so C<\p{M}> is equivalent to |
---|
183 | C<\pM>. Many predefined properties are available, such as |
---|
184 | C<\p{Mirrored}> and C<\p{Tibetan}>. |
---|
185 | |
---|
186 | The official Unicode script and block names have spaces and dashes as |
---|
187 | separators, but for convenience you can use dashes, spaces, or |
---|
188 | underbars, and case is unimportant. It is recommended, however, that |
---|
189 | for consistency you use the following naming: the official Unicode |
---|
190 | script, property, or block name (see below for the additional rules |
---|
191 | that apply to block names) with whitespace and dashes removed, and the |
---|
192 | words "uppercase-first-lowercase-rest". C<Latin-1 Supplement> thus |
---|
193 | becomes C<Latin1Supplement>. |
---|
194 | |
---|
195 | You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret |
---|
196 | (^) between the first brace and the property name: C<\p{^Tamil}> is |
---|
197 | equal to C<\P{Tamil}>. |
---|
198 | |
---|
199 | B<NOTE: the properties, scripts, and blocks listed here are as of |
---|
200 | Unicode 3.2.0, March 2002, or Perl 5.8.0, July 2002. Unicode 4.0.0 |
---|
201 | came out in April 2003, and Perl 5.8.1 in September 2003.> |
---|
202 | |
---|
203 | Here are the basic Unicode General Category properties, followed by their |
---|
204 | long form. You can use either; C<\p{Lu}> and C<\p{UppercaseLetter}>, |
---|
205 | for instance, are identical. |
---|
206 | |
---|
207 | Short Long |
---|
208 | |
---|
209 | L Letter |
---|
210 | Lu UppercaseLetter |
---|
211 | Ll LowercaseLetter |
---|
212 | Lt TitlecaseLetter |
---|
213 | Lm ModifierLetter |
---|
214 | Lo OtherLetter |
---|
215 | |
---|
216 | M Mark |
---|
217 | Mn NonspacingMark |
---|
218 | Mc SpacingMark |
---|
219 | Me EnclosingMark |
---|
220 | |
---|
221 | N Number |
---|
222 | Nd DecimalNumber |
---|
223 | Nl LetterNumber |
---|
224 | No OtherNumber |
---|
225 | |
---|
226 | P Punctuation |
---|
227 | Pc ConnectorPunctuation |
---|
228 | Pd DashPunctuation |
---|
229 | Ps OpenPunctuation |
---|
230 | Pe ClosePunctuation |
---|
231 | Pi InitialPunctuation |
---|
232 | (may behave like Ps or Pe depending on usage) |
---|
233 | Pf FinalPunctuation |
---|
234 | (may behave like Ps or Pe depending on usage) |
---|
235 | Po OtherPunctuation |
---|
236 | |
---|
237 | S Symbol |
---|
238 | Sm MathSymbol |
---|
239 | Sc CurrencySymbol |
---|
240 | Sk ModifierSymbol |
---|
241 | So OtherSymbol |
---|
242 | |
---|
243 | Z Separator |
---|
244 | Zs SpaceSeparator |
---|
245 | Zl LineSeparator |
---|
246 | Zp ParagraphSeparator |
---|
247 | |
---|
248 | C Other |
---|
249 | Cc Control |
---|
250 | Cf Format |
---|
251 | Cs Surrogate (not usable) |
---|
252 | Co PrivateUse |
---|
253 | Cn Unassigned |
---|
254 | |
---|
255 | Single-letter properties match all characters in any of the |
---|
256 | two-letter sub-properties starting with the same letter. |
---|
257 | C<L&> is a special case, which is an alias for C<Ll>, C<Lu>, and C<Lt>. |
---|
258 | |
---|
259 | Because Perl hides the need for the user to understand the internal |
---|
260 | representation of Unicode characters, there is no need to implement |
---|
261 | the somewhat messy concept of surrogates. C<Cs> is therefore not |
---|
262 | supported. |
---|
263 | |
---|
264 | Because scripts differ in their directionality--Hebrew is |
---|
265 | written right to left, for example--Unicode supplies these properties: |
---|
266 | |
---|
267 | Property Meaning |
---|
268 | |
---|
269 | BidiL Left-to-Right |
---|
270 | BidiLRE Left-to-Right Embedding |
---|
271 | BidiLRO Left-to-Right Override |
---|
272 | BidiR Right-to-Left |
---|
273 | BidiAL Right-to-Left Arabic |
---|
274 | BidiRLE Right-to-Left Embedding |
---|
275 | BidiRLO Right-to-Left Override |
---|
276 | BidiPDF Pop Directional Format |
---|
277 | BidiEN European Number |
---|
278 | BidiES European Number Separator |
---|
279 | BidiET European Number Terminator |
---|
280 | BidiAN Arabic Number |
---|
281 | BidiCS Common Number Separator |
---|
282 | BidiNSM Non-Spacing Mark |
---|
283 | BidiBN Boundary Neutral |
---|
284 | BidiB Paragraph Separator |
---|
285 | BidiS Segment Separator |
---|
286 | BidiWS Whitespace |
---|
287 | BidiON Other Neutrals |
---|
288 | |
---|
289 | For example, C<\p{BidiR}> matches characters that are normally |
---|
290 | written right to left. |
---|
291 | |
---|
292 | =back |
---|
293 | |
---|
294 | =head2 Scripts |
---|
295 | |
---|
296 | The script names which can be used by C<\p{...}> and C<\P{...}>, |
---|
297 | such as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows: |
---|
298 | |
---|
299 | Arabic |
---|
300 | Armenian |
---|
301 | Bengali |
---|
302 | Bopomofo |
---|
303 | Buhid |
---|
304 | CanadianAboriginal |
---|
305 | Cherokee |
---|
306 | Cyrillic |
---|
307 | Deseret |
---|
308 | Devanagari |
---|
309 | Ethiopic |
---|
310 | Georgian |
---|
311 | Gothic |
---|
312 | Greek |
---|
313 | Gujarati |
---|
314 | Gurmukhi |
---|
315 | Han |
---|
316 | Hangul |
---|
317 | Hanunoo |
---|
318 | Hebrew |
---|
319 | Hiragana |
---|
320 | Inherited |
---|
321 | Kannada |
---|
322 | Katakana |
---|
323 | Khmer |
---|
324 | Lao |
---|
325 | Latin |
---|
326 | Malayalam |
---|
327 | Mongolian |
---|
328 | Myanmar |
---|
329 | Ogham |
---|
330 | OldItalic |
---|
331 | Oriya |
---|
332 | Runic |
---|
333 | Sinhala |
---|
334 | Syriac |
---|
335 | Tagalog |
---|
336 | Tagbanwa |
---|
337 | Tamil |
---|
338 | Telugu |
---|
339 | Thaana |
---|
340 | Thai |
---|
341 | Tibetan |
---|
342 | Yi |
---|
343 | |
---|
344 | Extended property classes can supplement the basic |
---|
345 | properties, defined by the F<PropList> Unicode database: |
---|
346 | |
---|
347 | ASCIIHexDigit |
---|
348 | BidiControl |
---|
349 | Dash |
---|
350 | Deprecated |
---|
351 | Diacritic |
---|
352 | Extender |
---|
353 | GraphemeLink |
---|
354 | HexDigit |
---|
355 | Hyphen |
---|
356 | Ideographic |
---|
357 | IDSBinaryOperator |
---|
358 | IDSTrinaryOperator |
---|
359 | JoinControl |
---|
360 | LogicalOrderException |
---|
361 | NoncharacterCodePoint |
---|
362 | OtherAlphabetic |
---|
363 | OtherDefaultIgnorableCodePoint |
---|
364 | OtherGraphemeExtend |
---|
365 | OtherLowercase |
---|
366 | OtherMath |
---|
367 | OtherUppercase |
---|
368 | QuotationMark |
---|
369 | Radical |
---|
370 | SoftDotted |
---|
371 | TerminalPunctuation |
---|
372 | UnifiedIdeograph |
---|
373 | WhiteSpace |
---|
374 | |
---|
375 | and there are further derived properties: |
---|
376 | |
---|
377 | Alphabetic Lu + Ll + Lt + Lm + Lo + OtherAlphabetic |
---|
378 | Lowercase Ll + OtherLowercase |
---|
379 | Uppercase Lu + OtherUppercase |
---|
380 | Math Sm + OtherMath |
---|
381 | |
---|
382 | ID_Start Lu + Ll + Lt + Lm + Lo + Nl |
---|
383 | ID_Continue ID_Start + Mn + Mc + Nd + Pc |
---|
384 | |
---|
385 | Any Any character |
---|
386 | Assigned Any non-Cn character (i.e. synonym for \P{Cn}) |
---|
387 | Unassigned Synonym for \p{Cn} |
---|
388 | Common Any character (or unassigned code point) |
---|
389 | not explicitly assigned to a script |
---|
390 | |
---|
391 | For backward compatibility (with Perl 5.6), all properties mentioned |
---|
392 | so far may have C<Is> prepended to their name, so C<\P{IsLu}>, for |
---|
393 | example, is equal to C<\P{Lu}>. |
---|
394 | |
---|
395 | =head2 Blocks |
---|
396 | |
---|
397 | In addition to B<scripts>, Unicode also defines B<blocks> of |
---|
398 | characters. The difference between scripts and blocks is that the |
---|
399 | concept of scripts is closer to natural languages, while the concept |
---|
400 | of blocks is more of an artificial grouping based on groups of 256 |
---|
401 | Unicode characters. For example, the C<Latin> script contains letters |
---|
402 | from many blocks but does not contain all the characters from those |
---|
403 | blocks. It does not, for example, contain digits, because digits are |
---|
404 | shared across many scripts. Digits and similar groups, like |
---|
405 | punctuation, are in a category called C<Common>. |
---|
406 | |
---|
407 | For more about scripts, see the UTR #24: |
---|
408 | |
---|
409 | http://www.unicode.org/unicode/reports/tr24/ |
---|
410 | |
---|
411 | For more about blocks, see: |
---|
412 | |
---|
413 | http://www.unicode.org/Public/UNIDATA/Blocks.txt |
---|
414 | |
---|
415 | Block names are given with the C<In> prefix. For example, the |
---|
416 | Katakana block is referenced via C<\p{InKatakana}>. The C<In> |
---|
417 | prefix may be omitted if there is no naming conflict with a script |
---|
418 | or any other property, but it is recommended that C<In> always be used |
---|
419 | for block tests to avoid confusion. |
---|
420 | |
---|
421 | These block names are supported: |
---|
422 | |
---|
423 | InAlphabeticPresentationForms |
---|
424 | InArabic |
---|
425 | InArabicPresentationFormsA |
---|
426 | InArabicPresentationFormsB |
---|
427 | InArmenian |
---|
428 | InArrows |
---|
429 | InBasicLatin |
---|
430 | InBengali |
---|
431 | InBlockElements |
---|
432 | InBopomofo |
---|
433 | InBopomofoExtended |
---|
434 | InBoxDrawing |
---|
435 | InBraillePatterns |
---|
436 | InBuhid |
---|
437 | InByzantineMusicalSymbols |
---|
438 | InCJKCompatibility |
---|
439 | InCJKCompatibilityForms |
---|
440 | InCJKCompatibilityIdeographs |
---|
441 | InCJKCompatibilityIdeographsSupplement |
---|
442 | InCJKRadicalsSupplement |
---|
443 | InCJKSymbolsAndPunctuation |
---|
444 | InCJKUnifiedIdeographs |
---|
445 | InCJKUnifiedIdeographsExtensionA |
---|
446 | InCJKUnifiedIdeographsExtensionB |
---|
447 | InCherokee |
---|
448 | InCombiningDiacriticalMarks |
---|
449 | InCombiningDiacriticalMarksforSymbols |
---|
450 | InCombiningHalfMarks |
---|
451 | InControlPictures |
---|
452 | InCurrencySymbols |
---|
453 | InCyrillic |
---|
454 | InCyrillicSupplementary |
---|
455 | InDeseret |
---|
456 | InDevanagari |
---|
457 | InDingbats |
---|
458 | InEnclosedAlphanumerics |
---|
459 | InEnclosedCJKLettersAndMonths |
---|
460 | InEthiopic |
---|
461 | InGeneralPunctuation |
---|
462 | InGeometricShapes |
---|
463 | InGeorgian |
---|
464 | InGothic |
---|
465 | InGreekExtended |
---|
466 | InGreekAndCoptic |
---|
467 | InGujarati |
---|
468 | InGurmukhi |
---|
469 | InHalfwidthAndFullwidthForms |
---|
470 | InHangulCompatibilityJamo |
---|
471 | InHangulJamo |
---|
472 | InHangulSyllables |
---|
473 | InHanunoo |
---|
474 | InHebrew |
---|
475 | InHighPrivateUseSurrogates |
---|
476 | InHighSurrogates |
---|
477 | InHiragana |
---|
478 | InIPAExtensions |
---|
479 | InIdeographicDescriptionCharacters |
---|
480 | InKanbun |
---|
481 | InKangxiRadicals |
---|
482 | InKannada |
---|
483 | InKatakana |
---|
484 | InKatakanaPhoneticExtensions |
---|
485 | InKhmer |
---|
486 | InLao |
---|
487 | InLatin1Supplement |
---|
488 | InLatinExtendedA |
---|
489 | InLatinExtendedAdditional |
---|
490 | InLatinExtendedB |
---|
491 | InLetterlikeSymbols |
---|
492 | InLowSurrogates |
---|
493 | InMalayalam |
---|
494 | InMathematicalAlphanumericSymbols |
---|
495 | InMathematicalOperators |
---|
496 | InMiscellaneousMathematicalSymbolsA |
---|
497 | InMiscellaneousMathematicalSymbolsB |
---|
498 | InMiscellaneousSymbols |
---|
499 | InMiscellaneousTechnical |
---|
500 | InMongolian |
---|
501 | InMusicalSymbols |
---|
502 | InMyanmar |
---|
503 | InNumberForms |
---|
504 | InOgham |
---|
505 | InOldItalic |
---|
506 | InOpticalCharacterRecognition |
---|
507 | InOriya |
---|
508 | InPrivateUseArea |
---|
509 | InRunic |
---|
510 | InSinhala |
---|
511 | InSmallFormVariants |
---|
512 | InSpacingModifierLetters |
---|
513 | InSpecials |
---|
514 | InSuperscriptsAndSubscripts |
---|
515 | InSupplementalArrowsA |
---|
516 | InSupplementalArrowsB |
---|
517 | InSupplementalMathematicalOperators |
---|
518 | InSupplementaryPrivateUseAreaA |
---|
519 | InSupplementaryPrivateUseAreaB |
---|
520 | InSyriac |
---|
521 | InTagalog |
---|
522 | InTagbanwa |
---|
523 | InTags |
---|
524 | InTamil |
---|
525 | InTelugu |
---|
526 | InThaana |
---|
527 | InThai |
---|
528 | InTibetan |
---|
529 | InUnifiedCanadianAboriginalSyllabics |
---|
530 | InVariationSelectors |
---|
531 | InYiRadicals |
---|
532 | InYiSyllables |
---|
533 | |
---|
534 | =over 4 |
---|
535 | |
---|
536 | =item * |
---|
537 | |
---|
538 | The special pattern C<\X> matches any extended Unicode |
---|
539 | sequence--"a combining character sequence" in Standardese--where the |
---|
540 | first character is a base character and subsequent characters are mark |
---|
541 | characters that apply to the base character. C<\X> is equivalent to |
---|
542 | C<(?:\PM\pM*)>. |
---|
543 | |
---|
544 | =item * |
---|
545 | |
---|
546 | The C<tr///> operator translates characters instead of bytes. Note |
---|
547 | that the C<tr///CU> functionality has been removed. For similar |
---|
548 | functionality see pack('U0', ...) and pack('C0', ...). |
---|
549 | |
---|
550 | =item * |
---|
551 | |
---|
552 | Case translation operators use the Unicode case translation tables |
---|
553 | when character input is provided. Note that C<uc()>, or C<\U> in |
---|
554 | interpolated strings, translates to uppercase, while C<ucfirst>, |
---|
555 | or C<\u> in interpolated strings, translates to titlecase in languages |
---|
556 | that make the distinction. |
---|
557 | |
---|
558 | =item * |
---|
559 | |
---|
560 | Most operators that deal with positions or lengths in a string will |
---|
561 | automatically switch to using character positions, including |
---|
562 | C<chop()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>, |
---|
563 | C<sprintf()>, C<write()>, and C<length()>. Operators that |
---|
564 | specifically do not switch include C<vec()>, C<pack()>, and |
---|
565 | C<unpack()>. Operators that really don't care include C<chomp()>, |
---|
566 | operators that treats strings as a bucket of bits such as C<sort()>, |
---|
567 | and operators dealing with filenames. |
---|
568 | |
---|
569 | =item * |
---|
570 | |
---|
571 | The C<pack()>/C<unpack()> letters C<c> and C<C> do I<not> change, |
---|
572 | since they are often used for byte-oriented formats. Again, think |
---|
573 | C<char> in the C language. |
---|
574 | |
---|
575 | There is a new C<U> specifier that converts between Unicode characters |
---|
576 | and code points. |
---|
577 | |
---|
578 | =item * |
---|
579 | |
---|
580 | The C<chr()> and C<ord()> functions work on characters, similar to |
---|
581 | C<pack("U")> and C<unpack("U")>, I<not> C<pack("C")> and |
---|
582 | C<unpack("C")>. C<pack("C")> and C<unpack("C")> are methods for |
---|
583 | emulating byte-oriented C<chr()> and C<ord()> on Unicode strings. |
---|
584 | While these methods reveal the internal encoding of Unicode strings, |
---|
585 | that is not something one normally needs to care about at all. |
---|
586 | |
---|
587 | =item * |
---|
588 | |
---|
589 | The bit string operators, C<& | ^ ~>, can operate on character data. |
---|
590 | However, for backward compatibility, such as when using bit string |
---|
591 | operations when characters are all less than 256 in ordinal value, one |
---|
592 | should not use C<~> (the bit complement) with characters of both |
---|
593 | values less than 256 and values greater than 256. Most importantly, |
---|
594 | DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>) |
---|
595 | will not hold. The reason for this mathematical I<faux pas> is that |
---|
596 | the complement cannot return B<both> the 8-bit (byte-wide) bit |
---|
597 | complement B<and> the full character-wide bit complement. |
---|
598 | |
---|
599 | =item * |
---|
600 | |
---|
601 | lc(), uc(), lcfirst(), and ucfirst() work for the following cases: |
---|
602 | |
---|
603 | =over 8 |
---|
604 | |
---|
605 | =item * |
---|
606 | |
---|
607 | the case mapping is from a single Unicode character to another |
---|
608 | single Unicode character, or |
---|
609 | |
---|
610 | =item * |
---|
611 | |
---|
612 | the case mapping is from a single Unicode character to more |
---|
613 | than one Unicode character. |
---|
614 | |
---|
615 | =back |
---|
616 | |
---|
617 | Things to do with locales (Lithuanian, Turkish, Azeri) do B<not> work |
---|
618 | since Perl does not understand the concept of Unicode locales. |
---|
619 | |
---|
620 | See the Unicode Technical Report #21, Case Mappings, for more details. |
---|
621 | |
---|
622 | =back |
---|
623 | |
---|
624 | =over 4 |
---|
625 | |
---|
626 | =item * |
---|
627 | |
---|
628 | And finally, C<scalar reverse()> reverses by character rather than by byte. |
---|
629 | |
---|
630 | =back |
---|
631 | |
---|
632 | =head2 User-Defined Character Properties |
---|
633 | |
---|
634 | You can define your own character properties by defining subroutines |
---|
635 | whose names begin with "In" or "Is". The subroutines must be defined |
---|
636 | in the C<main> package. The user-defined properties can be used in the |
---|
637 | regular expression C<\p> and C<\P> constructs. Note that the effect |
---|
638 | is compile-time and immutable once defined. |
---|
639 | |
---|
640 | The subroutines must return a specially-formatted string, with one |
---|
641 | or more newline-separated lines. Each line must be one of the following: |
---|
642 | |
---|
643 | =over 4 |
---|
644 | |
---|
645 | =item * |
---|
646 | |
---|
647 | Two hexadecimal numbers separated by horizontal whitespace (space or |
---|
648 | tabular characters) denoting a range of Unicode code points to include. |
---|
649 | |
---|
650 | =item * |
---|
651 | |
---|
652 | Something to include, prefixed by "+": a built-in character |
---|
653 | property (prefixed by "utf8::"), to represent all the characters in that |
---|
654 | property; two hexadecimal code points for a range; or a single |
---|
655 | hexadecimal code point. |
---|
656 | |
---|
657 | =item * |
---|
658 | |
---|
659 | Something to exclude, prefixed by "-": an existing character |
---|
660 | property (prefixed by "utf8::"), for all the characters in that |
---|
661 | property; two hexadecimal code points for a range; or a single |
---|
662 | hexadecimal code point. |
---|
663 | |
---|
664 | =item * |
---|
665 | |
---|
666 | Something to negate, prefixed "!": an existing character |
---|
667 | property (prefixed by "utf8::") for all the characters except the |
---|
668 | characters in the property; two hexadecimal code points for a range; |
---|
669 | or a single hexadecimal code point. |
---|
670 | |
---|
671 | =back |
---|
672 | |
---|
673 | For example, to define a property that covers both the Japanese |
---|
674 | syllabaries (hiragana and katakana), you can define |
---|
675 | |
---|
676 | sub InKana { |
---|
677 | return <<END; |
---|
678 | 3040\t309F |
---|
679 | 30A0\t30FF |
---|
680 | END |
---|
681 | } |
---|
682 | |
---|
683 | Imagine that the here-doc end marker is at the beginning of the line. |
---|
684 | Now you can use C<\p{InKana}> and C<\P{InKana}>. |
---|
685 | |
---|
686 | You could also have used the existing block property names: |
---|
687 | |
---|
688 | sub InKana { |
---|
689 | return <<'END'; |
---|
690 | +utf8::InHiragana |
---|
691 | +utf8::InKatakana |
---|
692 | END |
---|
693 | } |
---|
694 | |
---|
695 | Suppose you wanted to match only the allocated characters, |
---|
696 | not the raw block ranges: in other words, you want to remove |
---|
697 | the non-characters: |
---|
698 | |
---|
699 | sub InKana { |
---|
700 | return <<'END'; |
---|
701 | +utf8::InHiragana |
---|
702 | +utf8::InKatakana |
---|
703 | -utf8::IsCn |
---|
704 | END |
---|
705 | } |
---|
706 | |
---|
707 | The negation is useful for defining (surprise!) negated classes. |
---|
708 | |
---|
709 | sub InNotKana { |
---|
710 | return <<'END'; |
---|
711 | !utf8::InHiragana |
---|
712 | -utf8::InKatakana |
---|
713 | +utf8::IsCn |
---|
714 | END |
---|
715 | } |
---|
716 | |
---|
717 | You can also define your own mappings to be used in the lc(), |
---|
718 | lcfirst(), uc(), and ucfirst() (or their string-inlined versions). |
---|
719 | The principle is the same: define subroutines in the C<main> package |
---|
720 | with names like C<ToLower> (for lc() and lcfirst()), C<ToTitle> (for |
---|
721 | the first character in ucfirst()), and C<ToUpper> (for uc(), and the |
---|
722 | rest of the characters in ucfirst()). |
---|
723 | |
---|
724 | The string returned by the subroutines needs now to be three |
---|
725 | hexadecimal numbers separated by tabulators: start of the source |
---|
726 | range, end of the source range, and start of the destination range. |
---|
727 | For example: |
---|
728 | |
---|
729 | sub ToUpper { |
---|
730 | return <<END; |
---|
731 | 0061\t0063\t0041 |
---|
732 | END |
---|
733 | } |
---|
734 | |
---|
735 | defines an uc() mapping that causes only the characters "a", "b", and |
---|
736 | "c" to be mapped to "A", "B", "C", all other characters will remain |
---|
737 | unchanged. |
---|
738 | |
---|
739 | If there is no source range to speak of, that is, the mapping is from |
---|
740 | a single character to another single character, leave the end of the |
---|
741 | source range empty, but the two tabulator characters are still needed. |
---|
742 | For example: |
---|
743 | |
---|
744 | sub ToLower { |
---|
745 | return <<END; |
---|
746 | 0041\t\t0061 |
---|
747 | END |
---|
748 | } |
---|
749 | |
---|
750 | defines a lc() mapping that causes only "A" to be mapped to "a", all |
---|
751 | other characters will remain unchanged. |
---|
752 | |
---|
753 | (For serious hackers only) If you want to introspect the default |
---|
754 | mappings, you can find the data in the directory |
---|
755 | C<$Config{privlib}>/F<unicore/To/>. The mapping data is returned as |
---|
756 | the here-document, and the C<utf8::ToSpecFoo> are special exception |
---|
757 | mappings derived from <$Config{privlib}>/F<unicore/SpecialCasing.txt>. |
---|
758 | The C<Digit> and C<Fold> mappings that one can see in the directory |
---|
759 | are not directly user-accessible, one can use either the |
---|
760 | C<Unicode::UCD> module, or just match case-insensitively (that's when |
---|
761 | the C<Fold> mapping is used). |
---|
762 | |
---|
763 | A final note on the user-defined property tests and mappings: they |
---|
764 | will be used only if the scalar has been marked as having Unicode |
---|
765 | characters. Old byte-style strings will not be affected. |
---|
766 | |
---|
767 | =head2 Character Encodings for Input and Output |
---|
768 | |
---|
769 | See L<Encode>. |
---|
770 | |
---|
771 | =head2 Unicode Regular Expression Support Level |
---|
772 | |
---|
773 | The following list of Unicode support for regular expressions describes |
---|
774 | all the features currently supported. The references to "Level N" |
---|
775 | and the section numbers refer to the Unicode Technical Report 18, |
---|
776 | "Unicode Regular Expression Guidelines", version 6 (Unicode 3.2.0, |
---|
777 | Perl 5.8.0). |
---|
778 | |
---|
779 | =over 4 |
---|
780 | |
---|
781 | =item * |
---|
782 | |
---|
783 | Level 1 - Basic Unicode Support |
---|
784 | |
---|
785 | 2.1 Hex Notation - done [1] |
---|
786 | Named Notation - done [2] |
---|
787 | 2.2 Categories - done [3][4] |
---|
788 | 2.3 Subtraction - MISSING [5][6] |
---|
789 | 2.4 Simple Word Boundaries - done [7] |
---|
790 | 2.5 Simple Loose Matches - done [8] |
---|
791 | 2.6 End of Line - MISSING [9][10] |
---|
792 | |
---|
793 | [ 1] \x{...} |
---|
794 | [ 2] \N{...} |
---|
795 | [ 3] . \p{...} \P{...} |
---|
796 | [ 4] now scripts (see UTR#24 Script Names) in addition to blocks |
---|
797 | [ 5] have negation |
---|
798 | [ 6] can use regular expression look-ahead [a] |
---|
799 | or user-defined character properties [b] to emulate subtraction |
---|
800 | [ 7] include Letters in word characters |
---|
801 | [ 8] note that Perl does Full case-folding in matching, not Simple: |
---|
802 | for example U+1F88 is equivalent with U+1F00 U+03B9, |
---|
803 | not with 1F80. This difference matters for certain Greek |
---|
804 | capital letters with certain modifiers: the Full case-folding |
---|
805 | decomposes the letter, while the Simple case-folding would map |
---|
806 | it to a single character. |
---|
807 | [ 9] see UTR #13 Unicode Newline Guidelines |
---|
808 | [10] should do ^ and $ also on \x{85}, \x{2028} and \x{2029} |
---|
809 | (should also affect <>, $., and script line numbers) |
---|
810 | (the \x{85}, \x{2028} and \x{2029} do match \s) |
---|
811 | |
---|
812 | [a] You can mimic class subtraction using lookahead. |
---|
813 | For example, what UTR #18 might write as |
---|
814 | |
---|
815 | [{Greek}-[{UNASSIGNED}]] |
---|
816 | |
---|
817 | in Perl can be written as: |
---|
818 | |
---|
819 | (?!\p{Unassigned})\p{InGreekAndCoptic} |
---|
820 | (?=\p{Assigned})\p{InGreekAndCoptic} |
---|
821 | |
---|
822 | But in this particular example, you probably really want |
---|
823 | |
---|
824 | \p{GreekAndCoptic} |
---|
825 | |
---|
826 | which will match assigned characters known to be part of the Greek script. |
---|
827 | |
---|
828 | Also see the Unicode::Regex::Set module, it does implement the full |
---|
829 | UTR #18 grouping, intersection, union, and removal (subtraction) syntax. |
---|
830 | |
---|
831 | [b] See L</"User-Defined Character Properties">. |
---|
832 | |
---|
833 | =item * |
---|
834 | |
---|
835 | Level 2 - Extended Unicode Support |
---|
836 | |
---|
837 | 3.1 Surrogates - MISSING [11] |
---|
838 | 3.2 Canonical Equivalents - MISSING [12][13] |
---|
839 | 3.3 Locale-Independent Graphemes - MISSING [14] |
---|
840 | 3.4 Locale-Independent Words - MISSING [15] |
---|
841 | 3.5 Locale-Independent Loose Matches - MISSING [16] |
---|
842 | |
---|
843 | [11] Surrogates are solely a UTF-16 concept and Perl's internal |
---|
844 | representation is UTF-8. The Encode module does UTF-16, though. |
---|
845 | [12] see UTR#15 Unicode Normalization |
---|
846 | [13] have Unicode::Normalize but not integrated to regexes |
---|
847 | [14] have \X but at this level . should equal that |
---|
848 | [15] need three classes, not just \w and \W |
---|
849 | [16] see UTR#21 Case Mappings |
---|
850 | |
---|
851 | =item * |
---|
852 | |
---|
853 | Level 3 - Locale-Sensitive Support |
---|
854 | |
---|
855 | 4.1 Locale-Dependent Categories - MISSING |
---|
856 | 4.2 Locale-Dependent Graphemes - MISSING [16][17] |
---|
857 | 4.3 Locale-Dependent Words - MISSING |
---|
858 | 4.4 Locale-Dependent Loose Matches - MISSING |
---|
859 | 4.5 Locale-Dependent Ranges - MISSING |
---|
860 | |
---|
861 | [16] see UTR#10 Unicode Collation Algorithms |
---|
862 | [17] have Unicode::Collate but not integrated to regexes |
---|
863 | |
---|
864 | =back |
---|
865 | |
---|
866 | =head2 Unicode Encodings |
---|
867 | |
---|
868 | Unicode characters are assigned to I<code points>, which are abstract |
---|
869 | numbers. To use these numbers, various encodings are needed. |
---|
870 | |
---|
871 | =over 4 |
---|
872 | |
---|
873 | =item * |
---|
874 | |
---|
875 | UTF-8 |
---|
876 | |
---|
877 | UTF-8 is a variable-length (1 to 6 bytes, current character allocations |
---|
878 | require 4 bytes), byte-order independent encoding. For ASCII (and we |
---|
879 | really do mean 7-bit ASCII, not another 8-bit encoding), UTF-8 is |
---|
880 | transparent. |
---|
881 | |
---|
882 | The following table is from Unicode 3.2. |
---|
883 | |
---|
884 | Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte |
---|
885 | |
---|
886 | U+0000..U+007F 00..7F |
---|
887 | U+0080..U+07FF C2..DF 80..BF |
---|
888 | U+0800..U+0FFF E0 A0..BF 80..BF |
---|
889 | U+1000..U+CFFF E1..EC 80..BF 80..BF |
---|
890 | U+D000..U+D7FF ED 80..9F 80..BF |
---|
891 | U+D800..U+DFFF ******* ill-formed ******* |
---|
892 | U+E000..U+FFFF EE..EF 80..BF 80..BF |
---|
893 | U+10000..U+3FFFF F0 90..BF 80..BF 80..BF |
---|
894 | U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF |
---|
895 | U+100000..U+10FFFF F4 80..8F 80..BF 80..BF |
---|
896 | |
---|
897 | Note the C<A0..BF> in C<U+0800..U+0FFF>, the C<80..9F> in |
---|
898 | C<U+D000...U+D7FF>, the C<90..B>F in C<U+10000..U+3FFFF>, and the |
---|
899 | C<80...8F> in C<U+100000..U+10FFFF>. The "gaps" are caused by legal |
---|
900 | UTF-8 avoiding non-shortest encodings: it is technically possible to |
---|
901 | UTF-8-encode a single code point in different ways, but that is |
---|
902 | explicitly forbidden, and the shortest possible encoding should always |
---|
903 | be used. So that's what Perl does. |
---|
904 | |
---|
905 | Another way to look at it is via bits: |
---|
906 | |
---|
907 | Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte |
---|
908 | |
---|
909 | 0aaaaaaa 0aaaaaaa |
---|
910 | 00000bbbbbaaaaaa 110bbbbb 10aaaaaa |
---|
911 | ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa |
---|
912 | 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa |
---|
913 | |
---|
914 | As you can see, the continuation bytes all begin with C<10>, and the |
---|
915 | leading bits of the start byte tell how many bytes the are in the |
---|
916 | encoded character. |
---|
917 | |
---|
918 | =item * |
---|
919 | |
---|
920 | UTF-EBCDIC |
---|
921 | |
---|
922 | Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe. |
---|
923 | |
---|
924 | =item * |
---|
925 | |
---|
926 | UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks) |
---|
927 | |
---|
928 | The followings items are mostly for reference and general Unicode |
---|
929 | knowledge, Perl doesn't use these constructs internally. |
---|
930 | |
---|
931 | UTF-16 is a 2 or 4 byte encoding. The Unicode code points |
---|
932 | C<U+0000..U+FFFF> are stored in a single 16-bit unit, and the code |
---|
933 | points C<U+10000..U+10FFFF> in two 16-bit units. The latter case is |
---|
934 | using I<surrogates>, the first 16-bit unit being the I<high |
---|
935 | surrogate>, and the second being the I<low surrogate>. |
---|
936 | |
---|
937 | Surrogates are code points set aside to encode the C<U+10000..U+10FFFF> |
---|
938 | range of Unicode code points in pairs of 16-bit units. The I<high |
---|
939 | surrogates> are the range C<U+D800..U+DBFF>, and the I<low surrogates> |
---|
940 | are the range C<U+DC00..U+DFFF>. The surrogate encoding is |
---|
941 | |
---|
942 | $hi = ($uni - 0x10000) / 0x400 + 0xD800; |
---|
943 | $lo = ($uni - 0x10000) % 0x400 + 0xDC00; |
---|
944 | |
---|
945 | and the decoding is |
---|
946 | |
---|
947 | $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00); |
---|
948 | |
---|
949 | If you try to generate surrogates (for example by using chr()), you |
---|
950 | will get a warning if warnings are turned on, because those code |
---|
951 | points are not valid for a Unicode character. |
---|
952 | |
---|
953 | Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16 |
---|
954 | itself can be used for in-memory computations, but if storage or |
---|
955 | transfer is required either UTF-16BE (big-endian) or UTF-16LE |
---|
956 | (little-endian) encodings must be chosen. |
---|
957 | |
---|
958 | This introduces another problem: what if you just know that your data |
---|
959 | is UTF-16, but you don't know which endianness? Byte Order Marks, or |
---|
960 | BOMs, are a solution to this. A special character has been reserved |
---|
961 | in Unicode to function as a byte order marker: the character with the |
---|
962 | code point C<U+FEFF> is the BOM. |
---|
963 | |
---|
964 | The trick is that if you read a BOM, you will know the byte order, |
---|
965 | since if it was written on a big-endian platform, you will read the |
---|
966 | bytes C<0xFE 0xFF>, but if it was written on a little-endian platform, |
---|
967 | you will read the bytes C<0xFF 0xFE>. (And if the originating platform |
---|
968 | was writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.) |
---|
969 | |
---|
970 | The way this trick works is that the character with the code point |
---|
971 | C<U+FFFE> is guaranteed not to be a valid Unicode character, so the |
---|
972 | sequence of bytes C<0xFF 0xFE> is unambiguously "BOM, represented in |
---|
973 | little-endian format" and cannot be C<U+FFFE>, represented in big-endian |
---|
974 | format". |
---|
975 | |
---|
976 | =item * |
---|
977 | |
---|
978 | UTF-32, UTF-32BE, UTF-32LE |
---|
979 | |
---|
980 | The UTF-32 family is pretty much like the UTF-16 family, expect that |
---|
981 | the units are 32-bit, and therefore the surrogate scheme is not |
---|
982 | needed. The BOM signatures will be C<0x00 0x00 0xFE 0xFF> for BE and |
---|
983 | C<0xFF 0xFE 0x00 0x00> for LE. |
---|
984 | |
---|
985 | =item * |
---|
986 | |
---|
987 | UCS-2, UCS-4 |
---|
988 | |
---|
989 | Encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit |
---|
990 | encoding. Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>, |
---|
991 | because it does not use surrogates. UCS-4 is a 32-bit encoding, |
---|
992 | functionally identical to UTF-32. |
---|
993 | |
---|
994 | =item * |
---|
995 | |
---|
996 | UTF-7 |
---|
997 | |
---|
998 | A seven-bit safe (non-eight-bit) encoding, which is useful if the |
---|
999 | transport or storage is not eight-bit safe. Defined by RFC 2152. |
---|
1000 | |
---|
1001 | =back |
---|
1002 | |
---|
1003 | =head2 Security Implications of Unicode |
---|
1004 | |
---|
1005 | =over 4 |
---|
1006 | |
---|
1007 | =item * |
---|
1008 | |
---|
1009 | Malformed UTF-8 |
---|
1010 | |
---|
1011 | Unfortunately, the specification of UTF-8 leaves some room for |
---|
1012 | interpretation of how many bytes of encoded output one should generate |
---|
1013 | from one input Unicode character. Strictly speaking, the shortest |
---|
1014 | possible sequence of UTF-8 bytes should be generated, |
---|
1015 | because otherwise there is potential for an input buffer overflow at |
---|
1016 | the receiving end of a UTF-8 connection. Perl always generates the |
---|
1017 | shortest length UTF-8, and with warnings on Perl will warn about |
---|
1018 | non-shortest length UTF-8 along with other malformations, such as the |
---|
1019 | surrogates, which are not real Unicode code points. |
---|
1020 | |
---|
1021 | =item * |
---|
1022 | |
---|
1023 | Regular expressions behave slightly differently between byte data and |
---|
1024 | character (Unicode) data. For example, the "word character" character |
---|
1025 | class C<\w> will work differently depending on if data is eight-bit bytes |
---|
1026 | or Unicode. |
---|
1027 | |
---|
1028 | In the first case, the set of C<\w> characters is either small--the |
---|
1029 | default set of alphabetic characters, digits, and the "_"--or, if you |
---|
1030 | are using a locale (see L<perllocale>), the C<\w> might contain a few |
---|
1031 | more letters according to your language and country. |
---|
1032 | |
---|
1033 | In the second case, the C<\w> set of characters is much, much larger. |
---|
1034 | Most importantly, even in the set of the first 256 characters, it will |
---|
1035 | probably match different characters: unlike most locales, which are |
---|
1036 | specific to a language and country pair, Unicode classifies all the |
---|
1037 | characters that are letters I<somewhere> as C<\w>. For example, your |
---|
1038 | locale might not think that LATIN SMALL LETTER ETH is a letter (unless |
---|
1039 | you happen to speak Icelandic), but Unicode does. |
---|
1040 | |
---|
1041 | As discussed elsewhere, Perl has one foot (two hooves?) planted in |
---|
1042 | each of two worlds: the old world of bytes and the new world of |
---|
1043 | characters, upgrading from bytes to characters when necessary. |
---|
1044 | If your legacy code does not explicitly use Unicode, no automatic |
---|
1045 | switch-over to characters should happen. Characters shouldn't get |
---|
1046 | downgraded to bytes, either. It is possible to accidentally mix bytes |
---|
1047 | and characters, however (see L<perluniintro>), in which case C<\w> in |
---|
1048 | regular expressions might start behaving differently. Review your |
---|
1049 | code. Use warnings and the C<strict> pragma. |
---|
1050 | |
---|
1051 | =back |
---|
1052 | |
---|
1053 | =head2 Unicode in Perl on EBCDIC |
---|
1054 | |
---|
1055 | The way Unicode is handled on EBCDIC platforms is still |
---|
1056 | experimental. On such platforms, references to UTF-8 encoding in this |
---|
1057 | document and elsewhere should be read as meaning the UTF-EBCDIC |
---|
1058 | specified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issues |
---|
1059 | are specifically discussed. There is no C<utfebcdic> pragma or |
---|
1060 | ":utfebcdic" layer; rather, "utf8" and ":utf8" are reused to mean |
---|
1061 | the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic> |
---|
1062 | for more discussion of the issues. |
---|
1063 | |
---|
1064 | =head2 Locales |
---|
1065 | |
---|
1066 | Usually locale settings and Unicode do not affect each other, but |
---|
1067 | there are a couple of exceptions: |
---|
1068 | |
---|
1069 | =over 4 |
---|
1070 | |
---|
1071 | =item * |
---|
1072 | |
---|
1073 | You can enable automatic UTF-8-ification of your standard file |
---|
1074 | handles, default C<open()> layer, and C<@ARGV> by using either |
---|
1075 | the C<-C> command line switch or the C<PERL_UNICODE> environment |
---|
1076 | variable, see L<perlrun> for the documentation of the C<-C> switch. |
---|
1077 | |
---|
1078 | =item * |
---|
1079 | |
---|
1080 | Perl tries really hard to work both with Unicode and the old |
---|
1081 | byte-oriented world. Most often this is nice, but sometimes Perl's |
---|
1082 | straddling of the proverbial fence causes problems. |
---|
1083 | |
---|
1084 | =back |
---|
1085 | |
---|
1086 | =head2 When Unicode Does Not Happen |
---|
1087 | |
---|
1088 | While Perl does have extensive ways to input and output in Unicode, |
---|
1089 | and few other 'entry points' like the @ARGV which can be interpreted |
---|
1090 | as Unicode (UTF-8), there still are many places where Unicode (in some |
---|
1091 | encoding or another) could be given as arguments or received as |
---|
1092 | results, or both, but it is not. |
---|
1093 | |
---|
1094 | The following are such interfaces. For all of these interfaces Perl |
---|
1095 | currently (as of 5.8.3) simply assumes byte strings both as arguments |
---|
1096 | and results, or UTF-8 strings if the C<encoding> pragma has been used. |
---|
1097 | |
---|
1098 | One reason why Perl does not attempt to resolve the role of Unicode in |
---|
1099 | this cases is that the answers are highly dependent on the operating |
---|
1100 | system and the file system(s). For example, whether filenames can be |
---|
1101 | in Unicode, and in exactly what kind of encoding, is not exactly a |
---|
1102 | portable concept. Similarly for the qx and system: how well will the |
---|
1103 | 'command line interface' (and which of them?) handle Unicode? |
---|
1104 | |
---|
1105 | =over 4 |
---|
1106 | |
---|
1107 | =item * |
---|
1108 | |
---|
1109 | chmod, chmod, chown, chroot, exec, link, lstat, mkdir, |
---|
1110 | rename, rmdir, stat, symlink, truncate, unlink, utime, -X |
---|
1111 | |
---|
1112 | =item * |
---|
1113 | |
---|
1114 | %ENV |
---|
1115 | |
---|
1116 | =item * |
---|
1117 | |
---|
1118 | glob (aka the <*>) |
---|
1119 | |
---|
1120 | =item * |
---|
1121 | |
---|
1122 | open, opendir, sysopen |
---|
1123 | |
---|
1124 | =item * |
---|
1125 | |
---|
1126 | qx (aka the backtick operator), system |
---|
1127 | |
---|
1128 | =item * |
---|
1129 | |
---|
1130 | readdir, readlink |
---|
1131 | |
---|
1132 | =back |
---|
1133 | |
---|
1134 | =head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl) |
---|
1135 | |
---|
1136 | Sometimes (see L</"When Unicode Does Not Happen">) there are |
---|
1137 | situations where you simply need to force Perl to believe that a byte |
---|
1138 | string is UTF-8, or vice versa. The low-level calls |
---|
1139 | utf8::upgrade($bytestring) and utf8::downgrade($utf8string) are |
---|
1140 | the answers. |
---|
1141 | |
---|
1142 | Do not use them without careful thought, though: Perl may easily get |
---|
1143 | very confused, angry, or even crash, if you suddenly change the 'nature' |
---|
1144 | of scalar like that. Especially careful you have to be if you use the |
---|
1145 | utf8::upgrade(): any random byte string is not valid UTF-8. |
---|
1146 | |
---|
1147 | =head2 Using Unicode in XS |
---|
1148 | |
---|
1149 | If you want to handle Perl Unicode in XS extensions, you may find the |
---|
1150 | following C APIs useful. See also L<perlguts/"Unicode Support"> for an |
---|
1151 | explanation about Unicode at the XS level, and L<perlapi> for the API |
---|
1152 | details. |
---|
1153 | |
---|
1154 | =over 4 |
---|
1155 | |
---|
1156 | =item * |
---|
1157 | |
---|
1158 | C<DO_UTF8(sv)> returns true if the C<UTF8> flag is on and the bytes |
---|
1159 | pragma is not in effect. C<SvUTF8(sv)> returns true is the C<UTF8> |
---|
1160 | flag is on; the bytes pragma is ignored. The C<UTF8> flag being on |
---|
1161 | does B<not> mean that there are any characters of code points greater |
---|
1162 | than 255 (or 127) in the scalar or that there are even any characters |
---|
1163 | in the scalar. What the C<UTF8> flag means is that the sequence of |
---|
1164 | octets in the representation of the scalar is the sequence of UTF-8 |
---|
1165 | encoded code points of the characters of a string. The C<UTF8> flag |
---|
1166 | being off means that each octet in this representation encodes a |
---|
1167 | single character with code point 0..255 within the string. Perl's |
---|
1168 | Unicode model is not to use UTF-8 until it is absolutely necessary. |
---|
1169 | |
---|
1170 | =item * |
---|
1171 | |
---|
1172 | C<uvuni_to_utf8(buf, chr)> writes a Unicode character code point into |
---|
1173 | a buffer encoding the code point as UTF-8, and returns a pointer |
---|
1174 | pointing after the UTF-8 bytes. |
---|
1175 | |
---|
1176 | =item * |
---|
1177 | |
---|
1178 | C<utf8_to_uvuni(buf, lenp)> reads UTF-8 encoded bytes from a buffer and |
---|
1179 | returns the Unicode character code point and, optionally, the length of |
---|
1180 | the UTF-8 byte sequence. |
---|
1181 | |
---|
1182 | =item * |
---|
1183 | |
---|
1184 | C<utf8_length(start, end)> returns the length of the UTF-8 encoded buffer |
---|
1185 | in characters. C<sv_len_utf8(sv)> returns the length of the UTF-8 encoded |
---|
1186 | scalar. |
---|
1187 | |
---|
1188 | =item * |
---|
1189 | |
---|
1190 | C<sv_utf8_upgrade(sv)> converts the string of the scalar to its UTF-8 |
---|
1191 | encoded form. C<sv_utf8_downgrade(sv)> does the opposite, if |
---|
1192 | possible. C<sv_utf8_encode(sv)> is like sv_utf8_upgrade except that |
---|
1193 | it does not set the C<UTF8> flag. C<sv_utf8_decode()> does the |
---|
1194 | opposite of C<sv_utf8_encode()>. Note that none of these are to be |
---|
1195 | used as general-purpose encoding or decoding interfaces: C<use Encode> |
---|
1196 | for that. C<sv_utf8_upgrade()> is affected by the encoding pragma |
---|
1197 | but C<sv_utf8_downgrade()> is not (since the encoding pragma is |
---|
1198 | designed to be a one-way street). |
---|
1199 | |
---|
1200 | =item * |
---|
1201 | |
---|
1202 | C<is_utf8_char(s)> returns true if the pointer points to a valid UTF-8 |
---|
1203 | character. |
---|
1204 | |
---|
1205 | =item * |
---|
1206 | |
---|
1207 | C<is_utf8_string(buf, len)> returns true if C<len> bytes of the buffer |
---|
1208 | are valid UTF-8. |
---|
1209 | |
---|
1210 | =item * |
---|
1211 | |
---|
1212 | C<UTF8SKIP(buf)> will return the number of bytes in the UTF-8 encoded |
---|
1213 | character in the buffer. C<UNISKIP(chr)> will return the number of bytes |
---|
1214 | required to UTF-8-encode the Unicode character code point. C<UTF8SKIP()> |
---|
1215 | is useful for example for iterating over the characters of a UTF-8 |
---|
1216 | encoded buffer; C<UNISKIP()> is useful, for example, in computing |
---|
1217 | the size required for a UTF-8 encoded buffer. |
---|
1218 | |
---|
1219 | =item * |
---|
1220 | |
---|
1221 | C<utf8_distance(a, b)> will tell the distance in characters between the |
---|
1222 | two pointers pointing to the same UTF-8 encoded buffer. |
---|
1223 | |
---|
1224 | =item * |
---|
1225 | |
---|
1226 | C<utf8_hop(s, off)> will return a pointer to an UTF-8 encoded buffer |
---|
1227 | that is C<off> (positive or negative) Unicode characters displaced |
---|
1228 | from the UTF-8 buffer C<s>. Be careful not to overstep the buffer: |
---|
1229 | C<utf8_hop()> will merrily run off the end or the beginning of the |
---|
1230 | buffer if told to do so. |
---|
1231 | |
---|
1232 | =item * |
---|
1233 | |
---|
1234 | C<pv_uni_display(dsv, spv, len, pvlim, flags)> and |
---|
1235 | C<sv_uni_display(dsv, ssv, pvlim, flags)> are useful for debugging the |
---|
1236 | output of Unicode strings and scalars. By default they are useful |
---|
1237 | only for debugging--they display B<all> characters as hexadecimal code |
---|
1238 | points--but with the flags C<UNI_DISPLAY_ISPRINT>, |
---|
1239 | C<UNI_DISPLAY_BACKSLASH>, and C<UNI_DISPLAY_QQ> you can make the |
---|
1240 | output more readable. |
---|
1241 | |
---|
1242 | =item * |
---|
1243 | |
---|
1244 | C<ibcmp_utf8(s1, pe1, u1, l1, u1, s2, pe2, l2, u2)> can be used to |
---|
1245 | compare two strings case-insensitively in Unicode. For case-sensitive |
---|
1246 | comparisons you can just use C<memEQ()> and C<memNE()> as usual. |
---|
1247 | |
---|
1248 | =back |
---|
1249 | |
---|
1250 | For more information, see L<perlapi>, and F<utf8.c> and F<utf8.h> |
---|
1251 | in the Perl source code distribution. |
---|
1252 | |
---|
1253 | =head1 BUGS |
---|
1254 | |
---|
1255 | =head2 Interaction with Locales |
---|
1256 | |
---|
1257 | Use of locales with Unicode data may lead to odd results. Currently, |
---|
1258 | Perl attempts to attach 8-bit locale info to characters in the range |
---|
1259 | 0..255, but this technique is demonstrably incorrect for locales that |
---|
1260 | use characters above that range when mapped into Unicode. Perl's |
---|
1261 | Unicode support will also tend to run slower. Use of locales with |
---|
1262 | Unicode is discouraged. |
---|
1263 | |
---|
1264 | =head2 Interaction with Extensions |
---|
1265 | |
---|
1266 | When Perl exchanges data with an extension, the extension should be |
---|
1267 | able to understand the UTF-8 flag and act accordingly. If the |
---|
1268 | extension doesn't know about the flag, it's likely that the extension |
---|
1269 | will return incorrectly-flagged data. |
---|
1270 | |
---|
1271 | So if you're working with Unicode data, consult the documentation of |
---|
1272 | every module you're using if there are any issues with Unicode data |
---|
1273 | exchange. If the documentation does not talk about Unicode at all, |
---|
1274 | suspect the worst and probably look at the source to learn how the |
---|
1275 | module is implemented. Modules written completely in Perl shouldn't |
---|
1276 | cause problems. Modules that directly or indirectly access code written |
---|
1277 | in other programming languages are at risk. |
---|
1278 | |
---|
1279 | For affected functions, the simple strategy to avoid data corruption is |
---|
1280 | to always make the encoding of the exchanged data explicit. Choose an |
---|
1281 | encoding that you know the extension can handle. Convert arguments passed |
---|
1282 | to the extensions to that encoding and convert results back from that |
---|
1283 | encoding. Write wrapper functions that do the conversions for you, so |
---|
1284 | you can later change the functions when the extension catches up. |
---|
1285 | |
---|
1286 | To provide an example, let's say the popular Foo::Bar::escape_html |
---|
1287 | function doesn't deal with Unicode data yet. The wrapper function |
---|
1288 | would convert the argument to raw UTF-8 and convert the result back to |
---|
1289 | Perl's internal representation like so: |
---|
1290 | |
---|
1291 | sub my_escape_html ($) { |
---|
1292 | my($what) = shift; |
---|
1293 | return unless defined $what; |
---|
1294 | Encode::decode_utf8(Foo::Bar::escape_html(Encode::encode_utf8($what))); |
---|
1295 | } |
---|
1296 | |
---|
1297 | Sometimes, when the extension does not convert data but just stores |
---|
1298 | and retrieves them, you will be in a position to use the otherwise |
---|
1299 | dangerous Encode::_utf8_on() function. Let's say the popular |
---|
1300 | C<Foo::Bar> extension, written in C, provides a C<param> method that |
---|
1301 | lets you store and retrieve data according to these prototypes: |
---|
1302 | |
---|
1303 | $self->param($name, $value); # set a scalar |
---|
1304 | $value = $self->param($name); # retrieve a scalar |
---|
1305 | |
---|
1306 | If it does not yet provide support for any encoding, one could write a |
---|
1307 | derived class with such a C<param> method: |
---|
1308 | |
---|
1309 | sub param { |
---|
1310 | my($self,$name,$value) = @_; |
---|
1311 | utf8::upgrade($name); # make sure it is UTF-8 encoded |
---|
1312 | if (defined $value) |
---|
1313 | utf8::upgrade($value); # make sure it is UTF-8 encoded |
---|
1314 | return $self->SUPER::param($name,$value); |
---|
1315 | } else { |
---|
1316 | my $ret = $self->SUPER::param($name); |
---|
1317 | Encode::_utf8_on($ret); # we know, it is UTF-8 encoded |
---|
1318 | return $ret; |
---|
1319 | } |
---|
1320 | } |
---|
1321 | |
---|
1322 | Some extensions provide filters on data entry/exit points, such as |
---|
1323 | DB_File::filter_store_key and family. Look out for such filters in |
---|
1324 | the documentation of your extensions, they can make the transition to |
---|
1325 | Unicode data much easier. |
---|
1326 | |
---|
1327 | =head2 Speed |
---|
1328 | |
---|
1329 | Some functions are slower when working on UTF-8 encoded strings than |
---|
1330 | on byte encoded strings. All functions that need to hop over |
---|
1331 | characters such as length(), substr() or index(), or matching regular |
---|
1332 | expressions can work B<much> faster when the underlying data are |
---|
1333 | byte-encoded. |
---|
1334 | |
---|
1335 | In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1 |
---|
1336 | a caching scheme was introduced which will hopefully make the slowness |
---|
1337 | somewhat less spectacular, at least for some operations. In general, |
---|
1338 | operations with UTF-8 encoded strings are still slower. As an example, |
---|
1339 | the Unicode properties (character classes) like C<\p{Nd}> are known to |
---|
1340 | be quite a bit slower (5-20 times) than their simpler counterparts |
---|
1341 | like C<\d> (then again, there 268 Unicode characters matching C<Nd> |
---|
1342 | compared with the 10 ASCII characters matching C<d>). |
---|
1343 | |
---|
1344 | =head2 Porting code from perl-5.6.X |
---|
1345 | |
---|
1346 | Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer |
---|
1347 | was required to use the C<utf8> pragma to declare that a given scope |
---|
1348 | expected to deal with Unicode data and had to make sure that only |
---|
1349 | Unicode data were reaching that scope. If you have code that is |
---|
1350 | working with 5.6, you will need some of the following adjustments to |
---|
1351 | your code. The examples are written such that the code will continue |
---|
1352 | to work under 5.6, so you should be safe to try them out. |
---|
1353 | |
---|
1354 | =over 4 |
---|
1355 | |
---|
1356 | =item * |
---|
1357 | |
---|
1358 | A filehandle that should read or write UTF-8 |
---|
1359 | |
---|
1360 | if ($] > 5.007) { |
---|
1361 | binmode $fh, ":utf8"; |
---|
1362 | } |
---|
1363 | |
---|
1364 | =item * |
---|
1365 | |
---|
1366 | A scalar that is going to be passed to some extension |
---|
1367 | |
---|
1368 | Be it Compress::Zlib, Apache::Request or any extension that has no |
---|
1369 | mention of Unicode in the manpage, you need to make sure that the |
---|
1370 | UTF-8 flag is stripped off. Note that at the time of this writing |
---|
1371 | (October 2002) the mentioned modules are not UTF-8-aware. Please |
---|
1372 | check the documentation to verify if this is still true. |
---|
1373 | |
---|
1374 | if ($] > 5.007) { |
---|
1375 | require Encode; |
---|
1376 | $val = Encode::encode_utf8($val); # make octets |
---|
1377 | } |
---|
1378 | |
---|
1379 | =item * |
---|
1380 | |
---|
1381 | A scalar we got back from an extension |
---|
1382 | |
---|
1383 | If you believe the scalar comes back as UTF-8, you will most likely |
---|
1384 | want the UTF-8 flag restored: |
---|
1385 | |
---|
1386 | if ($] > 5.007) { |
---|
1387 | require Encode; |
---|
1388 | $val = Encode::decode_utf8($val); |
---|
1389 | } |
---|
1390 | |
---|
1391 | =item * |
---|
1392 | |
---|
1393 | Same thing, if you are really sure it is UTF-8 |
---|
1394 | |
---|
1395 | if ($] > 5.007) { |
---|
1396 | require Encode; |
---|
1397 | Encode::_utf8_on($val); |
---|
1398 | } |
---|
1399 | |
---|
1400 | =item * |
---|
1401 | |
---|
1402 | A wrapper for fetchrow_array and fetchrow_hashref |
---|
1403 | |
---|
1404 | When the database contains only UTF-8, a wrapper function or method is |
---|
1405 | a convenient way to replace all your fetchrow_array and |
---|
1406 | fetchrow_hashref calls. A wrapper function will also make it easier to |
---|
1407 | adapt to future enhancements in your database driver. Note that at the |
---|
1408 | time of this writing (October 2002), the DBI has no standardized way |
---|
1409 | to deal with UTF-8 data. Please check the documentation to verify if |
---|
1410 | that is still true. |
---|
1411 | |
---|
1412 | sub fetchrow { |
---|
1413 | my($self, $sth, $what) = @_; # $what is one of fetchrow_{array,hashref} |
---|
1414 | if ($] < 5.007) { |
---|
1415 | return $sth->$what; |
---|
1416 | } else { |
---|
1417 | require Encode; |
---|
1418 | if (wantarray) { |
---|
1419 | my @arr = $sth->$what; |
---|
1420 | for (@arr) { |
---|
1421 | defined && /[^\000-\177]/ && Encode::_utf8_on($_); |
---|
1422 | } |
---|
1423 | return @arr; |
---|
1424 | } else { |
---|
1425 | my $ret = $sth->$what; |
---|
1426 | if (ref $ret) { |
---|
1427 | for my $k (keys %$ret) { |
---|
1428 | defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret->{$k}; |
---|
1429 | } |
---|
1430 | return $ret; |
---|
1431 | } else { |
---|
1432 | defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret; |
---|
1433 | return $ret; |
---|
1434 | } |
---|
1435 | } |
---|
1436 | } |
---|
1437 | } |
---|
1438 | |
---|
1439 | |
---|
1440 | =item * |
---|
1441 | |
---|
1442 | A large scalar that you know can only contain ASCII |
---|
1443 | |
---|
1444 | Scalars that contain only ASCII and are marked as UTF-8 are sometimes |
---|
1445 | a drag to your program. If you recognize such a situation, just remove |
---|
1446 | the UTF-8 flag: |
---|
1447 | |
---|
1448 | utf8::downgrade($val) if $] > 5.007; |
---|
1449 | |
---|
1450 | =back |
---|
1451 | |
---|
1452 | =head1 SEE ALSO |
---|
1453 | |
---|
1454 | L<perluniintro>, L<encoding>, L<Encode>, L<open>, L<utf8>, L<bytes>, |
---|
1455 | L<perlretut>, L<perlvar/"${^UNICODE}"> |
---|
1456 | |
---|
1457 | =cut |
---|