source: trunk/third/perl/pod/perlunicode.pod @ 20075

Revision 20075, 47.8 KB checked in by zacheiss, 21 years ago (diff)
This commit was generated by cvs2svn to compensate for changes in r20074, which included commits to RCS files with non-trunk default branches.
Line 
1=head1 NAME
2
3perlunicode - Unicode support in Perl
4
5=head1 DESCRIPTION
6
7=head2 Important Caveats
8
9Unicode support is an extensive requirement. While Perl does not
10implement the Unicode standard or the accompanying technical reports
11from cover to cover, Perl does support many Unicode features.
12
13=over 4
14
15=item Input and Output Layers
16
17Perl knows when a filehandle uses Perl's internal Unicode encodings
18(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
19the ":utf8" layer.  Other encodings can be converted to Perl's
20encoding on input or from Perl's encoding on output by use of the
21":encoding(...)"  layer.  See L<open>.
22
23To indicate that Perl source itself is using a particular encoding,
24see L<encoding>.
25
26=item Regular Expressions
27
28The regular expression compiler produces polymorphic opcodes.  That is,
29the pattern adapts to the data and automatically switches to the Unicode
30character scheme when presented with Unicode data--or instead uses
31a traditional byte scheme when presented with byte data.
32
33=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
34
35As a compatibility measure, the C<use utf8> pragma must be explicitly
36included to enable recognition of UTF-8 in the Perl scripts themselves
37(in string or regular expression literals, or in identifier names) on
38ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
39machines.  B<These are the only times when an explicit C<use utf8>
40is needed.>  See L<utf8>.
41
42You can also use the C<encoding> pragma to change the default encoding
43of the data in your script; see L<encoding>.
44
45=item C<use encoding> needed to upgrade non-Latin-1 byte strings
46
47By default, there is a fundamental asymmetry in Perl's unicode model:
48implicit upgrading from byte strings to Unicode strings assumes that
49they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
50downgraded with UTF-8 encoding.  This happens because the first 256
51codepoints in Unicode happens to agree with Latin-1. 
52
53If you wish to interpret byte strings as UTF-8 instead, use the
54C<encoding> pragma:
55
56    use encoding 'utf8';
57
58See L</"Byte and Character Semantics"> for more details.
59
60=back
61
62=head2 Byte and Character Semantics
63
64Beginning with version 5.6, Perl uses logically-wide characters to
65represent strings internally.
66
67In future, Perl-level operations will be expected to work with
68characters rather than bytes.
69
70However, as an interim compatibility measure, Perl aims to
71provide a safe migration path from byte semantics to character
72semantics for programs.  For operations where Perl can unambiguously
73decide that the input data are characters, Perl switches to
74character semantics.  For operations where this determination cannot
75be made without additional information from the user, Perl decides in
76favor of compatibility and chooses to use byte semantics.
77
78This behavior preserves compatibility with earlier versions of Perl,
79which allowed byte semantics in Perl operations only if
80none of the program's inputs were marked as being as source of Unicode
81character data.  Such data may come from filehandles, from calls to
82external programs, from information provided by the system (such as %ENV),
83or from literals and constants in the source text.
84
85The C<bytes> pragma will always, regardless of platform, force byte
86semantics in a particular lexical scope.  See L<bytes>.
87
88The C<utf8> pragma is primarily a compatibility device that enables
89recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
90Note that this pragma is only required while Perl defaults to byte
91semantics; when character semantics become the default, this pragma
92may become a no-op.  See L<utf8>.
93
94Unless explicitly stated, Perl operators use character semantics
95for Unicode data and byte semantics for non-Unicode data.
96The decision to use character semantics is made transparently.  If
97input data comes from a Unicode source--for example, if a character
98encoding layer is added to a filehandle or a literal Unicode
99string constant appears in a program--character semantics apply.
100Otherwise, byte semantics are in effect.  The C<bytes> pragma should
101be used to force byte semantics on Unicode data.
102
103If strings operating under byte semantics and strings with Unicode
104character data are concatenated, the new string will be created by
105decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
106old Unicode string used EBCDIC.  This translation is done without
107regard to the system's native 8-bit encoding.  To change this for
108systems with non-Latin-1 and non-EBCDIC native encodings, use the
109C<encoding> pragma.  See L<encoding>.
110
111Under character semantics, many operations that formerly operated on
112bytes now operate on characters. A character in Perl is
113logically just a number ranging from 0 to 2**31 or so. Larger
114characters may encode into longer sequences of bytes internally, but
115this internal detail is mostly hidden for Perl code.
116See L<perluniintro> for more.
117
118=head2 Effects of Character Semantics
119
120Character semantics have the following effects:
121
122=over 4
123
124=item *
125
126Strings--including hash keys--and regular expression patterns may
127contain characters that have an ordinal value larger than 255.
128
129If you use a Unicode editor to edit your program, Unicode characters
130may occur directly within the literal strings in one of the various
131Unicode encodings (UTF-8, UTF-EBCDIC, UCS-2, etc.), but will be recognized
132as such and converted to Perl's internal representation only if the
133appropriate L<encoding> is specified.
134
135Unicode characters can also be added to a string by using the
136C<\x{...}> notation.  The Unicode code for the desired character, in
137hexadecimal, should be placed in the braces. For instance, a smiley
138face is C<\x{263A}>.  This encoding scheme only works for characters
139with a code of 0x100 or above.
140
141Additionally, if you
142
143   use charnames ':full';
144
145you can use the C<\N{...}> notation and put the official Unicode
146character name within the braces, such as C<\N{WHITE SMILING FACE}>.
147
148
149=item *
150
151If an appropriate L<encoding> is specified, identifiers within the
152Perl script may contain Unicode alphanumeric characters, including
153ideographs.  Perl does not currently attempt to canonicalize variable
154names.
155
156=item *
157
158Regular expressions match characters instead of bytes.  "." matches
159a character instead of a byte.  The C<\C> pattern is provided to force
160a match a single byte--a C<char> in C, hence C<\C>.
161
162=item *
163
164Character classes in regular expressions match characters instead of
165bytes and match against the character properties specified in the
166Unicode properties database.  C<\w> can be used to match a Japanese
167ideograph, for instance.
168
169(However, and as a limitation of the current implementation, using
170C<\w> or C<\W> I<inside> a C<[...]> character class will still match
171with byte semantics.)
172
173=item *
174
175Named Unicode properties, scripts, and block ranges may be used like
176character classes via the C<\p{}> "matches property" construct and
177the  C<\P{}> negation, "doesn't match property".
178
179For instance, C<\p{Lu}> matches any character with the Unicode "Lu"
180(Letter, uppercase) property, while C<\p{M}> matches any character
181with an "M" (mark--accents and such) property.  Brackets are not
182required for single letter properties, so C<\p{M}> is equivalent to
183C<\pM>. Many predefined properties are available, such as
184C<\p{Mirrored}> and C<\p{Tibetan}>.
185
186The official Unicode script and block names have spaces and dashes as
187separators, but for convenience you can use dashes, spaces, or
188underbars, and case is unimportant. It is recommended, however, that
189for consistency you use the following naming: the official Unicode
190script, property, or block name (see below for the additional rules
191that apply to block names) with whitespace and dashes removed, and the
192words "uppercase-first-lowercase-rest". C<Latin-1 Supplement> thus
193becomes C<Latin1Supplement>.
194
195You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
196(^) between the first brace and the property name: C<\p{^Tamil}> is
197equal to C<\P{Tamil}>.
198
199B<NOTE: the properties, scripts, and blocks listed here are as of
200Unicode 3.2.0, March 2002, or Perl 5.8.0, July 2002.  Unicode 4.0.0
201came out in April 2003, and Perl 5.8.1 in September 2003.>
202
203Here are the basic Unicode General Category properties, followed by their
204long form.  You can use either; C<\p{Lu}> and C<\p{UppercaseLetter}>,
205for instance, are identical.
206
207    Short       Long
208
209    L           Letter
210    Lu          UppercaseLetter
211    Ll          LowercaseLetter
212    Lt          TitlecaseLetter
213    Lm          ModifierLetter
214    Lo          OtherLetter
215
216    M           Mark
217    Mn          NonspacingMark
218    Mc          SpacingMark
219    Me          EnclosingMark
220
221    N           Number
222    Nd          DecimalNumber
223    Nl          LetterNumber
224    No          OtherNumber
225
226    P           Punctuation
227    Pc          ConnectorPunctuation
228    Pd          DashPunctuation
229    Ps          OpenPunctuation
230    Pe          ClosePunctuation
231    Pi          InitialPunctuation
232                (may behave like Ps or Pe depending on usage)
233    Pf          FinalPunctuation
234                (may behave like Ps or Pe depending on usage)
235    Po          OtherPunctuation
236
237    S           Symbol
238    Sm          MathSymbol
239    Sc          CurrencySymbol
240    Sk          ModifierSymbol
241    So          OtherSymbol
242
243    Z           Separator
244    Zs          SpaceSeparator
245    Zl          LineSeparator
246    Zp          ParagraphSeparator
247
248    C           Other
249    Cc          Control
250    Cf          Format
251    Cs          Surrogate   (not usable)
252    Co          PrivateUse
253    Cn          Unassigned
254
255Single-letter properties match all characters in any of the
256two-letter sub-properties starting with the same letter.
257C<L&> is a special case, which is an alias for C<Ll>, C<Lu>, and C<Lt>.
258
259Because Perl hides the need for the user to understand the internal
260representation of Unicode characters, there is no need to implement
261the somewhat messy concept of surrogates. C<Cs> is therefore not
262supported.
263
264Because scripts differ in their directionality--Hebrew is
265written right to left, for example--Unicode supplies these properties:
266
267    Property    Meaning
268
269    BidiL       Left-to-Right
270    BidiLRE     Left-to-Right Embedding
271    BidiLRO     Left-to-Right Override
272    BidiR       Right-to-Left
273    BidiAL      Right-to-Left Arabic
274    BidiRLE     Right-to-Left Embedding
275    BidiRLO     Right-to-Left Override
276    BidiPDF     Pop Directional Format
277    BidiEN      European Number
278    BidiES      European Number Separator
279    BidiET      European Number Terminator
280    BidiAN      Arabic Number
281    BidiCS      Common Number Separator
282    BidiNSM     Non-Spacing Mark
283    BidiBN      Boundary Neutral
284    BidiB       Paragraph Separator
285    BidiS       Segment Separator
286    BidiWS      Whitespace
287    BidiON      Other Neutrals
288
289For example, C<\p{BidiR}> matches characters that are normally
290written right to left.
291
292=back
293
294=head2 Scripts
295
296The script names which can be used by C<\p{...}> and C<\P{...}>,
297such as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows:
298
299    Arabic
300    Armenian
301    Bengali
302    Bopomofo
303    Buhid
304    CanadianAboriginal
305    Cherokee
306    Cyrillic
307    Deseret
308    Devanagari
309    Ethiopic
310    Georgian
311    Gothic
312    Greek
313    Gujarati
314    Gurmukhi
315    Han
316    Hangul
317    Hanunoo
318    Hebrew
319    Hiragana
320    Inherited
321    Kannada
322    Katakana
323    Khmer
324    Lao
325    Latin
326    Malayalam
327    Mongolian
328    Myanmar
329    Ogham
330    OldItalic
331    Oriya
332    Runic
333    Sinhala
334    Syriac
335    Tagalog
336    Tagbanwa
337    Tamil
338    Telugu
339    Thaana
340    Thai
341    Tibetan
342    Yi
343
344Extended property classes can supplement the basic
345properties, defined by the F<PropList> Unicode database:
346
347    ASCIIHexDigit
348    BidiControl
349    Dash
350    Deprecated
351    Diacritic
352    Extender
353    GraphemeLink
354    HexDigit
355    Hyphen
356    Ideographic
357    IDSBinaryOperator
358    IDSTrinaryOperator
359    JoinControl
360    LogicalOrderException
361    NoncharacterCodePoint
362    OtherAlphabetic
363    OtherDefaultIgnorableCodePoint
364    OtherGraphemeExtend
365    OtherLowercase
366    OtherMath
367    OtherUppercase
368    QuotationMark
369    Radical
370    SoftDotted
371    TerminalPunctuation
372    UnifiedIdeograph
373    WhiteSpace
374
375and there are further derived properties:
376
377    Alphabetic      Lu + Ll + Lt + Lm + Lo + OtherAlphabetic
378    Lowercase       Ll + OtherLowercase
379    Uppercase       Lu + OtherUppercase
380    Math            Sm + OtherMath
381
382    ID_Start        Lu + Ll + Lt + Lm + Lo + Nl
383    ID_Continue     ID_Start + Mn + Mc + Nd + Pc
384
385    Any             Any character
386    Assigned        Any non-Cn character (i.e. synonym for \P{Cn})
387    Unassigned      Synonym for \p{Cn}
388    Common          Any character (or unassigned code point)
389                    not explicitly assigned to a script
390
391For backward compatibility (with Perl 5.6), all properties mentioned
392so far may have C<Is> prepended to their name, so C<\P{IsLu}>, for
393example, is equal to C<\P{Lu}>.
394
395=head2 Blocks
396
397In addition to B<scripts>, Unicode also defines B<blocks> of
398characters.  The difference between scripts and blocks is that the
399concept of scripts is closer to natural languages, while the concept
400of blocks is more of an artificial grouping based on groups of 256
401Unicode characters. For example, the C<Latin> script contains letters
402from many blocks but does not contain all the characters from those
403blocks. It does not, for example, contain digits, because digits are
404shared across many scripts. Digits and similar groups, like
405punctuation, are in a category called C<Common>.
406
407For more about scripts, see the UTR #24:
408
409   http://www.unicode.org/unicode/reports/tr24/
410
411For more about blocks, see:
412
413   http://www.unicode.org/Public/UNIDATA/Blocks.txt
414
415Block names are given with the C<In> prefix. For example, the
416Katakana block is referenced via C<\p{InKatakana}>.  The C<In>
417prefix may be omitted if there is no naming conflict with a script
418or any other property, but it is recommended that C<In> always be used
419for block tests to avoid confusion.
420
421These block names are supported:
422
423    InAlphabeticPresentationForms
424    InArabic
425    InArabicPresentationFormsA
426    InArabicPresentationFormsB
427    InArmenian
428    InArrows
429    InBasicLatin
430    InBengali
431    InBlockElements
432    InBopomofo
433    InBopomofoExtended
434    InBoxDrawing
435    InBraillePatterns
436    InBuhid
437    InByzantineMusicalSymbols
438    InCJKCompatibility
439    InCJKCompatibilityForms
440    InCJKCompatibilityIdeographs
441    InCJKCompatibilityIdeographsSupplement
442    InCJKRadicalsSupplement
443    InCJKSymbolsAndPunctuation
444    InCJKUnifiedIdeographs
445    InCJKUnifiedIdeographsExtensionA
446    InCJKUnifiedIdeographsExtensionB
447    InCherokee
448    InCombiningDiacriticalMarks
449    InCombiningDiacriticalMarksforSymbols
450    InCombiningHalfMarks
451    InControlPictures
452    InCurrencySymbols
453    InCyrillic
454    InCyrillicSupplementary
455    InDeseret
456    InDevanagari
457    InDingbats
458    InEnclosedAlphanumerics
459    InEnclosedCJKLettersAndMonths
460    InEthiopic
461    InGeneralPunctuation
462    InGeometricShapes
463    InGeorgian
464    InGothic
465    InGreekExtended
466    InGreekAndCoptic
467    InGujarati
468    InGurmukhi
469    InHalfwidthAndFullwidthForms
470    InHangulCompatibilityJamo
471    InHangulJamo
472    InHangulSyllables
473    InHanunoo
474    InHebrew
475    InHighPrivateUseSurrogates
476    InHighSurrogates
477    InHiragana
478    InIPAExtensions
479    InIdeographicDescriptionCharacters
480    InKanbun
481    InKangxiRadicals
482    InKannada
483    InKatakana
484    InKatakanaPhoneticExtensions
485    InKhmer
486    InLao
487    InLatin1Supplement
488    InLatinExtendedA
489    InLatinExtendedAdditional
490    InLatinExtendedB
491    InLetterlikeSymbols
492    InLowSurrogates
493    InMalayalam
494    InMathematicalAlphanumericSymbols
495    InMathematicalOperators
496    InMiscellaneousMathematicalSymbolsA
497    InMiscellaneousMathematicalSymbolsB
498    InMiscellaneousSymbols
499    InMiscellaneousTechnical
500    InMongolian
501    InMusicalSymbols
502    InMyanmar
503    InNumberForms
504    InOgham
505    InOldItalic
506    InOpticalCharacterRecognition
507    InOriya
508    InPrivateUseArea
509    InRunic
510    InSinhala
511    InSmallFormVariants
512    InSpacingModifierLetters
513    InSpecials
514    InSuperscriptsAndSubscripts
515    InSupplementalArrowsA
516    InSupplementalArrowsB
517    InSupplementalMathematicalOperators
518    InSupplementaryPrivateUseAreaA
519    InSupplementaryPrivateUseAreaB
520    InSyriac
521    InTagalog
522    InTagbanwa
523    InTags
524    InTamil
525    InTelugu
526    InThaana
527    InThai
528    InTibetan
529    InUnifiedCanadianAboriginalSyllabics
530    InVariationSelectors
531    InYiRadicals
532    InYiSyllables
533
534=over 4
535
536=item *
537
538The special pattern C<\X> matches any extended Unicode
539sequence--"a combining character sequence" in Standardese--where the
540first character is a base character and subsequent characters are mark
541characters that apply to the base character.  C<\X> is equivalent to
542C<(?:\PM\pM*)>.
543
544=item *
545
546The C<tr///> operator translates characters instead of bytes.  Note
547that the C<tr///CU> functionality has been removed.  For similar
548functionality see pack('U0', ...) and pack('C0', ...).
549
550=item *
551
552Case translation operators use the Unicode case translation tables
553when character input is provided.  Note that C<uc()>, or C<\U> in
554interpolated strings, translates to uppercase, while C<ucfirst>,
555or C<\u> in interpolated strings, translates to titlecase in languages
556that make the distinction.
557
558=item *
559
560Most operators that deal with positions or lengths in a string will
561automatically switch to using character positions, including
562C<chop()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
563C<sprintf()>, C<write()>, and C<length()>.  Operators that
564specifically do not switch include C<vec()>, C<pack()>, and
565C<unpack()>.  Operators that really don't care include C<chomp()>,
566operators that treats strings as a bucket of bits such as C<sort()>,
567and operators dealing with filenames.
568
569=item *
570
571The C<pack()>/C<unpack()> letters C<c> and C<C> do I<not> change,
572since they are often used for byte-oriented formats.  Again, think
573C<char> in the C language.
574
575There is a new C<U> specifier that converts between Unicode characters
576and code points.
577
578=item *
579
580The C<chr()> and C<ord()> functions work on characters, similar to
581C<pack("U")> and C<unpack("U")>, I<not> C<pack("C")> and
582C<unpack("C")>.  C<pack("C")> and C<unpack("C")> are methods for
583emulating byte-oriented C<chr()> and C<ord()> on Unicode strings.
584While these methods reveal the internal encoding of Unicode strings,
585that is not something one normally needs to care about at all.
586
587=item *
588
589The bit string operators, C<& | ^ ~>, can operate on character data.
590However, for backward compatibility, such as when using bit string
591operations when characters are all less than 256 in ordinal value, one
592should not use C<~> (the bit complement) with characters of both
593values less than 256 and values greater than 256.  Most importantly,
594DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>)
595will not hold.  The reason for this mathematical I<faux pas> is that
596the complement cannot return B<both> the 8-bit (byte-wide) bit
597complement B<and> the full character-wide bit complement.
598
599=item *
600
601lc(), uc(), lcfirst(), and ucfirst() work for the following cases:
602
603=over 8
604
605=item *
606
607the case mapping is from a single Unicode character to another
608single Unicode character, or
609
610=item *
611
612the case mapping is from a single Unicode character to more
613than one Unicode character.
614
615=back
616
617Things to do with locales (Lithuanian, Turkish, Azeri) do B<not> work
618since Perl does not understand the concept of Unicode locales.
619
620See the Unicode Technical Report #21, Case Mappings, for more details.
621
622=back
623
624=over 4
625
626=item *
627
628And finally, C<scalar reverse()> reverses by character rather than by byte.
629
630=back
631
632=head2 User-Defined Character Properties
633
634You can define your own character properties by defining subroutines
635whose names begin with "In" or "Is".  The subroutines must be defined
636in the C<main> package.  The user-defined properties can be used in the
637regular expression C<\p> and C<\P> constructs.  Note that the effect
638is compile-time and immutable once defined.
639
640The subroutines must return a specially-formatted string, with one
641or more newline-separated lines.  Each line must be one of the following:
642
643=over 4
644
645=item *
646
647Two hexadecimal numbers separated by horizontal whitespace (space or
648tabular characters) denoting a range of Unicode code points to include.
649
650=item *
651
652Something to include, prefixed by "+": a built-in character
653property (prefixed by "utf8::"), to represent all the characters in that
654property; two hexadecimal code points for a range; or a single
655hexadecimal code point.
656
657=item *
658
659Something to exclude, prefixed by "-": an existing character
660property (prefixed by "utf8::"), for all the characters in that
661property; two hexadecimal code points for a range; or a single
662hexadecimal code point.
663
664=item *
665
666Something to negate, prefixed "!": an existing character
667property (prefixed by "utf8::") for all the characters except the
668characters in the property; two hexadecimal code points for a range;
669or a single hexadecimal code point.
670
671=back
672
673For example, to define a property that covers both the Japanese
674syllabaries (hiragana and katakana), you can define
675
676    sub InKana {
677        return <<END;
678    3040\t309F
679    30A0\t30FF
680    END
681    }
682
683Imagine that the here-doc end marker is at the beginning of the line.
684Now you can use C<\p{InKana}> and C<\P{InKana}>.
685
686You could also have used the existing block property names:
687
688    sub InKana {
689        return <<'END';
690    +utf8::InHiragana
691    +utf8::InKatakana
692    END
693    }
694
695Suppose you wanted to match only the allocated characters,
696not the raw block ranges: in other words, you want to remove
697the non-characters:
698
699    sub InKana {
700        return <<'END';
701    +utf8::InHiragana
702    +utf8::InKatakana
703    -utf8::IsCn
704    END
705    }
706
707The negation is useful for defining (surprise!) negated classes.
708
709    sub InNotKana {
710        return <<'END';
711    !utf8::InHiragana
712    -utf8::InKatakana
713    +utf8::IsCn
714    END
715    }
716
717You can also define your own mappings to be used in the lc(),
718lcfirst(), uc(), and ucfirst() (or their string-inlined versions).
719The principle is the same: define subroutines in the C<main> package
720with names like C<ToLower> (for lc() and lcfirst()), C<ToTitle> (for
721the first character in ucfirst()), and C<ToUpper> (for uc(), and the
722rest of the characters in ucfirst()).
723
724The string returned by the subroutines needs now to be three
725hexadecimal numbers separated by tabulators: start of the source
726range, end of the source range, and start of the destination range.
727For example:
728
729    sub ToUpper {
730        return <<END;
731    0061\t0063\t0041
732    END
733    }
734
735defines an uc() mapping that causes only the characters "a", "b", and
736"c" to be mapped to "A", "B", "C", all other characters will remain
737unchanged.
738
739If there is no source range to speak of, that is, the mapping is from
740a single character to another single character, leave the end of the
741source range empty, but the two tabulator characters are still needed.
742For example:
743
744    sub ToLower {
745        return <<END;
746    0041\t\t0061
747    END
748    }
749
750defines a lc() mapping that causes only "A" to be mapped to "a", all
751other characters will remain unchanged.
752
753(For serious hackers only)  If you want to introspect the default
754mappings, you can find the data in the directory
755C<$Config{privlib}>/F<unicore/To/>.  The mapping data is returned as
756the here-document, and the C<utf8::ToSpecFoo> are special exception
757mappings derived from <$Config{privlib}>/F<unicore/SpecialCasing.txt>.
758The C<Digit> and C<Fold> mappings that one can see in the directory
759are not directly user-accessible, one can use either the
760C<Unicode::UCD> module, or just match case-insensitively (that's when
761the C<Fold> mapping is used).
762
763A final note on the user-defined property tests and mappings: they
764will be used only if the scalar has been marked as having Unicode
765characters.  Old byte-style strings will not be affected.
766
767=head2 Character Encodings for Input and Output
768
769See L<Encode>.
770
771=head2 Unicode Regular Expression Support Level
772
773The following list of Unicode support for regular expressions describes
774all the features currently supported.  The references to "Level N"
775and the section numbers refer to the Unicode Technical Report 18,
776"Unicode Regular Expression Guidelines", version 6 (Unicode 3.2.0,
777Perl 5.8.0).
778
779=over 4
780
781=item *
782
783Level 1 - Basic Unicode Support
784
785        2.1 Hex Notation                        - done          [1]
786            Named Notation                      - done          [2]
787        2.2 Categories                          - done          [3][4]
788        2.3 Subtraction                         - MISSING       [5][6]
789        2.4 Simple Word Boundaries              - done          [7]
790        2.5 Simple Loose Matches                - done          [8]
791        2.6 End of Line                         - MISSING       [9][10]
792
793        [ 1] \x{...}
794        [ 2] \N{...}
795        [ 3] . \p{...} \P{...}
796        [ 4] now scripts (see UTR#24 Script Names) in addition to blocks
797        [ 5] have negation
798        [ 6] can use regular expression look-ahead [a]
799             or user-defined character properties [b] to emulate subtraction
800        [ 7] include Letters in word characters
801        [ 8] note that Perl does Full case-folding in matching, not Simple:
802             for example U+1F88 is equivalent with U+1F00 U+03B9,
803             not with 1F80.  This difference matters for certain Greek
804             capital letters with certain modifiers: the Full case-folding
805             decomposes the letter, while the Simple case-folding would map
806             it to a single character.
807        [ 9] see UTR #13 Unicode Newline Guidelines
808        [10] should do ^ and $ also on \x{85}, \x{2028} and \x{2029}
809             (should also affect <>, $., and script line numbers)
810             (the \x{85}, \x{2028} and \x{2029} do match \s)
811
812[a] You can mimic class subtraction using lookahead.
813For example, what UTR #18 might write as
814
815    [{Greek}-[{UNASSIGNED}]]
816
817in Perl can be written as:
818
819    (?!\p{Unassigned})\p{InGreekAndCoptic}
820    (?=\p{Assigned})\p{InGreekAndCoptic}
821
822But in this particular example, you probably really want
823
824    \p{GreekAndCoptic}
825
826which will match assigned characters known to be part of the Greek script.
827
828Also see the Unicode::Regex::Set module, it does implement the full
829UTR #18 grouping, intersection, union, and removal (subtraction) syntax.
830
831[b] See L</"User-Defined Character Properties">.
832
833=item *
834
835Level 2 - Extended Unicode Support
836
837        3.1 Surrogates                          - MISSING       [11]
838        3.2 Canonical Equivalents               - MISSING       [12][13]
839        3.3 Locale-Independent Graphemes        - MISSING       [14]
840        3.4 Locale-Independent Words            - MISSING       [15]
841        3.5 Locale-Independent Loose Matches    - MISSING       [16]
842
843        [11] Surrogates are solely a UTF-16 concept and Perl's internal
844             representation is UTF-8.  The Encode module does UTF-16, though.
845        [12] see UTR#15 Unicode Normalization
846        [13] have Unicode::Normalize but not integrated to regexes
847        [14] have \X but at this level . should equal that
848        [15] need three classes, not just \w and \W
849        [16] see UTR#21 Case Mappings
850
851=item *
852
853Level 3 - Locale-Sensitive Support
854
855        4.1 Locale-Dependent Categories         - MISSING
856        4.2 Locale-Dependent Graphemes          - MISSING       [16][17]
857        4.3 Locale-Dependent Words              - MISSING
858        4.4 Locale-Dependent Loose Matches      - MISSING
859        4.5 Locale-Dependent Ranges             - MISSING
860
861        [16] see UTR#10 Unicode Collation Algorithms
862        [17] have Unicode::Collate but not integrated to regexes
863
864=back
865
866=head2 Unicode Encodings
867
868Unicode characters are assigned to I<code points>, which are abstract
869numbers.  To use these numbers, various encodings are needed.
870
871=over 4
872
873=item *
874
875UTF-8
876
877UTF-8 is a variable-length (1 to 6 bytes, current character allocations
878require 4 bytes), byte-order independent encoding. For ASCII (and we
879really do mean 7-bit ASCII, not another 8-bit encoding), UTF-8 is
880transparent.
881
882The following table is from Unicode 3.2.
883
884 Code Points            1st Byte  2nd Byte  3rd Byte  4th Byte
885
886   U+0000..U+007F       00..7F
887   U+0080..U+07FF       C2..DF    80..BF
888   U+0800..U+0FFF       E0        A0..BF    80..BF
889   U+1000..U+CFFF       E1..EC    80..BF    80..BF
890   U+D000..U+D7FF       ED        80..9F    80..BF
891   U+D800..U+DFFF       ******* ill-formed *******
892   U+E000..U+FFFF       EE..EF    80..BF    80..BF
893  U+10000..U+3FFFF      F0        90..BF    80..BF    80..BF
894  U+40000..U+FFFFF      F1..F3    80..BF    80..BF    80..BF
895 U+100000..U+10FFFF     F4        80..8F    80..BF    80..BF
896
897Note the C<A0..BF> in C<U+0800..U+0FFF>, the C<80..9F> in
898C<U+D000...U+D7FF>, the C<90..B>F in C<U+10000..U+3FFFF>, and the
899C<80...8F> in C<U+100000..U+10FFFF>.  The "gaps" are caused by legal
900UTF-8 avoiding non-shortest encodings: it is technically possible to
901UTF-8-encode a single code point in different ways, but that is
902explicitly forbidden, and the shortest possible encoding should always
903be used.  So that's what Perl does.
904
905Another way to look at it is via bits:
906
907 Code Points                    1st Byte   2nd Byte  3rd Byte  4th Byte
908
909                    0aaaaaaa     0aaaaaaa
910            00000bbbbbaaaaaa     110bbbbb  10aaaaaa
911            ccccbbbbbbaaaaaa     1110cccc  10bbbbbb  10aaaaaa
912  00000dddccccccbbbbbbaaaaaa     11110ddd  10cccccc  10bbbbbb  10aaaaaa
913
914As you can see, the continuation bytes all begin with C<10>, and the
915leading bits of the start byte tell how many bytes the are in the
916encoded character.
917
918=item *
919
920UTF-EBCDIC
921
922Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.
923
924=item *
925
926UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks)
927
928The followings items are mostly for reference and general Unicode
929knowledge, Perl doesn't use these constructs internally.
930
931UTF-16 is a 2 or 4 byte encoding.  The Unicode code points
932C<U+0000..U+FFFF> are stored in a single 16-bit unit, and the code
933points C<U+10000..U+10FFFF> in two 16-bit units.  The latter case is
934using I<surrogates>, the first 16-bit unit being the I<high
935surrogate>, and the second being the I<low surrogate>.
936
937Surrogates are code points set aside to encode the C<U+10000..U+10FFFF>
938range of Unicode code points in pairs of 16-bit units.  The I<high
939surrogates> are the range C<U+D800..U+DBFF>, and the I<low surrogates>
940are the range C<U+DC00..U+DFFF>.  The surrogate encoding is
941
942        $hi = ($uni - 0x10000) / 0x400 + 0xD800;
943        $lo = ($uni - 0x10000) % 0x400 + 0xDC00;
944
945and the decoding is
946
947        $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
948
949If you try to generate surrogates (for example by using chr()), you
950will get a warning if warnings are turned on, because those code
951points are not valid for a Unicode character.
952
953Because of the 16-bitness, UTF-16 is byte-order dependent.  UTF-16
954itself can be used for in-memory computations, but if storage or
955transfer is required either UTF-16BE (big-endian) or UTF-16LE
956(little-endian) encodings must be chosen.
957
958This introduces another problem: what if you just know that your data
959is UTF-16, but you don't know which endianness?  Byte Order Marks, or
960BOMs, are a solution to this.  A special character has been reserved
961in Unicode to function as a byte order marker: the character with the
962code point C<U+FEFF> is the BOM.
963
964The trick is that if you read a BOM, you will know the byte order,
965since if it was written on a big-endian platform, you will read the
966bytes C<0xFE 0xFF>, but if it was written on a little-endian platform,
967you will read the bytes C<0xFF 0xFE>.  (And if the originating platform
968was writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.)
969
970The way this trick works is that the character with the code point
971C<U+FFFE> is guaranteed not to be a valid Unicode character, so the
972sequence of bytes C<0xFF 0xFE> is unambiguously "BOM, represented in
973little-endian format" and cannot be C<U+FFFE>, represented in big-endian
974format".
975
976=item *
977
978UTF-32, UTF-32BE, UTF-32LE
979
980The UTF-32 family is pretty much like the UTF-16 family, expect that
981the units are 32-bit, and therefore the surrogate scheme is not
982needed.  The BOM signatures will be C<0x00 0x00 0xFE 0xFF> for BE and
983C<0xFF 0xFE 0x00 0x00> for LE.
984
985=item *
986
987UCS-2, UCS-4
988
989Encodings defined by the ISO 10646 standard.  UCS-2 is a 16-bit
990encoding.  Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>,
991because it does not use surrogates.  UCS-4 is a 32-bit encoding,
992functionally identical to UTF-32.
993
994=item *
995
996UTF-7
997
998A seven-bit safe (non-eight-bit) encoding, which is useful if the
999transport or storage is not eight-bit safe.  Defined by RFC 2152.
1000
1001=back
1002
1003=head2 Security Implications of Unicode
1004
1005=over 4
1006
1007=item *
1008
1009Malformed UTF-8
1010
1011Unfortunately, the specification of UTF-8 leaves some room for
1012interpretation of how many bytes of encoded output one should generate
1013from one input Unicode character.  Strictly speaking, the shortest
1014possible sequence of UTF-8 bytes should be generated,
1015because otherwise there is potential for an input buffer overflow at
1016the receiving end of a UTF-8 connection.  Perl always generates the
1017shortest length UTF-8, and with warnings on Perl will warn about
1018non-shortest length UTF-8 along with other malformations, such as the
1019surrogates, which are not real Unicode code points.
1020
1021=item *
1022
1023Regular expressions behave slightly differently between byte data and
1024character (Unicode) data.  For example, the "word character" character
1025class C<\w> will work differently depending on if data is eight-bit bytes
1026or Unicode.
1027
1028In the first case, the set of C<\w> characters is either small--the
1029default set of alphabetic characters, digits, and the "_"--or, if you
1030are using a locale (see L<perllocale>), the C<\w> might contain a few
1031more letters according to your language and country.
1032
1033In the second case, the C<\w> set of characters is much, much larger.
1034Most importantly, even in the set of the first 256 characters, it will
1035probably match different characters: unlike most locales, which are
1036specific to a language and country pair, Unicode classifies all the
1037characters that are letters I<somewhere> as C<\w>.  For example, your
1038locale might not think that LATIN SMALL LETTER ETH is a letter (unless
1039you happen to speak Icelandic), but Unicode does.
1040
1041As discussed elsewhere, Perl has one foot (two hooves?) planted in
1042each of two worlds: the old world of bytes and the new world of
1043characters, upgrading from bytes to characters when necessary.
1044If your legacy code does not explicitly use Unicode, no automatic
1045switch-over to characters should happen.  Characters shouldn't get
1046downgraded to bytes, either.  It is possible to accidentally mix bytes
1047and characters, however (see L<perluniintro>), in which case C<\w> in
1048regular expressions might start behaving differently.  Review your
1049code.  Use warnings and the C<strict> pragma.
1050
1051=back
1052
1053=head2 Unicode in Perl on EBCDIC
1054
1055The way Unicode is handled on EBCDIC platforms is still
1056experimental.  On such platforms, references to UTF-8 encoding in this
1057document and elsewhere should be read as meaning the UTF-EBCDIC
1058specified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issues
1059are specifically discussed. There is no C<utfebcdic> pragma or
1060":utfebcdic" layer; rather, "utf8" and ":utf8" are reused to mean
1061the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic>
1062for more discussion of the issues.
1063
1064=head2 Locales
1065
1066Usually locale settings and Unicode do not affect each other, but
1067there are a couple of exceptions:
1068
1069=over 4
1070
1071=item *
1072
1073You can enable automatic UTF-8-ification of your standard file
1074handles, default C<open()> layer, and C<@ARGV> by using either
1075the C<-C> command line switch or the C<PERL_UNICODE> environment
1076variable, see L<perlrun> for the documentation of the C<-C> switch.
1077
1078=item *
1079
1080Perl tries really hard to work both with Unicode and the old
1081byte-oriented world. Most often this is nice, but sometimes Perl's
1082straddling of the proverbial fence causes problems.
1083
1084=back
1085
1086=head2 When Unicode Does Not Happen
1087
1088While Perl does have extensive ways to input and output in Unicode,
1089and few other 'entry points' like the @ARGV which can be interpreted
1090as Unicode (UTF-8), there still are many places where Unicode (in some
1091encoding or another) could be given as arguments or received as
1092results, or both, but it is not.
1093
1094The following are such interfaces.  For all of these interfaces Perl
1095currently (as of 5.8.3) simply assumes byte strings both as arguments
1096and results, or UTF-8 strings if the C<encoding> pragma has been used.
1097
1098One reason why Perl does not attempt to resolve the role of Unicode in
1099this cases is that the answers are highly dependent on the operating
1100system and the file system(s).  For example, whether filenames can be
1101in Unicode, and in exactly what kind of encoding, is not exactly a
1102portable concept.  Similarly for the qx and system: how well will the
1103'command line interface' (and which of them?) handle Unicode?
1104
1105=over 4
1106
1107=item *
1108
1109chmod, chmod, chown, chroot, exec, link, lstat, mkdir,
1110rename, rmdir, stat, symlink, truncate, unlink, utime, -X
1111
1112=item *
1113
1114%ENV
1115
1116=item *
1117
1118glob (aka the <*>)
1119
1120=item *
1121
1122open, opendir, sysopen
1123
1124=item *
1125
1126qx (aka the backtick operator), system
1127
1128=item *
1129
1130readdir, readlink
1131
1132=back
1133
1134=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
1135
1136Sometimes (see L</"When Unicode Does Not Happen">) there are
1137situations where you simply need to force Perl to believe that a byte
1138string is UTF-8, or vice versa.  The low-level calls
1139utf8::upgrade($bytestring) and utf8::downgrade($utf8string) are
1140the answers.
1141
1142Do not use them without careful thought, though: Perl may easily get
1143very confused, angry, or even crash, if you suddenly change the 'nature'
1144of scalar like that.  Especially careful you have to be if you use the
1145utf8::upgrade(): any random byte string is not valid UTF-8.
1146
1147=head2 Using Unicode in XS
1148
1149If you want to handle Perl Unicode in XS extensions, you may find the
1150following C APIs useful.  See also L<perlguts/"Unicode Support"> for an
1151explanation about Unicode at the XS level, and L<perlapi> for the API
1152details.
1153
1154=over 4
1155
1156=item *
1157
1158C<DO_UTF8(sv)> returns true if the C<UTF8> flag is on and the bytes
1159pragma is not in effect.  C<SvUTF8(sv)> returns true is the C<UTF8>
1160flag is on; the bytes pragma is ignored.  The C<UTF8> flag being on
1161does B<not> mean that there are any characters of code points greater
1162than 255 (or 127) in the scalar or that there are even any characters
1163in the scalar.  What the C<UTF8> flag means is that the sequence of
1164octets in the representation of the scalar is the sequence of UTF-8
1165encoded code points of the characters of a string.  The C<UTF8> flag
1166being off means that each octet in this representation encodes a
1167single character with code point 0..255 within the string.  Perl's
1168Unicode model is not to use UTF-8 until it is absolutely necessary.
1169
1170=item *
1171
1172C<uvuni_to_utf8(buf, chr)> writes a Unicode character code point into
1173a buffer encoding the code point as UTF-8, and returns a pointer
1174pointing after the UTF-8 bytes.
1175
1176=item *
1177
1178C<utf8_to_uvuni(buf, lenp)> reads UTF-8 encoded bytes from a buffer and
1179returns the Unicode character code point and, optionally, the length of
1180the UTF-8 byte sequence.
1181
1182=item *
1183
1184C<utf8_length(start, end)> returns the length of the UTF-8 encoded buffer
1185in characters.  C<sv_len_utf8(sv)> returns the length of the UTF-8 encoded
1186scalar.
1187
1188=item *
1189
1190C<sv_utf8_upgrade(sv)> converts the string of the scalar to its UTF-8
1191encoded form.  C<sv_utf8_downgrade(sv)> does the opposite, if
1192possible.  C<sv_utf8_encode(sv)> is like sv_utf8_upgrade except that
1193it does not set the C<UTF8> flag.  C<sv_utf8_decode()> does the
1194opposite of C<sv_utf8_encode()>.  Note that none of these are to be
1195used as general-purpose encoding or decoding interfaces: C<use Encode>
1196for that.  C<sv_utf8_upgrade()> is affected by the encoding pragma
1197but C<sv_utf8_downgrade()> is not (since the encoding pragma is
1198designed to be a one-way street).
1199
1200=item *
1201
1202C<is_utf8_char(s)> returns true if the pointer points to a valid UTF-8
1203character.
1204
1205=item *
1206
1207C<is_utf8_string(buf, len)> returns true if C<len> bytes of the buffer
1208are valid UTF-8.
1209
1210=item *
1211
1212C<UTF8SKIP(buf)> will return the number of bytes in the UTF-8 encoded
1213character in the buffer.  C<UNISKIP(chr)> will return the number of bytes
1214required to UTF-8-encode the Unicode character code point.  C<UTF8SKIP()>
1215is useful for example for iterating over the characters of a UTF-8
1216encoded buffer; C<UNISKIP()> is useful, for example, in computing
1217the size required for a UTF-8 encoded buffer.
1218
1219=item *
1220
1221C<utf8_distance(a, b)> will tell the distance in characters between the
1222two pointers pointing to the same UTF-8 encoded buffer.
1223
1224=item *
1225
1226C<utf8_hop(s, off)> will return a pointer to an UTF-8 encoded buffer
1227that is C<off> (positive or negative) Unicode characters displaced
1228from the UTF-8 buffer C<s>.  Be careful not to overstep the buffer:
1229C<utf8_hop()> will merrily run off the end or the beginning of the
1230buffer if told to do so.
1231
1232=item *
1233
1234C<pv_uni_display(dsv, spv, len, pvlim, flags)> and
1235C<sv_uni_display(dsv, ssv, pvlim, flags)> are useful for debugging the
1236output of Unicode strings and scalars.  By default they are useful
1237only for debugging--they display B<all> characters as hexadecimal code
1238points--but with the flags C<UNI_DISPLAY_ISPRINT>,
1239C<UNI_DISPLAY_BACKSLASH>, and C<UNI_DISPLAY_QQ> you can make the
1240output more readable.
1241
1242=item *
1243
1244C<ibcmp_utf8(s1, pe1, u1, l1, u1, s2, pe2, l2, u2)> can be used to
1245compare two strings case-insensitively in Unicode.  For case-sensitive
1246comparisons you can just use C<memEQ()> and C<memNE()> as usual.
1247
1248=back
1249
1250For more information, see L<perlapi>, and F<utf8.c> and F<utf8.h>
1251in the Perl source code distribution.
1252
1253=head1 BUGS
1254
1255=head2 Interaction with Locales
1256
1257Use of locales with Unicode data may lead to odd results.  Currently,
1258Perl attempts to attach 8-bit locale info to characters in the range
12590..255, but this technique is demonstrably incorrect for locales that
1260use characters above that range when mapped into Unicode.  Perl's
1261Unicode support will also tend to run slower.  Use of locales with
1262Unicode is discouraged.
1263
1264=head2 Interaction with Extensions
1265
1266When Perl exchanges data with an extension, the extension should be
1267able to understand the UTF-8 flag and act accordingly. If the
1268extension doesn't know about the flag, it's likely that the extension
1269will return incorrectly-flagged data.
1270
1271So if you're working with Unicode data, consult the documentation of
1272every module you're using if there are any issues with Unicode data
1273exchange. If the documentation does not talk about Unicode at all,
1274suspect the worst and probably look at the source to learn how the
1275module is implemented. Modules written completely in Perl shouldn't
1276cause problems. Modules that directly or indirectly access code written
1277in other programming languages are at risk.
1278
1279For affected functions, the simple strategy to avoid data corruption is
1280to always make the encoding of the exchanged data explicit. Choose an
1281encoding that you know the extension can handle. Convert arguments passed
1282to the extensions to that encoding and convert results back from that
1283encoding. Write wrapper functions that do the conversions for you, so
1284you can later change the functions when the extension catches up.
1285
1286To provide an example, let's say the popular Foo::Bar::escape_html
1287function doesn't deal with Unicode data yet. The wrapper function
1288would convert the argument to raw UTF-8 and convert the result back to
1289Perl's internal representation like so:
1290
1291    sub my_escape_html ($) {
1292      my($what) = shift;
1293      return unless defined $what;
1294      Encode::decode_utf8(Foo::Bar::escape_html(Encode::encode_utf8($what)));
1295    }
1296
1297Sometimes, when the extension does not convert data but just stores
1298and retrieves them, you will be in a position to use the otherwise
1299dangerous Encode::_utf8_on() function. Let's say the popular
1300C<Foo::Bar> extension, written in C, provides a C<param> method that
1301lets you store and retrieve data according to these prototypes:
1302
1303    $self->param($name, $value);            # set a scalar
1304    $value = $self->param($name);           # retrieve a scalar
1305
1306If it does not yet provide support for any encoding, one could write a
1307derived class with such a C<param> method:
1308
1309    sub param {
1310      my($self,$name,$value) = @_;
1311      utf8::upgrade($name);     # make sure it is UTF-8 encoded
1312      if (defined $value)
1313        utf8::upgrade($value);  # make sure it is UTF-8 encoded
1314        return $self->SUPER::param($name,$value);
1315      } else {
1316        my $ret = $self->SUPER::param($name);
1317        Encode::_utf8_on($ret); # we know, it is UTF-8 encoded
1318        return $ret;
1319      }
1320    }
1321
1322Some extensions provide filters on data entry/exit points, such as
1323DB_File::filter_store_key and family. Look out for such filters in
1324the documentation of your extensions, they can make the transition to
1325Unicode data much easier.
1326
1327=head2 Speed
1328
1329Some functions are slower when working on UTF-8 encoded strings than
1330on byte encoded strings.  All functions that need to hop over
1331characters such as length(), substr() or index(), or matching regular
1332expressions can work B<much> faster when the underlying data are
1333byte-encoded.
1334
1335In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1
1336a caching scheme was introduced which will hopefully make the slowness
1337somewhat less spectacular, at least for some operations.  In general,
1338operations with UTF-8 encoded strings are still slower. As an example,
1339the Unicode properties (character classes) like C<\p{Nd}> are known to
1340be quite a bit slower (5-20 times) than their simpler counterparts
1341like C<\d> (then again, there 268 Unicode characters matching C<Nd>
1342compared with the 10 ASCII characters matching C<d>).
1343
1344=head2 Porting code from perl-5.6.X
1345
1346Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer
1347was required to use the C<utf8> pragma to declare that a given scope
1348expected to deal with Unicode data and had to make sure that only
1349Unicode data were reaching that scope. If you have code that is
1350working with 5.6, you will need some of the following adjustments to
1351your code. The examples are written such that the code will continue
1352to work under 5.6, so you should be safe to try them out.
1353
1354=over 4
1355
1356=item *
1357
1358A filehandle that should read or write UTF-8
1359
1360  if ($] > 5.007) {
1361    binmode $fh, ":utf8";
1362  }
1363
1364=item *
1365
1366A scalar that is going to be passed to some extension
1367
1368Be it Compress::Zlib, Apache::Request or any extension that has no
1369mention of Unicode in the manpage, you need to make sure that the
1370UTF-8 flag is stripped off. Note that at the time of this writing
1371(October 2002) the mentioned modules are not UTF-8-aware. Please
1372check the documentation to verify if this is still true.
1373
1374  if ($] > 5.007) {
1375    require Encode;
1376    $val = Encode::encode_utf8($val); # make octets
1377  }
1378
1379=item *
1380
1381A scalar we got back from an extension
1382
1383If you believe the scalar comes back as UTF-8, you will most likely
1384want the UTF-8 flag restored:
1385
1386  if ($] > 5.007) {
1387    require Encode;
1388    $val = Encode::decode_utf8($val);
1389  }
1390
1391=item *
1392
1393Same thing, if you are really sure it is UTF-8
1394
1395  if ($] > 5.007) {
1396    require Encode;
1397    Encode::_utf8_on($val);
1398  }
1399
1400=item *
1401
1402A wrapper for fetchrow_array and fetchrow_hashref
1403
1404When the database contains only UTF-8, a wrapper function or method is
1405a convenient way to replace all your fetchrow_array and
1406fetchrow_hashref calls. A wrapper function will also make it easier to
1407adapt to future enhancements in your database driver. Note that at the
1408time of this writing (October 2002), the DBI has no standardized way
1409to deal with UTF-8 data. Please check the documentation to verify if
1410that is still true.
1411
1412  sub fetchrow {
1413    my($self, $sth, $what) = @_; # $what is one of fetchrow_{array,hashref}
1414    if ($] < 5.007) {
1415      return $sth->$what;
1416    } else {
1417      require Encode;
1418      if (wantarray) {
1419        my @arr = $sth->$what;
1420        for (@arr) {
1421          defined && /[^\000-\177]/ && Encode::_utf8_on($_);
1422        }
1423        return @arr;
1424      } else {
1425        my $ret = $sth->$what;
1426        if (ref $ret) {
1427          for my $k (keys %$ret) {
1428            defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret->{$k};
1429          }
1430          return $ret;
1431        } else {
1432          defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret;
1433          return $ret;
1434        }
1435      }
1436    }
1437  }
1438
1439
1440=item *
1441
1442A large scalar that you know can only contain ASCII
1443
1444Scalars that contain only ASCII and are marked as UTF-8 are sometimes
1445a drag to your program. If you recognize such a situation, just remove
1446the UTF-8 flag:
1447
1448  utf8::downgrade($val) if $] > 5.007;
1449
1450=back
1451
1452=head1 SEE ALSO
1453
1454L<perluniintro>, L<encoding>, L<Encode>, L<open>, L<utf8>, L<bytes>,
1455L<perlretut>, L<perlvar/"${^UNICODE}">
1456
1457=cut
Note: See TracBrowser for help on using the repository browser.