非文字

非文字

[1] Unicode では、いくつかの符号位置非文字 (noncharacters) とされています。

[19] 文字ではないという意味ではなく、非文字という1つの分類であり、 noncharacter と1語で表記します。文字ではありませんが、いずれも Unicode符号位置です。

定義・説明

[10]

C2
A process shall not interpret a noncharacter code point as an abstract character.
  • The noncharacter code points may be used internally, such as for sentinel values or delimiters, but should not be exchanged publicly.

[11]

C7
When a process purports not to modify the interpretation of a valid coded character sequence, it shall make no change to that coded character sequence other than the possible replacement of character sequences by their canonical-equivalent sequences or the deletion of noncharacter code points.
  • (中略)
  • If a noncharacter that does not have a specific internal use is unexpectedly encountered in processing, an implementation may signal an error or delete or ignore the noncharacter. If these options are not taken, the noncharacter should be treated as an unassigned code point. For example, an API that returned a character property value for a noncharacter would return the same value as the default value for an unassigned code point.
  • (後略)
D12 Coded character sequence
An ordered sequence of one or more code points.
  • A coded character sequence is also known as a coded character representation.
  • Normally a coded character sequence consists of a sequence of encoded characters, but it may also include noncharacters or reserved code points.
  • Internally, a process may choose to make use of noncharacter code points in its coded character sequences. However, such noncharacter code points may not be interpreted as abstract characters (see conformance clause C2), and their removal by a conformant process does not constitute modification of interpretation of the coded character sequence (see conformance clause C7).
  • (後略)

[12]

D14 Noncharacter
A code point that is permanently reserved for internal use and that should never be interchanged. Noncharacters consist of the values U+nFFFE and U+nFFFF (where n is from 0 to 1016) and the values U+FDD0..U+FDEF.
  • For more information, see Section 16.7, Noncharacters.
  • These code points are permanently reserved as noncharacters.
D15 Reserved code point
Any code point of the Unicode Standard that is reserved for future assignment. Also known as an unassigned code point.
  • Surrogate code points and noncharacters are considered assigned code points, but not assigned characters.
  • (後略)

[13]

16.7 Noncharacters

Noncharacters: U+FFFE, U+FFFF, and Others

Noncharacters are code points that are permanently reserved in the Unicode Standard for internal use. They are forbidden for use in open interchange of Unicode text data. See Section 3.4, Characters and Encoding, for the formal definition of noncharacters and conformance requirements related to their use.

The Unicode Standard sets aside 66 noncharacter code points. The last two code points of each plane are noncharacters: U+FFFE and U+FFFF on the BMP, U+1FFFE and U+1FFFF on Plane 1, and so on, up to U+10FFFE and U+10FFFF on Plane 16, for a total of 34 code points. In addition, there is a contiguous range of another 32 noncharacter code points in the BMP: U+FDD0..U+FDEF. For historical reasons, the range U+FDD0..U+FDEF is contained within the Arabic Presentation Forms-A block, but those noncharacters are not “Arabic noncharacters” or “right-to-left noncharacters,” and are not distinguished in any other way from the other noncharacters, except in their code point values.

Applications are free to use any of these noncharacter code points internally but should never attempt to exchange them. If a noncharacter is received in open interchange, an application is not required to interpret it in any way. It is good practice, however, to recognize it as a noncharacter and to take appropriate action, such as removing it from the text. Note that Unicode conformance freely allows the removal of these characters. (See conformance clause C7 in Section 3.2, Conformance Requirements.)

In effect, noncharacters can be thought of as application-internal private-use code points. Unlike the private-use characters discussed in Section 16.5, Private-Use Characters, which are assigned characters and which are intended for use in open interchange, subject to interpretation by private agreement, noncharacters are permanently reserved (unassigned) and have no interpretation whatsoever outside of their possible application-internal private uses.

[9]

These codes are intended for process-internal uses, but are not permitted for interchange.

一覧

[32] 最新版の Unicode にける非文字の一覧は >>31 にあります。

各種応用における取り扱い

HTML

[4] HTML5 では、著者文書非文字を含めてはなりませんHTML構文解析器非文字構文解析誤りとし、 U+FFFD に置き換えなければなりませんHTML5

XML

[6] XML では、 U+FFFEU+FFFF文書に含めると整形式ではなくなります。 それ以外の非文字を含めることはできますが、 Note において非推奨 (discouraged) とされています。

U+FDD0〜U+FDEF

範囲の誤り

[2] 複数の仕様書が、非文字の範囲を「U+FDD0〜U+FDEF」ではなく、誤って 「U+FDD0〜U+FDDF」としていました。

[8] Unicode 5.1 の Code Chart PDF にすら、

This block also contains 32 noncharacters in the range FDD0‐FDDF.

と間違った記述が含まれています。

<http://www.unicode.org/charts/PDF/UFB50.pdf>「04-Apr-2008 09:52 342K」 (2009年2月現在)
Unicode 4.0PDF <http://www.unicode.org/charts/PDF/Unicode-4.0/U40-FB50.pdf> には該当部分の記述がそもそもなかったみたいです。

[5] XMLXML 1.0 4e E02、XML 1.1 2e E02 (2007年8月15日) でこの誤りを修正しました。

[3] HTML5 は r2708 (2009年1月) でこの誤りを修正しました。

歴史

[16] これら32符号位置Unicode 3.1非文字とされました。

U+FFFE

[7] U+FFFE は、 BOM U+FEFF と区別するため、 非文字として文字符号化されない符号位置に指定されています。

[15]

U+FFFE. This noncharacter has the intended peculiarity that, when represented in UTF-16 and then serialized, it has the opposite byte sequence of U+FEFF, the byte order mark. This means that applications should reserve U+FFFE as an internal signal that a UTF-16 text stream is in a reversed byte format. Detection of U+FFFE at the start of an input stream should be taken as a strong indication that the input stream should be byte-swapped before interpretation. For more on the use of the byte order mark and its interaction with the noncharacter U+FFFE, see Section 16.8, Specials.

U+FFFF

[14]

U+FFFF and U+10FFFF. These two noncharacter code points have the attribute of being associated with the largest code unit values for particular Unicode encoding forms. In UTF-16, U+FFFF is associated with the largest 16-bit code unit value, FFFF16. U+10FFFF is associated with the largest legal UTF-32 32-bit code unit value, 10FFFF16. This attribute renders these two noncharacter code points useful for internal purposes as sentinels. For example, they might be used to indicate the end of a list, to represent a value in an index guaranteed to be higher than any valid character value, and so on.

U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, ..., U+FFFFE, U+FFFFF, U+10FFFE, U+10FFFF

歴史

[17] Unicode 3.1 はこれら32符号位置Unicode 3.0 で既に非文字であったとしていますが、 Unicode 3.0 のどの部分でそのように規定されているのか確認できませんでした。

There are 34 specific code points in Unicode 3.0 that are characterized as noncharacters.

[18] これら32符号位置非文字として code charts にはじめて掲載されたのは Unicode 3.1 です。 (それ以前はサロゲート部分の code charts がありませんでした。)

[25] Unicode ML の発言 (>>26) によれば、これらが追加されたのは ISO/IEC 10646 の規定があったためであり、それが発見された時には既に遅かった、ということです。

[27] ISO/IEC 10646-1:2000 に対応する JIS X 0221-1:2001 の 7. には、

いかなるでも、符号位置FFFE 及び FFFF を使用してはならない。

... とあります。 ISO/IEC 10646-1:2000 は Unicode 3.0 に相当しています。 これが Unicode でも非文字として扱わなければならなくなった根拠でしょう。

[28] ISO/IEC 10646-1:1993 から既にこの規定があったのかは未確認です。

[29] FFFEUCS-4 BOM における誤認を防ぐためと説明できます (ただし BOM が導入されたのと FFFE が予約されたのでどちらが先かは不明)。

[30] FFFF が予約されている理由は不明です。

処理モデル

[33] 非文字は処理系依存の意味を割り当てても良いことになっています。 従って非文字が含まれる文字列相互運用性が保証されません。 ある実装では正しく扱える文字列が他の実装では意図せぬ動作を招くおそれがあります。

[34] とはいえ多くの実装は非文字に特別な意味を与えておらず、 プロトコルによっては誤りとして扱われたり、 U+FFFD に置き換えられたりする他は、 普通の未割り当ての符号位置と同じように扱われることが多いようです。

関連

[20] サロゲート符号位置文字ではありませんが、非文字ではありません。

[21] 制御文字U+FEFF ZWNBSP文字であり、非文字ではありません。

[22] BOM符号化方式に依存したビット組合せであり、 文字でも非文字でも、符号位置でもありません。

[23] unicode - What's the purpose of the noncharacters U+FDD0 to U+FDEF? - Stack Overflow ( ( 版)) <http://stackoverflow.com/questions/5188679/whats-the-purpose-of-the-noncharacters-ufdd0-to-ufdef>

[24] Unicode Character Encoding Stability Policy ( ( 版)) <http://unicode.org/policies/stability_policy.html#Property_Value>

[35] Define control and noncharacter (annevk著, ) <https://github.com/whatwg/infra/commit/ad1b87aecce01759096fcdbf6acc2bd6096c3168>

[36] Editorial: use noncharacter and control from Infra (annevk著, ) <https://github.com/whatwg/html/commit/70925237a88d9802bfe7224fe9c78b146af615be>

[37] Editorial: use noncharacter from Infra (annevk著, ) <https://github.com/whatwg/url/commit/4a4c55959bec4f091373723bd0d507432d4b3dac>