UTF-8 (文字コード)

[34] UTF-8 は、世界中で広く用いられており、 Linux や Web では原則として用いられている標準的な文字コードです。

Unicode と UTF-8

[44] Unicode は、色々な文字に符号を与えています。その符号をどのように表現するかは選択の余地があり、歴史的に幾つかの方法が用いられました。その中で現在最も広く用いられているのが UTF-8 です。 UTF-8 は文字の符号を一定の規則で求められる1-4個のバイトの列として表すものです。

[45] 文字「a」 (LATIN SMALL LETTER A) は Unicode で U+0061 という符号位置が割り当てられています。これは UTF-8 ではバイト列 0x61 によって表されます。

[46] 文字「字」 (CJK UNIFIED IDEOGRAPH-5B57) は Unicode で U+5B57 という符号位置が割り当てられています。これは UTF-8 ではバイト列 0xE5 0xAD 0x97 によって表されます。

定義

性質

[38] ASCII互換符号化です。

[39] 0x00-0x7F は常に U+0000-U+007F を表します。

[40] U+0000-U+10FFFF のすべての符号位置を符号化できます。

[48] すべての符号位置に対応するバイト列が一意に定まります。

[49] 文字列の部分一致と文字列に対応するバイト列の部分一致の結果が等しくなります。

[50] 文字列の符号位置による整列と文字列に対応するバイト列の整列の結果が等しくなります。

[51] 任意のバイト列が妥当な文字列を表すとは限りません。

[52] BOM を使わない限りは、文字列を表すバイト列の単純な連結が文字列の連結となります。

BOM との関係

[2] UTF-8 の BOM は必須ではありません。詳しくは BOM (>>9) を参照して下さい。 BOM 必須説は、そういうことにしたい人達が勝手に流した風説です。

[8] BOM を使うと UTF-8 の重要な性質の1つである ASCII 互換性が失われます。多くの場合 BOM を使うのは適切ではないと考えられています。

非文字との関係

[3] Unicode 3.0 以降では S-area の符号位置 (の UTF-8 表現) は禁止されています。他方、 U+FFFF のような non-character (の UTF-8 表現) は禁止されていません。 non character は情報交換を目的としたものではありませんから情報交換用で紛れ込んでいたなら間違いですが、内部処理に使うことが出来るものですから、 UTF で禁止したら使う意味が無くなるということらしいです。

Charset 名

[13] IANA に登録された charset 名は utf-8 です。

[14] HTTP では、たまに誤って charset 名として utf8 が用いられることがあります。

Unicode の版との関係

[15] unicode-1-1-utf-8 は・・・ RFC 1641, RFC 2279

HTML における UTF-8

[42] XML文書が <meta charset> を含む場合、その値は UTF-8 (ASCII大文字・小文字不区別) でなければなりません。従って当該文書の文字符号化は、 UTF-8 でなければなりません。

<meta charset> 参照。

[43] HTML文書や <meta charset> を含まないXML文書にはこの制約は適用されません。

符号化

[108] UTF-8符号化 (UTF-8 encode) は、符号位置ストリームストリームについて、次のようにします >>85。

[109] 符号化した結果を返します。
ストリーム
ストリーム
符号化
UTF-8

[113] 次の場面で使われます。

[132] UTF-8符号化が使われる場面

[111] TextEncoder の encode メソッドは、UTF-8符号化を使っていませんが、実質的に等価です。

[110] 他に、符号化操作が UTF-8 について実行されることもあります。

[140] なお、仕様書としてUTF-8符号化を使っていても、実質的には同型符号化となる場合があります。

[95] BOM は生成されません。文字列の先頭に ZWNBSP がある場合、復号やUTF-8復号は BOM とみなすので、文字列に戻した時に ZWNBSP が失われてしまいます。

復号

[86] UTF-8復号 (UTF-8 decode) は、バイトストリームストリームを次のようにします >>85。

[87] バッファーを、空のバイト列に設定します。
[88] ストリームから3バイトをバッファーに読みます。
[89] バッファーが 0xEF 0xBB 0xBF 以外なら、
1. [90] ストリームにバッファーを prependします。
[91] 出力を、符号位置ストリームに設定します。
[92] UTF-8 の復号器を走らせます。 入力をストリーム、出力を出力とします。
[93] 出力を返します。

[94] つまりこの操作は先頭の BOM を無視します。

[96] BOMなしUTF-8復号 (UTF-8 decode without BOM) は、バイトストリームストリームを次のようにします >>85。

[97] 出力を、符号位置ストリームに設定します。
[98] UTF-8 の復号器を走らせます。 入力をストリーム、出力を出力とします。
[99] 出力を返します。

[100] 先頭に BOM があったとしても、 BOM ではなく ZWNBSP とみなします。

[101] BOMなしUTF-8復号または失敗 (UTF-8 decode without BOM or fail) は、バイトストリームストリームを次のようにします >>85。

[102] 出力を、符号位置ストリームに設定します。
[103] 結果を、 UTF-8 の復号器を走らせた結果に設定します。 入力をストリーム、出力を出力、 誤りモードを fatal とします。
[105] 結果が誤りなら、
1. [106] 失敗を返します。
[107] それ以外なら、
1. [104] 出力を返します。

[112] これらの操作は同期的に処理される手続きのような形で仕様書では記述されていますが、実際には、呼び出される文脈次第で、入力ストリームを読みながら、読んだところから順に復号器に与え、出力ストリームに書き込んでいくことが想定されているようです。

[114] これらの操作の呼び出される場面については、復号を参照。

[117] UTF-8 の復号器オブジェクトは、次の状態を持ちます >>116。

UTF-8符号位置 (UTF-8 code point): 初期値は 0。
UTF-8見たバイト数 (UTF-8 bytes seen): 初期値は 0。
UTF-8必要バイト数 (UTF-8 bytes needed): 初期値は 0。
UTF-8下境界 (UTF-8 lower boundary): 初期値は 0x80。
UTF-8上境界 (UTF-8 upper boundary): 初期値は 0xBF。

[118] UTF-8 復号器オブジェクト復号器の取扱器は、 ストリームと字句について、次のようにします >>116。

[119] 字句が end-of-stream なら、
1. [122] 復号器のUTF-8必要バイト数が 0 以外なら、
  1. [120] 復号器のUTF-8必要バイト数を 0 に設定します。
  2. [121] 誤りを返します。
2. [123] それ以外なら、
  1. [124] 終了済みを返します。
[125] それ以外で、復号器のUTF-8必要バイト数が 0 なら、
1. [126] バイトにより、次表に従い、 復号器のUTF-8必要バイト数、UTF-8符号位置、UTF-8上境界、 UTF-8下境界が指定されていればそれぞれ設定し、 返し値を返します。
  バイト
  バイト
  必要
  UTF-8必要バイト数
  符号位置
  UTF-8符号位置
  上界
  UTF-8上境界
  下界
  UTF-8下境界
  返す
  返し値
  バイト
  [ 0x00, 0x7F ]
  返す
  バイトと同じ値の符号位置
  バイト
  [ 0xC2, 0xDF ]
  必要
  1
  符号位置
  バイト & 0x1F
  返す
  継続
  バイト
  0xE0
  必要
  2
  下界
  0xA0
  符号位置
  バイト & 0xF
  返す
  継続
  バイト
  [ 0xE1, 0xEC ], 0xEE, 0xEF
  必要
  2
  符号位置
  バイト & 0xF
  返す
  継続
  バイト
  0xED
  必要
  2
  上界
  0x9F
  符号位置
  バイト & 0xF
  返す
  継続
  バイト
  0xF0
  必要
  3
  符号位置
  バイト & 0x7
  下界
  0x90
  返す
  継続
  バイト
  [ 0xF1, 0xF3 ]
  必要
  3
  符号位置
  バイト & 0x7
  返す
  継続
  バイト
  0xF4
  必要
  3
  符号位置
  バイト & 0x7
  上界
  0x8F
  返す
  継続
  バイト
  それ以外
  返す
  誤り
[146] それ以外で、バイトが [ 復号器のUTF-8下境界, 復号器のUTF-8上境界 ] の範囲内にないなら、
1. [147] 復号器のUTF-8必要バイト数を、 0 に設定します。
2. [150] 復号器のUTF-8見たバイト数を、 0 に設定します。
3. [148] 復号器のUTF-8符号位置を、 0 に設定します。
4. [149] 復号器のUTF-8下境界を、0x80 に設定します。
5. [151] 復号器のUTF-8上境界を、0xBF に設定します。
6. [152] バイトをストリームにprependします。
7. [153] 誤りを返します。
[154] それ以外なら、
1. [156] 復号器のUTF-8見たバイト数をインクリメントします。
2. [157] 復号器のUTF-8符号位置を、 (復号器のUTF-8符号位置 << 6) | (バイト & 0x3F) に設定します。
3. [158] 復号器のUTF-8下境界を、0x80 に設定します。
4. [159] 復号器のUTF-8上境界を、0xBF に設定します。
5. [155] 復号器のUTF-8必要バイト数と復号器のUTF-8見たバイト数が等しくなければ、
  1. [160] 継続を返します。
6. [161] それ以外なら、
  1. [162] 符号位置を、値が復号器のUTF-8符号位置の符号位置に設定します。
  2. [163] 復号器のUTF-8符号位置を、 0 に設定します。
  3. [164] 復号器のUTF-8見たバイト数を、 0 に設定します。
  4. [165] 復号器のUTF-8必要バイト数を、 0 に設定します。
  5. [166] 符号位置を返します。

[127] これより、不正なバイト列、U-00110000 以上、サロゲートは、構成する各バイトが U+FFFD に置き換えられます。

HTTP 認証の `UTF-8`

[54] 基本認証やダイジェスト認証の charset="" 引数には値 UTF-8 を指定できます >>53, >>58, >>69。

[55] これは、文字列を NFC で正規化してから、 RFC 3629 UTF-8 でバイト列に変換することをいいます >>53, >>69。

[56] 受信者は、利用者識別子においては、 RFC 7613 UsernameCasePreserved プロファイルから : を除くすべての文字に対応しなければなりません >>53, >>69。

[70] 基本認証では credentials で、ダイジェスト認証では A1 で : が区切り文字として使われているため、 : は除外されているようです。

[57] 受信者は、合言葉においては、 RFC 7613 OpaqueString プロファイルからすべての文字に対応しなければなりません >>53, >>69。

[71] プロファイルの文字への対応は要求していますが、プロファイル自体を適用することは要求されていないようです。また、プロファイルで禁止されている文字を使うことを禁止はしていないようです。

[76] 「対応」というのが何を意味しているのか不明です。利用者名や合言葉に使える文字についてはHTTPとしての制約以外に、アプリケーションや組織などの制約もありそうです。 IETF が組織のセキュリティーポリシーやアプリケーションのデータベース設計などに介入しようとしているとは考えにくいですから、プロファイルに含まれる文字であっても使えないことはありそうです。普通サーバーは利用者名や合言葉が本来のものと一致しなければすべて 401 エラーにしますから、サーバーが HTTP レベルでこれらのプロファイルの文字に対応しているかどうかは、外部から観測できないように思えます。プロファイルに従い写像や正規化を適用することを指しているのだとしたら観測可能ですが、そんな実装はあるのでしょうか。。。

歴史

FSS-UTF

[203] wg20-n193-fss-utf.pdf, 2009-02-13T23:43:50.000Z, 2023-07-16T14:42:44.520Z https://www.unicode.org/L2/Historical/wg20-n193-fss-utf.pdf

ISO/IEC 10646 の定義

[36] ISO/IEC 10646 では符号位置が U-7FFFFFFF まであり、そのすべてが UTF-8 でも表現できます。UTF-8 における1文字は最大6バイトで表現されます。

[37] U-00110000 以上の符号位置には文字は割り当てられていませんが (私用域が以前はありましたが削除されました。)、 UTF-8 としては現在も存在しているはずです。

RFC による定義

[196] RFC 2044 - UTF-8, a transformation format of Unicode and ISO 10646, 2021-01-31T15:16:01.000Z, 2021-03-22T09:58:50.095Z https://tools.ietf.org/html/rfc2044

[197] RFC 2279 - UTF-8, a transformation format of ISO 10646, 2021-03-14T22:21:32.000Z, 2021-03-23T12:03:30.417Z https://tools.ietf.org/html/rfc2279

[6] 2003-11-10 23:49:29 +00:00 名無しさん: ついに IETF Full Standard の RFC3629 (= STD63) がでました。

[41] Net-Unicode はそれに制御文字の用法や正規化に関する規定を加えたプロファイルです。

[199] RFC 3629 を参照: IETF 版 JSON, SASL ANONYMOUS

[200] JOSE は RFC 3629 を参照し、 UTF8(STRING) という関数風の表記法で仕様を記述しています。 >>191, >>195, >>202, >>187

[191] RFC 7515 - JSON Web Signature (JWS) (2020-03-29 16:13:43 +09:00) https://tools.ietf.org/html/rfc7515#section-1.1
UTF8(STRING) denotes the octets of the UTF-8 [RFC3629] representation
of STRING, where STRING is a sequence of zero or more Unicode
[UNICODE] characters.
[195] RFC 7516 - JSON Web Encryption (JWE), 2022-11-23T02:45:29.000Z https://datatracker.ietf.org/doc/html/rfc7516#section-1.1
[202] RFC 7517: JSON Web Key (JWK), 2022-11-25T08:33:34.000Z https://www.rfc-editor.org/rfc/rfc7517.html#section-1.1
[187] RFC 7518 - JSON Web Algorithms (JWA) (2019-11-27 04:11:14 +09:00) https://tools.ietf.org/html/rfc7518#section-1.1
UTF8(STRING) denotes the octets of the UTF-8 [RFC3629] representation
of STRING, where STRING is a sequence of zero or more Unicode
[UNICODE] characters.

[201] 似た ASCII(STRING) があります。

MLSF

UTF-8b

CESU-8

[7] CESU-8 は U+10000 以上の符号位置を UTF-16 同様のサロゲートペアによって表現するものです。 CESU-8

Web UTF-8

[19] Web Applications 1.0 は、「バイト列をUTF-8として誤り取り扱い付きで復号 (decode a byte string as UTF-8, with error handling) 」することについて規定しています。これは、 UTF-8 バイト列を文字列に復号するにあたって不正なバイト列を適宜 U+FFFD に置き換える方法を定めたものです。

[18] Web Applications 1.0 http://www.whatwg.org/specs/web-apps/current-work/complete.html#decoded-as-utf-8,-with-error-handling

[17] Web Applications 1.0 r5530 Tighten up UTF-8 error handling definitions Fixing http://www.w3.org/Bugs/Public/show_bug.cgi?id=9663 ( (2010-09-29 04:16:00 +09:00 版)) http://html5.org/tools/web-apps-tracker?from=5529&to=5530

[47] これは後の Encoding Standard の定義につながりました。現在の HTML Standard は Encoding Standard を参照しています。

実装

Emacs で UTF-8

[11] Emacs で UTF-8 なファイルを開くとなぜか漢字だけ全部化ける (仮名とかは化けない) という時は .emacs とかにこう書くと直るかもよ:

(prefer-coding-system 'utf-8-unix)

Perl の UTF-8

[12] Perl の Encode モジュールには「utf8」と「utf-8」があります。「utf8」は UTF-8 っぽい符号化方式を使った Perl の内部コード、「utf-8」は Unicode の UTF-8 です。

utf8 旗と utf8 文字列

・・・

`use utf8`

use utf8

`utf8` 符号化

[66] Perl の Encode モジュールは文字符号化の名前として「utf8」と「utf-8」に対応しています。前者は Perl の文字列の内部符号化方式としての UTF-8 (のバイト文字列としての表現)、後者は Unicode の文字符号化方式である UTF-8 を表しています。

`:utf8` PerlIO 層

[67] perldoc PerlIO にもちゃんと書いてありましたが、 PerlIO 層 :utf8 を使った入力は不正な入力も黙って受け入れます。読み込むファイルが UTF-8 として不正なバイト列であったとしても、警告も何も無しで、黙って utf8 旗を立てた SV にするみたいです。なので、入力が不正なバイト列である可能性がある場合、 :encoding(utf8) を使う必要があります。こちらは不正なバイト列を \xHH に置き換えるようです。

おそらく :utf8 の方が早いのでしょうが、もし不正なバイト列が混入していた場合、読み込みの時点では何も起こらず、その後読み込んだ文字列に対して何らかの操作を行おうとした時点で Malformed UTF-8 character (fatal) というエラーが出ますので、デバッグが難しくなるかもしれません。

EBCDIC 環境との関係

・・・

メモ

[68] Perl, utf8 フラグ, ハッシュ, リテラル, => - 冬通りに消え行く制服ガールは、夢物語にリアルを求めない。 - subtech (2010-03-25 09:56:06 +09:00 版) https://subtech.g.hatena.ne.jp/cho45/20100323/1269329227

Java の UTF

[1] Java が実装している UTF-8 の変種は、 U+0000 を表現するために 0xC0 0x80 を (0x00 の代わりに) 使います。 (0x00 は Java では文字列の終端を表し、文字列内には入れられません。)

[4] このような、「長さが最小ではない表現」は、 UTF-8 では禁止されています。もっとも、各規格は当初そのことを忘れていました。設計者は最初からそのことに注意していたらしいですが、気が抜けていたとかなんとか。
[5] 実際には巷の UTF-8 の decoder は最短じゃない表現を「意図どおりに」解読してしまうことがありますが、それは禁止されています。

[9] Java 修正 (modified) UTF-8 (昔は Java の世界で UTF-8 と呼ばれていたもの。) は >>1 に加えて CESU-8 らしいです。

Supplementary Characters in the Java Platform http://java.sun.com/developer/technicalArticles/Intl/Supplementary/

[10] JNI Types and Data Structures http://java.sun.com/j2se/1.5.0/docs/guide/jni/spec/types.html

[204] DataInput (Java Platform SE 8 ), 2023-09-25T04:33:17.000Z https://docs.oracle.com/javase/jp/8/docs/api/java/io/DataInput.html#modified-utf-8

MySQL の UTF8

[25] MySQL では元々の utf8 は3バイトまでしか表せませんでした。のちにこれは utf8mb3 という別名が付けられ、それとは別に、 4バイトまで表せる utf8mb4 が追加されています。

[61] MySQL の CHARSET utf8、utf8mb3 は、1文字が3バイト以下で表せる範囲の UTF-8 です。

[62] Unicode の範囲をすべて表せるためには utf8mb4 を使う必要があります。

メモ

[16] Official Google Blog: Unicode nearing 50% of the web (2010-01-29 05:36:25 +09:00 版) http://googleblog.blogspot.com/2010/01/unicode-nearing-50-of-web.html

[20] Web Applications 1.0 r5940 typo in the allowed UTF-8 ranges ( (2011-03-04 11:06:00 +09:00 版)) http://html5.org/tools/web-apps-tracker?from=5939&to=5940

[21] Web Applications 1.0 r5942 Fix the UTF-8 decoder error handling to handle a few errors I'd missed, including in particular surrogate halves. This may be a mistake; if I'm forgetting something please let me know so I can fix it. (e.g. did we decide not to catch surrogates or something?) ( (2011-03-04 11:56:00 +09:00 版)) http://html5.org/tools/web-apps-tracker?from=5941&to=5942

[22] RFC 6120 - Extensible Messaging and Presence Protocol (XMPP): Core ( (2011-03-31 08:23:45 +09:00 版)) http://tools.ietf.org/html/rfc6120#section-11.6

[23] 「私のために争わないで」文字コードのUTF8さん、自殺 : bogusnews ( (2012-04-01 10:06:51 +09:00 版)) http://bogusne.ws/article/41580267.html

[24] IRC logs: freenode / #whatwg / 20120419 ( (2012-04-24 21:49:36 +09:00 版)) http://krijnhoetmer.nl/irc-logs/whatwg/20120419

[26] Web Applications 1.0 r7647 Embrace the Encodings specification. ( (2013-01-24 10:38:00 +09:00 版)) http://html5.org/tools/web-apps-tracker?from=7646&to=7647

[27] Provide better encoding label guidance. (Basically require utf-8 all ove... · a454d2e · whatwg/encoding ( (2013-02-16 12:48:30 +09:00 版)) https://github.com/whatwg/encoding/commit/a454d2e543964b8d5432778ff917324e8032b78c

[28] Web Applications 1.0 r7782 Strip a leading BOM from scripts in workers, if any. Also, use more of the encoding spec. ( (2013-03-30 03:45:00 +09:00 版)) http://html5.org/tools/web-apps-tracker?from=7781&to=7782

[29] 3 - JNI Types and Data Structures ( (2011-09-07 21:57:56 +09:00 版)) http://docs.oracle.com/javase/1.3/docs/guide/jni/spec/types.doc.html#16542

[30] jwerle/libutf8 ( (2013-12-16 00:12:32 +09:00 版)) https://github.com/jwerle/libutf8

[31] Web Applications 1.0 r8405 Various editorial tweaks. ( (2014-01-17 17:12:00 +09:00 版)) http://html5.org/tools/web-apps-tracker?from=8404&to=8405

[32] Gmail の (日本語での?) デフォルト設定が2014年4月中頃に ISO-2022-JP から UTF-8 に変わったようです。 (13-17 の間?)

[33] JNI Types and Data Structures ( (2014-05-08 13:10:38 +09:00 版)) http://docs.oracle.com/javase/7/docs/technotes/guides/jni/spec/types.html#wp16542

[35] Web Applications 1.0 r8722 Adjust notes on encoding detection ( (2014-08-28 08:12:00 +09:00 版)) http://html5.org/r/8722

[60] Fix #19: reword utf-8 decoder step to avoid extra parenthesis · whatwg/encoding@05d9649 (2015-12-16 12:30:40 +09:00 版) https://github.com/whatwg/encoding/commit/05d96490cba5e800e40b76bfd4acc7e7ff2981ae

[77] 【開発】.net C#でRSS/Atom feedとか読んでみる | 鍋風呂 (2016-01-07 17:51:33 +09:00 版) http://blog.ahh.jp/?p=1007

<?xml encode=”***”>が適当。結構多かった。”utf-8”となるべきところ、”utf8”となっている。

[78] Add utf-8 decode without BOM or fail for HTML · whatwg/encoding@4b20911 (2016-02-10 21:42:33 +09:00 版) https://github.com/whatwg/encoding/commit/4b209111f7ab450eb1935551159b98413b5c23e0

[79] Use utf-8 decode without BOM rather than UTF-8 decoder · whatwg/html@39a2e6c (2016-02-10 21:44:11 +09:00 版) https://github.com/whatwg/html/commit/39a2e6cde3b4820db56fabe1859de0dc0e6ed8d9

[80] Drop dependencies on Encoding Standard's decoder concept · whatwg/url@37f9329 (2016-02-11 12:00:57 +09:00 版) https://github.com/whatwg/url/commit/37f932928378c0df521034cfd223f4ba603ef476

[81] Update integration with Encoding Standard · whatwg/html@6a31c26 (2016-02-14 18:47:22 +09:00 版) https://github.com/whatwg/html/commit/6a31c26cf12e39dab1a488e75dd56c03d6786d39

[82] RFC 4194 - The S Hexdump Format (2015-12-13 15:30:39 +09:00 版) https://tools.ietf.org/html/rfc4194#section-9

Required parameters: charset
This parameter must exist and must be set to "UTF-8". No other
character sets are allowed for transporting SHF data. The character
set designator MUST be uppercase.

[83] Editorial: avoid teaching bad UTF-8 math · whatwg/encoding@19a25b5 (2016-04-07 11:08:35 +09:00 版) https://github.com/whatwg/encoding/commit/19a25b5fcae895853d964b7ee6afa2fe9b6070a8

[84] UTF-8 processing using SIMD (SSE4) (2016-05-25 18:28:34 +09:00) https://woboq.com/blog/utf-8-processing-using-simd.html

[128] Parse application/x-www-form-urlencoded using UTF-8 only (annevk著, 2017-01-17 19:11:02 +09:00) https://github.com/whatwg/url/commit/3fe969679f78c92c353047661b0c4b6797f099f6

[129] Thunderbird/SeaMonkey の既定のテキストエンコーディングを UTF-8 に変更する · Issue #63 · mozilla-japan/gecko-l10n (2017-02-04 13:50:20 +09:00) https://github.com/mozilla-japan/gecko-l10n/issues/63

[130] Use Encoding's "UTF-8 encode" hook. (mkruisselbrink著, 2017-02-04 09:05:21 +09:00) https://github.com/w3c/FileAPI/commit/64c346deb9132a8cefc1ce79050256cfc64fcc72

[131] RFC 8160 - IUTF8 Terminal Mode in Secure Shell (SSH) (2017-04-20 17:11:00 +09:00) https://tools.ietf.org/html/rfc8160

[115] Upwork API Reference (2017-03-13 18:51:45 +09:00) https://developers.upwork.com/?lang=python#getting-started_encoding

UTF-8 is the default encoding for XML and since 2010, it has become the dominant character set on the Web.

[133] XLIFF Version 2.0 (2014-08-06 01:00:00 +09:00) http://docs.oasis-open.org/xliff/xliff-core/v2.0/os/xliff-core-v2.0-os.html#d0e15952

[134] For your information: Illegal UTF-8 proposal · Issue #112 · whatwg/encoding (2017-05-16 13:19:52 +09:00) https://github.com/whatwg/encoding/issues/112

[135] Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 (2017-05-16 00:37:21 +09:00) http://unicode.org/pipermail/unicode/2017-May/005389.html

[136] Learderboard - Feature Phone - Mobage Developers Documentation Center (2017-05-18 01:16:25 +09:00) https://docs.mobage.com/display/JPFP/Leaderboard_FP

format
レスポンス形式を指定する事が出来ます。
値
説明
備考
json
"application/json; charset=utf8"
任意、デフォルト値

[137] 19938 – Number of decoder errors emitted by the UTF-8 decoder for incomplete/invalid sequences (2017-05-20 13:17:18 +09:00) https://www.w3.org/Bugs/Public/show_bug.cgi?id=19938

[138] How Many REPLACEMENT CHARACTERs? (Henri Sivonen著, 2017-05-31 21:26:02 +09:00) https://hsivonen.fi/broken-utf-8/

[139] 過去に投稿された記事で、一部の「絵文字」が文字化けする不具合がありました - はてなブログ開発ブログ (2017-07-11 17:44:37 +09:00) http://staff.hatenablog.com/entry/2017/07/11/142000

はてなブログにこれまで投稿された一部の記事中の「絵文字」が?(クエスチョンマーク)になってしまう不具合が、2017年6月7日(水)にありました。

[141] Use Infra for JSON parsing (annevk著, 2017-09-29 21:18:40 +09:00) https://github.com/whatwg/fetch/commit/d095af0ebb3343d294c37fab5c124b1a2534b6a6

[142] Use Infra for JSON parsing by annevk · Pull Request #610 · whatwg/fetch (2017-10-06 15:01:48 +09:00) https://github.com/whatwg/fetch/pull/610

[143] Use Infra for JSON parsing (annevk著, 2017-09-29 21:22:12 +09:00) https://github.com/whatwg/xhr/commit/ed83926c8236d14cc8720f023e09658d8bdd00d3

[144] Require UTF-8 (sideshowbarker著, 2017-10-06 19:09:17 +09:00) https://github.com/whatwg/html/commit/fae77e3c558b9f083dfb9086752863a4789268f5

[145] Require utf-8 when specifying character encoding by sideshowbarker · Pull Request #3091 · whatwg/html (2017-11-03 19:52:38 +09:00) https://github.com/whatwg/html/pull/3091

[167] Windows 10のInsider PreviewでシステムロケールをUTF-8にするオプションが追加される | スラド (2017-11-14 18:22:51 +09:00) https://srad.jp/story/17/11/14/0640253/

[168] Should UTF-8 'as specified in' point to the Encoding spec? · Issue #253 · w3c/imsc (2017-11-21 11:58:08 +09:00) https://github.com/w3c/imsc/issues/253

[169] Timed Text Working Group Teleconference -- 09 Nov 2017 (2017-11-11 10:30:00 +09:00) https://www.w3.org/2017/11/09-tt-minutes.html#item08

[170] A Branchless UTF-8 Decoder « null program (2017-12-22 10:57:17 +09:00) http://nullprogram.com/blog/2017/10/06/

[171] UTF-8 history (2009-06-30 20:00:24 +09:00) http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

[172] Editorial: uppercase UTF-8 (annevk著, 2018-04-17 18:41:45 +09:00) https://github.com/whatwg/xhr/commit/aeaa4432bc39ab171d6fede3790bfb4ee3255990

[173] Editorial: uppercase UTF-8 (and other encodings, if any) · Issue #196 · whatwg/xhr (2018-04-18 13:26:11 +09:00) https://github.com/whatwg/xhr/issues/196

[174] Editorial: rewrite send()'s body/content-type processing (domenic著, 2018-04-24 00:09:11 +09:00) https://github.com/whatwg/xhr/commit/f47bbab42dabe1f52e5e9f1ed1fa6df06a6eb310

[175] Meta: UTF-8 decode without BOM or fail is used (annevk著, 2018-04-25 18:33:26 +09:00) https://github.com/whatwg/encoding/commit/387b0c08430a27e99c036f64abac9b3dfb46dcd0

[176] Meta: TF-8 decode without BOM or fail is used at least once by annevk · Pull Request #124 · whatwg/encoding (2018-04-26 15:27:04 +09:00) https://github.com/whatwg/encoding/pull/124

[177] Do not use percent decode on strings by annevk · Pull Request #3111 · whatwg/html (2018-05-25 01:08:35 +09:00) https://github.com/whatwg/html/pull/3111

[178] Editorial: replace UTF-8 encode with isomorphic encode (annevk著, 2018-05-28 21:03:01 +09:00) https://github.com/whatwg/fetch/commit/ffbaefb5c4f68b9d619e9db6491fd665a30a2ffb

[179] Give clearer advice on hooks for standards (annevk著, 2018-04-25 21:22:45 +09:00) https://github.com/whatwg/encoding/commit/b579018b406d7752f8b7a3aa9c2bc800519c6f1a

[180] Do not use percent decode on strings (annevk著, 2017-10-10 21:54:23 +09:00) https://github.com/whatwg/html/commit/ce8404fa5d8c2c91725c5262fd69d0d45c227ec8

[181] Do not use percent decode on strings by annevk · Pull Request #3111 · whatwg/html (2019-02-25 17:54:38 +09:00) https://github.com/whatwg/html/pull/3111

[182] Limit Content-Type overrides to when charset isn't already UTF-8 (annevk著, 2018-10-31 00:23:33 +09:00) https://github.com/whatwg/xhr/commit/721f3c9f3d64aa1ae528efb78468f8c4c7213f91

[183] UTF-8のコードポイントはどうやって高速に数えるか - Qiita (2019-04-06 11:42:20 +09:00) https://qiita.com/saka1_p/items/ff49d981cfd56f3588cc

[184] Require UTF-8 for accept-charset (annevk著, 2018-11-23 22:52:22 +09:00) https://github.com/whatwg/html/commit/840e22fe5d9be9c3c8c712150c0b98c7a4c62933

[185] Consider restricting <form accept-charset> to utf-8 · Issue #3097 · whatwg/html (2019-06-20 12:36:39 +09:00) https://github.com/whatwg/html/issues/3097

[186] Require UTF-8 for accept-charset by annevk · Pull Request #4195 · whatwg/html (2019-06-20 12:39:27 +09:00) https://github.com/whatwg/html/pull/4195

[188] VRML97, ISO/IEC 14772-1:1997 -- 4 Concepts (2014-01-31 07:20:50 +09:00) https://www.web3d.org/documents/specifications/14772/V2.0/part1/concepts.html#4.2.2

The <encoding type> is either "utf8" or any other authorized values defined in other parts of ISO/IEC 14772. The identifier "utf8" indicates a clear text encoding that allows for international characters to be displayed in ISO/IEC 14772 using the UTF-8 encoding defined in ISO/IEC 10646-1 (otherwise known as Unicode); see 2.[UTF8].

[189] VRML97, ISO/IEC 14772-1:1997 -- 4 Concepts (2014-01-31 07:20:50 +09:00) https://www.web3d.org/documents/specifications/14772/V2.0/part1/concepts.html#4.3

[190] RFC 8030 - Generic Event Delivery Using HTTP Push (2020-03-09 00:13:33 +09:00) https://tools.ietf.org/html/rfc8030#section-5

Content-Type: text/plain;charset=utf8

[192] 662822 - Incomplete page load on abnamro.nl (2020-05-12 14:58:15 +09:00) https://bugs.chromium.org/p/chromium/issues/detail?id=662822

[193] Tcl Improvement Proposals: TIP 587: Default utf-8 for source command (2020-11-17T00:58:43.000Z) https://core.tcl-lang.org/tips/doc/trunk/tip/587.md

[194] Should Body.formData() always strip the BOM? · Issue #650 · whatwg/fetch (2021-03-06T02:54:59.000Z) https://github.com/whatwg/fetch/issues/650

[198] rfc3862 (2021-06-11T05:09:42.000Z) https://datatracker.ietf.org/doc/html/rfc3862#page-15

utf-8-unix

仕様書