The encoding of the sentence pair and the alignment files is mixed: ISO Latin-2 on the Hungarian side, and ISO Latin-1 on the English side. The overwhelming majority of the texts use compatible subsets of these two encodings, so for viewing, the files can be considered ISO Latin-2 encoded.

[46] ISO-2022-JP とシフトJISの混在については ISO-2022-JP 参照。

[47] ASCII と 7ビット符号の混在についてはフォント依存符号化参照。

[28] 明らかに誤ったもの、完全なる修復が困難なものは文字コードの修復を参照。

人工的な利用例

[3] 2025-11-08T13:15:53.800Z https://zsigri.tripod.com/fontboard/cjk/gbhzgbk.html

EUC-CN + HZ

[4] 2025-11-08T13:16:06.400Z https://zsigri.tripod.com/fontboard/cjk/jis.html

SJIS + 日本語EUC + JISコード

[5] 2025-11-08T13:16:25.800Z https://zsigri.tripod.com/fontboard/cjk/ksc.html

ISO-2022-KR + EUC-KR

[27] null, 2022-12-09T09:05:45.000Z, 2025-11-20T15:43:39.007Z https://www.regata.yar.ru/error/contact.html.var

いろいろが混合。

[32] Russification of Macintosh: Testing, 2025-11-26T05:28:26.000Z https://web.archive.org/web/20001120020200fw_/http://www.friends-partners.org/partners/rusmac/test.html

いろいろが混合。

[19] Slovni standardi, 2025-11-13T07:43:34.000Z, 1997-10-21T15:13:26.819Z https://web.archive.org/web/19971021150753/http://www.hina.hr/hina/standardi.html

[20] >>19 これは厳密には「混在」ではありませんが、当時の netscape.ini からの引用を含みます。 Firefox は Windows-1252 と認識し、 Chrome は Shift_JIS と認識します。当時の著者は Windows-1250 の環境だったと推測されます。

Windows-1252 として読むと

    [INTL]
    Font0=iso-8859-1,HRHelvetica,10,Courier New,10
    Font1=x-sjis,‚l‚r –¾’©,10,‚l‚r –¾’©,10

Shift_JIS として読むと

[INTL]
Font0=iso-8859-1,HRHelvetica,10,Courier New,10
Font1=x-sjis,MS全角 明朝,10,MS全角 明朝,10

と文字コードごとのフォント指定で、日本語環境以外で読めないとしてもそのままフォント名が書かれた部分だったわけです。

おそらく著者は読めないままコピペしたはずです。

この引用部分以外は ASCII文字。

実装

[29] compact_enc_det/compact_enc_det/compact_enc_det.cc at master · google/compact_enc_det · GitHub, 2025-11-24T07:32:16.000Z https://github.com/google/compact_enc_det/blob/master/compact_enc_det/compact_enc_det.cc#L232

// We try here to avoid having title text dominate the encoding detection,
// for the not-infrequent error case of title in encoding1, body in encoding2:
// we want to bias toward encoding2 winning.

[30] >>29 は文字コードの判定の実装である ced の注釈ですが、 title と本文とで符号化が異なるような場合がままあり、本文を優先させるとしています。

[31] 確かに、本文データベースからのデータを配信システム生成 HTML に差し込む際の処理が適切でないと思われる、 UI 部分と本文部分とで文字コードが異なり文字化けしている事例はしばしば見かけます。

文字コードの混合

文字コードの混合

実利用例

人工的な利用例

実装

関連

メモ