原典

[32] 当時は正常に表示されていたはずなのに現在 >>31 は文字化けしています。なんという皮肉でしょうか。現在 Content-Type: が text/html; charset=UTF-8 とされていますが、実際には windows-1252 で符号化されているようです。 2025-05-17T08:52:58.800Z

実装

[17] 本家 Mozilla の C++ の実装の他、各言語への移植版が存在しています。

[33] universalchardet 本家
- [43] gecko-dev/extensions/universalchardet at 67d9eda26a48c9024492de5931478fe14daa2cd1 · mozilla/gecko-dev · GitHub, 2025-05-17T09:24:08.000Z https://github.com/mozilla/gecko-dev/tree/67d9eda26a48c9024492de5931478fe14daa2cd1/extensions/universalchardet
- [79] #115114 autodetect universal detects french as Central European (ISO-… · mozilla/gecko-dev@3f7e5bd · GitHub, 2025-05-18T05:11:26.000Z https://github.com/mozilla/gecko-dev/commit/3f7e5bdeed3308eed1bb85a2f417e86ad8bab417
- [84] 183354 patch by alexey@optus.net (Alexey Chernyak) r=ftang sr=blizzar… · mozilla/gecko-dev@68332f7 · GitHub, 2025-10-18T09:51:28.000Z https://github.com/mozilla/gecko-dev/commit/68332f717f14e8f2467ca4f2c521ed8fe6eff71d
- [45] 平成23(2011)年に UTF-32BE, UTF-32LE, X-ISO-10646-UCS-4-3412, X-ISO-10646-UCS-4-2143 が削除されるなど、 HTML Standard / Encoding Standard 方面の動きと連動した機能削減が行われている。
- [21] mozilla-central: files (2012-05-26 11:36:03 +09:00 版) http://hg.mozilla.org/mozilla-central/file/tip/extensions/universalchardet/
  - [44] 現在では新実装 (>>24) への置き換えのため中身がなくなっている 2025-05-17T09:24:56.800Z
- [2] ~~seamonkey/extensions/universalchardet/src/base/ (2007-04-24 00:11:24 +09:00 版)http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/src/base/~~
- [37] >>36 の置き換え
[3] Universal Encoding Detector: character encoding auto-detection in Python (2007-11-18 20:19:09 +09:00 版) http://chardet.feedparser.org/
[4] Universalchardet - やる気向上作戦 (2009-02-11 14:40:44 +09:00 版) http://www.void.in/wiki/Universalchardet
- [52] 消滅確認 2025-05-18T03:27:01.600Z
- [53] やる気向上作戦: universalchardet, 2025-05-18T03:25:57.000Z, 2006-05-05T13:36:00.469Z https://web.archive.org/web/20060505133528/http://www.void.in/wiki/universalchardet
- [54] >>33 の派生
[5] RubyForge: Universal Encoding Detector: Project Info (Bruce Williams (http://codefluency.com), for Ruby Central (http://rubycentral.org) 著, 2009-03-15 11:44:11 +09:00 版) http://rubyforge.org/projects/chardet
- [55] 消滅確認 2025-05-18T03:36:03.200Z
- [56] RubyForge: Universal Encoding Detector: Project Filelist, Bruce Williams (http://codefluency.com), for Ruby Central (http://rubycentral.org), 2025-05-18T03:34:46.000Z, 2007-12-24T18:21:12.435Z https://web.archive.org/web/20071224181754/http://rubyforge.org/frs/?group_id=1506&release_id=4726
[7] 自娱自乐的Emacser: 查看 universalchardet 高频字表的 perl 程序 (2009-02-25 12:16:07 +09:00 版) http://wenbinhome.blogspot.com/2006/12/universalchardet-perl.html
[11] juniversalchardet - Java port of universalchardet - Google Project Hosting ( (2012-05-26 11:25:24 +09:00 版)) http://code.google.com/p/juniversalchardet/
- [38] Google Code Archive - Long-term storage for Google Code Project Hosting., 2025-05-17T08:59:51.000Z https://code.google.com/archive/p/juniversalchardet/
- [39] >>4 の派生
- [50] GitHub - seratch/juniversalchardet: Automatically exported from code.google.com/p/juniversalchardet, 2025-05-18T03:23:09.000Z https://github.com/seratch/juniversalchardet/tree/master
  - [51] >>11 そのまま
- [48] GitHub - albfernandez/juniversalchardet: Originally exported from code.google.com/p/juniversalchardet, 2025-05-18T03:13:10.000Z https://github.com/albfernandez/juniversalchardet
  - [49] >>11 の派生版
  - [65] Support for US-ASCII · albfernandez/juniversalchardet@fcf1898 · GitHub, 2025-05-18T04:29:38.000Z https://github.com/albfernandez/juniversalchardet/commit/fcf1898ab7bae3f7308d991e225c3016e460b352
[13] uchardet - universalchardet - Google Project Hosting ( (2012-05-26 11:26:23 +09:00 版)) http://code.google.com/p/uchardet/
- [57] GitHub - BYVoid/uchardet: An encoding detector library ported from Mozilla, 2025-05-18T04:17:34.000Z https://github.com/BYVoid/uchardet
  - [58] >>13 から移転
  - [86] Single Byte charsets: high ctrl character ratio lowers confidence. · BYVoid/uchardet@55b4f23 · GitHub, 2025-10-19T13:38:53.000Z https://github.com/BYVoid/uchardet/commit/55b4f23971db61
- [59] uchardet, 2022-12-08T21:31:14.000Z, 2025-05-18T04:18:19.511Z https://www.freedesktop.org/wiki/Software/uchardet/
  - [60] >>57 から移転
  - [62] uchardet / uchardet · GitLab, 2025-05-18T04:21:17.000Z https://gitlab.freedesktop.org/uchardet/uchardet
- [61] 独自の改良多数
- [66] GitHub - PyYoshi/uchardet: uchardet is an encoding detector library, which takes a sequence of bytes in an unknown character encoding and attempts to determine the encoding of the text. Returned encoding names are iconv-compatible., 2025-05-18T04:31:18.000Z https://github.com/PyYoshi/uchardet
  - [67] >>59 旧所在からの派生
[46] GitHub - errepi/ude: A C# port of Mozilla Universal Charset Detector., 2025-05-18T03:04:53.000Z https://github.com/errepi/ude
- [47] >>11 の移植
[82] GitHub - CharsetDetector/UTF-unknown: Character set detector build in C# - .NET 5+, .NET Core 2+, .NET standard 1+ & .NET 4+, 2025-05-19T12:49:25.000Z https://github.com/CharsetDetector/UTF-unknown
- [83] >>46, >>59 からの派生
[14] nuniversalchardet - C# Port of UniversalCharDet - Google Project Hosting ( (2012-05-26 11:26:34 +09:00 版)) http://code.google.com/p/nuniversalchardet/
[22] perl-web-encodings/lib/Web/Encoding/UnivCharDet.pod at master · manakai/perl-web-encodings · GitHub (2013-05-03 09:26:53 +09:00 版) https://github.com/manakai/perl-web-encodings/blob/master/lib/Web/Encoding/UnivCharDet.pod
- [40] >>33 の移植
[12] GitHub - dcramer/chardet: Forked version of chardet, 2025-05-17T04:07:06.000Z https://github.com/dcramer/chardet
[23] https://github.com/sigmavirus24/charade
- [15] GitHub - sv24-archive/charade: NO LONGER MAINTAINED. USE chardet/chardet. Fork of chardet to support Python 2 and 3 in one code base., 2025-05-17T04:07:18.000Z https://github.com/sv24-archive/charade
[16] GitHub - chardet/chardet: Python character encoding detector, 2025-05-17T04:07:28.000Z https://github.com/chardet/chardet
- [85] Support CP949, fixes #10 by puzzlet · Pull Request #13 · sv24-archive/charade · GitHub, 2025-10-19T10:20:29.000Z https://github.com/sv24-archive/charade/pull/13/files
- [76] Added a prober for MacRoman encoding. · chardet/chardet@c292b52 · GitHub, 2025-05-18T05:07:29.000Z https://github.com/chardet/chardet/commit/c292b52a97e57c95429ef559af36845019b88b33
- [77] adding capital letter sharp S and ISO-8859-15 (#222) · chardet/chardet@bba08ba · GitHub, 2025-05-18T05:08:03.000Z https://github.com/chardet/chardet/commit/bba08ba5d656e38282a365af44e477b6f5ee74c7
- [78] add prober for Johab Korean (#207) · chardet/chardet@6282d49 · GitHub, 2025-05-18T05:09:14.000Z https://github.com/chardet/chardet/commit/6282d499bf6400ba1845119265a3d5e57c8b1fa1
- [26] Detection of windows-1250 · Issue #87 · chardet/chardet, 2025-05-17T04:15:34.000Z https://github.com/chardet/chardet/issues/87
- [80] 1 sentence utf-8 detected as Windows-1252 · Issue #185 · chardet/chardet, 2025-05-18T05:12:12.000Z https://github.com/chardet/chardet/issues/185
- [27] [WIP] Retrain SBCS Models and some refactoring by dan-blanchard · Pull Request #99 · chardet/chardet · GitHub, 2025-05-17T04:16:05.000Z https://github.com/chardet/chardet/pull/99
  - [81] >>57 の影響
[8] GitHub - aadsm/jschardet: Character encoding auto-detection in JavaScript (port of python's chardet), 2025-05-16T10:08:45.000Z https://github.com/aadsm/jschardet
[63] CRAN: Package uchardet, 2025-05-18T04:02:20.000Z, 2025-05-18T04:28:17.163Z https://cran.r-project.org/web/packages/uchardet/index.html
[70] Google Code Archive - Long-term storage for Google Code Project Hosting., 2025-05-18T04:55:39.000Z https://code.google.com/archive/p/nuniversalchardet/
[72] GitHub - thuleqaid/rust-chardet: rust version of chardet, 2025-05-18T04:59:30.000Z https://github.com/thuleqaid/rust-chardet
[75] Encode::Detect::Detector - Detects the encoding of data - metacpan.org, 2025-05-18T05:05:23.000Z https://metacpan.org/release/JGMYERS/Encode-Detect-1.01/view/Detector.pm
[73] GitHub - Joungkyun/libchardet: libchardet - Mozilla's Universal Charset Detector C/C++ API, 2025-05-18T05:02:38.000Z https://github.com/Joungkyun/libchardet
- [74] >>33, >>75, >>57 の派生

[24] GitHub - hsivonen/chardetng: A character encoding detector for legacy Web content., 2025-05-17T04:08:29.000Z https://github.com/hsivonen/chardetng
- [64] >>33 の置き換え

[20] 次に示すのは独立した実装ではなく、他の実装のラッパーとして機能するものです。

標準化

[10] UNIVCHARDET 自体は1実装に過ぎず、何らかの標準でも、標準によって義務付けられた実装でもありませんが、 Web Applications 1.0 は文字符号化の決定算法の中で出現頻度分析に基づく推定の利用を認めており、その具体例として >>1 を挙げています。

判定算法で失敗する例

[9] HTML documents misinterpreted by charset sniffer を参照してください。

メモ

[25] >>24 は諸々改善されていることがドキュメントにあるが、近年の Webブラウザーの、旧来文字コードを切り捨てていく方向に則ってるのが不安要素だなあ。

[28] >>27 は1バイト符号化の判定に大きな変更を加えている (従来未対応の文字コードへの対応を含む。) が、マージされずに長年放置されている。

[87] ISO-8859-2 / Windows-1250 の判定問題 >>79 >>26 文字コードの判定

A composite approach to language/encoding detection

A composite approach to language/encoding detection

原典

実装

標準化

判定算法で失敗する例

メモ