[18] [DFN[universalchardet]] は、[[文字]]の出現頻度の統計データを元に[[バイト列]]の[[文字コード]]を[[判定][文字コード判定]]する手法とその実装です。

* 原典

[REFS[
- [1]
[DFN[[CITE[A composite approach to language/encoding detection]]]] ([[Shanjian Li]] 著, [TIME[2007-09-08 17:38:48 +09:00]] 版) <http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html>
-- [29] 
移転確認 [TIME[2019-11-11T09:04:10.300Z]]
-- [30] 
[CITE[A composite approach to language/encoding detection]], [[Shanjian Li]], [TIME[2019-11-11 18:03:59 +09:00]] <https://www-archive.mozilla.org/projects/intl/universalcharsetdetection>
--- [31] 
[CITE[A composite approach to language/encoding detection]], [[Shanjian Li]], [TIME[2025-05-17T08:51:04.000Z]] <https://www-archive.mozilla.org/projects/intl/universalcharsetdetection>
- [41] 
[CITE[How to build standalone universal charset detector from Mozilla source]], [[Frank Yung-Fong Tang]], [TIME[2025-05-17T09:02:46.000Z]] <https://www-archive.mozilla.org/projects/intl/detectorsrc>


]REFS]


[32] 
当時は正常に表示されていたはずなのに現在 >>31 は[[文字化け]]しています。なんという皮肉でしょうか。
現在 [CODE[Content-Type:]] が [CODE[text/html; charset=UTF-8]]
とされていますが、実際には [CODE[windows-1252]] で[[符号化]]されているようです。
[TIME[2025-05-17T08:52:58.800Z]]


* 実装

[17] 本家 [[Mozilla]] の [[C++]] の実装の他、各言語への移植版が存在しています。

[REFS[

- [36] chardet 本家
-- [42] 
[CITE@en[gecko-dev/intl/chardet at 24978cfe7cef443392a56e0958a9d63a6cb93219 · mozilla/gecko-dev · GitHub]], [TIME[2025-05-17T09:11:30.000Z]] <https://github.com/mozilla/gecko-dev/tree/24978cfe7cef443392a56e0958a9d63a6cb93219/intl/chardet>
- [34] [CITE[Java port of Mozilla's Automatic Charset Detection algorithm.]], [TIME[2010-11-24T20:42:37.000Z]], [TIME[2025-05-17T08:55:03.700Z]] <https://jchardet.sourceforge.net/>
-- [35] >>36 の移植
]REFS]

[REFS[
- [33] universalchardet 本家
-- [43] 
[CITE@en[gecko-dev/extensions/universalchardet at 67d9eda26a48c9024492de5931478fe14daa2cd1 · mozilla/gecko-dev · GitHub]], [TIME[2025-05-17T09:24:08.000Z]] <https://github.com/mozilla/gecko-dev/tree/67d9eda26a48c9024492de5931478fe14daa2cd1/extensions/universalchardet>
-- [79] 
[CITE@en[#115114 autodetect universal detects french as Central European (ISO-… · mozilla/gecko-dev@3f7e5bd · GitHub]], [TIME[2025-05-18T05:11:26.000Z]] <https://github.com/mozilla/gecko-dev/commit/3f7e5bdeed3308eed1bb85a2f417e86ad8bab417>
-- [84] 
[CITE@en[183354 patch by alexey@optus.net (Alexey Chernyak) r=ftang sr=blizzar… · mozilla/gecko-dev@68332f7 · GitHub]], [TIME[2025-10-18T09:51:28.000Z]] <https://github.com/mozilla/gecko-dev/commit/68332f717f14e8f2467ca4f2c521ed8fe6eff71d>
-- [45] 
[TIME[平成23(2011)年][2011]]に
[CODE[UTF-32BE]],
[CODE[UTF-32LE]],
[CODE[X-ISO-10646-UCS-4-3412]],
[CODE[X-ISO-10646-UCS-4-2143]]
が削除されるなど、
[CITE[HTML Standard]] / [CITE[Encoding Standard]]
方面の動きと連動した機能削減が行われている。
-- [21] [CITE@en-US[mozilla-central: files]] ([TIME[2012-05-26 11:36:03 +09:00]] 版) <http://hg.mozilla.org/mozilla-central/file/tip/extensions/universalchardet/>
--- [44] 現在では新実装 (>>24) への置き換えのため中身がなくなっている
[TIME[2025-05-17T09:24:56.800Z]]
-- [2] [DEL[[CITE[seamonkey/extensions/universalchardet/src/base/]] ([TIME[2007-04-24 00:11:24 +09:00]] 版)<http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/src/base/>]]
-- [37] >>36 の置き換え
-[3]
[CITE@en[Universal Encoding Detector: character encoding auto-detection in Python]] ([TIME[2007-11-18 20:19:09 +09:00]] 版) <http://chardet.feedparser.org/>
-[4]
[CITE@ja[Universalchardet - やる気向上作戦]] ([TIME[2009-02-11 14:40:44 +09:00]] 版) <http://www.void.in/wiki/Universalchardet>
-- [52] 消滅確認 [TIME[2025-05-18T03:27:01.600Z]]
-- [53] 
[CITE[やる気向上作戦: universalchardet]], [TIME[2025-05-18T03:25:57.000Z]], [TIME[2006-05-05T13:36:00.469Z]] <https://web.archive.org/web/20060505133528/http://www.void.in/wiki/universalchardet>
-- [54] >>33 の派生
-[5]
[CITE@ja[RubyForge: Universal Encoding Detector: Project Info]] ([[Bruce Williams (http://codefluency.com), for Ruby Central (http://rubycentral.org)]] 著, [TIME[2009-03-15 11:44:11 +09:00]] 版) <http://rubyforge.org/projects/chardet>
-- [55] 消滅確認 [TIME[2025-05-18T03:36:03.200Z]]
-- [56] 
[CITE@en[RubyForge: Universal Encoding Detector: Project Filelist]], [[Bruce Williams (http://codefluency.com), for Ruby Central (http://rubycentral.org)]], [TIME[2025-05-18T03:34:46.000Z]], [TIME[2007-12-24T18:21:12.435Z]] <https://web.archive.org/web/20071224181754/http://rubyforge.org/frs/?group_id=1506&release_id=4726>
-[7]
[CITE[自娱自乐的Emacser: 查看 universalchardet 高频字表的 perl 程序]] ([TIME[2009-02-25 12:16:07 +09:00]] 版) <http://wenbinhome.blogspot.com/2006/12/universalchardet-perl.html>
-[11] [CITE[juniversalchardet - Java port of universalchardet - Google Project Hosting]]
( ([TIME[2012-05-26 11:25:24 +09:00]] 版))
<http://code.google.com/p/juniversalchardet/>
-- [38] [CITE@en[Google Code Archive - Long-term storage for Google Code Project Hosting.]], [TIME[2025-05-17T08:59:51.000Z]] <https://code.google.com/archive/p/juniversalchardet/>
-- [39] >>4 の派生
-- [50] 
[CITE@en[GitHub - seratch/juniversalchardet: Automatically exported from code.google.com/p/juniversalchardet]], [TIME[2025-05-18T03:23:09.000Z]] <https://github.com/seratch/juniversalchardet/tree/master>
--- [51] >>11 そのまま
-- [48] 
[CITE@en[GitHub - albfernandez/juniversalchardet: Originally exported from code.google.com/p/juniversalchardet]], [TIME[2025-05-18T03:13:10.000Z]] <https://github.com/albfernandez/juniversalchardet>
--- [49] >>11 の派生版
--- [65] 
[CITE@en[Support for US-ASCII · albfernandez/juniversalchardet@fcf1898 · GitHub]], [TIME[2025-05-18T04:29:38.000Z]] <https://github.com/albfernandez/juniversalchardet/commit/fcf1898ab7bae3f7308d991e225c3016e460b352>
-[13] [CITE[uchardet - universalchardet - Google Project Hosting]]
( ([TIME[2012-05-26 11:26:23 +09:00]] 版))
<http://code.google.com/p/uchardet/>
-- [57] [CITE@en[GitHub - BYVoid/uchardet: An encoding detector library ported from Mozilla]], [TIME[2025-05-18T04:17:34.000Z]] <https://github.com/BYVoid/uchardet>
--- [58] >>13 から移転
--- [86] 
[CITE@en[Single Byte charsets: high ctrl character ratio lowers confidence. · BYVoid/uchardet@55b4f23 · GitHub]], [TIME[2025-10-19T13:38:53.000Z]] <https://github.com/BYVoid/uchardet/commit/55b4f23971db61>
-- [59] [CITE[uchardet]], [TIME[2022-12-08T21:31:14.000Z]], [TIME[2025-05-18T04:18:19.511Z]] <https://www.freedesktop.org/wiki/Software/uchardet/>
--- [60] >>57 から移転
--- [62] 
[CITE@ja[uchardet / uchardet · GitLab]], [TIME[2025-05-18T04:21:17.000Z]] <https://gitlab.freedesktop.org/uchardet/uchardet>
-- [61] 独自の改良多数
-- [66] 
[CITE@en[GitHub - PyYoshi/uchardet: uchardet is an encoding detector library, which takes a sequence of bytes in an unknown character encoding and attempts to determine the encoding of the text. Returned encoding names are iconv-compatible.]], [TIME[2025-05-18T04:31:18.000Z]] <https://github.com/PyYoshi/uchardet>
--- [67] >>59 旧所在からの派生
- [46] 
[CITE@en[GitHub - errepi/ude: A C# port of Mozilla Universal Charset Detector.]], [TIME[2025-05-18T03:04:53.000Z]] <https://github.com/errepi/ude>
-- [47] >>11 の移植
- [82] [CITE@en[GitHub - CharsetDetector/UTF-unknown: Character set detector build in C# - .NET 5+, .NET Core 2+, .NET standard 1+ & .NET 4+]], [TIME[2025-05-19T12:49:25.000Z]] <https://github.com/CharsetDetector/UTF-unknown>
-- [83] >>46, >>59 からの派生
-[14] [CITE[nuniversalchardet - C# Port of UniversalCharDet - Google Project Hosting]]
( ([TIME[2012-05-26 11:26:34 +09:00]] 版))
<http://code.google.com/p/nuniversalchardet/>
- [22] [CITE[perl-web-encodings/lib/Web/Encoding/UnivCharDet.pod at master · manakai/perl-web-encodings · GitHub]] ([TIME[2013-05-03 09:26:53 +09:00]] 版) <https://github.com/manakai/perl-web-encodings/blob/master/lib/Web/Encoding/UnivCharDet.pod>
-- [40] >>33 の移植
- [12] 
[CITE@en[GitHub - dcramer/chardet: Forked version of chardet]], [TIME[2025-05-17T04:07:06.000Z]] <https://github.com/dcramer/chardet>
- [23] <https://github.com/sigmavirus24/charade>
-- [15] 
[CITE@en[GitHub - sv24-archive/charade: NO LONGER MAINTAINED. USE chardet/chardet. Fork of chardet to support Python 2 and 3 in one code base.]], [TIME[2025-05-17T04:07:18.000Z]] <https://github.com/sv24-archive/charade>
- [16] 
[CITE@en[GitHub - chardet/chardet: Python character encoding detector]], [TIME[2025-05-17T04:07:28.000Z]] <https://github.com/chardet/chardet>
-- [85] 
[CITE@en[Support CP949, fixes #10 by puzzlet · Pull Request #13 · sv24-archive/charade · GitHub]], [TIME[2025-10-19T10:20:29.000Z]] <https://github.com/sv24-archive/charade/pull/13/files>
-- [76] 
[CITE@en[Added a prober for MacRoman encoding. · chardet/chardet@c292b52 · GitHub]], [TIME[2025-05-18T05:07:29.000Z]] <https://github.com/chardet/chardet/commit/c292b52a97e57c95429ef559af36845019b88b33>
-- [77] 
[CITE@en[adding capital letter sharp S and ISO-8859-15 (#222) · chardet/chardet@bba08ba · GitHub]], [TIME[2025-05-18T05:08:03.000Z]] <https://github.com/chardet/chardet/commit/bba08ba5d656e38282a365af44e477b6f5ee74c7>
-- [78] 
[CITE@en[add prober for Johab Korean (#207) · chardet/chardet@6282d49 · GitHub]], [TIME[2025-05-18T05:09:14.000Z]] <https://github.com/chardet/chardet/commit/6282d499bf6400ba1845119265a3d5e57c8b1fa1>
-- [26] 
[CITE@en[Detection of windows-1250 · Issue #87 · chardet/chardet]], [TIME[2025-05-17T04:15:34.000Z]] <https://github.com/chardet/chardet/issues/87>
-- [80] 
[CITE@en[1 sentence utf-8 detected as Windows-1252 · Issue #185 · chardet/chardet]], [TIME[2025-05-18T05:12:12.000Z]] <https://github.com/chardet/chardet/issues/185>
-- [27] 
[CITE@en['''['''WIP''']''' Retrain SBCS Models and some refactoring by dan-blanchard · Pull Request #99 · chardet/chardet · GitHub]], [TIME[2025-05-17T04:16:05.000Z]] <https://github.com/chardet/chardet/pull/99>
--- [81] >>57 の影響
- [8] 
[CITE@en[GitHub - aadsm/jschardet: Character encoding auto-detection in JavaScript (port of python's chardet)]], [TIME[2025-05-16T10:08:45.000Z]] <https://github.com/aadsm/jschardet>
- [63] 
[CITE[CRAN: Package uchardet]], [TIME[2025-05-18T04:02:20.000Z]], [TIME[2025-05-18T04:28:17.163Z]] <https://cran.r-project.org/web/packages/uchardet/index.html>
- [70] 
[CITE@en[Google Code Archive - Long-term storage for Google Code Project Hosting.]], [TIME[2025-05-18T04:55:39.000Z]] <https://code.google.com/archive/p/nuniversalchardet/>
- [72] 
[CITE@en[GitHub - thuleqaid/rust-chardet: rust version of chardet]], [TIME[2025-05-18T04:59:30.000Z]] <https://github.com/thuleqaid/rust-chardet>
-
[75] 
[CITE@en-US[Encode::Detect::Detector - Detects the encoding of data - metacpan.org]], [TIME[2025-05-18T05:05:23.000Z]] <https://metacpan.org/release/JGMYERS/Encode-Detect-1.01/view/Detector.pm>
-
[73] 
[CITE@en[GitHub - Joungkyun/libchardet: libchardet - Mozilla's Universal Charset Detector C/C++ API]], [TIME[2025-05-18T05:02:38.000Z]] <https://github.com/Joungkyun/libchardet>
-- [74] >>33, >>75, >>57 の派生
]REFS]

[REFS[
- [24] 
[CITE@en[GitHub - hsivonen/chardetng: A character encoding detector for legacy Web content.]], [TIME[2025-05-17T04:08:29.000Z]] <https://github.com/hsivonen/chardetng>
-- [64] >>33 の置き換え

]REFS]

[20] 次に示すのは独立した実装ではなく、他の実装のラッパーとして機能するものです。

[REFS[
-[6]
[CITE@en[Whatpm — Perl Modules for Web Hypertext Application Technologies (beta)]] ([TIME[2009-01-12 11:36:16 +09:00]] 版) <http://suika.fam.cx/www/markup/html/whatpm/readme#whatpm-charset-universalchardet>
- [19] [CITE[xyzzy.lisp.universalchardet/site-lisp/uchardet at master · southly/xyzzy.lisp.universalchardet · GitHub]] ([TIME[2012-05-26 11:32:30 +09:00]] 版) <https://github.com/southly/xyzzy.lisp.universalchardet/tree/master/site-lisp/uchardet>
- [68] [CITE@en[GitHub - PyYoshi/cChardet: universal character encoding detector]], [TIME[2025-05-18T04:33:26.000Z]] <https://github.com/PyYoshi/cChardet>
-- [69] >>66 用
-
[71] 
[CITE@en[GitHub - emk/rust-uchardet: Rust wrapper for the uchardet character encoding detection library]], [TIME[2025-05-18T04:58:52.000Z]] <https://github.com/emk/rust-uchardet>


]REFS]

* 標準化

[10] UNIVCHARDET 自体は1実装に過ぎず、何らかの標準でも、標準によって義務付けられた実装でもありませんが、
[[Web Applications 1.0]] は[[文字符号化]]の決定算法の中で出現頻度分析に基づく推定の利用を認めており、
その具体例として >>1 を挙げています。

* 判定算法で失敗する例

[9] 
[[HTML documents misinterpreted by charset sniffer]] を参照してください。


* メモ


[25] 
>>24 は諸々改善されていることがドキュメントにあるが、近年の [[Webブラウザー]]の、
旧来[[文字コード]]を切り捨てていく方向に則ってるのが不安要素だなあ。


[28] 
>>27 は1バイト符号化の判定に大きな変更を加えている (従来未対応の文字コードへの対応を含む。)
が、マージされずに長年放置されている。



[87] 
[[ISO-8859-2]] / [[Windows-1250]] の判定問題
>>79 >>26
[SEE[ [[文字コードの判定]] ]]