This document describes how to validate files with respect to its encoding in an implementation that decodes a file for the purpose of conformance checking.
An example of such implementation is perl-web-encodings <https://github.com/manakai/perl-web-encodings>.
This document depends on the Infra Standard.
The terms byte, byte string, code point, and string are defined by the Infra Standard.
The terms
encoding,
encode,
decode,
decoder,
fatal
,
error,
GB18030,
GBK,
Big5,
Shift_JIS,
EUC-JP,
ISO-2022-JP,
ISO-2022-JP decoder output state,
ASCII,
and
EUC-KR
are defined by the Encoding Standard.
The term BOM is defined by the Unicode Standard.
The term encoding string, where encoding is an encoding, represents a byte string which is intended to be decoded by encoding's decoder.
Bytes for code point in encoding, where code point is a code point and encoding is an encoding, are the result of encoding code point in encoding.
A code point char is encodable in encoding encoding
if encoding a string char in encoding with
error mode fatal
would not result in error.
An encoding string string is in error if
decoding string with encoding encoding and
error mode fatal
would result in error.
An encoding string MUST NOT be in error.
BOM SHOULD NOT be used.
A GB18030 or GBK string is discouraged to contain bytes which is equal to 0x80 or 0xA3 0xA0 and is decoded to code point U+20AC or U+3000 in GB18030 or GBK.
A Big5 string is discouraged to contain bytes for a code point in Big5 which is not encodable in Big5.
A Big5 string is discouraged to contain bytes bytes which is decoded to code point char in Big5 if bytes for char in Big5 is not equal to bytes.
A Shift_JIS string is discouraged to contain bytes for a code point in Shift_JIS which is not encodable in Shift_JIS.
A Shift_JIS string is discouraged to contain bytes bytes which is decoded to code point char in Shift_JIS if bytes for char in Shift_JIS is not equal to bytes.
An EUC-JP string is discouraged to contain bytes for a code point in EUC-JP which is not encodable in EUC-JP.
An EUC-JP string is discouraged to contain bytes bytes which is decoded to code point char in EUC-JP if bytes for char in EUC-JP is not equal to bytes.
An ISO-2022-JP string is discouraged to contain bytes bytes which is decoded to code point char in ISO-2022-JP if bytes for char in ISO-2022-JP is not equal to bytes.
An ISO-2022-JP string MUST NOT be a byte string the final value of ISO-2022-JP decoder output state is not ASCII when decoded as ISO-2022-JP.
An ISO-2022-JP string MUST NOT contain bytes 0x1B 0x24 0x40.