Encoding Validation

Encoding Validation

Introduction

This document describes how to validate files with respect to its encoding in an implementation that decodes a file for the purpose of conformance checking.

An example of such implementation is perl-web-encodings <https://github.com/manakai/perl-web-encodings>.

Terminology

This document depends on the Infra Standard.

The terms byte, byte string, code point, and string are defined by the Infra Standard.

The terms encoding, encode, decode, decoder, fatal, error, GB18030, GBK, Big5, Shift_JIS, EUC-JP, ISO-2022-JP, ISO-2022-JP decoder output state, ASCII, and EUC-KR are defined by the Encoding Standard.

The term BOM is defined by the Unicode Standard.

The term encoding string, where encoding is an encoding, represents a byte string which is intended to be decoded by encoding's decoder.

Bytes for code point in encoding, where code point is a code point and encoding is an encoding, are the result of encoding code point in encoding.

A code point char is encodable in encoding encoding if encoding a string char in encoding with error mode fatal would not result in error.

An encoding string string is in error if decoding string with encoding encoding and error mode fatal would result in error.

General rules

An encoding string MUST NOT be in error.

BOM SHOULD NOT be used.

It is not roundtrippable and it makes any encoding metadata ignored.

GB18030

A GB18030 or GBK string is discouraged to contain bytes which is equal to 0x80 or 0xA3 0xA0 and is decoded to code point U+20AC or U+3000 in GB18030 or GBK.

Big5

A Big5 string is discouraged to contain bytes for a code point in Big5 which is not encodable in Big5.

In other words, use of HKSCS extensions are discouraged, as they are not roundtrippable.

A Big5 string is discouraged to contain bytes bytes which is decoded to code point char in Big5 if bytes for char in Big5 is not equal to bytes.

In other words, when there are multiple byte representations for a code point, non-canonical representations are discouraged, as they are not roundtrippable.

Shift_JIS

A Shift_JIS string is discouraged to contain bytes for a code point in Shift_JIS which is not encodable in Shift_JIS.

In other words, use of EUDCs are discouraged, as they are not roundtrippable and in fact not interoperable at all.

A Shift_JIS string is discouraged to contain bytes bytes which is decoded to code point char in Shift_JIS if bytes for char in Shift_JIS is not equal to bytes.

In other words, when there are multiple byte representations for a code point, non-canonical representations are discouraged, as they are not roundtrippable.

EUC-JP

An EUC-JP string is discouraged to contain bytes for a code point in EUC-JP which is not encodable in EUC-JP.

In other words, use of JIS X 0212 characters are discouraged, as they are not roundtrippable.

An EUC-JP string is discouraged to contain bytes bytes which is decoded to code point char in EUC-JP if bytes for char in EUC-JP is not equal to bytes.

In other words, when there are multiple byte representations for a code point, non-canonical representations are discouraged, as they are not roundtrippable.

ISO-2022-JP

An ISO-2022-JP string is discouraged to contain bytes bytes which is decoded to code point char in ISO-2022-JP if bytes for char in ISO-2022-JP is not equal to bytes.

In other words, when there are multiple byte representations for a code point, non-canonical representations are discouraged, as they are not roundtrippable.

An ISO-2022-JP string MUST NOT be a byte string the final value of ISO-2022-JP decoder output state is not ASCII when decoded as ISO-2022-JP.

An ISO-2022-JP string MUST NOT contain bytes 0x1B 0x24 0x40.

It designates an obsolete standard.