* Introduction

This document describes how to validate files with respect to its [[encoding]]
in an implementation that decodes a file for the purpose of conformance checking.

[EG[
An example of such implementation is perl-web-encodings
<https://github.com/manakai/perl-web-encodings>.
]EG]

* Terminology

This document depends on the [CITE[Infra Standard]].

The terms
[DFN[byte]],
[DFN[byte string]],
[DFN[code point]], and
[DFN[string]]
are defined by the [CITE[Infra Standard]].

The terms
[DFN[encoding]],
[DFN[encode]],
[DFN[decode]],
[DFN[decoder]],
[DFN[[CODE[fatal]]]],
[DFN[error]],
[DFN[GB18030]],
[DFN[GBK]],
[DFN[Big5]],
[DFN[Shift_JIS]],
[DFN[EUC-JP]],
[DFN[ISO-2022-JP]],
[DFN[ISO-2022-JP decoder output state]],
[DFN[ASCII]],
and 
[DFN[EUC-KR]]
are defined by the [CITE[Encoding Standard]].

The term [DFN[BOM]] is defined by the [CITE[Unicode Standard]].

The term [DFN[[VAR[encoding]] string]], where [VAR[encoding]] is an [[encoding]],
represents a [[byte string]] which is intended to be [[decoded][decode]]
by [VAR[encoding]]'s [[decoder]].

[DFN[Bytes for [VAR[code point]] in [VAR[encoding]]]],
where [VAR[code point]] is a [[code point]] and [VAR[encoding]] is an [[encoding]],
are the result of [[encoding][encode]] [VAR[code point]] in [VAR[encoding]].

A [[code point]] [VAR[char]] is [DFN[encodable]] in [[encoding]] [VAR[encoding]]
if [[encoding][encode]] a [[string]] [VAR[char]] in [VAR[encoding]] with
[VAR[error mode]] [CODE[fatal]] would not result in [[error]].

An [VAR[encoding]] string [VAR[string]] is [DFN[in error]] if
[[decoding][decode]] [VAR[string]] with [[encoding]] [VAR[encoding]] and
[VAR[error mode]] [CODE[fatal]] would result in [[error]].

* General rules

An [VAR[encoding]] string [MUST[MUST NOT]] be [[in error]].

[[BOM]] [SHOULD[SHOULD NOT]] be used.

;; It is not roundtrippable and it makes any encoding metadata ignored.

* GB18030

A [[GB18030]] or [[GBK]] string is discouraged to contain [[bytes][byte]] which
is equal to 0x80 or 0xA3 0xA0 and
is [[decoded][decode]] to [[code point]] U+20AC or U+3000 in [[GB18030]] or [[GBK]].

* Big5

A [[Big5]] string is discouraged to contain bytes for a [[code point]]
in [[Big5]] which is not [[encodable]] in [[Big5]].

;; In other words, use of HKSCS extensions are discouraged, as they are not
roundtrippable.

A [[Big5]] string is discouraged to contain [[bytes][byte]] [VAR[bytes]] which
is [[decoded][decode]] to [[code point]] [VAR[char]] in [[Big5]]
if bytes for [VAR[char]] in [[Big5]] is not equal to [VAR[bytes]].

;; In other words, when there are multiple byte representations for a [[code point]],
non-canonical representations are discouraged, as they are not roundtrippable.

* Shift_JIS

A [[Shift_JIS]] string is discouraged to contain bytes for a [[code point]]
in [[Shift_JIS]] which is not [[encodable]] in [[Shift_JIS]].

;; In other words, use of EUDCs are discouraged, as they are not
roundtrippable and in fact not interoperable at all.

A [[Shift_JIS]] string is discouraged to contain [[bytes][byte]] [VAR[bytes]] which
is [[decoded][decode]] to [[code point]] [VAR[char]] in [[Shift_JIS]]
if bytes for [VAR[char]] in [[Shift_JIS]] is not equal to [VAR[bytes]].

;; In other words, when there are multiple byte representations for a [[code point]],
non-canonical representations are discouraged, as they are not roundtrippable.

* EUC-JP

An [[EUC-JP]] string is discouraged to contain bytes for a [[code point]]
in [[EUC-JP]] which is not [[encodable]] in [[EUC-JP]].

;; In other words, use of JIS X 0212 characters are discouraged, as they are not
roundtrippable.

An [[EUC-JP]] string is discouraged to contain [[bytes][byte]] [VAR[bytes]] which
is [[decoded][decode]] to [[code point]] [VAR[char]] in [[EUC-JP]]
if bytes for [VAR[char]] in [[EUC-JP]] is not equal to [VAR[bytes]].

;; In other words, when there are multiple byte representations for a [[code point]],
non-canonical representations are discouraged, as they are not roundtrippable.

* ISO-2022-JP

An [[ISO-2022-JP]] string is discouraged to contain [[bytes][byte]] [VAR[bytes]] which
is [[decoded][decode]] to [[code point]] [VAR[char]] in [[ISO-2022-JP]]
if bytes for [VAR[char]] in [[ISO-2022-JP]] is not equal to [VAR[bytes]].

;; In other words, when there are multiple byte representations for a [[code point]],
non-canonical representations are discouraged, as they are not roundtrippable.

An [[ISO-2022-JP]] string [MUST[MUST NOT]] be a [[byte string]] 
the final value of [[ISO-2022-JP decoder output state]] is not [[ASCII]]
when [[decoded][decode]] as [[ISO-2022-JP]].

An [[ISO-2022-JP]] string [MUST[MUST NOT]] contain bytes 0x1B 0x24 0x40.

;; It designates an obsolete standard.