RFC 3986//2

2. Characters

The URI syntax provides a method of encoding data, presumably for the sake of identifying a resource, as a sequence of characters. The URI characters are, in turn, frequently encoded as octets for transport or presentation. This specification does not mandate any particular character encoding for mapping between URI characters and the octets used to store or transmit those characters. When a URI appears in a protocol element, the character encoding is defined by that protocol; without such a definition, a URI is assumed to be in the same character encoding as the surrounding text.

URI の構文はおそらくは資源の識別のために使われるのであろうデータを文字列として符号化する方法を提供しています。その URI の各文字は輸送や表現のためにオクテットとして符号化されることがよくあります。この仕様では URI の文字と蓄積や転送の際に使用するオクテットとの写像のための特定の文字符号化を強制することはしません。 URI がプロトコル要素に現れる時は、そのプロトコルによって文字符号化が定義されます。定義がないときには、 URI は周りの文章と同じ文字符号化であると仮定します。

The ABNF notation defines its terminal values to be non-negative integers (codepoints) based on the US-ASCII coded character set [ASCII]. Because a URI is a sequence of characters, we must invert that relation in order to understand the URI syntax. Therefore, the integer values used by the ABNF must be mapped back to their corresponding characters via US-ASCII in order to complete the syntax rules.

ABNF 記法は終端値を US-ASCII 符号化文字集合に基づく非負整数 (符号位置) であると定義しています。 URI は文字列ですから、 URI 構文を理解するためにはその関係を逆転させなければなりません。従って、 ABNF で使われている整数値を US-ASCII により対応する文字に逆写像させることで完全な構文規則が得られます。

A URI is composed from a limited set of characters consisting of digits, letters, and a few graphic symbols. A reserved subset of those characters may be used to delimit syntax components within a URI while the remaining characters, including both the unreserved set and those reserved characters not acting as delimiters, define each component's identifying data.

URI は限られた文字 (letter) 、数字、いくつかの図形記号の集合から作ります。その文字のうち、予約集合は URI の中の構文部品を区切るために使用し、残った非予約集合と区切子のはたらきをしていない予約文字の集合は各構成部品の識別するデータを定義します。

2.1. Percent-Encoding

A percent-encoding mechanism is used to represent a data octet in a component when that octet's corresponding character is outside the allowed set or is being used as a delimiter of, or within, the component. A percent-encoded octet is encoded as a character triplet, consisting of the percent character "%" followed by the two hexadecimal digits representing that octet's numeric value. For example, "%20" is the percent-encoding for the binary octet "00100000" (ABNF: %x20), which in US-ASCII corresponds to the space character (SP). Section 2.4 describes when percent-encoding and decoding is applied.

百分率符号化は部品中で認められている文字の集合に含まれていない文字や部品の内外で区切子として使われている文字について、その文字に対応するデータ・オクテットを表現する仕組みです。百分率符号化オクテットは、百分率記号 % の後にオクテットの数値を表現する2つの十六進数字をつけた 3文字の組で符号化します。例えば、 %20 はバイナリ・オクテット 001000000 (ABNF: %x20) の百分率符号化であり、 US-ASCII では間隔文字 (SP) に対応します。百分率符号化をいつ符号化したり復号したりするかは 2.4節で説明します。

pct-encoded = "%" HEXDIG HEXDIG

The uppercase hexadecimal digits 'A' through 'F' are equivalent to the lowercase digits 'a' through 'f', respectively. If two URIs differ only in the case of hexadecimal digits used in percent-encoded octets, they are equivalent. For consistency, URI producers and normalizers should use uppercase hexadecimal digits for all percent-encodings.

大文字の十六進数字 A〜F と小文字の数字 a〜f はそれぞれ同値です。 2つの URI が百分率符号化オクテットで使われている十六進数字の大文字・小文字の差以外は同じであるなら、この2つの URI は同値です。一貫性のため、 URI 生成器と URI 正規化器はすべての百分率符号化において大文字の十六進数字を使用するべきです。

2.2. Reserved Characters

URIs include components and subcomponents that are delimited by characters in the "reserved" set. These characters are called "reserved" because they may (or may not) be defined as delimiters by the generic syntax, by each scheme-specific syntax, or by the implementation-specific syntax of a URI's dereferencing algorithm. If data for a URI component would conflict with a reserved character's purpose as a delimiter, then the conflicting data must be percent-encoded before the URI is formed.

URI は予約集合の文字によって区切られる部品や部分部品を含みます。これらの文字が予約と呼ばれるのは、 URI の参照を解く方法の一般構文や scheme 規定の構文や実装規定の構文で区切子として定義されるかもしれません (しされないかもしれません) からです。 URI 部品のデータが予約文字の区切子としての目的と衝突するのであれば、衝突しているデータは URI を作る前に百分率符号化しなければなりません。

reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

The purpose of reserved characters is to provide a set of delimiting characters that are distinguishable from other data within a URI. URIs that differ in the replacement of a reserved character with its corresponding percent-encoded octet are not equivalent. Percent-encoding a reserved character, or decoding a percent-encoded octet that corresponds to a reserved character, will change how the URI is interpreted by most applications. Thus, characters in the reserved set are protected from normalization and are therefore safe to be used by scheme-specific and producer-specific algorithms for delimiting data subcomponents within a URI.

予約文字の目的は、 URI の他のデータと区別できる区切子文字の集合を提供することです。ある URI とその中の予約文字を対応する百分率符号化オクテットに置き換えた URI は同値ではありません。予約文字を百分率符号化したり、予約文字に対応する百分率符号化オクテットを復号したりすると、その URI のほとんどの応用による解釈のされ方が変わってしまいます。ですから、予約集合の文字は正規化から護られており、従って URI 中のデータ部品を区切るための scheme 規定の方法や生産者規定の方法で使っても安全なのです。

A subset of the reserved characters (gen-delims) is used as delimiters of the generic URI components described in Section 3. A component's ABNF syntax rule will not use the reserved or gen-delims rule names directly; instead, each syntax rule lists the characters allowed within that component (i.e., not delimiting it), and any of those characters that are also in the reserved set are "reserved" for use as subcomponent delimiters within the component. Only the most common subcomponents are defined by this specification; other subcomponents may be defined by a URI scheme's specification, or by the implementation-specific syntax of a URI's dereferencing algorithm, provided that such subcomponents are delimited by characters in the reserved set allowed within that component.

予約文字の部分集合 gen-delims は3章の一般 URI 部品の区切子として使われています。部品の ABNF 構文規則は規則名 reserved や規則名 gen-delims を直接使いません。その代わりに、各構文規則が部品内で認める (その部品を区切らない) 文字を列挙しています。そしてその中で予約集合にもある文字はその部品の中で部分部品区切子として使うために予約されています。この仕様書では本当によく使われる部分部品だけを定義しています。他の部分部品は URI scheme の仕様書で定義されるかもしれませんし、実装規定の URI の参照を解く算法の構文で定義されるかもしれません。その部分部品はその部品の中で認められている予約集合の文字で区切ります。

URI producing applications should percent-encode data octets that correspond to characters in the reserved set unless these characters are specifically allowed by the URI scheme to represent data in that component. If a reserved character is found in a URI component and no delimiting role is known for that character, then it must be interpreted as representing the data octet corresponding to that character's encoding in US-ASCII.

URI 生成応用は、 URI scheme が特に認めていない限り、予約集合の文字に対応するデータ・オクテットを部品の中で表現するにあたって百分率符号化するべきです。予約文字が URI 部品の中に見つかり、その文字に区切子としての役割が知られていなければ、 US-ASCII でその文字に対応するデータ・オクテットを表現していると解釈しなければなりません。

2.3. Unreserved Characters

Characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These include uppercase and lowercase letters, decimal digits, hyphen, period, underscore, and tilde.

URI で認められておりかつ予約目的を持たない文字を、非予約と呼びます。非予約には大文字・小文字, 十進数字, ハイフン, 句点, 下線, チルダが含まれます。

unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"

URIs that differ in the replacement of an unreserved character with its corresponding percent-encoded US-ASCII octet are equivalent: they identify the same resource. However, URI comparison implementations do not always perform normalization prior to comparison (see Section 6). For consistency, percent-encoded octets in the ranges of ALPHA (%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E), underscore (%5F), or tilde (%7E) should not be created by URI producers and, when found in a URI, should be decoded to their corresponding unreserved characters by URI normalizers.

ある URI と非予約文字を対応する百分率符号化 US-ASCII オクテットで置き換えた URI は等価です。両者は同じ資源を識別します。しかし、 URI 比較実装は比較の前に必ずしも正規化を行いません。一貫性のため、 ALPHA (%41〜%5A および %61〜%7A), DIGIT (%30〜%39), ハイフン (%2D), 句点 (%2E), 下線 (%5F), チルダ (%7E) の百分率符号化オクテットを URI 生成器は作るべきではなく、 URI 正規化器はこれらが URI 中に見つかった時は対応する非予約文字に復号するべきです。

2.4. When to Encode or Decode

Under normal circumstances, the only time when octets within a URI are percent-encoded is during the process of producing the URI from its component parts. This is when an implementation determines which of the reserved characters are to be used as subcomponent delimiters and which can be safely used as data. Once produced, a URI is always in its percent-encoded form.

普通の状況では、 URI 中のオクテットが百分率符号化されるのは構成する部品から URI を生成する過程においてのみです。この時が実装がどの予約文字を部分部品の区切子として使用し、どの予約文字を安全にデータとして使用できるかを決定する時です。一度 URI を生成してしまったら、常に URI は百分率符号化形となります。

When a URI is dereferenced, the components and subcomponents significant to the scheme-specific dereferencing process (if any) must be parsed and separated before the percent-encoded octets within those components can be safely decoded, as otherwise the data may be mistaken for component delimiters. The only exception is for percent-encoded octets corresponding to characters in the unreserved set, which can be decoded at any time. For example, the octet corresponding to the tilde ("~") character is often encoded as "%7E" by older URI processing implementations; the "%7E" can be replaced by "~" without changing its interpretation.

URI の参照を解く時、 scheme 規定の参照を解く過程で重要な部品や部分部品は (あれば) 部品内の百分率符号化オクテットを復号するより先に構文解析して分離しておかなければなりません。そうしなければデータが部品区切子と誤解されてしまうかもしれませんから、安全に復号することはできません。唯一の例外は非予約集合の文字に対応する百分率符号化オクテットで、こちらはいつでも復号できます。例えば、チルダ (~) 文字に対応するオクテットはしばしば古い URI 処理実装によって %7E と符号化されます。 %7E はその解釈を変えてしまうことなく ~ に置換できます。

Because the percent ("%") character serves as the indicator for percent-encoded octets, it must be percent-encoded as "%25" for that octet to be used as data within a URI. Implementations must not percent-encode or decode the same string more than once, as decoding an already decoded string might lead to misinterpreting a percent data octet as the beginning of a percent-encoding, or vice versa in the case of percent-encoding an already percent-encoded string.

百分率記号 (%) は百分率符号化オクテットの標識として機能しますから、 URI 中のデータとして使用する時には必ず %25 と百分率符号化しなければなりません。同じ文字列を百分率符号化を行ったり復号したりを何度も繰り返すと、百分率データ・オクテットを百分率符号化の始まりと誤認してしまったり、逆に既に百分率符号化された文字列を百分率符号化することが起こってしまうかもしれませんから、実装はそのようなことを行ってはなりません。

2.5. Identifying Data

URI characters provide identifying data for each of the URI components, serving as an external interface for identification between systems. Although the presence and nature of the URI production interface is hidden from clients that use its URIs (and is thus beyond the scope of the interoperability requirements defined by this specification), it is a frequent source of confusion and errors in the interpretation of URI character issues. Implementers have to be aware that there are multiple character encodings involved in the production and transmission of URIs: local name and data encoding, public interface encoding, URI character encoding, data format encoding, and protocol encoding.

URI の文字は URI 部品それぞれの識別データを提供し、システム間の識別のための外部界面として働きます。 URI の生成の界面の存在と性質は URI を使うクライアントから隠されています (しそれ故にこの仕様で定義される相互運用性要件の適用範囲外です) が、 URI の文字の問題の解釈はよくある混乱と誤りの原因です。実装は URI の生成や転送に局所名・局所データの符号化、公開界面符号化、 URI 文字符号化、データ書式符号化、プロトコル符号化といくつもの文字符号化が関係していることに注意しなければなりません。

Local names, such as file system names, are stored with a local character encoding. URI producing applications (e.g., origin servers) will typically use the local encoding as the basis for producing meaningful names. The URI producer will transform the local encoding to one that is suitable for a public interface and then transform the public interface encoding into the restricted set of URI characters (reserved, unreserved, and percent-encodings). Those characters are, in turn, encoded as octets to be used as a reference within a data format (e.g., a document charset), and such data formats are often subsequently encoded for transmission over Internet protocols.

ファイル・システム名のような局所名は局所文字符号化で蓄積されています。 URI 生成応用 (例えば起点鯖) は普通局所符号化を意味のある名前の生成の基礎とします。 URI 生成器は局所符号化を公開界面に適当な符号化に変形し、公開界面符号化を限られた URI の文字の集合 (予約文字、非予約文字、百分率符号化) に変形します。 URI 文字は次にデータ書式の中で参照として使うためにオクテット列に符号化され、そのデータ書式はインターネットのプロトコルによる転送のために更に符号化されることもよくあります。

For most systems, an unreserved character appearing within a URI component is interpreted as representing the data octet corresponding to that character's encoding in US-ASCII. Consumers of URIs assume that the letter "X" corresponds to the octet "01011000", and even when that assumption is incorrect, there is no harm in making it. A system that internally provides identifiers in the form of a different character encoding, such as EBCDIC, will generally perform character translation of textual identifiers to UTF-8 [STD63] (or some other superset of the US-ASCII character encoding) at an internal interface, thereby providing more meaningful identifiers than those resulting from simply percent-encoding the original octets.

ほとんどのシステムでは URI 部品に現れる非予約文字は US-ASCII によるその文字の符号化に対応するデータ・オクテットを表現していると解釈します。 URI の消費者は文字 X がオクテット 01011000 に対応していると仮定しており、しかもこの仮定が間違っていてもそう仮定することに何の害もありません。内部的に識別子を EBCDIC など異なる文字符号化の形で提供しているシステムは通常内部界面で識別子を UTF-8 (やその他の US-ASCII 文字符号化の超集合) に文字を翻訳し、単なる元のオクテットの百分率符号化よりもずっと意味のある識別子を提供しています。

For example, consider an information service that provides data, stored locally using an EBCDIC-based file system, to clients on the Internet through an HTTP server. When an author creates a file with the name "Laguna Beach" on that file system, the "http" URI corresponding to that resource is expected to contain the meaningful string "Laguna%20Beach". If, however, that server produces URIs by using an overly simplistic raw octet mapping, then the result would be a URI containing "%D3%81%87%A4%95%81@%C2%85%81%83%88". An internal transcoding interface fixes this problem by transcoding the local name to a superset of US-ASCII prior to producing the URI. Naturally, proper interpretation of an incoming URI on such an interface requires that percent-encoded octets be decoded (e.g., "%20" to SP) before the reverse transcoding is applied to obtain the local name.

例えば、局所的に EBCDIC を基にしたファイル・システムを使って蓄積されたデータを提供する情報サービスが HTTP 鯖を通じてインターネットのクライアント向けに提供されていると考えましょう。著者がファイルを Laguna Beach という名前で作成すると、その資源に対応する http URI は意味のある文字列 Laguna%20Beach を含んでいることが期待されます。しかし、鯖が URI を単純に生のオクテットを写像するだけで URI を生成すると、 %D3%81%87%A4%95%81@%C2%85%81%83%88 が含まれる URI になります。この問題は内部の転符号化界面が局所名を URI の生成の前に US-ASCII の超集合に転符号化することで修正できます。当然、このような界面に来る URI を適切に解釈するためには局所名を得るための逆の転符号化を行う前に百分率符号化オクテットを復号する (例えば %20 を SP にする) 必要があります。

In some cases, the internal interface between a URI component and the identifying data that it has been crafted to represent is much less direct than a character encoding translation. For example, portions of a URI might reflect a query on non-ASCII data, or numeric coordinates on a map. Likewise, a URI scheme may define components with additional encoding requirements that are applied prior to forming the component and producing the URI.

場合によっては URI 部品と技巧的に表現された識別データの内部界面が文字符号化翻訳以上に非直接的なこともあります。例えば、 URI の一部分が非 ASCII データの照会を亜hんえいしていたり、地図の数値座標だったりするかもしれません。同様に、 URI scheme は部品を形成したて URI を生成する前に更に符号化を適用するように要求した部品を定義しても構いません。

When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]; then only those octets that do not correspond to characters in the unreserved set should be percent-encoded. For example, the character A would be represented as "A", the character LATIN CAPITAL LETTER A WITH GRAVE would be represented as "%C3%80", and the character KATAKANA LETTER A would be represented as "%E3%82%A2".

新しい URI scheme が普遍文字集合の文字からなる文章的なデータを表現する部品を定義する時には、データはまず UTF-8 文字符号化に従ってオクテットとして符号化するべきです。その後非予約集合に対応する文字のないオクテットだけを百分率符号化するべきです。例えば、文字 A は A と表現し、]] 文字 LATIN CAPITAL LETTER A WITH GRAVE は %C3%80 と表現し、 KATAKANA KETTER A は %E3%82%A2 と表現します。

License

RFCのライセンス