URLにおける文字

[3] URL ではいろいろな文字が使えます。

文字クラス

[6] URL の文字クラス

パーセント符号化

パーセント符号化, %u

URL query

ドメイン名

[4] ドメイン名を使うときは、そのまま UTF-8 でパーセント符号化することもできますし、 Punycode にすることもできます。 authority, IDN, mailto:

文字コード

[7] 歴史的にはともかく、現行仕様では URL query の部分にのみ、文字コードが関係します。 URL query それ以外の部分では常に UTF-8 と解釈されます。

[8] パーセント符号化すれば、任意の文字コードやバイナリーデータを埋め込むことはできます。

セキュリティー

文字のセキュリティー, URL

URL scheme ごとの制限

[5] URL scheme によっていろいろな制限があります。 URL scheme

歴史

初期 URL 仕様

[1] RFC 1738 (URL) 2.2. URL Character Encoding Issues

URLs are sequences of characters, i.e., letters, digits, and special characters. A URLs may be represented in a variety of ways: e.g., ink on paper, or a sequence of octets in a coded character set. The interpretation of a URL depends only on the identity of the characters used.

URL は、文字、すなわち letter, 数字, 特殊文字の連続体です。 URL は、例えば紙の上のインクや、符号化文字集合によるオクテットの連続体など、種々の方法で表現することができます。 URL の解釈は、私用されている文字の識別にのみ依存します。

In most URL schemes, the sequences of characters in different parts of a URL are used to represent sequences of octets used in Internet protocols. For example, in the ftp scheme, the host name, directory name and file names are such sequences of octets, represented by parts of the URL. Within those parts, an octet may be represented by the chararacter which has that octet as its code within the US-ASCII [20] coded character set.

ほとんどの URL scheme では、 URL の異なる部分の文字列はインターネットのプロトコルで使うオクテットの列を表現するのに使います。例えば、 ftp scheme では、ホスト名やディレクトリ名やファイル名がそのようなオクテットの列で、 URL の部分として表現します。これらの部分の中においては、オクテットはそのオクテットを US-ASCII 符号化文字集合中でその符号として持つ文字で表現するかもしれません。

In addition, octets may be encoded by a character triplet consisting of the character "%" followed by the two hexadecimal digits (from "0123456789ABCDEF") which forming the hexadecimal value of the octet. (The characters "abcdef" may also be used in hexadecimal encodings.)

加えて、オクテットは % という文字の後に 2つの16進数字 (0123456789ABCDEF から。) が続く、3文字列によって符号化することができます。 2つの数字がそのオクテットの16進値を表します。 (16進符号化では文字 abcdef も使っても構いません。)

Octets must be encoded if they have no corresponding graphic character within the US-ASCII coded character set, if the use of the corresponding character is unsafe, or if the corresponding character is reserved for some other interpretation within the particular URL scheme.

オクテットは、それに対応する図形文字が US‐ASCII 符号化文字集合中にない場合や、対応する文字が非安全である場合、又は対応する文字が特定の URL scheme 中で他の解釈のために予約されている場合には符号化しなければなりません。

以下略

[2] RFC 2396 (URI) 2. URI Characters and Escape Sequences

URI consist of a restricted set of characters, primarily chosen to aid transcribability and usability both in computer systems and in non-computer communications. Characters used conventionally as delimiters around URI were excluded. The restricted set of characters consists of digits, letters, and a few graphic symbols were chosen from those common to most of the character encodings and input facilities available to Internet users.

URI は、主として計算機システム中及び非計算機意思疎通の双方における転記性・可用性を考慮して選んだ限られた文字の集合で構成します。 URI のまわりの区切子として使われる文字は除外しました。限られた文字の集合には、数字, letter, いくつかの図形記号が含まれますが、これはほとんどの文字符号化及びインターネット利用者に利用可能な入力機能で共通なものから選びました。

uric = reserved | unreserved | escaped

Within a URI, characters are either used as delimiters, or to represent strings of data (octets) within the delimited portions. Octets are either represented directly by a character (using the US-ASCII character for that octet [ASCII]) or by an escape encoding. This representation is elaborated below.

URI 中では、文字は区切子として使うか、又は区切られた部分内のデータの列 (オクテット) を表現するのに使います。オクテットは直接文字で表現する (そのオクテットの US‐ASCII 文字を使う) か、又は escape 符号化を使います。

2.1 URI and non-ASCII characters

The relationship between URI and characters has been a source of confusion for characters that are not part of US-ASCII. To describe the relationship, it is useful to distinguish between a "character" (as a distinguishable semantic entity) and an "octet" (an 8-bit byte). There are two mappings, one from URI characters to octets, and a second from octets to original characters:

URI と文字との関係は US‐ASCII の一部ではない文字についての混乱の元になっています。関係を記述するために、「文字」 (相互に区別可能な意味的実体) と「オクテット」 (8ビット・バイト) を区別すると便利です。 URI 文字からオクテットへと、オクテットから元の文字への2つの写像があります。

URI character sequence->octet sequence->original character sequence

URI 文字列→オクテット列→元の文字列

A URI is represented as a sequence of characters, not as a sequence of octets. That is because URI might be "transported" by means that are not through a computer network, e.g., printed on paper, read over the radio, etc.

URI は、オクテットの列ではなく、文字の列として表現されます。どうしてかというと、 URI は計算機網以外の、例えば紙に印刷したり、ラジオで読み上げたりする方法で「伝送」されるかもしれないからです。

A URI scheme may define a mapping from URI characters to octets; whether this is done depends on the scheme. Commonly, within a delimited component of a URI, a sequence of characters may be used to represent a sequence of octets. For example, the character "a" represents the octet 97 (decimal), while the character sequence "%", "0", "a" represents the octet 10 (decimal).

URI scheme は、URI 文字からオクテットへの写像を定義するかもしれません。そうするかどうかは scheme によります。一般的には、 URI の区切られた部品中では、文字の列はオクテットの列を表現するのに使われるかもしれません。例えば、文字 a はオクテット 97 (10進) を表現し、文字列 % 0 a はオクテット 10 (10進) を表現します。

There is a second translation for some resources: the sequence of octets defined by a component of the URI is subsequently used to represent a sequence of characters. A 'charset' defines this mapping. There are many charsets in use in Internet protocols. For example, UTF-8 [UTF-8] defines a mapping from sequences of octets to sequences of characters in the repertoire of ISO 10646.

資源によっては2回目の転写があります。 URI の部品として定義されたオクテットの列を今度は文字の列を表現するのに使います。「charset」がこの写像を定義します。インターネットのプロトコルで使われる charset は沢山あります。例えば、 UTF-8 はオクテットの列から ISO10646 のレパートリ中の文字の列への写像を定義しています。

In the simplest case, the original character sequence contains only characters that are defined in US-ASCII, and the two levels of mapping are simple and easily invertible: each 'original character' is represented as the octet for the US-ASCII code for it, which is, in turn, represented as either the US-ASCII character, or else the "%" escape sequence for that octet.

最も単純な場合には、元の文字列は US‐ASCII で定義された文字だけを含んでいて、 2つの水準の写像は単純で簡単に逆にすることができます。それぞれの「元の文字」はその US‐ASCII 符号のオクテットで表現されていて、これは今度は US‐ASCII 文字として表現されているか、又はそのオクテットの % エスケープ・シーケンスで表現されているかです。

For original character sequences that contain non-ASCII characters, however, the situation is more difficult. Internet protocols that transmit octet sequences intended to represent character sequences are expected to provide some way of identifying the charset used, if there might be more than one [RFC2277]. However, there is currently no provision within the generic URI syntax to accomplish this identification. An individual URI scheme may require a single charset, define a default charset, or provide a way to indicate the charset used.

しかし、元の文字の列が非 ASCII 文字を含んでいる場合には、状況はより難しくなります。文字列を表現することを目的としてオクテット列を転送するインターネットのプロトコルは、複数の charset を使うことができるのであれば、使用している charset を識別するための何らかの方法が提供されることが期待されます。しかし、この識別を実現するための準備は現在一般 URI 構文にはありません。個々の URI scheme は、単一の charset を要求しても構いませんし、既定の charset を定義しても構いませんし、あるいは使用している charset を示す方法を提供しても構いません。

It is expected that a systematic treatment of character encoding within URI will be developed as a future modification of this specification.

将来のこの仕様書の修正において、 URI 中での文字符号化の体系的な扱いが開発されることが期待されます。

URLにおける文字

URLにおける文字

文字クラス

パーセント符号化

URL query

ドメイン名

文字コード

セキュリティー

URL scheme ごとの制限

関連

歴史

初期 URL 仕様

2.1 URI and non-ASCII characters

2.2. Reserved Characters

2.3. Unreserved Characters

2.4. Escape Sequences

IRI

メモ