Robots Exclusion Protocol

robots.txt

[23] Webサーバーの /robots.txt は、当該Webサーバーのクロールに関する検索エンジンのロボットへの指示を記述するファイルです。

仕様書

[2] robotstxt.org http://www.robotstxt.org/
- [ROBOTS94] A Standard for Robot Exclusion http://www.robotstxt.org/wc/norobots.html
- [ROBOTS97] A Standard for Robot Exclusion http://www.robotstxt.org/wc/norobots-rfc.html
HTML 4
- The robots.txt file IW:HTML4:"appendix/notes.html#h-B.4.1.1"

[ROBOTS94] が1994年の合意で、その後1997年に Internet Draft [ROBOTS97] が書かれましたが、未完成のままです。 HTML 4 は附属書 B (参考) の中で解説していますが、規定はしていません。

[3] HTML 4.0 の解説には間違いが沢山ありました。 HTML 4.01 では修正されています。

HTML 4.01 A.1.2 Errors that were corrected IW:HTML4:"appendix/changes.html#h-A.1.2"

[4] sitemaps.org - Protocol (2007-04-11 20:52:23 +09:00 版) http://www.sitemaps.org/protocol.html#submit_robots (名無しさん 2007-04-12 11:00:38 +00:00)

プロトコル

Crawl-Delay:

実装

[21] wget は複数の URL を保存する時標準設定で robots.txt に従いますが、設定を変更して無効化することもできます。

[22] robots.txt は検索エンジン向けで運用されている現実があるため、 wget のような異なる目的のロボットがこれに従う必要があるのかどうか、疑問はあります。

歴史

[1] robotはぢきについて http://c-moon.jp/robots.shtml

[6] 自分のサイトを更新チェックされたくない - はてなアンテナのヘルプ (2011-09-09 12:18:55 +09:00 版) http://hatenaantenna.g.hatena.ne.jp/keyword/%E8%87%AA%E5%88%86%E3%81%AE%E3%82%B5%E3%82%A4%E3%83%88%E3%82%92%E6%9B%B4%E6%96%B0%E3%83%81%E3%82%A7%E3%83%83%E3%82%AF%E3%81%95%E3%82%8C%E3%81%9F%E3%81%8F%E3%81%AA%E3%81%84?kid=19#robots

[8] WWW::RobotsRules - search.cpan.org ( (2013-03-10 05:21:28 +09:00 版)) http://search.cpan.org/dist/lcwa/lib/lwp/lib/WWW/RobotRules.pm

[9] WWW::RobotRules::Parser - search.cpan.org ( (2013-03-10 05:22:34 +09:00 版)) http://search.cpan.org/dist/WWW-RobotRules-Parser/lib/WWW/RobotRules/Parser.pm

[10] WWW::RobotRules - search.cpan.org ( (2013-03-10 05:23:35 +09:00 版)) http://search.cpan.org/dist/WWW-RobotRules/lib/WWW/RobotRules.pm

[11] WWW::RobotRules::Extended - search.cpan.org ( (2013-03-10 05:28:09 +09:00 版)) http://search.cpan.org/dist/WWW-RobotRules-Extended/lib/WWW/RobotRules/Extended.pm

[12] robots.txtにおけるAllowとDisallowとSitemapの優先順位 - 45式::雑記 ( (渡辺四ん五(4n5) 著, 2012-02-29 17:58:01 +09:00 版)) http://www.45shiki.net/blog/2009/12/b000924.htm

[13] Robots exclusion standard - Wikipedia, the free encyclopedia ( (2013-03-10 00:00:25 +09:00 版)) http://en.wikipedia.org/wiki/Robots_exclusion_standard

[14] Official Google Webmaster Central Blog: Improving on Robots Exclusion Protocol ( (2014-03-19 08:31:54 +09:00 版)) http://googlewebmastercentral.blogspot.jp/2008/06/improving-on-robots-exclusion-protocol.html

[15] The Web Robots Pages ( (2013-12-03 09:21:42 +09:00 版)) http://www.robotstxt.org/faq/future.html

[16] Robots.txt Specifications - Webmasters — Google Developers ( (2012-08-02 09:24:38 +09:00 版)) https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt

[17] How to Create a Robots.txt File - Bing Webmaster Tools ( (2014-03-20 07:09:51 +09:00 版)) http://www.bing.com/webmaster/help/how-to-create-a-robots-txt-file-cb7c31ec

[18] 著作権法施行規則 ( (2014-10-09 01:08:42 +09:00 版)) http://law.e-gov.go.jp/htmldata/S45/S45F03501000026.html#1000000000007000000000000000000000000000000000000000000000000000000000000000000

[26] Applebot について - Apple サポート ( (2015-06-02 11:01:40 +09:00)) https://support.apple.com/ja-jp/HT204683

Applebot は、Apple の Web クローラーです。Siri や Spotlight 検索候補などの製品で使用されています。慣習的な robots.txt の規則と robots meta タグを尊重します。アクセス元は 17.0.0.0 ネットワークブロックです。
User-agent 文字列には、“Applebot” と補足のエージェント情報が記載されます。下記は、その例です。
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Applebot/0.1)
robots の制御指示で Applebot には言及していなくても Googlebot について指定されている場合、Apple のロボットは Googlebot に対する指示に従います。

[27] 開発サーバーなどまったくクロールされたくない場合は、

User-agent: *
Disallow: /

... という robots.txt を返すべきです。

[5] Robots exclusion standard - Wikipedia (2016-11-01 14:15:50 +09:00) https://en.wikipedia.org/wiki/Robots_exclusion_standard

[19] An Analysis of the World's Leading robots.txt Files (Ben Frederickson著, 2017-10-20 22:51:02 +09:00) http://www.benfrederickson.com/robots-txt-analysis/

[20] トップ100万ウェブサイトのrobots.txtを解析した人とその結果 | 秋元@サイボウズラボ・プログラマー・ブログ (2018-01-04 13:08:43 +09:00) http://developer.cybozu.co.jp/akky/2017/11/one-million-robots-txt-analyzed/

[24] robots.txt ファイルについて - Search Console ヘルプ (2018-09-23 17:04:59 +09:00) https://support.google.com/webmasters/answer/6062608#robotted-but-indexed

Google では、robots.txt でブロックされているコンテンツをクロールしたりインデックスに登録したりすることはありませんが、ブロック対象の URL がウェブ上の他の場所からリンクされている場合、その URL を検出してインデックスに登録する可能性はあります。

[25] Wayback Machineがrobots.txtを無視するようになるかも? | 海外SEO情報ブログ (2018-11-22 13:17:16 +09:00) https://www.suzukikenichi.com/blog/wayback-machine-planning-to-ignore-robotstxt/

[28] Official Google Webmaster Central Blog: Formalizing the Robots Exclusion Protocol Specification (2019-07-01 23:33:39 +09:00) https://webmasters.googleblog.com/2019/07/rep-id.html

[29] Official Google Webmaster Central Blog: Google's robots.txt parser is now open source (2019-07-01 23:33:39 +09:00) https://webmasters.googleblog.com/2019/07/repp-oss.html

[30] google/robotstxt: The repository contains Google's robots.txt parser and matcher as a C++ library (compliant to C++11). (2019-07-02 09:52:10 +09:00) https://github.com/google/robotstxt

[31] draft-rep-wg-topic-00 - Robots Exclusion Protocol (2019-07-02 03:55:27 +09:00) https://tools.ietf.org/html/draft-rep-wg-topic-00

[32] GNU Wget 1.20 Manual (2020-10-01T07:32:00.000Z) https://www.gnu.org/software/wget/manual/wget.html#index-wgetrc-commands

[33] Robots.txt meant for search engines don’t work well for web archives - Internet Archive Blogs (2021-05-20T09:10:02.000Z) https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/