今日看點(diǎn)：bot短期密集訪問(wèn)形成的流量高峰有哪些？如何解決？

來(lái)源：CSDN 時(shí)間：2023-03-31 08:01:11

周末大清早收到封警報(bào)郵件，估計(jì)網(wǎng)站被攻擊了，要么就是緩存日志memory的問(wèn)題。打開(kāi)access.log 看了一眼，原來(lái)該時(shí)間段內(nèi)大波的bot（bot：網(wǎng)上機(jī)器人；自動(dòng)程序 a computer programthat performs a particular task again and again many times）訪問(wèn)了我的網(wǎng)站。

http://ltx71.com http://mj12bot.com http://www.bing.com/bingbot.htm http://ahrefs.com/robot/ http://yandex.com/bots

(資料圖)

website.com (AWS) - Monitor is Down

Down since Mar 25, 2017 1:38:58 AM CET

Site Monitored

Resolved IP

54.171.32.xx

Reason

Service Unavailable.

Monitor Group

XX Applications

Outage Details

LocationResolved IPReasonLondon - UK (5.77.35.xx)54.171.32.xxService Unavailable.Headers : HTTP/1.1 503 Service Unavailable: Back-end server is at capacity Content-Length : 0 Connection : keep-alive GET / HTTP/1.1 Cache-Control : no-cache Accept : */* Connection : Keep-Alive Accept-Encoding : gzip User-Agent : Site24x7 Host : xxxSeattle - US (104.140.20.xx)54.171.32.xxService Unavailable.Headers : HTTP/1.1 503 Service Unavailable: Back-end server is at capacity Content-Length : 0 Connection : keep-alive GET / HTTP/1.1 Cache-Control : no-cache Accept : */* Connection : Keep-Alive Accept-Encoding : gzip User-Agent : Site24x7 Host : xxx

上網(wǎng)搜了一下，發(fā)現(xiàn)許多webmaster都遇到了由于bot短期密集訪問(wèn)形成的流量高峰而無(wú)法其它終端提供服務(wù)的問(wèn)題。從這篇文章的分析中，我們看到有這樣幾種方法來(lái)block這些web bot。

1. robots.txt

許多網(wǎng)絡(luò)爬蟲(chóng)都是先去搜索robots.txt，如下所示：

"199.58.86.206" - - [25/Mar/2017:01:26:50 +0000] "GET /robots.txt HTTP/1.1" 404 341 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)" "199.58.86.206" - - [25/Mar/2017:01:26:54 +0000] "GET / HTTP/1.1" 200 129989 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)" "162.210.196.98" - - [25/Mar/2017:01:39:18 +0000] "GET /robots.txt HTTP/1.1" 404 341 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)"

許多bot的發(fā)布者也談到了如果不希望被爬取，應(yīng)該如何來(lái)操作，就以MJ12bot為例：

How can I block MJ12bot?

MJ12bot adheres to the robots.txtstandard. If you want the bot to prevent website from being crawled then add the following text to your robots.txt:

User-agent: MJ12bot

Disallow: /

Please do not waste your time trying to block bot via IP in htaccess - we do not use any consecutive IP blocks so your efforts will be in vain. Also please make sure the bot can actually retrieve robots.txt itself - if it can"t then it will assume (this is the industry practice) that its okay to crawl your site.

If you have reason to believe that MJ12bot did NOT obey your robots.txt commands, then please let us know via email: bot@majestic12.co.uk. Please provide URL to your website and log entries showing bot trying to retrieve pages that it was not supposed to.

How can I slow down MJ12bot?

You can easily slow down bot by adding the following to your robots.txt file:

User-Agent: MJ12bot

Crawl-Delay: 5

Crawl-Delay should be an integer number and it signifies number of seconds of wait between requests. MJ12bot will make an up to 20 seconds delay between requests to your site - note however that while it is unlikely, it is still possible your site may have been crawled from multiple MJ12bots at the same time. Making high Crawl-Delay should minimise impact on your site. This Crawl-Delay parameter will also be active if it was used for * wildcard.

If our bot detects that you used Crawl-Delay for any other bot then it will automatically crawl slower even though MJ12bot specifically was not asked to do so.

那么我們可以寫如下的

User-agent: YisouSpider

Disallow: /

User-agent: EasouSpider

Disallow: /

User-agent: EtaoSpider

Disallow: /

User-agent: MJ12bot

Disallow: /

另外，鑒于很多bot都會(huì)去訪問(wèn)這些目錄：

/wp-login.php /wp-admin/

/trackback/

/?replytocom=

…

許多WordPress網(wǎng)站也確實(shí)用到了這些文件夾，那么如何在不影響功能的情況下做一些調(diào)整呢？

robots.txt修改之前robots.txt修改之后

User-agent: *

Disallow: /wp-admin

Disallow: /wp-content/plugins

Disallow: /wp-content/themes

Disallow: /wp-includes

Disallow: /?s=User-agent: *

Disallow: /wp-admin

Disallow: /wp-*

Allow: /wp-content/uploads/

Disallow: /wp-content

Disallow: /wp-login.php

Disallow: /comments

Disallow: /wp-includes

Disallow: /*/trackback

Disallow: /*?replytocom*

Disallow: /?p=*&preview=true

Disallow: /?s=

不過(guò)，也可以看到許多爬蟲(chóng)并不理會(huì)robots.txt，以這個(gè)為例，就沒(méi)有先去訪問(wèn)robots.txt

"10.70.8.30, 163.172.65.40" - - [25/Mar/2017:02:13:36 +0000] "GET / HTTP/1.1" 200 129989 "-" "Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)" "178.63.23.67, 163.172.65.40" - - [25/Mar/2017:02:13:42 +0000] "GET / HTTP/1.1" 200 129989 "-" "Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)" "178.63.23.67, 163.172.65.40" - - [25/Mar/2017:02:14:17 +0000] "GET /static/js/utils.js HTTP/1.1" 200 5345 "http://iatatravelcentre.com/" "Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)" "178.63.23.67, 163.172.65.40" - - [25/Mar/2017:02:14:17 +0000] "GET /static/css/home.css HTTP/1.1" 200 8511 "http://iatatravelcentre.com/" "Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)"

這個(gè)時(shí)候就要試一下其他幾種方法。

2. .htaccess

原理就是利用URL rewrite，只要發(fā)現(xiàn)訪問(wèn)來(lái)自于這些agent，就禁止其訪問(wèn)。作者“~吉爾伽美什”的這篇文章介紹了關(guān)于.htaccess的很多用法。

5. Blocking users by IP 根據(jù)IP阻止用戶訪問(wèn)order allow,deny deny from 123.45.6.7 deny from 12.34.5. (整個(gè)C類地址) allow from all 6. Blocking users/sites by referrer 根據(jù)referrer阻止用戶/站點(diǎn)訪問(wèn)需要mod_rewrite模塊例1. 阻止單一referrer: badsite.comRewriteEngine on # Options +FollowSymlinks RewriteCond %{HTTP_REFERER} badsite\.com [NC] RewriteRule .* - [F] 例2. 阻止多個(gè)referrer: badsite1.com, badsite2.comRewriteEngine on # Options +FollowSymlinks RewriteCond %{HTTP_REFERER} badsite1\.com [NC,OR] RewriteCond %{HTTP_REFERER} badsite2\.com RewriteRule .* - [F] [NC] - 大小寫不敏感(Case-insensite) [F] - 403 Forbidden 注意以上代碼注釋掉了”O(jiān)ptions +FollowSymlinks”這個(gè)語(yǔ)句。如果服務(wù)器未在 httpd.conf 的段落設(shè)置 FollowSymLinks, 則需要加上這句，否則會(huì)得到”500 Internal Server error”錯(cuò)誤。 7. Blocking bad bots and site rippers (aka offline browsers) 阻止壞爬蟲(chóng)和離線瀏覽器需要mod_rewrite模塊壞爬蟲(chóng)? 比如一些抓垃圾email地址的爬蟲(chóng)和不遵守robots.txt的爬蟲(chóng)(如baidu?) 可以根據(jù) HTTP_USER_AGENT 來(lái)判斷它們 (但是還有更無(wú)恥的如”中搜 zhongsou.com”之流把自己的agent設(shè)置為 “Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)” 太流氓了，就無(wú)能為力了) RewriteEngine On RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR] RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR] RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR] RewriteCond %{HTTP_USER_AGENT} ^Custo [OR] RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR] RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR] RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR] RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR] RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR] RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR] RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR] RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR] RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR] RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR] RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR] RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR] RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR] RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR] RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR] RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR] RewriteCond %{HTTP_USER_AGENT} ^HMView [OR] RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR] RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR] RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR] RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR] RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR] RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR] RewriteCond %{HTTP_USER_AGENT} ^larbin [OR] RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR] RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR] RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR] RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR] RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR] RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR] RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR] RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR] RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR] RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR] RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR] RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR] RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR] RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR] RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR] RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR] RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR] RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR] RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR] RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR] RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR] RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR] RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR] RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR] RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR] RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR] RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR] RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR] RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR] RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR] RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR] RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR] RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR] RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR] RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR] RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR] RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR] RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR] RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR] RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR] RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR] RewriteCond %{HTTP_USER_AGENT} ^Wget [OR] RewriteCond %{HTTP_USER_AGENT} ^Widow [OR] RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR] RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR] RewriteCond %{HTTP_USER_AGENT} ^Zeus RewriteRule ^.* - [F,L] [F] - 403 Forbidden [L] - 連接(Link) 8. Change your default directory page 改變?nèi)笔∧夸涰?yè)面 DirectoryIndex index.html index.php index.cgi index.pl 9. Redirects 轉(zhuǎn)向單個(gè)文件Redirect /old_dir/old_file.html http://yoursite.com/new_dir/new_file.html 整個(gè)目錄Redirect /old_dir http://yoursite.com/new_dir 效果: 如同將目錄移動(dòng)位置一樣 http://yoursite.com/old_dir -> http://yoursite.com/new_dir http://yoursite.com/old_dir/dir1/test.html -> http://yoursite.com/new_dir/dir1/test.html Tip: 使用用戶目錄時(shí)Redirect不能轉(zhuǎn)向的解決方法當(dāng)你使用Apache默認(rèn)的用戶目錄，如 http://mysite.com/~windix，當(dāng)你想轉(zhuǎn)向 http://mysite.com/~windix/jump時(shí)，你會(huì)發(fā)現(xiàn)下面這個(gè)Redirect不工作: Redirect /jump http://www.google.com 正確的方法是改成 Redirect /~windix/jump http://www.google.com (source: .htaccess Redirect in “Sites” not redirecting: why? ) 10. Prevent viewing of .htaccess file 防止.htaccess文件被查看 order allow,deny deny from all

3. 拒絕IP的訪問(wèn)

可以在Apache配置文件httpd.conf指明拒絕來(lái)自某些IP的訪問(wèn)。

Order allow,deny

Allow from all

Deny from5.9.26.210

Deny from162.243.213.131

但是由于很多時(shí)候，這些訪問(wèn)的IP并不固定，所以這種方法不太方便，而且修改了httpd.conf還要重啟apache才能生效，所以建議采用修改.htaccess。

責(zé)任編輯：

標(biāo)簽：

上一篇：世界即時(shí)看！【劍靈力士】新版本力士職業(yè)天賦加點(diǎn)推薦備戰(zhàn)不刪檔
下一篇：最后一頁(yè)

相關(guān)推薦：

精彩放送：

一区二区三区电影_国产伦精品一区二区三区视频免费_亚洲欧美国产精品va在线观看_国产精品一二三四

今日看點(diǎn)：bot短期密集訪問(wèn)形成的流量高峰有哪些？如何解決？