System: Blocking Unwanted Spiders and Scrapers
If you're concerned about bandwidth, server resources, or just trying protect your content from automated scrapers then you should realise that it's not a fight that can be won. Having said that, here's a case study on how to recognise and block unwanted user agents from accessing your website, using mod_rewrite and an .htaccess file.
Recognising the Enemy
Let's start by scanning the logs to find out which IP addresses are making the most request.
awk '{print $1}' combined_log | sort | uniq -c | sort -n | tail -40
Note: Replace combined_log with the location of your actual combined-log file, or use *combined_log to process all at once.
This returns a list of the top 40 IP addresses in terms of the number of requests. We'll stick to the top 5 for now:
824 XX.173.68.90
956 XX.184.192.199
983 XXX.9.3.10
1068 XXX.231.187.166
1098 XXX.235.117.192
The number on the right is obviously an IP address. The other numbers are the total number of requests made from each address. We can make this a bit more informative by displaying the domain associated with each address:
awk '{print $1}' combined_log | sort | uniq -c | sort -n | tail -40 \
| awk '{print $2,$2,$1}' | logresolve | awk '{printf "%6d %s (%s)\n",$3,$1,$2}'
Note: you'll need to know where the logresolve function is located on your server, and be able to call it.
This returns the same list, but with the IP addresses also being resolved into domain names:
824 example.rr.com (XX.173.68.90)
957 example.optonline.net (XX.184.192.199)
983 example.centie.net.au (XXX.9.3.10)
1068 example.gulftel.com (XXX.231.187.166)
1098 example.westnet.com.au (XXX.235.117.192)
Now you might recognise some of the heavy-hitters. Some of them might be you, or a server script that runs periodically over the site (checking links or building a search index for example) or a legitimate search engine spider (Googlebot, msnbot, Slurp, ...). You should be able to safely ignore them.
With the others, let's see what they're up to.
Divide and Conquer
Firstly, let's confirm the number of requests, and whether they're accessing a single or multiple sites. We start with the IP address with the most requests.
grep -c XXX.235.117.192 *combined_log | grep -v \:0
Note: This is only going to be useful if you're scanning multipe logfiles.
In our case, the top IP address was only accessing a single site. That probably means that they have a particular interest in, and may in fact be the owner or regular user of that site. You should check this out before taking any action.
The second IP address turned out to be accessing multiple, unrelated websites. This merits further investigation so see if they're a legitimate spider or something less wholesome.
Let's see exactly what they're after:
grep XXX.231.187.166 combined_log | more
Things you should be looking for now:
- Timing of requests - regular, random, sporadic, ...
- Pages requested - small sample, random sample, systematic, pages only, images only, ...
- Server Status Codes - 200, 301, 401, 403, 404, ...
- User Agent - none, browser, search engine, spider, ...
For this IP address we see a typical spidering pattern, but no requests for robots.txt, no pause between requests and they triggered some 401 Unauthorised responses. Their user agent is always "Java/1.4.1_04".
The other three IP addresses returned similar results. The only difference being that the user agent changed slightly. We've made an 'executive decision' that they're all going to be blocked.
Turning Back the Tide
The following lines added to your .htacces file will block any requests coming from IP addresses starting with XXX.23 or XX.1 where the User Agent starts with Java:
# anonymous Java-based spiders
RewriteCond %{REMOTE_HOST} ^XXX\.23[0-9] [OR]
RewriteCond %{REMOTE_HOST} ^XX\.1[0-9][0-9]
RewriteCond %{HTTP_USER_AGENT} ^Java
RewriteRule .* - [F]
We could also have listed the IP addresses one at a time, but other tests showed a number of 'Java' user agents coming from the same ISP, or at least the same IP-block.
You need to be a little bit careful here. Sometimes an IP address belongs to a proxy server which means that an entire organisation or thousands of subscribers to an ISP could be affected if your rules are too indiscriminate. Unless you're certain that an IP address or IP-block is only being used by malcontents then try to always add a User Agent RequestCond so you don't end up blocking legitimate users.
Repairing the Breach
Having turned up a number of 'Java' agents that we decided to block, we might want to investigate other request with similar user agents. For example, user agents starting with Java:
awk -F\" '($6 ~ /^Java/)' combined_log | awk '{print $1}' | sort | uniq -c | sort -n
This returns a list of IP addresses similar to those above. The addresses we've already picked up will appear at the top, and you'll also see the less-active ones and have the option to investigate further.
As we mentioned at the start of this page, you're never going to be block all the 'bad' agents while letting in the good ones. There are simply too many possible variations. There are automated solutions for blocking IP addresses on a temporary or permanent basis based on behaviour, but that's a whole different ball-game.
Related Articles - Log Files
- SQL Using a PostgreSQL foreign data wrapper to analyze log files
- System Controlling what logs where with rsyslog.conf
- System Logging sFTP activity for chrooted users
- System Analyzing Apache Log Files
- System Bash script to generate broken links report
- System Blocking Unwanted Spiders and Scrapers
- System Referer Spam from Live Search
- System Referer Spam from Microsoft Bing
- System Fake Traffic from AVG
Callum 14 August, 2014
Thanks for your reply to my question. Much appreciated.
Running with that example, I created a script that works for me. Not sure how elegant it is, but I'll share it in case someone else needs it:
#!/bin/bash
LOGFILE=/var/log/apache2/combined_log
TOP40=$(awk '{print $1}' $LOGFILE | sort | uniq -c | sort -rn | head -40 | awk '{print $2,$1}')
echo "$TOP40" | while read line
do
IFS=" "
set $line
IP=$1
SCORE=$2
grep $IP $LOGFILE | tail -1 | awk -v ip="$IP" -v score="$SCORE" '{print ip,ip,substr($4,2),score}' | logresolve | awk '{printf "%6d %s (%s) %s\n",$4,$1,$2,$3}'
done
Callum 13 August, 2014
How would you include the date & time of last visit in the command from Section 1? I can awk the log file to list the date & time like this:
awk '{print substr($4,2)}' combined_log
But I can't figure out how to get that included in the list of the top 40 visitors (as per your example).
It can't be done in a single command. You would need to run a loop over
the ip addresses. This might help:
#!/bin/bash
LOGFILE=/var/log/apache2/combined_log
TOP40=$(awk '{print $1}' $LOGFILE | sort | uniq -c | sort -n | tail -40 | awk '{print $2}')
for ip in $TOP40; do
grep $ip $LOGFILE | tail -1 | awk -v var="$ip" '{print var,substr($4,2)}'
done
Nicole King 20 November, 2013
A nicer way to block web sites is using ipsets and iptables. You have an iptables rule a bit like this
ipset create crawlers hash:ip
ipset add crawlers XX.173.68.90
ipset add crawlers XX.184.192.199
ipset add crawlers XXX.9.3.10
ipset add crawlers XXX.231.187.166
ipset add crawlers XXX.235.117.192
iptables -A INPUT -p tcp --dport 80 --match set --set-name crawlers src -j REJECT
This has the advantage that it can be modified on the fly from the command line, without having to restart Apache.
Thanks. I'll have to check out 'ipset'. We mostly just use Fail2Ban.
Andrei 12 May, 2011
Nice post, but you should use
uniq -c | sort -n | tail -40
instead of
uniq -c | sort | tail -40
to do numeric sorting.
Updated now. Thanks