System: Fake Traffic from AVG
Here's another example of a large corporation messing with our log files for no good reason, affecting visitor and traffic statistics and generally annoying webmasters around the world. This time it's not Microsoft sending log spam, but AVG 'pre-checking' links that their users may or may not even follow.
Traffic from AVG LinkScanner?
So what's going on exactly? For a full description you can follow the links under References below, but I'll try to summarise.
A company called AVG provides 'virus protection' to Windows users. The most recent version of their software, AVG 8, includes a 'LinkScanner' application that fetches content from sites listed in Google and other search result pages (SERPS) before they are clicked to check for suspicious code.
So if you have a website that appears in the top 10 or 20 search results for a common phrase then every time an AVG 8 user searches for that phrase your website will see traffic from their (fake) user agent:
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;1813)
The key part to look for being ;1813, which only appears for traffic from AVG LinkScanner.
The hit comes from the ip address of the person using their program and not from a central server, so it's impossible to block based on point of origin. The user themselves may never actually visit your website. The only function of this hit is for the AVG software to scan your site and display a 'green tick' if it passes muster.
This can result in hundreds of extra hits to your webserver (AVG has tens of millions of users), distorting traffic reports and wasting bandwidth. Our main hosting server was receiving around 1,500 hits per day from this user agent before we started blocking it (see below).
So how do I block it?
Using mod_rewrite in .htaccess the simplest way is to just check the user agent string for ;1813 and block any agent that matches. Other sites talk about returning pages with little or no content to fool the scanner, or even redirecting them back to the AVG website. We'll just block them.
RewriteCond %{HTTP_USER_AGENT} ;1813
RewriteRule .* - [F]
Translation:
- Reject (403) all traffic from user agents containing ;1813
You can also block some variants of this program using:
RewriteCond %{HTTP_USER_AGENT} ^User-Agent
RewriteRule .* - [F]
Translation:
- Reject (403) all traffic from user agents starting with User-Agent
You can of course combine these into one instruction:
RewriteCond %{HTTP_USER_AGENT} ^User-Agent [OR]
RewriteCond %{HTTP_USER_AGENT} ;1813
RewriteRule .* - [F]
Please Note: There may be other user agents that start with User-Agent, but in practice they are usually not human users.
A better blocker
One thing you might have noticed about the AVG user agent is that it contains a semicolon followed by a non-space character. In almost all valid user agent strings a semicolon will be followed by a space.
So let's construct a rewrite rule that blocks not just the ;1813 user agent, but also other malformed user agents. Here is what we came up with:
# invalid user agent string
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*;[^\ )]
RewriteCond %{HTTP_USER_AGENT} !MSN(;|\ Optimized) [NC]
RewriteCond %{HTTP_USER_AGENT} !;(MSN|MEGAUPLOAD|EGAMES|Google\ Wireless\ Transcoder)
RewriteCond %{HTTP_USER_AGENT} ![0-9][0-9];[0-9]
RewriteRule .* - [F]
Please Note: By using the above rules you will be blocking a lot of different user agents, some of which may be valid browsers or devices. You should only do this if you are going to monitor your logfiles (see below) and make adjustments as necessary.
Translation:
- IF the user agent starts with Mozilla and contains a semicolon followed by a character that is neither a space nor a closing bracket;
- AND the user agent does not contain MSN; or MSN Optimized;
- AND the string immediately following the semicolon is not MSN, MEGAUPLOAD, EGAMES or Google Wireless Transcoder (all relatively common);
- AND the user agent does not contain more than one number followed by a semicolon and another number (without spaces);
- THEN reject (403) all traffic from that user agent.
This effectively blocks user agents such as:
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;1813)
Mozilla/4.0 (compatible; MSIE 7.0;Windows NT 5.1;.NET CLR 1.1.4322;.NET CLR 2.0.50727;.NET CLR 3.0.04506.30)
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.15) Gecko/20080623 Firefox/2.0.0.14;MEGAUPLOAD 1.0
Mozilla/5.0(Windows;N;Win98;m18)Gecko/20010124
But, because of the extra conditions, allows through the following, which appear from our investigations to be valid:
Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; PalmSource/Palm-D062; Blazer/4.5) 16;320x320
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0;Google Wireless Transcoder;)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; MSN 9.0;MSN 9.1; MSNbQ002; MSNmen-us; MSNcOTH; MPLUS)
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Sky Broadband; .NET CLR 1.1.4322; MSN Optimized;GB; MSN Optimized;GB)
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.15) Gecko/20080623 Firefox/2.0.0.14;EGAMES 1.0
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9) Gecko/2008052906 Firefox/2.0.0.14;MEGAUPLOAD 1.0
To make sure you're not blocking anything that should be blocked you can identify the affected user agents in your logfile using the following command:
awk '($9 ~ /403/)' combined_log \
| awk -F\" '($(NF-1) ~ /^Mozilla.*;[^\ )]/){print $(NF-1)}' \
| sort | uniq -c
This will list the user agents that have been blocked (403) using the same regular expression and the first RewriteCond above. You can then check whether the 'exceptions' are sufficient. Remember also that a user agent that would otherwise have been let through can still be blocked by other rules.
As always, your feedback is welcome. Use the Feedback link below.
Privacy concerns
If you are a user of AVG you should be aware that there are also implications for you - it's not just a headache for webmasters. Every time you do a search in Google or another major search engine you will see some indication as to whether AVG thinks the sites in the SERP are secure, but at the same time every site on the results page (SERP) will receive a hit from your IP address.
So even if you're a very privacy-aware user who doesn't click on SERP links, but instead copies and pastes the links in a new browser window, it's too late. Your IP address will already have been recorded by every site in the list. They can use this information to deduce more or less what you were searching for. Your ISP will also see traffic from your computer to all those sites - even if you never visit them yourself.
So all round it's a lose-lose-lose situation. AVG gives itself a bad name, it's users are exposing their browser habits and webmasters and website owners get their stats distorted. Clearly something has to change!
References
Related Articles - Log Files
- SQL Using a PostgreSQL foreign data wrapper to analyze log files
- System Controlling what logs where with rsyslog.conf
- System Logging sFTP activity for chrooted users
- System Analyzing Apache Log Files
- System Bash script to generate broken links report
- System Blocking Unwanted Spiders and Scrapers
- System Referer Spam from Live Search
- System Referer Spam from Microsoft Bing
- System Fake Traffic from AVG