System: Referer Spam from Live Search
Occasionally something a bit 'odd' shows up in our logfiles that we can't identify or find a satisfactory explanation for on the web. Here's the latest that seems to originate from inside Redmond (our old friend Microsoft Corporation).
Description of the event
This is a strange pattern - it looks like log- or referer-spam except that it comes from the Microsoft corporate network. Similar logfile entries have been reported on WebmasterWorld, but apart from a cryptic message from msndude describing it as 'part of a quality check we run on selected pages', noone has come up with a sensible explanation.
This is how it all began:
Date Recorded: | 3 April 2007 |
---|---|
IP Address: | 131.107.0.96 (tide526.microsoft.com) |
User Agent: | Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; Win64; x64; SV1; .NET CLR 2.0.50727) |
The agent appears to be a normal Internet Explorer web browser - because it downloads CSS, JavaScript and even MP3 files that are included from the webpage - but with image loading disabled. The referer strings however are not valid searches - at least not on the public version of Live Search. They seem to be targetting 'spammy' keywords, but the websites in question (and each table row is a separate website) don't mention the keywords in question and are not even in related industries.
The same pattern goes back at least a couple of weeks and probably longer. Anyone with a theory is welcome to get in touch. It's put a bee in my bonnet because our search engine traffic reports are showing these 'spammy' keywords to our clients..
Referer |
---|
http://search.live.com/result.aspx?q=Dodge&mrt=en-us&FORM=LVSP |
http://search.live.com/result.aspx?q=hold+em&mrt=en-us&FORM=LVSP |
http://search.live.com/result.aspx?q=buspar&mrt=en-us&FORM=LVSP |
http://search.live.com/result.aspx?q=tramadol&mrt=en-us&FORM=LVSP |
http://search.live.com/result.aspx?q=clomid&mrt=en-us&FORM=LVSP |
http://search.live.com/result.aspx?q=nokia&mrt=en-us&FORM=LVSP |
http://search.live.com/result.aspx?q=upskirt&mrt=en-us&FORM=LVSP |
http://search.live.com/result.aspx?q=cell+phone&mrt=en-us&FORM=LVSP |
http://search.live.com/result.aspx?q=pontiac&mrt=en-us&FORM=LVSP |
http://search.live.com/result.aspx?q=airfare&mrt=en-us&FORM=LVSP |
http://search.live.com/result.aspx?q=volkswagen&mrt=en-us&FORM=LVSP |
http://search.live.com/result.aspx?q=diazepam&mrt=en-us&FORM=LVSP |
http://search.live.com/result.aspx?q=gay&mrt=en-us&FORM=LVSP |
This pattern has changed since this article was written. The referrals have changed from LVSP to LIVSOP and the keywords are no longer so offensive or spammy. More on this, and a means for blocking these referrals below.
Checking your log files
To display logfile entries of this type you can use the command:
grep LVSP combined_log
And the following awk command will show you the keywords being passed in the HTTP Referer string:
awk -F\" '($4 ~ /LVSP$/){print $4}' combined_log | awk -F[=\&] '{print $2}'
Just replace combined_log with a reference to one or more combined log files on your server.
More hijinks from Live Search LIVSOP
Thankfully the stream of inappropriate search terms from Microsoft's network seems to have stopped for now. They have however been replaced by an almost identical series of 'fake' search referrals flagged as 'LIVSOP' which obviously relates to 'LVSP' in some way.
The new referrals are coming from a range of IP addresses inside Microsoft Corporation. In the last 30 hours our server has received 180 requests from this source from 85 IP addresses in the block 65.55.165.0/25 (65.55.165.0 - 65.55.165.127). The user agent in each case is:
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)
You can find these entries in your log file using:
awk -F\" '($4 ~ /LIVSOP$/){print $4}' combined_log | awk -F[=\&] '{print $2}'
In each case the search term referred to is a single word, one that would never bring up that particular website or web page, but that does seem to always appear in the TITLE tag of the target page.
And there are plenty of (rightly, in my opinion) pissed off webmasters as you can see from the links below. Given that log files are the most accurate record of the performance of a website it's difficult to see how Microsoft can justify inserting fake search referrals.
The excuse forwarded by msndude is that this is some kind of 'quality check', but surely they could do this without passing a referrer that is so similar to 'real' search referrals that it pollutes web traffic reports and gives a false impression that Live Search is being used to find a website.
On this website (The Art of Web) we have had over 12,000 search referrals from Google in the last month, compared to just 110 from Live Search. Looking deeper however shows that only about 20% (one in five!) of the referrals from Live Search are real. In other words they are sending us four times as many fake referrals as real ones!!
Here are the kind of search terms we're talking about:
agents, array, border, browser, browsers, button, class, client, codes, collapse, colours, combined, command, content, cookies, credit, definition, download, email, example, examples, function, green, input, javascript, media, number, october, parent, password, random, report, request, rewrite, robots, search, september, server, shell, startdate, system, using, validator, value, warning, world
If you do want to block these referrals without blocking all traffic from the same network, you can use mod_rewrite as follows:
RewriteCond %{REMOTE_ADDR} ^65\.55\.165
RewriteCond %{HTTP_REFERER} FORM=LIVSOP$
RewriteRule .* - [F]
Translation:
- IF the IP address starts with 65.55.165;
- AND the referer ends with FORM=LIVSOP;
- THEN refuse access to the requested resource.
To use this code you will need to be able to edit the httpd.conf or .htaccess file for your website and have mod_rewrite enabled.
This will not stop the referrals from showing up in your log files (they will appear as 403 - Forbidden), but it will prevent the loading of related files (CSS, JavaScript, MP3, etc.) which solves the problem of having your AdSense statistics spoiled by this agent which is a common complaint from webmasters.
More Pharmaceutical Referrals
It seems that I spoke too soon regarding the ending of sexually explicit and pharmaceutical search terms from Live Search. There are still a few coming from a different IP block - namely the addresses 131.107.0.95 and 131.107.0.96.
From those addresses (also inside Microsoft Corporation) we're seeing search terms including:
adult, codeine, diflucan, nextel, nokia, phendimetrazine, sex
So if you want to block them as well the modified code is as follows:
RewriteCond %{REMOTE_ADDR} ^65\.55\.165 [OR]
RewriteCond %{REMOTE_ADDR} ^131\.107\.0\.9[56]
RewriteCond %{HTTP_REFERER} FORM=LIVSOP$
RewriteRule .* - [F]
Translation:
- IF the IP address starts with 65.55.165;
- OR the IP address is 131.107.0.95 or 131.107.0.96;
- AND the referer ends with FORM=LIVSOP;
- THEN refuse access to the requested resource.
New IPs to Block
Today the LIVSOP hits have started coming from a new IP block, so lets update the filter:
# block spurious referrals from microsoft LIVSOP
RewriteCond %{REMOTE_ADDR} ^65\.55\.165 [OR]
RewriteCond %{REMOTE_ADDR} ^65\.55\.232 [OR]
RewriteCond %{REMOTE_ADDR} ^131\.107\.0\.9[56]
RewriteCond %{HTTP_REFERER} FORM=LIVSOP$
RewriteRule .* - [F]
As much as I'd like our sites to appear in the search results for single generic keywords, it's really not feasible. Here are just some of the search terms for which we're seeing dud referrals from Microsoft:
about, achievement, amsterdam, apology, argenton, august, australia, backpacker, bridgewater, canberra, chicken, corowa, council, courtesy, darwin, detention, einasleigh, emerald, facials, fields, foster, functions, geographic, government, guantanamo, hicks, hotel, information, judicial, justice, kylie, ladies, legal, melbourne, military, motel, north, northern, nullarbor, photo, prahran, railway, region, restaurant, rugby, search, semester, sentencing, simulation, society, south, spatial, submissions, sydney, systems, taxation, terrorism, title, wollongong, zealand
What do Microsoft have to say about this?
"We have now optimized the tool to use only keywords that are relevant to your website"
What a joke. How is the word 'about' or 'region' relevant to any website?!? Sure they're in a tight spot playing catch-up with Google, but that's no excuse for spamming our websites.
Follow the links below for more in-depth reporting of the problem, including (finally) some response from Microsoft.
A rose by any other name ... QBHP
They just don't give up, do they. It's June 2008 and suddenly we're getting Microsoft referer spam using the code QBHP instead of LIVSOP. Maybe too many people were blocking the old name?
They're using a new range of IP addresses:
- 65.55.109.97
- 65.55.109.82
- 65.55.110.43
- 65.55.110.51
- 65.55.110.111
- 65.55.110.115
- 65.55.110.206
AND they've discoverd lower-case so FORM is now form:
http://search.live.com/results.aspx?q=search&form=QBHP
So here's the new mod_rewrite instructions to block them:
# block spurious referrals from microsoft LIVSOP or QBHP
RewriteCond %{REMOTE_ADDR} ^65\.55\.(109|110|165|232) [OR]
RewriteCond %{REMOTE_ADDR} ^131\.107\.0\.9[56]
RewriteCond %{HTTP_REFERER} FORM=(LIVSOP|QBHP)$ [NC]
RewriteRule .* - [F]
Note: The [NC] after the final RewriteCond indicates that the match is case-insensitive.
There is (was) a vague reference to QBHP on the Microsoft Privacy website here. Something about being able to do searches and go to a site without passing the search string. WTF?!? Is that in case you searched for your own credit card number or some other private/personal information?!? Or because you're too stupid to copy and paste or re-type the link yourself? Sheesh.
Dear Microsoft, Just because you can't build an operating system that protects your users, and can't assume that they have even basic common sense, please don't mess with the web developers and webmasters who rely on log analysis every day to build good websites. Just don't!
They just won't stop!
I guess I shouldn't be surprised that Microsoft have now introduced
yet another format for their log spamming user agent, but I am. This
time they've stripped off all the extra parameters and are just passing
the query string - to make it hard to identify I guess.
http://search.live.com/results.aspx?q=photo
http://search.live.com/results.aspx?q=domain
http://search.live.com/results.aspx?q=public
http://search.live.com/results.aspx?q=minister
http://search.live.com/results.aspx?q=child
http://search.live.com/results.aspx?q=council
http://search.live.com/results.aspx?q=stomp
http://search.live.com/results.aspx?q=refugee
http://search.live.com/results.aspx?q=david
http://search.live.com/results.aspx?q=august
http://search.live.com/results.aspx?q=simulation
http://search.live.com/results.aspx?q=party
This traffic started on 13 August 2008 and has been seen on our server coming from the following IP addresses:
- 65.55.232.35
- 65.55.232.36
- 65.55.232.37
- 65.55.232.38
- 65.55.232.39
- 65.55.232.40
- 65.55.232.41
- 65.55.232.43
- 65.55.232.44
- 65.55.232.45
- 65.55.232.48
- 65.55.232.49
Again, this can only be classified as log spam and it's unbelievable that an organisation as large as Microsoft would be so stupid as to think it's ok to spam the millions of websites in their index - on whatever pretext.
The robot follows msnbot as before, but uses the user agent Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322). It 'pretends' to have done a search to find your website and also downloads .js and .css files - including those from external sites and of course without caching. It doesn't appear to check robots.txt on those sites.
We spotted it after search traffic from Live to one of our sites shot up again from a typical 1-2 search referrals per day to closer to 20. Why do you think they want to inflate your search numbers from Live Search?
I can already hear you asking, how do we block them now?
# more spurious referrals from microsoft
RewriteCond %{REMOTE_ADDR} ^65\.55\.232
RewriteCond %{HTTP_REFERER} search\.live\.com
RewriteCond %{HTTP_REFERER} !\&
RewriteRule .* - [F]
At the time of writing the 'fake' traffic in this format has slowed down and we haven't had any hits for a couple of hours, but we're monitoring the logs to be sure...
Yep, they're still at it, but now being blocked (403) by our new rewrite rule.
These hits are now coming from other IP blocks so the blocking script needs to be changed as follows:
# more spurious referrals from microsoft
RewriteCond %{REMOTE_ADDR} ^65\.55\.(109|110|165|232)
RewriteCond %{HTTP_REFERER} search\.live\.com
RewriteCond %{HTTP_REFERER} !\&
RewriteRule .* - [F]
The spamming campaign has been expanded yet again and now includes addresses in the range 65.55.107.* in addition to those listed above.
References
- Microsoft is lying and intentionally screwing up your log files
- Microsoft Search Live - Strange Referrer Activity
- Possible Bot or Spammer?
- Live.com Sexually Explicit Search Terms
- Odd stats from "microsoft.com"
- Live Search Webmaster Center Blog : Live Search and Cloaking Detection
Related Articles - Log Files
- SQL Using a PostgreSQL foreign data wrapper to analyze log files
- System Controlling what logs where with rsyslog.conf
- System Logging sFTP activity for chrooted users
- System Analyzing Apache Log Files
- System Bash script to generate broken links report
- System Blocking Unwanted Spiders and Scrapers
- System Fake Traffic from AVG
- System Referer Spam from Microsoft Bing
- System Referer Spam from Live Search
School Games 10 March, 2011
I had the same problem but I didn`t go into such details but just like Penguin Pete here I bloked the traffic from them. It is good to know that i`m not the only one. At some point i was happy to see hits from them but i realized that there whore no good.
Penguin Pete 20 May, 2009
Most excellent write-up! I run a Linux-centric site anyway, so when I was dealing with this, I just plain blocked all traffic with a live.com in it and I'm done in one step.
MSN search never sent me a single actual hit anyway. My conclusion is they're doing this on pourpose to push their search engine, at which point I'm treating them like any other spammer. Lord knows, they have the same ethics as one.
John K 17 June, 2008
I hate MS just as much as the rest but give them credit. They appear to be looking for a better way to search and rank web content. (that is different than google) If that messes up our "canned" web stat programs, then we have to change with the times. To stay on top, we as web page content providers and web admins have to adjust. Way back when Google sent out bots to understand the web, if we all blocked them then we would be out in the cold now. Re-write your stats software, we just did; or risk loss of traffic in the future. Monitor, adjust repeat; that's the art of the web.
Sorry, but I can't give them credit for what they haven't done. And I don't think sending referer spam, which is what they're doing, is going to get them any closer to having a half-decent search engine.
As for 'loss of traffic', this website receives 98% of it's search traffic from Google, 1% from Yahoo! and just 0.1% from Live Search. That's close to 1,000 searches a day from Google compared to 1 or 2 from Live Search, and those ratios are repeated on most of the hundreds of other websites we're hosting.
We're not blocking the msnbot spider, and not suggesting that people do that, just the log spamming component as it's absolutely ridiculous and pointless what they're doing. And if we can detect and block it so effectively, then it's going to be useless in combatting spammers - if that was ever the intention.
Jeff Walker 12 June, 2008
I just wanted to thank you for your comprehensive documentation of this issue. I too have been plagued by these Microsoft bots across several websites; they are particularly annoying on the e-commerce websites.
I have watched and studied this new "QBHP" bot on one of my websites. First the regular msnbot turns up and crawls a page, then the "QHBP" bot turns up and crawls the same page. The time difference between the two visits is random, anything from a second to a few minutes, but whenever I see msnbot on a page, I know that "QBHP" will not be far behind it.
Microsoft hints that the "QBHP" bot is something to do with its IE phishing filter, but the pattern I have observed doesn't appear to me to tally with this explanation. It is definitely linked to msnbot.
That's the same pattern we're seeing on all our sites. First "msnbot/1.1" makes a request, with no referer string, and apparently with a valid If-Modified-Since request-header as they receive a 304 response. That's followed by another request with a fake referer (QBHP) and the user agent "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)", which we block (403) using the above rewrite rule.
I don't bother myself any more over what they're trying to achieve as they've clearly lost the plot somewhere