System: Apache mod_pagespeed issues
This article follows on from our discussion of mod_pagespeed settings where we talk about the installation process and introduce the various core and non-core filters and other features. In this article we're going to focus more on the impact of mod_pagespeed and its evolution over time.
Since its release in late 2010 there have been a number of bug fixes and enhancements. As an interested party, I have been active in reporting bugs as they arise and have also submitted one or two feature requests which may bear fruit and make our lives easier. Some of these are discussed below.
Using GET parameters
This is not very well documented, but to disable mod_pagespeed for a single page load you can add the query param ModPagespeed=off to the URL. This will disable all configuration setting and show how the page would appear if Mod Pagespeed was not installed.
You can also specify exactly which filters to enable by passing the query param ModPagespeedFilters=list,of,filters. Only filters included in the list will be active, with the others being disabled.
These GET parameters don't have to be visible in the Location bar, but can also be added using mod_rewrite rules. There are also now options for .htaccess to override filter settings.
All the filters should be specifiable with the query-params, but not other things that you would put into pagespeed.conf or .htaccess.
Speed testing mod_pagespeed
As described above, for any page the filtering can be disabled by passing a GET parameter ModPagespeed=off. Using this you can feed your pages both with and without mod_pagespeed enabled to your favourite speed checker to make sure it is having a positive effect.
Some of the (free) tools we like to use for this include GTmetrix, REDbot and the Cacheability Query. This kind of third-party tool is generally more useful than collecting statistics internally.
One of the Google mod_pagespeed team has also recommended WebPageTest for A/B testing. Use a high repeat-count (10 is the max).
Feature requests
Control mod_pagespeed using environment variables (Issue 201)
A lot of problems would be avoided with the ability to customise ourselves which filters are applied according to the User-Agent or other environmental variables.
For example, we could enable remove_comments for most users, but disable it for certain spiders that read instructions in HTML comments.
Or to disable extend_cache for certain users so they can more easily see new images and CSS changes.
A model to use would be the Special Purpose Environment Variables already available for mod_deflate, mod_deflate and other Apache modules.
SetEnvIfNoCase Request_URI \.js$ no-gzip dont-vary
Data-ification of small background images
The rewrite_images filter will 'inline' small images in the HTML by using a data: string. As yet, this is not being done for CSS background images - neither in external CSS files nor when using inline styles.
Bug reports
Not recognising CSS3 syntax (Issue 108)
The following (real-world) CSS statements, cause the parser to bail and not minify any CSS in the same file or code block:
Update: As of 0.9.16.9 many CSS3 styles are now being recognised by the CSS parser. From the below examples, only the border-radius settings with the / are failing.
<style>
.border8 {
-moz-border-radius: 36px / 12px;
border-radius: 36px / 12px;
}
fieldset {
background: -webkit-gradient(linear, 0 80%, 0 100%, from(white), to(#eee));
background: -moz-linear-gradient(top, white 80%, #eee);
}
input[type=submit] {
background: #f7fafc -webkit-gradient(linear, left top, left bottom, from(#fff), to(#dae6f1));
background: #f7fafc -moz-linear-gradient(top, #fff, #dae6f1);
}
input::-webkit-input-placeholder {
color: #ababab;
}
</style>
Local control of filters (Issue 75)
It would be really nice, and almost essential in some cases, to be able to enable/disable some or all filters both on a per-vhost basis (without having to duplicate all the settings or reload the server) and/or based on the requested filename/directory.
This could probably best be implemented using the existing <Files>, <Filesmatch> or <Directory> directives in .htaccess. A number of people have requested this so let's wait and see.
Update: This was added in 0.9.11.3 using wildcards in ModPagespeed config, <Directory> directives and/or .htaccess.
If-Modified-Since conditional requests (Issue 80)
The Extend Cache filter which creates cached versions of different files, doesn't seem to support If-Modified-Since conditional requests. Instead of returning a 304 Not Modified response, the whole file is transferred again.
For example, results from Cacheability Query with mod_pagespeed disabled:
Request: | http://www.the-art-of-web.com/common.css |
---|---|
Expires: | 4 weeks 2 days from now (Wed, 22 Dec 2010 14:53:19 GMT) |
Cache-Control: | - |
Last-Modified: | 2 hr 9 min ago (Mon, 22 Nov 2010 12:43:31 GMT) validated |
and with mod_pagespeed enabled:
Request: | http://www.the-art-of-web.com/ce.63990ae65645d7972bafaa151d9bd53f.common,s.css |
---|---|
Expires: | 4 weeks 2 days from now (Wed, 22 Dec 2010 14:53:19 GMT) |
Cache-Control: | max-age=31536000, public |
Last-Modified: | 46 min 20 sec ago (Mon Nov 22 14:07:09 2010 GMT) validation returned same object |
So when a user agent makes an If-Modified-Since request, instead of receiving a 304 Not Modified header, they instead receive the entire file. That could increase rather than reduce bandwidth usage in some situations.
Also the Expires header can probably be removed from the rewritten file because if/when the file does change, the hash, and therefore the URL, will change as well.
Update: This was addressed in 0.9.11.3 with cached resources now returning 'Not Modified' in response to If-Modified-Since requests - which is valid because a change in the source file will result in a new URL.
Logfile error for CSS files (Issue 113)
When a CSS file is encountered the [error] message 'Failed to load resource ...' appears in the logs. This means either that the file needs another pass before it can be made cacheable or that it contains CSS statements not recognised by the parser - the @import command or some CSS3 syntax for example.
Update: This has been fixed in 0.9.11.3
facebookexternalhit 404 errors
Not specifically related to mod_pagespeed, but the facebookexternalhit spider is mangling image requests because it replaces the "," characters with "%2C" in the URL. This then generates a "Fetch failed for ..." [error] as mod_pagespeed tries to load a cached image.
Update: The cache filename format has changed in 0.9.11.3 which should fix this - once Facebook catches up
Logfile warnings
Every single file request from the internal Serf/0.3.1 spider results in a [warn] message in the error log: 'Someone is already fetching ...'. This is regardless of the presence of any *.lock files.
...
[Tue Nov 23 07:39:22 2010] [warn] [mod_pagespeed 0.9.8.1-215] /var/mod_pagespeed/files/0b342e9914eba294481cc3ddc961c3b5.lock:0: Someone is already fetching http://www.the-art-of-web.com/images/head_02.jpg
[Tue Nov 23 08:43:36 2010] [warn] [mod_pagespeed 0.9.8.1-215] /var/mod_pagespeed/files/0b342e9914eba294481cc3ddc961c3b5.lock:0: Someone is already fetching http://www.the-art-of-web.com/images/head_02.jpg
[Tue Nov 23 09:50:05 2010] [warn] [mod_pagespeed 0.9.8.1-215] /var/mod_pagespeed/files/0b342e9914eba294481cc3ddc961c3b5.lock:0: Someone is already fetching http://www.the-art-of-web.com/images/head_02.jpg
[Tue Nov 23 09:50:06 2010] [warn] [mod_pagespeed 0.9.8.1-215] /var/mod_pagespeed/files/0b342e9914eba294481cc3ddc961c3b5.lock:0: Someone is already fetching http://www.the-art-of-web.com/images/head_02.jpg
[Tue Nov 23 09:50:06 2010] [warn] [mod_pagespeed 0.9.8.1-215] /var/mod_pagespeed/files/0b342e9914eba294481cc3ddc961c3b5.lock:0: Someone is already fetching http://www.the-art-of-web.com/images/head_02.jpg
[Tue Nov 23 10:55:35 2010] [warn] [mod_pagespeed 0.9.8.1-215] /var/mod_pagespeed/files/0b342e9914eba294481cc3ddc961c3b5.lock:0: Someone is already fetching http://www.the-art-of-web.com/images/head_02.jpg
[Tue Nov 23 11:58:12 2010] [warn] [mod_pagespeed 0.9.8.1-215] /var/mod_pagespeed/files/0b342e9914eba294481cc3ddc961c3b5.lock:0: Someone is already fetching http://www.the-art-of-web.com/images/head_02.jpg
...
Update: This has been fixed in 0.9.10.1
Broken images and crashes
The Serf spider has an occasional fit and for a few seconds drops the last directory from requested URL file paths resulting in 404 errors and mod_pagespeed.so [alert] messages.
The [alert] message seems to come from the line:
CHECK(string_size == storage_->size() - kStorageOverhead);
At least for images, an invalid src link is created and each time it's requested a process is aborted and the server 'unexpectedly drops the connection'. This can continue for some hours, or until the cache files are deleted and the Apache configuration reloaded.
Here is an extract from the access logs showing the problem starting around 1am and ending only around 4.30am for a particular file:
...
212.19.215.138 - - [22/Nov/2010:21:34:05 +0100] "GET /images/logo-transparent.png HTTP/1.1" 200 14747 "-" "Serf/0.3.1"
212.19.215.138 - - [22/Nov/2010:22:49:30 +0100] "GET /images/logo-transparent.png HTTP/1.1" 200 14747 "-" "Serf/0.3.1"
212.19.215.138 - - [22/Nov/2010:23:58:23 +0100] "GET /images/logo-transparent.png HTTP/1.1" 200 14747 "-" "Serf/0.3.1"
212.19.215.138 - - [23/Nov/2010:00:59:08 +0100] "GET /logo-transparent.png HTTP/1.1" 404 2056 "-" "Serf/0.3.1"
212.19.215.138 - - [23/Nov/2010:00:59:23 +0100] "GET /logo-transparent.png HTTP/1.1" 404 2056 "-" "Serf/0.3.1"
212.19.215.138 - - [23/Nov/2010:00:59:24 +0100] "GET /logo-transparent.png HTTP/1.1" 404 2056 "-" "Serf/0.3.1"
...
212.19.215.138 - - [23/Nov/2010:04:28:08 +0100] "GET /logo-transparent.png HTTP/1.1" 404 2056 "-" "Serf/0.3.1"
212.19.215.138 - - [23/Nov/2010:04:28:09 +0100] "GET /logo-transparent.png HTTP/1.1" 404 2056 "-" "Serf/0.3.1"
212.19.215.138 - - [23/Nov/2010:04:31:17 +0100] "GET /logo-transparent.png HTTP/1.1" 404 2056 "-" "Serf/0.3.1"
212.19.215.138 - - [23/Nov/2010:04:36:15 +0100] "GET /images/logo-transparent.png HTTP/1.1" 200 14747 "-" "Serf/0.3.1"
212.19.215.138 - - [23/Nov/2010:05:37:32 +0100] "GET /images/logo-transparent.png HTTP/1.1" 200 14747 "-" "Serf/0.3.1"
212.19.215.138 - - [23/Nov/2010:06:37:32 +0100] "GET /images/logo-transparent.png HTTP/1.1" 200 14747 "-" "Serf/0.3.1"
...
Note that while the valid (200) request take place only every hour or so (the variable ModPagespeedFileCacheCleanIntervalMs is set to 3600000 so the cache clears every hour), the invalid (404) request can occur every few seconds.
Update: This has been fixed in 0.9.10.1
SEO and other concerns
Having taken a number of steps to optimise our servers and websites before mod_pagespeed became available, it has been interesting to note the interaction between those settings and mod_pagespeed and how some of our practices have changed.
Page Last-Modified headers
When using PHP to server content, it's common for pages to have no Last-Modified date attached to them. We had put in place a system to insert a Last-Modified header for news items and other date-dependent items based on the publication date (stored in the database).
This meant that an article from 2007 would announce itself in the response headers as being last modified in 2007, allowing a cache to 'assign its own freshness lifetime':
Last-Modified: Wed, 12 Dec 2007 13:00:00 GMT
Working on the assumption that search engine spiders are intelligent, this would allow them to fetch old articles less often and give priority to new content - saving our bandwidth and giving better search results.
With mod_pagespeed active, however, the Last-Modified header is removed completely, and replaced with the following:
X-Mod-Pagespeed: 0.9.16
Cache-Control: max-age=0, no-cache, no-store
There is a good reason why HTML content is set to be uncacheable. When any linked content (images, CSS or JavaScript) is changed, the generated URL also changes requiring the HTML to update. If the HTML file is cached then it will keep pointing to outdated addresses.
But it's a shame to see all those custom Last-Modified timestamps being wasted, and if we could vary settings based on the user agent this is something we might change, at least for Googlebot.
Image Caching (mod_expires)
Using mod_expires We had set a number of file types to have far-future expiry dates to encourage cacheing. In our case that meant a month in the future for images as well as CSS and JavaScript files:
<IfModule mod_expires.c>
ExpiresActive On
...
ExpiresByType image/gif "access plus 1 month"
ExpiresByType image/jpeg "access plus 1 month"
ExpiresByType image/png "access plus 1 month"
...
</IfModule>
Now it turns out that while mod_pagespeed actually extends the expiration date even further, to one year in the future, the internal spider still relies on our Apache settings.
If the spider (Serf) doesn't check regularly for changes to the image file then there is no simple way for users to see updated images when updating content. Even a forced reload/refresh or emptying your browser cache won't work as it's still the mod_pagespeed cached image being referenced in the HTML.
Recognising this we have set the expiry time for images back to a reasonable period of 5-10 minutes. This allows Serf to check images more frequently and update the HTML with new generated links.
The drawback of this is that images not processed by mod_pagespeed (e.g. where ModPagespeed is disabled, where background images are specified using inline CSS, or where images are loaded using JavaScript) are being served with the shorter expiry time.
When content is being updated using a CMS it would be nice to be able to turn off mod_pagespeed for that user using a cookie or other $_SERVER variable.
Image Search Engines
Using rewrite_images the code seen by search spiders is now moz-border-radius.gif.pagespeed.ce.HASH.gif instead of just moz-border-radius.gif.
If you get traffic from image search engines, this might have some impact on your rankings. And when small images are 'inlined' they may no longer be searchable at all. This would be another incentive for disabling certain filters for search engine spiders.
Also for images the response headers have changed from (for example):
Last-Modified: Thu, 16 Apr 2009 08:52:19 GMT
Cache-Control: max-age=600
Expires: Sun, 20 Mar 2011 10:19:33 GMT
to:
Last-Modified: Sun, 20 Mar 2011 09:33:51 GMT
Cache-Control: max-age=31536000
Expires: Mon, 19 Mar 2012 09:33:51 GMT
For images, unlike HTML, you can see that there is still a Last-Modified value, but it's the timestamp (I think) of the cached version rather than the actual image file.
Combining CSS and JavaScript files
These features have superseded our efforts at combining, minifying and cacheing multiple included files on-the-fly. Now we just try to place all the CSS in the <HEAD> (though even that's not necessary with move_css_to_head) and JavaScript at the bottom of the page and let mod_pagespeed do the rest.
What we haven't changed
For images, we are still stripping out extra data, optimising and resizing JPEGs during the upload process, and presenting them with a matching HTML width and height. This makes much of rewrite_images redundant, but it's the sensible thing to do.
We have also upgraded all Google Analytics code to Async and would not rely on a filter for that.
Conclusion
If you're running your own server, with your own content and want to try it out, go for it. Check which filters you want to enable and disable the ones you don't need. If you enable rewrite_javascript you will need to check that any locally hosted JavaScript files are still working.
A number of bugs and stability issues seem to have been addressed in the 0.9.10.1 update making it less likely that there will be errors. Also new features and improvements were added in 0.9.16.9.
On our hosted websites there is an average 2-3% gain in YSlow
and 3-4% in Page Speed - mostly due to JavaScript minification,
combining style sheets and inlining small images. Otheriwse, our
websites were already pretty well optimised. These scores could improve
further once the parser can recognise all CSS3 statements, which currently
prevents most still prevents some CSS files from being minified.
Another big boost would come from the compression of image background referenced in inline styles, and 'data-ification' of background images in both external CSS files and inline styles to reduce the number of requests.
In summary, it's important that you understand how each filter works and under what conditions to take full advantage. There are situations where ModPagespeed can work against you, depending on your server capacity and how your websites are constructed. Just ask GoDaddy.
References
- Google Code: modpagespeed issues list
- mod_pagespeed — why so hasty just yet?
- Stack Overflow: Stack Overflow - mod_pagespeed magento
Related Articles - ModPagespeed
- System Apache mod_pagespeed settings
- System Apache mod_pagespeed issues
Joshua Marantz 19 March, 2011
At least one of the issues you've brought up, handling of CSS background images, has been addressed by 0.9.16.9. Although you specifically reference "dataification" which we still don't do for CSS background issues, but now we do at least optimize and cache-extend CSS background images.
There are numerous other features now supported, including:
And least obvious but perhaps most important, we've made a lot of improvements in our underlying infrastructure to reduce the latency from mod_pagespeed and improve the stability under heavy load, as well ascompatibility with mod_speling, mod_include, and mod_rewrite.
Joshua Marantz 25 November, 2010
Thanks for the report!
The if-modified-since issue has been recorded as code.google.com/p/modpagespeed/issues/detail?id=112
Feel free to enter any other issues you find in that issue tracker as well. That will get our more direct attention.
Thanks. I have already submitted a couple of other issues there. Good to see you're being pro-active