System: mod_rewrite: Seach Engine Friendly URL's
The evolution of the World Wide Web (WWW) has seen URL's evolve from a very simple format to more and more complicated with the advent of database-driven sites and now back to simple. We look at how best to achieve search engine friendly URL's using mod_rewrite.
Background
In the early days of the WWW, websites were built by hand with each page saved as an individual HTML file. This allowed them to be easily indexed, copied and moved around as each element of the site was a unique file.
The next development was CGI scripts to server dynamic content. This led to URI such as:
http://www.example.net/cgi-bin/page.cgi
http://www.example.net/cgi-bin/script.pl?param1=val1¶m2=val2
The addresses could be cleaned up a bit, with the GET parameters being incorporated into the URI:
http://www.example.net/cgi-bin/script/param1/param2
Then along came various scripting languages that allowed pages to be dynamic without having to sit in a specific directory:
http://www.example.net/script.php?param1=val1¶m2=val2
http://www.example.net/script/param1/param2
The second example in this case would normally use Content Negotiation to recognise the address as a call to the script, with the remainder of the address (/param1/param2) accessible as an environment variable.
It was around this time that search engines such as Google emerged and suddenly everything had to be 'indexable' by search engine spiders with names such as Googlebot and Slurp (and now msnbot). The 'holy grail' being to make the site look as if it was made of of hand-crafted HTML files. In other words, we've come full circle.
The examples below describe the process of achieving this goal of simple URLs for comlex dynamic content.
Search Engine Friendly URL's
A typical component of a website is a 'latest news' database. Individual news items would be accessed as:
http://www.example.net/news.php?id=20050901
(a call to the news.php script passing a single GET parameter which identifies which item to display)
Now we introduce the RewriteRule:
RewriteRule ^news/([0-9]+) /news.php?id=$1
Translation:
- IF the request starts with news/ followed by one or more digits;
- THEN call the news.php script with the id parameter set to those digits.
The URI then becomes:
http://www.example.net/news/20050701
Or, because we left the RHS of the regular expression open, we can also use:
http://www.example.net/news/20050701.html
There is a slight problem here in that, if Content Negotiation is enabled, this URI could be taken as a call to the news.php script with the rest of the URL (/20050701.html) being unused. This is because there is no file called /news/20050701.html, and no file called /news, but /news.php does exist and Content Negotiation is all about finding that out. In that situation the correct news item won't be displayed.
The solution? We could re-name the script, change the format of the URI, or turn off Content Negotiation (which you might want to do in any case), but there's a simpler option:
RewriteRule ^news/([0-9]+) /scripts/news.php?id=$1
By moving the script into a sub-directory, which we can do now because it's no longer called 'in place', we avoid any chance of conflict. The /scripts/ directory can, and should, now be secured to avoid direct access.
One of the major benefits of using rewrite rules and 'hiding' the script is that ONLY requests matching the regular expression can access it. In this case it's not possible for someone to pass a non-numeric parameter, and any attempt would result, rightly, in a 404 Not Found response.
An even better rule in this case could be:
RewriteRule ^news/([0-9]{8}) /scripts/news.php?id=$1
or, if you want to be really strict and enforce the .html extension:
RewriteRule ^news/([0-9]{8})\.html$ /scripts/news.php?id=$1
Next we look at what to do if other sites or search engines are already linking to your dynamic pages.
Converting Dynamic to Search Engine Friendly URL's
If you follow the example in the previous section, you might see a lot of 404's in your logs because addresses that used to work have now been deprecated.
Wouldn't it be great if we could set up a PURL to handle this.
It's actually quite simple:
RewriteCond %{QUERY_STRING} ^id=([0-9]+)
RewriteRule ^news\.php /news/%1.html? [R=301,L]
Note: regular expression matches from a RewriteCond are referenced using % wheras those in a RewriteRule are referenced using $.
Translation:
- IF the query string starts with id=(one or more digits)
- AND the request is for /news.php
- THEN redirect (301) to the search-engine-friendly URL with no query string
The reason for having a 301 Permanent redirect is that search engines such as Google will take that to mean that the previously indexed page now exists at the new location, and pass on any PageRank accumulated at the old address.
You also don't want there to be multiple ways to access the same content as that can trigger a duplicate content penalty with the search engines.
If you change your URL structure more than once over time, you might end up with a chain of 301 redirects leading from the oldest to the newest format so it's a good idea to map everything out on paper or on a development server before going live.
More mod_rewrite Examples
By popular demand, here are some more advanced examples to help you on your way.
RewriteRule with two parameters
RewriteRule ^([A-Z]{2})/([\+a-z\ ]+)\.html$ /scripts/showtown.php?state=$1&town=$2
This allows you to use URL's such as /CA/san+francisco.html or /IL/chicago.html which will call the showtown.php script with GET parameters for $state and $town. If you want it to be case-insensitive just add an [NC] flag at the end.
Again, the beauty of mod_write is that people can't insert random or malicious values into the GET parameters as they will not match the regular expression and the script won't be called.
Having the article or blog title in the URL
I've had two separate queries recently of people wanting to know how they can have search engine friendly URLs that include the blog title. For example, if we wanted to use URLs something like:
http://www.the-art-of-web.com/system/search-engine-friendly-urls.html
There are two approaches to this. The first would be to store for each article or blog entry a unique string that matches (more or less) the title. You can't generally use the title itself as there are many characters that are not valid or need to be encoded in a URL so you end up with a bit of a mess.
In this case the rewrite rule is the same as if you were just using a database id, just with letters instead of numbers:
RewriteRule ^(images|scripts)/ - [L]
RewriteRule ^([a-z]+)/([-a-z]+)\.html /scripts/showarticle.php?section=$1&article=$2
You'll notice that we've introduced a bit more complexity here. The first parameter in the RewriteRule will match any directory name and can cause serious problems. To get around that the first RewriteRule will catch requests for /images/ or /scripts/ and terminate mod_rewrite in those cases.
Now the showarticle.php script just needs to do a database lookup based on the value of $section and $article and display the relevant content.
If you think that all sounds a bit complicated, I agree, and there's a simple workaround that many of the major sites are now using. The trick is to use just an id value to reference the articles, but then allow for more text to be added to the URL.
RewriteRule ^article/([0-9]+).*\.html /scripts/showarticle.php?id=$1
This means that after the id number you can fill in the rest of the URL however you want. So any of the following could be used to reference an article with an id value of 12:
/article/12.html
/article/12/search-engine-friendly-urls.html
/article/12/search-engine-friendly-urls-are-really-cool.html
Remember that it's important to use only one address for a given page, or at least that only one is delivered to search engines to avoid Duplicate Content penalties.
Common Mistakes
Trying to match a query string in the RewriteRule
The query string is never visible to the rewrite rule - it only sees the address portion of the request. As shown above you need to use a RewriteCond on %{QUERY_STRING} before your RewriteRule.
Appending QUERY_STRING to the rewrite target
This probably seemed like a great solution at the time:
RewriteRule ^books/([0-9]+)\.html /scripts/book.html?id=$1&%{QUERY_STRING}
but you're much better using the built-in QSA flag:
RewriteRule ^books/([0-9]+)\.html /scripts/book.html?id=$1 [QSA]
Note: QSA stands for Query String Append.
Trying to match or redirect to page anchors
Page anchors - addresses ending in #anchor - are handled entirely on the client-side and never passed to or from the server. This means they cannot be used in rewrite rules or conditions. This makes sense if you think about it as you can go from one anchor to another in your browser without the page reloading.
Missing images and style-sheets
After a redirect such as:
RewriteRule ^books/([0-9]+)\.html /showbook.php?id=$1
relative paths to images will no longer work. That's because as far as the web browser knows, you're currently in a directory called /books/. The fix is to use absolute paths when referencing images and other resources.
Some typical example:
<link rel="stylesheet" type="text/css" href="/style.css">
<a href="/index.html">Homepage</a>
<img src="/images/cover.jpg" alt="">
The most commonly effected items are images, style sheets, Flash and Java files and internal links.
References
Related Articles - mod_rewrite
- System Saving bandwidth with mod_rewrite and ImageMagick
- System Avoiding duplicate content filters
- System mod_rewrite: Examples
- System Using mod_rewrite to canonicalize and secure your domain
- System mod_rewrite: Seach Engine Friendly URL's
Daevid Vincent 20 July, 2011
I've searched for HOURS today trying to figure out how to get my search FORM results to be RESTful and finally found this Godsend of a page with a perfect example of exactly what I needed in #3.
GoldenGnu 8 February, 2011
Thank you for this! It's made of gold...
Fixed my little problem, without much work
SOLUTION:
Change: http:// example.net /?page=pagename to: http:// example.net /pagename
Use:
RewriteCond %{QUERY_STRING} ^page=(.+)
RewriteRule ^$ /%1? [R=301,L]
reichard 21 April, 2007
Thanks for the [QSA] tip on modrewrite. I had spent hours today trying to figure out why my htaccess wouldn't work.
Many thanks!