System: Analyzing Apache Log Files
There are many different packages that allow you to generate reports on who's visiting your site and what they're doing. The most popular at this time appear to be "Analog", "The Webalizer" and "AWStats" which are installed by default on many shared servers.
While such programs generate attractive reports, they only scratch the surface of what the log files can tell you. In this section we look at ways you can delve more deeply - focussing on the use of simple command line tools, particularly grep, awk and sed.
Combined log format
The following assumes an Apache HTTP Server combined log format where each entry in the log file contains the following information:
%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-agent}i"
where:
%h = IP address of the client (remote host) which made the request
%l = RFC 1413 identity of the client
%u = userid of the person requesting the document
%t = Time that the server finished processing the request
%r = Request line from the client in double quotes
%>s = Status code that the server sends back to the client
%b = Size of the object returned to the client
The final two items: Referer and User-agent give details on where the request originated and what type of agent made the request.
Sample log entries:
66.249.64.13 - - [18/Sep/2004:11:07:48 +1000] "GET /robots.txt HTTP/1.0" 200 468 "-" "Googlebot/2.1"
66.249.64.13 - - [18/Sep/2004:11:07:48 +1000] "GET / HTTP/1.0" 200 6433 "-" "Googlebot/2.1"
Note: The robots.txt file gives instructions to robots as to which parts of your site they are allowed to index. A request for / is a request for the default index page, normally index.html.
Using awk
The principal use of awk is to break up each line of a file into 'fields' or 'columns' using a pre-defined separator. Because each line of the log file is based on the standard format we can do many things quite easily.
Using the default separator which is any white-space (spaces or tabs) we get the following:
awk '{print $1}' combined_log # ip address (%h)
awk '{print $2}' combined_log # RFC 1413 identity (%l)
awk '{print $3}' combined_log # userid (%u)
awk '{print $4,5}' combined_log # date/time (%t)
awk '{print $9}' combined_log # status code (%>s)
awk '{print $10}' combined_log # size (%b)
You might notice that we've missed out some items. To get to them we need to set the delimiter to the " character which changes the way the lines are 'exploded' and allows the following:
awk -F\" '{print $2}' combined_log # request line (%r)
awk -F\" '{print $4}' combined_log # referer
awk -F\" '{print $6}' combined_log # user agent
Now that you understand the basics of breaking up the log file and identifying different elements, we can move on to more practical examples.
Examples
You want to list all user agents ordered by the number of times they appear (descending order):
awk -F\" '{print $6}' combined_log | sort | uniq -c | sort -fr
All we're doing here is extracing the user agent field from the log file and 'piping' it through some other commands. The first sort is to enable uniq to properly identify and count unique user agents. The final sort orders the result by number and name (both descending).
The result will look similar to a user agents report generated by one of the above-mentioned packages. The difference is that you can generate this ANY time from ANY log file or files.
If you're not particulary interested in which operating system the visitor is using, or what browser extensions they have, then you can use something like the following:
awk -F\" '{print $6}' combined_log \
| sed 's/(\([^;]\+; [^;]\+\)[^)]*)/(\1)/' \
| sort | uniq -c | sort -fr
Note: The \ at the end of a line simply indicates that the command will continue on the next line.
This will strip out the third and subsequent values in the 'bracketed' component of the user agent string. For example:
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR)
becomes:
Mozilla/4.0 (compatible; MSIE 6.0)
The next step is to start filtering the output so you can narrow down on a certain page or referer. Would you like to know which pages Google has been requesting from your site?
awk -F\" '($6 ~ /Googlebot/){print $2}' combined_log | awk '{print $2}'
Or who's been looking at your guestbook?
awk -F\" '($2 ~ /guestbook\.html/){print $6}' combined_log
It's just too easy isn't it!
Using just the examples above you can already generate your own reports to back up any kind of automated reporting your ISP provides. You could even write your own log analysis program.
Using log files to identify problems with your site
The steps outlined below will let you identify problems with your site by identifying the different server responses and the requests that caused them:
awk '{print $9}' combined_log | sort | uniq -c | sort
The output shows how many of each type of request your site is getting. A 'normal' request results in a 200 code which means a page or file has been requested and delivered but there are many other possibilities.
The most common responses are:
200 - OK
206 - Partial Content
301 - Moved Permanently
302 - Found
304 - Not Modified
401 - Unauthorised (password required)
403 - Forbidden
404 - Not Found
Note: For more on Status Codes you can read the article HTTP Server Status Codes.
A 301 or 302 code means that the request has been re-directed. What you'd like to see, if you're concerned about bandwidth usage, is a lot of 304 responses - meaning that the file didn't have to be delivered because they already had a cached version.
A 404 code may indicate that you have a problem - a broken internal link or someone linking to a page that no longer exists. You might need to fix the link, contact the site with the broken link, or set up a PURL so that the link can work again.
The next step is to identify which pages/files are generating the different codes. The following command will summarise the 404 ("Not Found") requests:
# list all 404 requests
awk '($9 ~ /404/)' combined_log
# summarise 404 requests
awk '($9 ~ /404/)' combined_log | awk '{print $9,$7}' | sort
Or, you can use an inverted regular expression to summarise the requests that didn't return 200 ("OK"):
awk '($9 !~ /200/)' combined_log | awk '{print $9,$7}' | sort | uniq
Or, you can include (or exclude in this case) a range of responses, in this case requests that returned 200 ("OK") or 304 ("Not Modified"):
awk '($9 !~ /200|304/)' combined_log | awk '{print $9,$7}' | sort | uniq
Suppose you've identifed a link that's generating a lot of 404 errors. Let's see where the requests are coming from:
awk -F\" '($2 ~ "^GET /path/to/brokenlink\.html"){print $4,$6}' combined_log
Now you can see not just the referer, but the user-agent making the request. You should be able to identify whether there is a broken link within your site, on an external site, or if a search engine or similar agent has an invalid address.
If you can't fix the link, you should look at using Apache mod_rewrite or a similar scheme to redirect (301) the requests to the most appropriate page on your site. By using a 301 instead of a normal (302) redirect you are indicating to search engines and other intelligent agents that they need to update their link as the content has 'Moved Permanently'.
Who's 'hotlinking' my images?
Something that really annoys some people is when their bandwidth is being used by their images being linked directly on other websites.
Here's how you can see who's doing this to your site. Just change www.example.net to your domain, and combined_log to your combined log file.
awk -F\" '($2 ~ /\.(jpg|gif)/ && $4 !~ /^http:\/\/www\.example\.net/){print $4}' combined_log \
| sort | uniq -c | sort
Translation:
- explode each row using ";
- the request line (%r) must contain ".jpg" or ".gif";
- the referer must not start with your website address (www.example.net in this example);
- display the referer and summarise.
You can block hot-linking using mod_rewrite but that can also result in blocking various search engine result pages, caches and online translation software. To see if this is happening, we look for 403 ("Forbidden") errors in the image requests:
# list image requests that returned 403 Forbidden
awk '($9 ~ /403/)' combined_log \
| awk -F\" '($2 ~ /\.(jpg|gif)/){print $4}' \
| sort | uniq -c | sort
Translation:
- the status code (%>s) is 403 Forbidden;
- the request line (%r) contains ".jpg" or ".gif";
- display the referer and summarise.
You might notice that the above command is simply a combination of the previous, and one presented earlier. It is necessary to call awk more than once because the 'referer' field is only available after the separator is set to \", wheras the 'status code' is available directly.
Blank User Agents
A 'blank' user agent is typically an indication that the request is from an automated script or someone who really values their privacy. The following command will give you a list of ip addresses for those user agents so you can decide if any need to be blocked:
awk -F\" '($6 ~ /^-?$/)' combined_log | awk '{print $1}' | sort | uniq
A further pipe through logresolve will give you the hostnames of those addresses.
References
Related Articles - Log Files
- SQL Using a PostgreSQL foreign data wrapper to analyze log files
- System Controlling what logs where with rsyslog.conf
- System Logging sFTP activity for chrooted users
- System Blocking Unwanted Spiders and Scrapers
- System Analyzing Apache Log Files
- System Bash script to generate broken links report
- System Referer Spam from Microsoft Bing
- System Fake Traffic from AVG
- System Referer Spam from Live Search
vinoth 17 July, 2018
How to i fetch Bot and human call from server access log with awk, cat, shell or any command using terminal?
Mokhtar Ebrahim 27 March, 2018
Very useful resource for analyzing apache logs with AWK.
Hungry for more! Especially for MySQL.
Best Regards,
Jim 20 February, 2016
Great Article
How to extract the request url and a status code. They are separate in examples.
Thanks
Watael 24 January, 2015
no need to pipe awk to itself, use " as separator, and split third field into an array, then test against error value you want to show:
awk -F\" '{split($3,ar," "); if(ar[1] == "403")print $0}' combined.log
Kent Haase 7 October, 2013
Thanks for the page on analyzing apache logs, Duncan.
I've modified some of the commands to get the access log via an sftp mount, use args passed via the command line, etc.
The (re) introduction to awk has also spurred me to think about performing some deeper filtering and analysis.
Your page gave me better than what I was originally looking for (Give a man to fish and he'll eat for a day. Teach a man to fish etc).
Daniel 18 June, 2013
I recently made a lightweight solution for quickly getting some statistics on an apache or nginx log when neither awk nor server-side analytics are available. It's a web-based app, and using the FileAPI, it all runs in the browser. If interested, you can find it at: serverlogstats.com/
Jan-Willem 11 June, 2013
Can you recommend any other pieces of software to analyse enormous logfiles (+50GB)?
Ben Carpenter 17 February, 2012
Just a quick note to say many thanks for the overview; it's a really valuable introduction and there's lots that can be gleaned from using it.
However, your awk scripts assume that $9 is the status code and this may not always be the case. I've written up a more extended script that caters for a number of the common issues I've experienced making that assumption. Hopefully this will help those who need to probe a bit more.