System: Bash script to generate broken links report
One of the most effective ways to find broken links is by Analyzing Apache log files for errors. This is faster and less resource intensive than spidering your whole website. It also has the advantage of showing broken inbound links.
The script presented below can be run from the command line or using CRON to scan a single logfile and generate a basic report.
The Bash script v0.9
This script requires a single argument being the name of a logfile. Commonly this might be something like access.log, combined.log or sitename-combined_log.
The listed options can be used to add details to the report and to specify an email recipient and sender. Without further ado, here's the script:
/usr/local/bin/broken-links-report:
#!/bin/bash
## Original shell script by Chirp Internet: chirpinternet.eu
## Please acknowledge use of this code by including this header.
##
## Usage: broken-links-report [options] LOGFILE
##
## Options:
## -n site name for report heading
## -d domain name
## -r recipient email
## -s sender/from address
##
LOGPATH=/var/log/apache2
EMAILREGEX="^[^ ]+@[^ ]+$"
SITENAME=
DOMAIN=
TARGET=
SENDER=
while getopts 'n:d:r:s:' OPTION
do
case $OPTION in
n)
SITENAME="$OPTARG"
;;
d)
DOMAIN="$OPTARG"
;;
r)
if [[ "$OPTARG" =~ $EMAILREGEX ]]
then
TARGET="$OPTARG"
else
echo "Invalid email: $OPTARG"
exit 2
fi
;;
s)
if [[ "$OPTARG" =~ $EMAILREGEX ]]
then
SENDER="$OPTARG"
else
echo "Invalid email: $OPTARG"
exit 2
fi
;;
?)
printf "Usage: %s [-n name] [-d domain] [-r target] [-s sender] logfile\n" $(basename $0) >&2
exit 2
;;
esac
done
shift $(($OPTIND - 1))
if [ ! "$#" -eq 1 ];
then
printf "Usage: %s [-n name] [-d domain] [-r target] [-s sender] logfile\n" $(basename $0) >&2
exit 2
fi
LOGNAME=$1
LOGFILE="${LOGPATH}/${LOGNAME}"
if [ ! -r "$LOGFILE" ];
then
echo "File not found: $LOGFILE"
exit 1
fi
# scan logfile for broken links
RETVAL=`awk '($6 ~ /GET/ && $9 !~ /200|204|206|301|302|304|401|403|-/){print $7}' $LOGFILE \
| sort | uniq -c | awk '($1 > 1){print $2}'`
if [ "$RETVAL" ];
then
SUBJECT="Missing (404) and deleted (410) URL report"
if [ "$SITENAME" ];
then
SUBJECT="${SUBJECT} for ${SITENAME}"
fi
body="*** Apache log file format: http://httpd.apache.org/docs/current/logs.html\n\n"
for i in $RETVAL
do
if [ "$DOMAIN" ];
then
body="${body}URL: http://${DOMAIN}${i}\n\n"
else
body="${body}URI: ${i}\n\n"
fi
OUTPUT=`grep -F "GET $i " $LOGFILE`;
body="${body}$OUTPUT\n\n";
done;
if [ "$TARGET" ];
then
if [ "$SENDER" ];
then
echo -e "${body}" | mail ${TARGET} -s "${SUBJECT}" -- -r "${SENDER}"
else
echo -e "${body}" | mail ${TARGET} -s "${SUBJECT}"
fi
else
echo "$SUBJECT"
echo
echo -e "${body}"
fi
else
echo "No broken links found in $LOGFILE"
fi
The script will only report broken links that appear more than once in the logfile - ($1 > 1) in the grep command. This condition could be removed, or change to a larger number, according to your traffic patterns.
You can find a new and improved version below.
Calling the script from CRON
In our case we want to run the script daily and to scan a full 24 hours, so the script is triggered from the apache logrotate configuration file. It could just as easily be called from a stand-alone cron or crontab file using a similar command.
/etc/logrotate.d/apache2:
/var/log/apache2/*log {
...
daily
...
compress
delaycompress
firstaction
/usr/local/bin/broken-links-report -n 'Example Site' -d www.example.net -s do-not-reply@example.net -r webmaster@example.net example-combined_log
...
endscript
lastaction
...
endscript
}
By calling the script as part of firstaction it will be called before any logs are rotated. If you place it in lastaction or postrotate then you will need to feed it the rolled over (.1) logfile.
Future improvements
Filtering the output
You may want to filter the output to ignore certain URLs or requests. Any established website will build up some broken inbound links over time and it doesn't make sense to set up a Redirect (301) in all cases.
There are also requests by various user agents for 'unnecessary' files, such as /sitemap.xml and /apple-touch-icon-precomposed.png which you might want to leave out of the report. And requests from bad robots for admin.php, register.php and other common exploit attempts.
Convert to shorthand
The code could be made a lot shorter by using shorthand for if/then and case statements.
Input validation
The script has some basic validation of input, but will still let you specify files outside the APACHE_LOG directory for scanning which is far from ideal.
Improved code v1.0
We've now addressed some of the above issues as well as cleaning up and compressing the code to make it more bash-like:
/usr/local/bin/broken-links-report:
#!/bin/bash
## Original shell script by Chirp Internet: chirpinternet.eu
## Please acknowledge use of this code by including this header.
##
## Usage: broken-links-report [options] LOGFILE
##
## Options:
## -d domain name
## -n site name for report heading
## -r recipient email
## -s sender/from address
##
LOGPATH="/var/log/apache2"
EMAILREGEX="^[^ ]+@[^ ]+$"
FILEREGEX="^[^. /][^ ]+$"
SUBJECT="Missing (404) and deleted (410) URL report"
USAGE=$(
printf "Usage: %s [-n name] [-d domain] [-r target] [-s sender] logfile" $(basename $0)
)
DOMAIN=
SITENAME=
TARGET=
SENDER=
while getopts 'n:d:r:s:' OPTION
do
case $OPTION in
d) DOMAIN="$OPTARG" ;;
n) SITENAME="$OPTARG" ;;
r) [[ "$OPTARG" =~ $EMAILREGEX ]] && TARGET="$OPTARG" || { printf "Invalid email: %s\n" "$OPTARG" 1>&2; exit 2; } ;;
s) [[ "$OPTARG" =~ $EMAILREGEX ]] && SENDER="$OPTARG" || { printf "Invalid email: %s\n" "$OPTARG" 1>&2; exit 2; } ;;
?) printf "$USAGE\n" 1>&2; exit 2 ;;
esac
done
shift $(($OPTIND - 1))
[ "$#" -eq 1 ] || { printf "$USAGE\n"; exit 2; }
LOGNAME=$1
[[ "$LOGNAME" =~ $FILEREGEX ]] || { printf "Invalid filename: %s\n" "$LOGNAME"; exit 1; }
LOGFILE="${LOGPATH}/${LOGNAME}"
[ -r "$LOGFILE" ] || { printf "File not found or not readable: %s\n" "$LOGFILE"; exit 1; }
BROKENLINKS=$(
awk '($6 ~ /GET/ && $9 !~ /200|204|206|301|302|304|401|403|-/){print $7}' $LOGFILE | sort | uniq -c | awk '($1 > 1){print $2}'
)
[ "$BROKENLINKS" ] || { printf "No broken links found in %s\n" "$LOGFILE"; exit 0; }
[ "$SITENAME" ] && SUBJECT="${SUBJECT} for ${SITENAME}"
REPORT=$(
printf "*** Apache log file format: http://httpd.apache.org/docs/current/logs.html\n\n"
for i in $BROKENLINKS; do
[ "$DOMAIN" ] && printf "URL: http://${DOMAIN}%s\n\n" $i || printf "URI: %s\n\n" $i
OUTPUT=$(grep -F "GET $i " $LOGFILE)
printf "%s\n\n" "$OUTPUT"
done;
)
[ "$TARGET" ] || { printf "%s\n\n%s\n\n" "$SUBJECT" "$REPORT"; exit 0; }
if [ "$SENDER" ]; then
echo -e "$REPORT" | mail "$TARGET" -s "$SUBJECT" -- -r "$SENDER"
else
echo -e "$REPORT" | mail "$TARGET" -s "$SUBJECT"
fi
And for copying:
Please let us know using the Feedback form below if you find this script useful or want to suggest bug fixes or improvements.
References
Related Articles - Log Files
- SQL Using a PostgreSQL foreign data wrapper to analyze log files
- System Controlling what logs where with rsyslog.conf
- System Logging sFTP activity for chrooted users
- System Analyzing Apache Log Files
- System Bash script to generate broken links report
- System Blocking Unwanted Spiders and Scrapers
- System Referer Spam from Live Search
- System Referer Spam from Microsoft Bing
- System Fake Traffic from AVG
Dave 19 January, 2015
Hi,
Thanks for this script. It works perfect!
But it also shows the sitemap.xml and apple-touch-icon-precomposed url's i the output. Can you be more precise how to filter them so they are NOT in the outmail email.
A quick solution would be to add a line after:
for i in $BROKENLINKS; do
where you issue a continue command if $i matches one of those file names.