SQL: Using a PostgreSQL foreign data wrapper to analyze log files
When we've talked before about analyzing log files we've mosting been looking at the command line scripting tools such as awk, grep and sed to manipulate data, and then sort and uniq to tally the results.
While this works fine in most cases, it would be nice sometimes to have access to the full power of an SQL engine.
Foreign data wrappers to the rescue
A foreign data wrapper in PostgreSQL allows you to directly query the contents of a local text file as if it was an SQL table. Actually, it can go further than that and connect to remote servers and databases, but for now a text file should be enough.
The first step is to have your data in a text file that can be read by PostgreSQL. By default a tab-delimited text file is expected, but you can also change the settings to recognise CSV and other formats.
To enable foreign data wrappers in your database:
postgres=# CREATE EXTENSION IF NOT EXISTS file_fdw;
CREATE EXTENSION
postgres=# CREATE SERVER localfile FOREIGN DATA WRAPPER file_fdw;
CREATE SERVER
So far, so good. We now have a new foreign server using the file_fdw foreign-data wrapper. This is a native PostgreSQL module so should be available on all systems.
See under References below for a list of available FDWs for parsing anything from RSS feeds to VCF gene sequences. There are even some for reading log files directly.
Preparing mail.log data for PostgreSQL
In this example we're just looking at something very simple. Extracting the relay and status values from the Sendmail mail.log into a tab-delimited text file:
# sed -n 's/.*relay=\([^ ]\+\).*stat=\(.\+\)/\1\t\2/p' /var/log/mail.log \
| sed 's/\(Sent\|Deferred\|Host unknown\).*/\1/' \
| grep -v 127.0.0.1 > /tmp/mail-relay-status.txt
This command takes the values following relay= and stat= in the log file from lines where those values exist and then does some tidying and pruning. The results are piped into a text file in the local /tmp/ directory.
An example of the generated text file:
...
gmail-smtp-in.l.google.com. Sent
aspmx.l.google.com. Sent
hotmail-com.olc.protection.outlook.com. Sent
relay.example.net. Deferred
relay.example.net. Deferred
relay.example.net. Sent
extmail.bigpond.com. Sent
extmail.bpbb.bigpond.com. Sent
gmail-smtp-in.l.google.com. Sent
...
Importing into PostgreSQL
This is not so much a matter of importing the data into a TABLE as we've done many times in the past, but of creating a live linkage from the database to the specified text file.
postgres=# CREATE FOREIGN TABLE mailstat (
relay character varying,
status character varying
)
SERVER localfile
OPTIONS (
filename '/tmp/mail-relay-status.txt'
);
CREATE FOREIGN TABLE
If we've done everything right we can now immediately start querying the collected data. Only if the file does not exist, there will be an error:
postgres=# SELECT * FROM mailstat;
ERROR: could not open file "/tmp/mail-relay-status.txt" for reading: No such file or directory
HINT: COPY FROM instructs the PostgreSQL server process to read a file. You may want a client-side facility such as psql's \copy.
In this case we've only extracted and imported two columns from the mail log. A more advanced version might also include the timestamp, From and To addresses and the Message Id so you can identify individual emails.
If you're looking to connect to CSV or other file formats, the OPTIONS clause is the place to specify the delimiter and other characteristics.
Example SQL queries
The whole point of this exercise is that we now have the ability to query our collected data as if it was a normal PostgreSQL table:
postgres=# SELECT DISTINCT status, COUNT(*) FROM mailstat GROUP BY status ORDER BY count DESC;
status | count
---------------------+-------
Sent | ....
Deferred | ..
User unknown | ..
Host unknown | .
Service unavailable | .
(5 rows)
postgres=# SELECT DISTINCT relay, COUNT(*) FROM mailstat GROUP BY relay ORDER BY COUNT(*) DESC LIMIT 5;
relay | count
-----------------------------------------+-------
... | ....
postgres=# SELECT DISTINCT relay, status, COUNT(*) FROM mailstat WHERE status != 'Sent' GROUP BY relay, status ORDER BY status, COUNT(*) DESC;
relay | status | count
-----------------------------------------+---------------------+-------
... | ... | ...
Now just imagine what you could do with access to all the log file information as separate columns, in real-time, using a specialised foreign data wrapper (FDW). Or even just with a CRON to regularly update the referenced text file.
You could go on to create VIEWs, and online reports that updated in real time, with a search function for filtering and analysing the data and never have to look at an actual log file again.
References
Related Articles - Log Files
- SQL Using a PostgreSQL foreign data wrapper to analyze log files
- System Controlling what logs where with rsyslog.conf
- System Logging sFTP activity for chrooted users
- System Analyzing Apache Log Files
- System Bash script to generate broken links report
- System Blocking Unwanted Spiders and Scrapers
- System Referer Spam from Live Search
- System Referer Spam from Microsoft Bing
- System Fake Traffic from AVG