skip to content

SQL: Using a PostgreSQL foreign data wrapper to analyze log files

When we've talked before about analyzing log files we've mosting been looking at the command line scripting tools such as awk, grep and sed to manipulate data, and then sort and uniq to tally the results.

While this works fine in most cases, it would be nice sometimes to have access to the full power of an SQL engine.

Foreign data wrappers to the rescue

A foreign data wrapper in PostgreSQL allows you to directly query the contents of a local text file as if it was an SQL table. Actually, it can go further than that and connect to remote servers and databases, but for now a text file should be enough.

The first step is to have your data in a text file that can be read by PostgreSQL. By default a tab-delimited text file is expected, but you can also change the settings to recognise CSV and other formats.

To enable foreign data wrappers in your database:

postgres=# CREATE EXTENSION IF NOT EXISTS file_fdw; CREATE EXTENSION postgres=# CREATE SERVER localfile FOREIGN DATA WRAPPER file_fdw; CREATE SERVER

So far, so good. We now have a new foreign server using the file_fdw foreign-data wrapper. This is a native PostgreSQL module so should be available on all systems.

See under References below for a list of available FDWs for parsing anything from RSS feeds to VCF gene sequences. There are even some for reading log files directly.

Preparing mail.log data for PostgreSQL

In this example we're just looking at something very simple. Extracting the relay and status values from the Sendmail mail.log into a tab-delimited text file:

# sed -n 's/.*relay=\([^ ]\+\).*stat=\(.\+\)/\1\t\2/p' /var/log/mail.log \ | sed 's/\(Sent\|Deferred\|Host unknown\).*/\1/' \ | grep -v 127.0.0.1 > /tmp/mail-relay-status.txt

This command takes the values following relay= and stat= in the log file from lines where those values exist and then does some tidying and pruning. The results are piped into a text file in the local /tmp/ directory.

An example of the generated text file:

... gmail-smtp-in.l.google.com. Sent aspmx.l.google.com. Sent hotmail-com.olc.protection.outlook.com. Sent relay.example.net. Deferred relay.example.net. Deferred relay.example.net. Sent extmail.bigpond.com. Sent extmail.bpbb.bigpond.com. Sent gmail-smtp-in.l.google.com. Sent ...

Importing into PostgreSQL

This is not so much a matter of importing the data into a TABLE as we've done many times in the past, but of creating a live linkage from the database to the specified text file.

postgres=# CREATE FOREIGN TABLE mailstat ( relay character varying, status character varying ) SERVER localfile OPTIONS ( filename '/tmp/mail-relay-status.txt' ); CREATE FOREIGN TABLE

If we've done everything right we can now immediately start querying the collected data. Only if the file does not exist, there will be an error:

postgres=# SELECT * FROM mailstat; ERROR: could not open file "/tmp/mail-relay-status.txt" for reading: No such file or directory HINT: COPY FROM instructs the PostgreSQL server process to read a file. You may want a client-side facility such as psql's \copy.

In this case we've only extracted and imported two columns from the mail log. A more advanced version might also include the timestamp, From and To addresses and the Message Id so you can identify individual emails.

If you're looking to connect to CSV or other file formats, the OPTIONS clause is the place to specify the delimiter and other characteristics.

Example SQL queries

The whole point of this exercise is that we now have the ability to query our collected data as if it was a normal PostgreSQL table:

postgres=# SELECT DISTINCT status, COUNT(*) FROM mailstat GROUP BY status ORDER BY count DESC; status | count ---------------------+------- Sent | .... Deferred | .. User unknown | .. Host unknown | . Service unavailable | . (5 rows) postgres=# SELECT DISTINCT relay, COUNT(*) FROM mailstat GROUP BY relay ORDER BY COUNT(*) DESC LIMIT 5; relay | count -----------------------------------------+------- ... | .... postgres=# SELECT DISTINCT relay, status, COUNT(*) FROM mailstat WHERE status != 'Sent' GROUP BY relay, status ORDER BY status, COUNT(*) DESC; relay | status | count -----------------------------------------+---------------------+------- ... | ... | ...

Now just imagine what you could do with access to all the log file information as separate columns, in real-time, using a specialised foreign data wrapper (FDW). Or even just with a CRON to regularly update the referenced text file.

You could go on to create VIEWs, and online reports that updated in real time, with a search function for filtering and analysing the data and never have to look at an actual log file again.

References

< SQL

Post your comment or question
top