skip to content

PHP: Stripping invalid Unicode for pdfTeX

An eternal problem when generating PDF files using pdfLaTeX or pdfTeX is that everything fails as soon as an unknown Unicode character is encountered.

Package inputenc Error: Unicode char XXX not set up for use with LaTeX.

Often the first approach is create a blacklist of disallowed characters, but this is rather impractical given the range of possible Unicode characters. There must be a better approach.

Where are the allowed characters defined

Rather than a black-list, the trick is to find out which characters are supported in your system, and then strip out any that are not.

In our case, the relevant files can be found in the directory /usr/share/texlive/texmf-dist/tex/latex/base/ with a .dfu extension:

/usr/share/texlive/texmf-dist/tex/latex/base/lcyenc.dfu /usr/share/texlive/texmf-dist/tex/latex/base/ly1enc.dfu /usr/share/texlive/texmf-dist/tex/latex/base/omsenc.dfu /usr/share/texlive/texmf-dist/tex/latex/base/ot1enc.dfu /usr/share/texlive/texmf-dist/tex/latex/base/ot2enc.dfu ...

And within those files, various unicode characters are defined as follows:

... \DeclareUnicodeCharacter{2013}{\textendash} \DeclareUnicodeCharacter{2014}{\textemdash} \DeclareUnicodeCharacter{2018}{\textquoteleft} \DeclareUnicodeCharacter{2019}{\textquoteright} ...

So all we need is to harvest those codes for our white-list. Simple.

Building the white-list

Extracting the defined unicode character codes is easy enough using command-line utilities:

$ grep -h DeclareUnicodeCharacter /usr/share/texlive/texmf-dist/tex/latex/base/*.dfu \ | awk -F [\{\}] '{print $2}' \ | sort | uniq 00A0 00A1 00A2 00A3 00A4 00A5 ...

On our system, after removing duplicates, this still returns some 500+ distinct codes. But as you can see, a lot of the codes are sequential, which means we can reduce them into ranges.

Optimising the white-list

The end goal here is to be able to use preg_replace in PHP to strip out any non-white-listed Unicode characters from text input to PDFLaTeX, using this syntax:

$content = preg_replace("/[^\x{00A0}\x{00A1}\x{00A2}...]/u", "?", $content);

Which we can optimise by combining adjacent characters into ranges as follows:

$content = preg_replace("/[^\x{00A0}-\x{0125}\x{0128}-\x{0137}...]/u", "?", $content);

This will replace any unsupported characters with the ? question mark character.

Putting it all together

The following static PHP class will combine the above steps to generate and apply a compact regex to any supplied string:

<?PHP namespace Chirp; // Original PHP code by Chirp Internet: www.chirpinternet.eu // Please acknowledge use of this code by including this header. class PDFTeXHelper { public static $defined_chars = []; public static function load_defined_chars() : void { if(self::$defined_chars) { return; } $retval = [ "\x{00}-\x{FF}", // include ascii by default ]; $command = "/usr/bin/grep -h DeclareUnicodeCharacter /usr/share/texlive/texmf-dist/tex/latex/base/*.dfu | /usr/bin/awk -F [\{\}] '{print $2}' | /usr/bin/sort | /usr/bin/uniq"; $allowed_chars = []; $ret = NULL; exec($command, $allowed_chars, $ret); if($ret !== 0) { die(__METHOD__ . " return value: {$ret}"); } $start = $end = NULL; $lastdec = $lasthex = NULL; foreach($allowed_chars as $hex) { $dec = hexdec($hex); if($lastdec && ($dec == $lastdec + 1)) { if(!$start) { $start = $lasthex; } $end = $hex; } else { if($start) { $retval[] = "\x{{$start}}-\x{{$end}}"; $start = NULL; } elseif($lasthex && ($lasthex != $end)) { $retval[] = "\x{{$lasthex}}"; } } $lasthex = $hex; $lastdec = $dec; } // for each defined unicode character if($lasthex) { $retval[] = "\x{{$lasthex}}"; } self::$defined_chars = $retval; } // ::load_defined_chars public static function strip_undefined_chars(string $text, string $replace = "?") : string { self::load_defined_chars(); $regex = '/[^' . implode("", self::$defined_chars) . ']/u'; return preg_replace($regex, $replace, $text); } // ::strip_undefined_chars }

expand code box

Without going into detail, we are simply looping through the earlier list of defined Unicode characters and, where possible, merging them into character ranges.

You will notice one addition from the earlier discussion - we are including by default the ASCII character range (00-FF) to avoid stripping out regular characters. You may want to expand this range depending on the character set you are working with.

Usage is very simple:

$content = \Chirp\PDFTeXHelper::strip_undefined_chars($content);

And on subsequent calls from the same script, a cached version of the white-list is used so that we're not constantly querying the system files.

You could also capture and store the regular expression on your system as a static file, as long as you remember to refresh it if/when new PDFTex libraries are installed.

< PHP

Post your comment or question
top