PHP: Stripping invalid Unicode for pdfTeX
An eternal problem when generating PDF files using pdfLaTeX or pdfTeX is that everything fails as soon as an unknown Unicode character is encountered.
Package inputenc Error: Unicode char XXX not set up for use with LaTeX.
Often the first approach is create a blacklist of disallowed characters, but this is rather impractical given the range of possible Unicode characters. There must be a better approach.
Where are the allowed characters defined
Rather than a black-list, the trick is to find out which characters are supported in your system, and then strip out any that are not.
In our case, the relevant files can be found in the directory /usr/share/texlive/texmf-dist/tex/latex/base/ with a .dfu extension:
/usr/share/texlive/texmf-dist/tex/latex/base/lcyenc.dfu
/usr/share/texlive/texmf-dist/tex/latex/base/ly1enc.dfu
/usr/share/texlive/texmf-dist/tex/latex/base/omsenc.dfu
/usr/share/texlive/texmf-dist/tex/latex/base/ot1enc.dfu
/usr/share/texlive/texmf-dist/tex/latex/base/ot2enc.dfu
...
And within those files, various unicode characters are defined as follows:
...
\DeclareUnicodeCharacter{2013}{\textendash}
\DeclareUnicodeCharacter{2014}{\textemdash}
\DeclareUnicodeCharacter{2018}{\textquoteleft}
\DeclareUnicodeCharacter{2019}{\textquoteright}
...
So all we need is to harvest those codes for our white-list. Simple.
Building the white-list
Extracting the defined unicode character codes is easy enough using command-line utilities:
$ grep -h DeclareUnicodeCharacter /usr/share/texlive/texmf-dist/tex/latex/base/*.dfu \
| awk -F [\{\}] '{print $2}' \
| sort | uniq
00A0
00A1
00A2
00A3
00A4
00A5
...
On our system, after removing duplicates, this still returns some 500+ distinct codes. But as you can see, a lot of the codes are sequential, which means we can reduce them into ranges.
Optimising the white-list
The end goal here is to be able to use preg_replace in PHP to strip out any non-white-listed Unicode characters from text input to PDFLaTeX, using this syntax:
$content = preg_replace("/[^\x{00A0}\x{00A1}\x{00A2}...]/u", "?", $content);
Which we can optimise by combining adjacent characters into ranges as follows:
$content = preg_replace("/[^\x{00A0}-\x{0125}\x{0128}-\x{0137}...]/u", "?", $content);
This will replace any unsupported characters with the ? question mark character.
Putting it all together
The following static PHP class will combine the above steps to generate and apply a compact regex to any supplied string:
<?PHP
namespace Chirp;
// Original PHP code by Chirp Internet: www.chirpinternet.eu
// Please acknowledge use of this code by including this header.
class PDFTeXHelper
{
public static $defined_chars = [];
public static function load_defined_chars() : void
{
if(self::$defined_chars) {
return;
}
$retval = [
"\x{00}-\x{FF}", // include ascii by default
];
$command = "/usr/bin/grep -h DeclareUnicodeCharacter /usr/share/texlive/texmf-dist/tex/latex/base/*.dfu | /usr/bin/awk -F [\{\}] '{print $2}' | /usr/bin/sort | /usr/bin/uniq";
$allowed_chars = [];
$ret = NULL;
exec($command, $allowed_chars, $ret);
if($ret !== 0) {
die(__METHOD__ . " return value: {$ret}");
}
$start = $end = NULL;
$lastdec = $lasthex = NULL;
foreach($allowed_chars as $hex) {
$dec = hexdec($hex);
if($lastdec && ($dec == $lastdec + 1)) {
if(!$start) {
$start = $lasthex;
}
$end = $hex;
} else {
if($start) {
$retval[] = "\x{{$start}}-\x{{$end}}";
$start = NULL;
} elseif($lasthex && ($lasthex != $end)) {
$retval[] = "\x{{$lasthex}}";
}
}
$lasthex = $hex;
$lastdec = $dec;
} // for each defined unicode character
if($lasthex) {
$retval[] = "\x{{$lasthex}}";
}
self::$defined_chars = $retval;
} // ::load_defined_chars
public static function strip_undefined_chars(string $text, string $replace = "?") : string
{
self::load_defined_chars();
$regex = '/[^' . implode("", self::$defined_chars) . ']/u';
return preg_replace($regex, $replace, $text);
} // ::strip_undefined_chars
}
Without going into detail, we are simply looping through the earlier list of defined Unicode characters and, where possible, merging them into character ranges.
You will notice one addition from the earlier discussion - we are including by default the ASCII character range (00-FF) to avoid stripping out regular characters. You may want to expand this range depending on the character set you are working with.
Usage is very simple:
$content = \Chirp\PDFTeXHelper::strip_undefined_chars($content);
And on subsequent calls from the same script, a cached version of the white-list is used so that we're not constantly querying the system files.
You could also capture and store the regular expression on your system as a static file, as long as you remember to refresh it if/when new PDFTex libraries are installed.
Related Articles - Parsing files
- PHP Parsing HTML to find Links
- PHP Parsing HTML files with DOMDocument and DOMXpath
- PHP Listing files in a ZIP archive
- PHP Parsing robots.txt
- PHP Stripping invalid Unicode for pdfTeX