PHP: Stripping invalid Unicode for pdfTeX

Tweet 0 Shares 0 Tweets 0 Comments

An eternal problem when generating PDF files using pdfLaTeX or pdfTeX is that everything fails as soon as an unknown Unicode character is encountered.

Package inputenc Error: Unicode char XXX not set up for use with LaTeX.

Often the first approach is create a blacklist of disallowed characters, but this is rather impractical given the range of possible Unicode characters. There must be a better approach.

Where are the allowed characters defined

Rather than a black-list, the trick is to find out which characters are supported in your system, and then strip out any that are not.

In our case, the relevant files can be found in the directory /usr/share/texlive/texmf-dist/tex/latex/base/ with a .dfu extension:

/usr/share/texlive/texmf-dist/tex/latex/base/lcyenc.dfu
/usr/share/texlive/texmf-dist/tex/latex/base/ly1enc.dfu
/usr/share/texlive/texmf-dist/tex/latex/base/omsenc.dfu
/usr/share/texlive/texmf-dist/tex/latex/base/ot1enc.dfu
/usr/share/texlive/texmf-dist/tex/latex/base/ot2enc.dfu
...

And within those files, various unicode characters are defined as follows:

...
\DeclareUnicodeCharacter{2013}{\textendash}
\DeclareUnicodeCharacter{2014}{\textemdash}
\DeclareUnicodeCharacter{2018}{\textquoteleft}
\DeclareUnicodeCharacter{2019}{\textquoteright}
...

So all we need is to harvest those codes for our white-list. Simple.

Building the white-list

Extracting the defined unicode character codes is easy enough using command-line utilities:

$ grep -h DeclareUnicodeCharacter /usr/share/texlive/texmf-dist/tex/latex/base/*.dfu \
    | awk -F [\{\}] '{print $2}' \
    | sort | uniq
00A0
00A1
00A2
00A3
00A4
00A5
...

On our system, after removing duplicates, this still returns some 500+ distinct codes. But as you can see, a lot of the codes are sequential, which means we can reduce them into ranges.

Optimising the white-list

The end goal here is to be able to use preg_replace in PHP to strip out any non-white-listed Unicode characters from text input to PDFLaTeX, using this syntax:

$content = preg_replace("/[^\x{00A0}\x{00A1}\x{00A2}...]/u", "?", $content);

Which we can optimise by combining adjacent characters into ranges as follows:

$content = preg_replace("/[^\x{00A0}-\x{0125}\x{0128}-\x{0137}...]/u", "?", $content);

This will replace any unsupported characters with the ? question mark character.

Putting it all together

The following static PHP class will combine the above steps to generate and apply a compact regex to any supplied string:

<?PHP
  namespace Chirp;

  // Original PHP code by Chirp Internet: www.chirpinternet.eu
  // Please acknowledge use of this code by including this header.

  class PDFTeXHelper
  {

    public static $defined_chars = [];

    public static function load_defined_chars() : void
    {

      if(self::$defined_chars) {
        return;
      }

      $retval = [
        "\x{00}-\x{FF}", // include ascii by default
      ];

      $command = "/usr/bin/grep -h DeclareUnicodeCharacter /usr/share/texlive/texmf-dist/tex/latex/base/*.dfu | /usr/bin/awk -F [\{\}] '{print $2}' | /usr/bin/sort | /usr/bin/uniq";
      $allowed_chars = [];
      $ret = NULL;

      exec($command, $allowed_chars, $ret);

      if($ret !== 0) {
        die(__METHOD__ . " return value: {$ret}");
      }

      $start = $end = NULL;
      $lastdec = $lasthex = NULL;

      foreach($allowed_chars as $hex) {

        $dec = hexdec($hex);

        if($lastdec && ($dec == $lastdec + 1)) {

          if(!$start) {
            $start = $lasthex;
          }

          $end = $hex;

        } else {

          if($start) {
            $retval[] = "\x{{$start}}-\x{{$end}}";
            $start = NULL;
          } elseif($lasthex && ($lasthex != $end)) {
            $retval[] = "\x{{$lasthex}}";
          }

        }

        $lasthex = $hex;
        $lastdec = $dec;

      } // for each defined unicode character

      if($lasthex) {
        $retval[] = "\x{{$lasthex}}";
      }

      self::$defined_chars = $retval;

    } // ::load_defined_chars

    public static function strip_undefined_chars(string $text, string $replace = "?") : string
    {

      self::load_defined_chars();

      $regex = '/[^' . implode("", self::$defined_chars) . ']/u';

      return preg_replace($regex, $replace, $text);

    } // ::strip_undefined_chars

  }

expand code box

Without going into detail, we are simply looping through the earlier list of defined Unicode characters and, where possible, merging them into character ranges.

You will notice one addition from the earlier discussion - we are including by default the ASCII character range (00-FF) to avoid stripping out regular characters. You may want to expand this range depending on the character set you are working with.

Usage is very simple:

$content = \Chirp\PDFTeXHelper::strip_undefined_chars($content);

And on subsequent calls from the same script, a cached version of the white-list is used so that we're not constantly querying the system files.

You could also capture and store the regular expression on your system as a static file, as long as you remember to refresh it if/when new PDFTex libraries are installed.

PHP Parsing HTML files with DOMDocument and DOMXpath
PHP Parsing HTML to find Links
PHP Listing files in a ZIP archive
PHP Parsing robots.txt
PHP Stripping invalid Unicode for pdfTeX

< PHP

Post your comment or question

PHP: Stripping invalid Unicode for pdfTeX

Where are the allowed characters defined

Building the white-list

Optimising the white-list

Putting it all together

Related Articles - Parsing files