Rank: Newbie Groups: Member
Joined: 4/23/2015 Posts: 1
|
We have encountered an issue with the font encoding used during conversion of Lotus Notes documents to PDF/A using the HTML Converter (EO.PDF for .NET). After conversion, we are running regular expressions (REGEX) against these PDFs.
While a space or hyphen in the converted PDF appear valid visually (they appear as spaces and hyphens), they have been assigned a different ASCII code which the default REGEX scan code does not recognize. Spaces are converted to ASCII code 120; hyphens are converted to ASCII code 162.
Post conversion it appears as though all fonts used within the document are of type TrueType(CID) and encoding Identity-H. This has caused some issues with our regular expressions not identifying certain characters as described above.
Is there any plans for the Essential Objects product that will allow clients to change the encoding in the conversion process?
We are running regular expressions to scan these documents for known string formats such as credit cards, SSNs etc. In order to ensure our REGEX functions in all cases, we are updating our code to consider these variations of spaces and hyphens. We are concerned there may be other characters for which we should implement this work around.
Is anyone aware of any mapping document between commonly seen characters (hyphen, comma, whitespace etc.) and their Identity-H counterparts?
|
Rank: Administration Groups: Administration
Joined: 5/27/2007 Posts: 24,421
|
Hi,
We do not have any plans to support other encodings. The encodings are mostly determined by the font. Windows give us TrueType font data, and true type font always use Identity-H. So there is no other encoding that's appropriate. As such changing encoding on our end is not the propery way to address your issue. You will need to look into TrueType font data in order to find the character value.
Thanks!
|