Welcome Guest Search | Active Topics | Sign In | Register

Problems with ligatures Options
George Harpur
Posted: Wednesday, April 11, 2018 12:37:47 PM
Rank: Newbie
Groups: Member

Joined: 3/19/2018
Posts: 6
I'm converting HTML to PDF with Eo.Pdf, and subsequently need to extract the text from the generated PDF. I'm finding that certain pairs of characters (such as 'ft', 'ff', 'tt') are being converted to ligatures, as would be expected in a modern HTML renderer, but I'm then struggling to get those characters back from the PDF, even when using techniques which claim to be 'ligature-aware'.

The easiest way to solve this would probably be to prevent the PDF from containing ligatures in the first place. I'm guessing this would be possible by injecting a piece of CSS (setting the 'font-variant-ligatures' CSS parameter to 'none' would hopefully do this) but I'm unsure of the easiest way to implement that.

Even so, this probably wouldn't be a universal solution (since it's possible that CSS in the converted file would override the setting) so understanding why the ligatures in the PDF are not being written in a way that's accessible to PDF text extractors would also be useful. Alternatively a way to globally disable ligature-creation would be great.

Thanks in advance for any help you can provide!
George Harpur
Posted: Thursday, April 12, 2018 9:50:00 AM
Rank: Newbie
Groups: Member

Joined: 3/19/2018
Posts: 6
A bit more information on this after inspection of the PDF... when looking at the 'glyph mappings' in the PDF for a document containing the ligatures 'ff', 'ft' and 'tt', only the fl ligature contains a toUnicode Mapping. Both the other ligatures in the font just have a mapping to U+0000.

This is a serious issue (which for example means that the PDF is not PDF/A compliant) because while the document displays correctly, it is impossible for any screen reader or other text extraction software to ever correctly retrieve the content from the PDF in this case.

(I don't believe I can add attachments here, so I'll email a sample document to support referencing this topic).
eo_support
Posted: Thursday, April 12, 2018 3:08:13 PM
Rank: Administration
Groups: Administration

Joined: 5/27/2007
Posts: 24,083
Hi,

Thanks for bring this to our attention. This does appear to be a problem. It's normal for ligatures to have no corresponding unicode mapping because it is just a glyph for displaying/printing purpose, not a character.

CSS font-variant-ligatures:none does turn off ligatures and produce the right result. You can apply this style without modifying your source HTML file with the following code:

Code: C#
//Use a HtmlToPdfSession object because we need to have access
//to the underlying EO.WebBrowser.WebView object through the
//HtmlToPdfSession object's RunWebViewCallback function 
using (HtmlToPdfSession session = HtmlToPdfSession.Create())
{
    session.RunWebViewCallback((webView, obj) =>
    {
        //Load the Url to be converted
        webView.LoadUrlAndWait(url);

        //Use JavaScript to create a style node that is equivalent of
        //the following CSS block
        //<style>
        // * { font-variant-ligatures:none; }
        //</style>
        webView.EvalScript(@"
(function()
{
var style = document.createElement('style');
style.appendChild(document.createTextNode('* { font-variant-ligatures:none; }'));
document.head.appendChild(style);
})();
");

        return null;
    }, null);

    //Now perform the conversion
    session.RenderAsPDF(result_pdf_file);
}


As you noticed, this can still be overridden by the end user. We could change our code to turn it off permanently but it could affect other users who does want it to be on. So we would rather leave it as is.

Thanks!
George Harpur
Posted: Thursday, April 12, 2018 3:54:34 PM
Rank: Newbie
Groups: Member

Joined: 3/19/2018
Posts: 6
That works well - thanks! I'd been working along similar lines by trying to set a UserStyleSheet in the BrowserOptions which might be a cleaner workaround, but I'm guessing there's no way to get to the WebView or Engine prior to creation?

This is good enough for now, but I do think the lack of a Unicode mapping for ligatures is a serious flaw that needs looking at. There are unicode characters for all common ligatures, and the PDF/A standard does require them to be present (see https://www.pdfa.org/improved-pdfa-1b/ and search for 'ligatures'). It's strange that the mapping is there already for one of the ligatures but not the others, so this must already be partially implemented.
eo_support
Posted: Thursday, April 12, 2018 6:05:34 PM
Rank: Administration
Groups: Administration

Joined: 5/27/2007
Posts: 24,083
Hi,

UserStyleSheet is available on BrowserOptions that can be passed in when the WebView is created, but the PDF engine reuse WebView objects for different conversions and this makes it difficult to expose the underlying UserStyleSheet to the HTML to PDF interface.

Ligature can have unicode mapping but that won't really resolve the copy and paste issue. For example, when "fi" is combined and even if it gets its own unicode, when it's copied over it won't be the same as "f" and "i".

Thanks!
George Harpur
Posted: Thursday, April 12, 2018 6:15:03 PM
Rank: Newbie
Groups: Member

Joined: 3/19/2018
Posts: 6
Understood. To be clear, I was using copy and paste as a simple technique to see what's happening - in practice, we're using tools that are aware of unicode ligatures, and will break them back into their component characters, but if no mapping is provided this becomes impossible.
eo_support
Posted: Thursday, April 12, 2018 6:23:47 PM
Rank: Administration
Groups: Administration

Joined: 5/27/2007
Posts: 24,083
This may not be possible. We rely on Chromium's rendering engine to render the PDF file and if that does not write ligature code, then it won't be there. More over, even if the code wants to write the unicode value, it may not be able to do so because that would depend on the font. Specifically, inside the font there is a "glyph" to "unicode" map, typically this map contains all "normal" characters, but it may not contain values for ligatures. There are also cases where a font has its own fancy ligatures that is not widely recognized. In those cases for sure you won't have a unicode value. So the safest way is probably to turn ligature off.


You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.