Bug 106755

Summary: tifinagh rendering problems
Product: LibreOffice Reporter: martin_hosken
Component: LinguisticAssignee: ⁨خالد حسني⁩ <khaled>
Status: RESOLVED FIXED    
Severity: normal CC: caolan.mcnamara, erack, khaled, vsfoote
Priority: medium    
Version: Inherited From OOo   
Hardware: All   
OS: All   
Whiteboard: target:5.4.0 target:5.3.3
Crash report or crash signature: Regression By:
Bug Depends on:    
Bug Blocks: 71732    
Attachments: test text data
Tifinagh test font

Description martin_hosken 2017-03-24 16:58:47 UTC
Created attachment 132129 [details]
test text data

Tifinagh rendering problems. The enclosed font and text render happily in firefox, and shape correctly with hb-shape (producing appropriate looking ligatures of the diacritic over the base) for either OpenType or Graphite. But the rendering is wrong (diacritics to the right of the base, with no ligatures) in libo 5.3. Not sure why
Comment 1 martin_hosken 2017-03-24 17:00:28 UTC
Created attachment 132130 [details]
Tifinagh test font
Comment 2 ⁨خالد حسني⁩ 2017-03-28 15:55:45 UTC
Seems like a bug in the layer above VCL; we are asked to layout the bases and the combining marks separately. If I added something like this early in CommonSalLayout::LayoutText():

SAL_DEBUG(rArgs.mrStr << " => " << rArgs.mrStr.copy(rArgs.mnMinCharPos, rArgs.mnEndCharPos - rArgs.mnMinCharPos));

I get this for the Tifinagh text:

debug:27777:1: ⴰ̂ => ⴰ
debug:27777:1: ⴰ̂ => ̂
debug:27777:1: ⴰ̆ => ⴰ
debug:27777:1: ⴰ̆ => ̆
debug:27777:1: ⵓ̂ => ⵓ
debug:27777:1: ⵓ̂ => ̂
debug:27777:1: ⵢ̂ => ⵢ
debug:27777:1: ⵢ̂ => ̂
debug:27777:1: ⵢ̣ => ⵢ
debug:27777:1: ⵢ̣ => ̣
debug:27777:1: ⵧ̂ => ⵧ
debug:27777:1: ⵧ̂ => ̂

I’m guessing that the code that itemizes text into simple/complex/asian scripts does not handle combining marks correctly and considers them a separate script from Tifinagh.
Comment 3 ⁨خالد حسني⁩ 2017-03-28 17:43:07 UTC
After a bit of debugging I think I found the root of this. Basically characters from Combining Diacritical Marks block are classified as ScriptType::LATIN when they should have been ScriptType::WEAK. This comes from i18npool/source/breakiterator/breakiteratorImpl.cxx:

BreakIteratorImpl::getScriptClass() which calls getCompatibilityScriptClassByBlock() which in turn checks the scriptList array that assigns all blocks from UBLOCK_BASIC_LATIN to UBLOCK_ARMENIAN as ScriptType::LATIN and this includes UBLOCK_COMBINING_DIACRITICAL_MARKS.

This comes from https://gerrit.libreoffice.org/gitweb?p=core.git;a=commitdiff;h=bf355b97ee4b53e38975f0e1847eda5b3e05f920 to fix bug 38095. I don’t know what old classification was and whether it indeed classified combining marks as Latin, but either way it does not make terrible since.

If I modify scriptList to skip UBLOCK_COMBINING_DIACRITICAL_MARKS, then I get correct rendering here.
Comment 4 ⁨خالد حسني⁩ 2017-03-28 17:55:43 UTC
Actually this have been broken for as long as this code have been around, https://gerrit.libreoffice.org/gitweb?p=core.git;a=commitdiff;h=c5cd37730e145b86129c6503ae5661820ae042e7 (look for UnicodeScript_kCombiningDiacritical). I say we should just fix it.
Comment 5 ⁨خالد حسني⁩ 2017-03-28 18:41:50 UTC
https://gerrit.libreoffice.org/#/c/35811/
Comment 6 martin_hosken 2017-03-28 19:04:26 UTC
Yes, definitely. Perhaps as we do that, we could use uscript_getScript() and get a more accurate categorisation. It'll mean a new list of script values, but that's not hard. I would be tempted to do a simple array of script code to category, which would be quicker and only 174 bytes. It would start out something like:
#define W ScriptType::WEAK
#define C ScriptType::COMPLEX
#define A ScriptType::ASIAN
#define L ScriptType::LATIN
{W, W, C, L, W, A, L, L,
 L, C, C, C, L, L, L, C,
 C, A, A, C, A, C, A, C,
 C, L, C, C, C, L, L, C,
 C, C, C, C, C, C, C, C,
 L, A, C, C, C, C, L, C,
 C, C, C, C, C, C, A, C,
 C, C, C, C, C, C, C, C,
 C, C, C, C, L, C, C, C,
 C, A, A, C, C, C, C, C,
 L, L, C, C, C, C, C, C,
 C, C, C, C, C, C, C, C,
 C, C, C, C, C, C, W, W,
 C, A, C, C, C, C, C, C,
 C, C, C, C, C, C, C, A,
 C, C, C, C, C, C, C, C,
 C, L, C, C, C, C, C, C,
 C, C, C, C, C, C, C, C,
 C, C, C, C, C, C, C, C,
 C, C, A, C, C, C, C, C,
 C, C, C, C, C, C, C, C,
 C, C, C, C, A, A, L}

But this needs some review, especially later on in the list.
Comment 7 ⁨خالد حسني⁩ 2017-03-28 19:37:34 UTC
(In reply to martin_hosken from comment #6)
> Yes, definitely. Perhaps as we do that, we could use uscript_getScript() and
> get a more accurate categorisation.

This was done before and lead to bug bug 38095, and was then fixed by first doing the old, by block, categorization that OpenOffice/LibreOffice did before and falling back uscript_getScript() for blocks not covered by the old method.

But may be we should go again the full uscript_getScript() way, and special case a few ranges of characters that are causing issues.
Comment 8 Commit Notification 2017-03-28 20:04:10 UTC
Khaled Hosny committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=55ddbfc610d2a00e565ca7bcb0277da33bb90947

tdf#106755: Fix script type for combining marks

It will be available in 5.4.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 9 Commit Notification 2017-03-29 08:11:30 UTC
Khaled Hosny committed a patch related to this issue.
It has been pushed to "libreoffice-5-3":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=1c5d8ea51f9da89d76b2b2d3bb896225a6ac9dca&h=libreoffice-5-3

tdf#106755: Fix script type for combining marks

It will be available in 5.3.3.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.