Bug 125596

Summary: DOCX: Writer misidentify text language (and appropriate font) in MS Word file (MSO2019)
Product: LibreOffice Reporter: Ratchanan Srirattanamet <peat>
Component: WriterAssignee: Not Assigned <libreoffice-bugs>
Status: NEW ---    
Severity: normal CC: aron.budea, himajin100000, os, xiscofauli
Priority: medium Keywords: filter:docx
Version: 6.0.7.3 release   
Hardware: All   
OS: All   
Whiteboard:
Crash report or crash signature: Regression By:
Bug Depends on:    
Bug Blocks: 104520    
Attachments: The DOCX file which has the problem
All screenshots from MS Office 2019 and LibreOffice 6.4.2.4

Description Ratchanan Srirattanamet 2019-05-30 17:08:45 UTC
Created attachment 151787 [details]
The DOCX file which has the problem

Step to reproduce:
1. Download the font "TH Sarabun New" from [1]. The font is licensed under GPL 2.0 + font exception.
2. Open the attached DOCX document. The text is configured to use "TH Sarabun New" as the complex (Thai) font, and "Liberation Sans" as the western font. Both of them are 16 pt.

Expectation: The Thai text (the word "ไทย") and most of the dots (".") are displayed using "TH Sarabun New", while the English text (the word "English") and the dots between the pipes ("|", including the pipes themselves) are displayed using "Liberation Sans". The whole text is fit within one line. MS Word 2019 shows this expected behavior. (See the screenshots.)

Actual result: The Thai text is displayed using "TH Sarabun New", while the English text, all dots, and the pipes are displayed using "Liberation Sans". The whole text is not fit within one line.

The problem is reproducible on:
- LO 6.0.7-0ubuntu0.18.04.6 from Ubuntu 18.04.
- LO 6.2.4.2 on Ubuntu 18.04, Snap and AppImage.
- LO 6.2.4.2 on Windows 10 version 1903 (build 18326.86)

The reason this is important is that most of the Thai fonts use the different font metrics then western fonts. For historical reason [2], Thai fonts consider that point-size means "line-height". As Thai symbols contain the symbol above and below the character, Thai fonts are usually 30% smaller than western fonts at the same point-size. [3]

Adding to this problem, MS Word considers the language of the text using the keyboard layout when it's typed, not actual text. For example, typing a dot (".") while using a Thai keyboard layout will make that dot Thai while typing a dot while using an English keyboard layout will make that dot English. MS Word seems to record this information in the file, which LO seems to be unable to read. So, when LO opens the file, LO displays the text using the wrong font with different font metric, causing the document's layout to changes.

[1] http://mdresearch.kku.ac.th/files/font/THSarabunNew.zip
[2] http://thep.blogspot.com/2016/02/thai-font-metrics.html (In Thai)
[3] However, some Thai fonts, mostly fonts from Thai Linux Working Group (TLWG), now uses the new metric which considers point-size to be character size. This makes those fonts have the same size as western fonts. See [2].
Comment 1 Ratchanan Srirattanamet 2019-05-30 17:09:49 UTC
Created attachment 151788 [details]
All screenshots from MS Office 2019 and LibreOffice 6.4.2.4
Comment 2 Usama 2019-06-16 03:26:31 UTC
Hello Ratchanan,

Thank you for reporting the bug. I can confirm that the bug is present in master.

Version: 6.3.0.0.alpha1+
Build ID: 77ae0abe21f672cf4b7d2e069f1d40d20edc49a7
CPU threads: 4; OS: Linux 4.9; UI render: default; VCL: gtk3; 
TinderBox: Linux-rpm_deb-x86_64@86-TDF, Branch:master, Time: 2019-05-31_15:33:33
Locale: en-GB (en_GB.utf8); UI-Language: en-US
Calc: threaded
Comment 3 Xisco Faulí 2019-06-28 15:27:42 UTC
I remember seeing the same document in another report.
Where did you get it from ?
Comment 4 Ratchanan Srirattanamet 2019-07-04 17:39:02 UTC
(In reply to Xisco Faulí from comment #3)
> I remember seeing the same document in another report.
> Where did you get it from ?

I didn't take it from anywhere. I created it by myself.
Comment 5 Aron Budea 2020-11-23 04:56:42 UTC
I'm assuming keyword bibisectRequest was added by mistake, if not, please readd with explanation.
Comment 6 Justin L 2022-06-22 16:16:11 UTC
repro 7.5+ and also true for DOC format.

writerfilter/source/dmapper/DomainMapper.cxx:
        case NS_ooxml::LN_CT_Fonts_hint :
            /*  assigns script type to ambiguous characters, values can be:
                NS_ooxml::LN_Value_ST_Hint_default
                NS_ooxml::LN_Value_ST_Hint_eastAsia
                NS_ooxml::LN_Value_ST_Hint_cs
             */
            //TODO: unsupported?