Summary: | PDF: Arabic text gets deformed when creating a PDF in LibreOffice Writer (Linux-only) | ||
---|---|---|---|
Product: | LibreOffice | Reporter: | vaaydayaasra |
Component: | Printing and PDF export | Assignee: | Not Assigned <libreoffice-bugs> |
Status: | RESOLVED NOTOURBUG | ||
Severity: | normal | CC: | ilmari.lauhakangas, khaled |
Priority: | medium | ||
Version: | 5.4.6.2 release | ||
Hardware: | All | ||
OS: | All | ||
Whiteboard: | |||
Crash report or crash signature: | Regression By: | ||
Bug Depends on: | |||
Bug Blocks: | 103378 | ||
Attachments: |
PDF created with LO 5.4.6.2 where textual content is garbled
Test PDF with various fonts |
Description
vaaydayaasra
2018-08-30 12:25:49 UTC
Created attachment 144554 [details]
PDF created with LO 5.4.6.2 where textual content is garbled
Repro. Can only successfully search with individual glyphs in PDF Arch Linux 64-bit Version: 6.2.0.0.alpha0+ Build ID: 8b1501d80dc9d3f42c351c6e026fa737e116cae5 CPU threads: 8; OS: Linux 4.18; UI render: default; VCL: gtk3_kde5; Locale: fi-FI (fi_FI.UTF-8); Calc: threaded Built on 23 September 2018 Dear vaaydayaasra, To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year. There have been thousands of bug fixes and commits since anyone checked on this bug report. During that time, it's possible that the bug has been fixed, or the details of the problem have changed. We'd really appreciate your help in getting confirmation that the bug is still present. If you have time, please do the following: Test to see if the bug is still present with the latest version of LibreOffice from https://www.libreoffice.org/download/ If the bug is present, please leave a comment that includes the information from Help - About LibreOffice. If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave a comment that includes the information from Help - About LibreOffice. Please DO NOT Update the version field Reply via email (please reply directly on the bug tracker) Set the bug's Status field to RESOLVED - FIXED (this status has a particular meaning that is not appropriate in this case) If you want to do more to help you can test to see if your issue is a REGRESSION. To do so: 1. Download and install oldest version of LibreOffice (usually 3.3 unless your bug pertains to a feature added after 3.3) from http://downloadarchive.documentfoundation.org/libreoffice/old/ 2. Test your bug 3. Leave a comment with your results. 4a. If the bug was present with 3.3 - set version to 'inherited from OOo'; 4b. If the bug was not present in 3.3 - add 'regression' to keyword Feel free to come ask questions or to say hello in our QA chat: https://kiwiirc.com/nextclient/irc.freenode.net/#libreoffice-qa Thank you for helping us make LibreOffice even better for everyone! Warm Regards, QA Team MassPing-UntouchedBug Still reproducible on: Version: 6.3.2.2 Build ID: libreoffice-6.3.2.2-snap1 CPU threads: 4; OS: Linux 4.15; UI render: default; VCL: gtk3; Locale: fi-FI (fi_FI.UTF-8); UI-Language: en-US Calc: threaded pdftotext's output is again different from my initial report but it's still garbled: أ ه ن ا م ه ت ي ر ا اشْت و ن اشترى بالل خمسة آالف كتاب َ This time the beginning of the sentence (found on the last line of the output) is already quite good, though ل and ا in the ligature لا are reversed. Thus on evince بالل matches بلال. The end of the sentence where there are diacritical vowel marks is worse than in my initial report. Dear vaaydayaasra, To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year. There have been thousands of bug fixes and commits since anyone checked on this bug report. During that time, it's possible that the bug has been fixed, or the details of the problem have changed. We'd really appreciate your help in getting confirmation that the bug is still present. If you have time, please do the following: Test to see if the bug is still present with the latest version of LibreOffice from https://www.libreoffice.org/download/ If the bug is present, please leave a comment that includes the information from Help - About LibreOffice. If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave a comment that includes the information from Help - About LibreOffice. Please DO NOT Update the version field Reply via email (please reply directly on the bug tracker) Set the bug's Status field to RESOLVED - FIXED (this status has a particular meaning that is not appropriate in this case) If you want to do more to help you can test to see if your issue is a REGRESSION. To do so: 1. Download and install oldest version of LibreOffice (usually 3.3 unless your bug pertains to a feature added after 3.3) from https://downloadarchive.documentfoundation.org/libreoffice/old/ 2. Test your bug 3. Leave a comment with your results. 4a. If the bug was present with 3.3 - set version to 'inherited from OOo'; 4b. If the bug was not present in 3.3 - add 'regression' to keyword Feel free to come ask questions or to say hello in our QA chat: https://kiwiirc.com/nextclient/irc.freenode.net/#libreoffice-qa Thank you for helping us make LibreOffice even better for everyone! Warm Regards, QA Team MassPing-UntouchedBug The problem seems to have been resolved on LO 7.3.0.3 on Windows 10. To test PDF output this time, I used Adobe Acrobat DC 2021.011.20039 64-bit. I haven't tested on Linux, where the problem initially appeared. Version: 7.3.0.3 (x64) / LibreOffice Community Build ID: 0f246aa12d0eee4a0f7adcefbf7c878fc2238db3 CPU threads: 4; OS: Windows 10.0 Build 19044; UI render: Skia/Raster; VCL: win Locale: fr-FR (fr_FR); UI: fr-FR Calc: CL Unfortunately still reproduced on Linux Arch Linux 64-bit Version: 7.4.0.0.alpha0+ / LibreOffice Community Build ID: 8f2b1b1cb84e1ae3139eb90b8efdf61e608adbad CPU threads: 8; OS: Linux 5.16; UI render: default; VCL: kf5 (cairo+xcb) Locale: fi-FI (fi_FI.UTF-8); UI: en-US Calc: threaded Jumbo Built on 24 February 2022 This highly depends on font and the PDF viewer used, and limitations of PDF format. We are doing our best with what PDF format gives us, we are outputting ToUnicode mapping when applicable and ActualText tagging when not. We try to limit the scope of ActualText spans so that individual characters and words can be selected and highlighted, otherwise we can tag full paragraphs with ActualText which will give the most fidelity in preserving the textual content, but then PDF viewers will treat the paragraph text as back box and can no longer associate the text with the glyphs rendered (so search results can’t be highlighted, parts of the paragraph can’t be selected and so on). PDF is not an archival format, no matter how hard Adobe wants to sell this idea, it is first and foremost a print format, a glorified paper so to speak. We are crippled by several issues here: * Text in PDF is output in visual order (i.e. from left to right), while the text content is stored in logical order (the first character comes first in memory, regardless of the direction). This means any tool extracting text from PDF need to reverse the logical to visual order and this process lossy and not always reliable. * PDF stores glyphs not characters, so we need to handle all the complex glyph to character relationships, that is why the result depends on the font. * Not all PDF viewers support ActualText tagging, and the ToUnicode mechanism can’t capture all the possible relations above. * PDF viewers will often try to guess where the spaces are since many PDF producing tools don’t output space character at all (they just position the glyphs so that they are separated visually by blank space), so sometimes kerning can be misrepresented as word spaces. Overall I don’t think there is anything that can be done here, but if someone can attach a PDF that is doing better, I can try to have a look and see if we can learn some trick from it. Lastly, none of this is platform dependent, if you are getting different results on different platforms, it will be either because the different fonts or PDF viewers used. Created attachment 182459 [details]
Test PDF with various fonts
Here is a test PDF and here is the extracted text:
Adobe Acrobat Reader DC:
اش ترى بلال خمسة آلاف كتاب وَ أَنَا اشْ تَر يَْتُهَا مِنْهُ
اشترى بلال خمسة آلاف كتاب وَ أَ نَا اشْتَرَيْتُهَا مِنْهُ
اشترى بلال خمسة آلاف كتاب وَ أَنَا اشْتَرَيْتُهَا مِنْهُ
اشترى بلال خمسة آلاف كتاب وَأَنَا اشْتَرَيْتُهَا مِنْهُ
اشترى بلال خمسة آلاف كتاب و أََ ناَ اشْترَيَتْهُاَ مِنهُْ
اشترى بلال خمسة آلاف كتاب وَأَنَا اشْتَرَيْتُهَا مِنْهُ
اش ترى بلال خمسة آلاف كتاب وَأَ نَا اشْ تَر يَْ تُه اَ مِنْهُ
نْ هُ هَا مِ
تُْ رَ ي
شْ تَ ا لف ت ك ا ب أََ و نَ ا ا
آ
مسة
خ
ا ش ترى ب ا لل
اشترى بلال خمسة آلاف كتاب وَأَنَا اشْتَرَيْتُهَا مِنْهُ
اشترى بلال خمسة آلاف كتاب وَ أَنَا اشْتَرَيْتُهَا مِنْهُ
اشترى بلال خمسة آلاف كتاب وَأَنَا اشْتَرَيْتُهَا مِنْهُ
نْهُ ا مِ نَا اشْ تَر يَْتُهَ
اش ترى بلال خمسة آلاف كتاب وَ أَ
اشترى بلال خمسة آلاف كتاب وَ أَنَا اشْتَرَيْتُهَا مِنْهُ
اشترى بلال خمسة آلاف كتاب وَأَنَا اشْتَرَيْتُهَا مِنْهُ
اشترى بلال خمسة آلاف كتاب وَ أَنَا اشْترََيتُْهَا مِنْهُ
اشترى بلال خمسة آلاف كتاب وَأَنَا اشْتَرَيْتُهَا مِنْهُ
ا ش تر ى بلال خ مس ة آلا ف ك تا ب وَ أَ نَا ا شْ تَر يَْ تُهَا مِ نْهُ
Apple’s Preview:
ا ش تَر ى ب لا ل خ م س ة آ لا ف ك ت ا ب َو أَ َ ن َ ا ا ْش َتَر ْي ُت َه ا ِم ن ْ ُه َ
اشترىبلالخمسةآلافكتاب َوَأََناا ْشَتَرْيُت َها ِمْنُه
اشترى بلال خمسة آلاف كتاب َو َأَ َنا ا ْش َت َر ْي ُت َها ِم ْن ُه
اشترى بلال خمسة آلاف كتاب َوأَنَا ا ْشتَ َر ْي ُت َها ِم ْن ُه
اشترى بلال خمسة آلاف كتاب وَأَنَا ا ْشتَرَيْتُهَا مِنْهُ
اشترىبلالخمسةآلافكتاب َوأََنَاا ْشتَرَيْتُ َها ِمنْ ُه
اشتَرىبلالخمسةأَلافكتاب َوأََنَااْشََتَرْيُتُهَاَ ِمْنُه خ َََََُُْْْ
شت شت ي ا رىياللمسهاالفكنابواياا رتهاِمنه
اشترى بلال خمسة آلاف كتاب َوأَنَا ا ْش َت َريْ ُت َها ِم ْن ُه
اشترىبلالخمسةآلافكتاب َوَأََنااْشَتَرْيُتَها ِمْنُه
اشترىبلالخمسةآلافكتاب َوَأَنااْشَتََرْيُتَها ِمْنُه َََََُُْْْ
اشتَرى بلال خمسة آلاف كتاب وأَنا اشتَريتها ِمنه
اشترى بلال خمسة آلاف كتاب َو َأَ َنا ا ْش َت َر ْي ُت َها ِم ْن ُه
اشترى بلال خمسة آلاف كتاب َوَأََنا ا ْشَتَرْيُت َها ِمْن ُه
اشترى بلال خمسة آلاف كتاب وَأََنَا ا ْشتَ َريْتُهَا ِمنْ ُه
اشترى بلال خمسة آلاف كتاب َوَأَنَا اشْتَرَيْتُهَا مِنْهُ
اشتَرى يلال خمسه أَلا ف كنا ب و َأَيا ا ْشتَرينها ِمن ُه ََ َََُْ ْ
Firefox PDF viewer:
َيْتُهَا مِنْهُ َ تَر ْ نَا اشَ أَ َ ى بلال خمسة آلاف كتاب وتَر اش
ُ
نَ ا اشْ تَ رَ يْ تُ هَ ا مِ نْ ه َأَ َ اشترى بلال خمسة آلاف كتاب و
ُ
نَا اشْ تَرَيْتُهَ ا مِنْهَأَ َ اشترى بلال خمسة آلاف كتاب و
ُ
اشترى بلال خمسة آلاف كتاب وَأَنَا اشْ تَرَيْتُهَا مِنْه
ُ
نَا اشْ تَرَيْتُهَا مِنْه أَ َ اشترى بلال خمسة آلاف كتاب و
ُ
نَا اشْ تَرَيْتُهَ ا مِنْهَ أَ َ اشترى بلال خمسة آلاف كتاب و
ُ
َا مِنْه ُ تُه ْ َي َتَر ْ اش َ نَا أَ َ لاف كتاب و أَ ى بلال خمسةتَر اش
ُ
ْ ه ن ُِ هَ ا مت ْ ي ََ رت ْ شَ ا ا ي
َ ا َو بانك فال ا همس خ الل ي رىت شا
ُ
اشترى بلال خمسة آلاف كتاب وَأَنَا اشْ تَرَيْتُهَا مِنْه
ُ
نَا اشْ تَرَيْتُهَا مِ نْهَ أَ َ اشترى بلال خمسة آلاف كتاب و
ُ
اشترى بلال خمسة آلاف كتاب وَ أَنَ ا اشْ تَرَ يْتُهَ ا مِنْه
ُ
َ يْتُهَ ا مِ نْه َ تَر ْ نَا اشَ أَ َ ى بلال خمسة آلاف كتاب وتَر اش
ُ
نَا اشْ تَرَيْتُهَا مِ نْهَ أَ َ اشترى بلال خمسة آلاف كتاب و
ُ
نَ ا اشْ تَ رَ يْ تُ هَ ا مِ نْ ه َ أَ َ اشترى بلال خمسة آلاف كتاب و
ُ
نَا اشْ تَرَيْتُهَا مِنْهَأَ َ اشترى بلال خمسة آلاف كتاب و
ُ
اشْتَرَيْتُهَا مِنْه َان َ أَ َ و كتاب آلاف خمسة بلال اشترى
ُ
ْهن ِ ُهَ ا منْ يَ َتَر ْ شَا ايَ أَ َ و بانك فلاأَ همس خ لالي ىتَر شا
Chrome PDF viewer:
ْيُت َه ِ ا منْ ُه
نَ ْ ا اشرَتَ
اشرت َ ى بالل خمسة آالف كتاب وَأ
َه ِ ا مْنُه
ُْت
ي
َ
َن ْ ا اشَتر
َ اشترى بالل خمسة آالف كتاب وَأ
ْ ُتَه ْ ا مِنُه
ي
َ
َ اشترى بالل خمسة آالف كتاب وَأَن ْ ا اشَتر
ه
تَه ْ ا مِنُ
ُْ
ي
َ
ْ نَا اشتَر
َ
َ اشترى بالل خمسة آالف كتاب وأ
اشترى بلال خمسة آلاف كتاب وََأنَا اشْت َرَيْتُه َا مِنْهُ
اشترى بالل خمسة آالف كتاب وََأنَا اشْ تَرَيْتُهَ ا مِ نْهُ
نُْه
ْهُتَا مِ
َأاَن ْ اشرَتَي
َ
اشرتى بالل مخسة الف كتاب و
ُ
ه
ْ
ن
َ ا ِم
ه
ُ
ْت
ي
َ
ر
َ
ت
ْ
ش
َ ا ا
ن
َأ
َ
اب و
كت
ف
لا
مس
خ
شت رى بلال
ا
ُْت َه ْ ا مِنُه
ي
َ
نَ ْ ا اش َتر
َ
أ
َ
اشترى بالل خمسة آالف كتاب و
ْ ُتَه ْ ا مِنُه
ي
َ
َن ْ ا اشَتر
َأ
َ اشترى بالل خمسة آالف كتاب و
َن ْ ا اشَتَرْيُت َه ْ ا مِ نُه
َ اشترى بالل خمسة آالف كتاب وأَ
ُ
ه
ْ
ن
ِ
َا م
ه
ُ
ت
ْ
ي
َ
ْ رَت
َا اش
ن
َأ
َ
اشرتى بالل خمسة آالف كتاب و
ُْت َه ْ ا مِ نُه
َري
َ اشترى بالل خمسة آالف كتاب وَأَن ْ ا اشَت
َ اشترى بالل خمسة آالف كتاب وَأ ْ نَ ا اش َتَ رْي َتُ ه ُ ا مِ نْ ه
ه
ُْ
هَ ِ ا من
ُ
ت
ْ
َري
نَ ْ ا اشتَ
َأ
اشترى بالل خمسة آالف كتاب وَ
ُ
ه
ْ
هَا مِن
ُ
ْت
تَرَي
َأنَا اشْ
َ اشترى بالل خمسة آالف كتاب و
ه
ُ
نْن
نَها مِ
ُ ْ
ي
نَنا اش ْ رَتَ
اشرَتى بالل خمسه الف كناب َ وَأ
As you can see in comment 9, results vary widely across fonts and PDF viewers, and Adobe’s viewer give the best result, but still some fonts gives broken results. |