Bug 70423 - FILEOPEN: Unexpected Addition Of Windows Line Breaks to LinuxText File
Summary: FILEOPEN: Unexpected Addition Of Windows Line Breaks to LinuxText File
Status: VERIFIED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
4.1.2.3 release
Hardware: Other Windows (All)
: low minor
Assignee: Not Assigned
URL:
Whiteboard: reviewed:2022 target:7.6.0
Keywords: difficultyBeginner, easyHack, skillCpp
: 150574 (view as bug list)
Depends on:
Blocks: Character
  Show dependency treegraph
 
Reported: 2013-10-13 11:35 UTC by John B
Modified: 2024-04-06 09:22 UTC (History)
8 users (show)

See Also:
Crash report or crash signature:


Attachments
Sample file with 10k characters as contents (9.77 KB, text/plain)
2021-08-29 00:16 UTC, Hossein
Details

Note You need to log in before you can comment on or make changes to this bug.
Description John B 2013-10-13 11:35:31 UTC
Problem description: 

When loading a large Linux Test file (where end of line characters are represented in hexidecimal by 0a)into the Windows version of Writer, Writer will spontaneously add a windows end of line character (represented in hexidecimal by 0d 0a) approximately every 9900 characters in the text.  It is "approximately" every 9900 characters because Writer seems to purposely put an end of line character at the first sign of white space after the 9900 character mark.  Results in the file I was working with yielded the extra character at character 9904, then again 9905 characters later, then again 9901, 9900, 9905, 9900, and 9903 characters later.  (I stopped counting at this point.)  Re-saving the file as a text file will save these extra bytes into the saved text file and a binary compare will reveal this.

This means that lines of text are broken up weirdly in the middle of sentences.  There is no conversion of any kind between the Linux end-of-file and Windows end-of-file.  Merely extra characters are added.

If I were to convert all Linux end-of-line characters into windows end-of-line characters BEFORE loading the text file into Writer, Writer does not appear to alter the file unexpectedly.

Steps to reproduce:
1. Get a sufficiently large text file with no windows end-of-line characters (hexidecimal representation 0d 0a).  A few hundred kilobytes should do, although I first noticed it on an 81 MB file.
2. Copy the text file so you have an original "Linux copy" and a "Windows copy" that you can play with in Writer.  (Eventually, you will perform a binary compare against the two.)
3. Load the Windows copy of the file into the Windows version of Writer.  **The changes are visible at this point if you know where to look in the text file.**  I will continue on so that you can easily see where the changes are made.  (Side Note: I used the portable version of Writer.  I did not test the Linux version.)
4. Re-save the file.  You'll probably have to "alter the file" by deleting one character and then typing that exact character back in.  (Do not use "undo".)  Save the file as a text file.
5. Do a binary compare between the Linux file and Windows file to find the exact places where Writer has altered the file.  Even a text compare should yield the problem spots.  I used Frhed for looking at the file in binary and WinMerge for my file compares.  Frhed helped me figure out that the error was occurring approximately every 9900 characters / bytes.
6. Note how the extra characters occur at the white spaces and not necessarily next to the Linux end-of-line characters.

Current behavior:

Extra end-of-line characters are added to the file next to white spaces approximately every 9900 characters / bytes.  This can even be seen after the Linux text file is loaded for the first time but before the file is re-saved.

Expected behavior:

No addition of end-of-line characters even in a file as large as 81 MB.  One possible option is to convert all Linux end-of-file characters into Windows end-of-file characters.  This would require an option so the user can decide how to output end-of-file characters during the save process.

Special Note 1: This bug was reproduced not only with the 81 MB file, but a 250 kB file as well.

Special Note 2: Although unnecessary to reproduce the bug, the 81 MB file I was using came from here: http://www.imdb.com/interfaces .  Click one of the FTP sites under Plain Text Data Files.  I used the trivia.list file.
Operating System: Windows 8
Version: 4.1.2.3 rc
Comment 1 Urmas 2013-10-13 16:32:38 UTC
There are indeed paragraph breaks added when opening such files with "Text file" filter. If the proper filter is used, they are saved fine.
Comment 2 John B 2013-10-14 11:51:13 UTC
Hi Urmas.  Thanks for writing back, but the bug occurs when the file is opened, not saved.  (Saving it means the changes Writer has made are made permanent within the file.)  As far as the filter goes: Writer appears to figure out which filter to use on opening the file.  (I am given no choice as to which filter to use.)  When I change the file name to ".txt" and open it, it seems to use the same filter as when it opens it as a ".list" file.  Hope that helps.  Thanks, -- John
Comment 3 Urmas 2013-10-14 23:10:39 UTC
The 'Encoded text' filter allows specifying line breaks.
Comment 4 John B 2013-10-18 06:03:06 UTC
Urmas, it took me a while to figure out how to respond to this.  What we are really looking at is an edge case that has poor documentation, is not intuitive to the end user, and an unfortunate programmer probably had to make a judgment call.  I still consider this a bug (because Writer is silently adding paragraph marks approximately every 9900 characters without any prompts at all).  You or others on the LibreOffice team may see this differently and I accept that judgment.  In other words, I leave this up to you as to whether to close this “bug” or look into it more deeply.  If you don’t consider this a bug, then be aware there is an unintuitive limitation in Writer that you are purposely leaving in the program that changes the content of the document without letting the user know.

*For everyone else*:  I’m going to explain what happened in as close to plain English as possible hoping that it may help someone else.  Disclaimers: 1) I’m not an expert in the inner workings of LibreOffice, but I am a programmer by profession.  2) Although I am dealing with Linux files, I’m writing this explanation from a Windows perspective since I usually do my work in the Windows world.

There are two things which triggered my “bug”.

*First*: From Windows Explorer, I double clicked the file I wanted to open.  Another way to do this is to open up LibreOffice, then click on File --> Open in the menus and leave the file type set to “All Files (*.*)”.

LibreOffice Writer guesses which file type I wanted to use.  Although it guessed the file type correctly, it chose the wrong filter.  I’ll elaborate on file types and filters.

The file you work on within Writer may be saved as an open document file (this is the native format of Writer with the “ODT File” Type) or as a Word document file (with the “DOCX File” Type) or as another type of file.  To achieve this, it appears that Writer “filters” every file it saves.  It also “filters” when it opens a file as well.  This filter is an internal mechanism that you usually don’t have to concern yourself with.  It is there merely to help LibreOffice understand a file it is opening and saving.  In some form or fashion, all programs must do this.  Almost always, it can be transparently done without ever asking you questions or prompting you for input.  With word processors (like LibreOffice Writer or Microsoft Word), the filters have to be given to you sometimes, but the programmers try to make it as transparent as possible so you don’t have to answer a hundred questions every time you open a file.  On occasion and under an unfortunate set of circumstances, Writer makes the wrong choice.  Sometimes, it can’t be helped.  This is what happened in my case.

My file type was a “Text Document” and text documents can be written many different ways.  That means text documents have many different filters.  Linux computers and Windows computers save text files differently so a different filter must be used when opening each of these kinds of text files.  Differently languages (English, German, etc) write out different characters within those text files and this affects the filter used as well.

If you let LibreOffice choose the file type, it chooses the usual “Windows” filter for text files.  Generally, it picks the correct choice, but in my case, this is incorrect since my file came from a Linux computer.  I need to choose the “Text Encoded (*.txt)” file type.  Only then will Writer ask me what what kind of filter should be used.  In this case, it is called the ASCII Filter.  I can only choose the file type when opening a file though the menus:  File --> Open.

You’ll have to figure out what kind of encoding you need.  Unfortunately, LibreOffice help is a bit sparse at this time: https://help.libreoffice.org/Common/ASCII_Filter_Options .  If the correct encoding is not used, LibreOffice will not know where to place the paragraph marks properly.

In my case, it did not properly interpret the Linux paragraph marks (or enter key characters) and the filter thought it the entire file was one giant paragraph.  This is the first part as to why the “bug” takes place.

*Second*: When paragraphs are too long (specifically) sometime after 9900 characters, LibreOffice Writer silently adds paragraph marks for a reason unknown to me.  I suspect there is some kind of internal limit within Writer that forces this kind of behavior.  I suspect (although I have not tried it out) that LibreOffice cannot handle paragraphs greater than 10000 characters.  9900 characters is about 3 - 5 pages worth of material and the programmer probably didn’t think anyone would ever write such a long paragraph in a word processor.

Every program has trade offs.  A sophisticated program like LibreOffice Writer has a lot of trade offs, but the programmers have done a fantastic job to hide them from you (and me).  Unfortunately, personal experience in my programming world has shown me there are some things that are very difficult to code around.  If there is a 10000 character limit in Writer, that is probably due to some trade off the programmer made so he / she could give you better performance or so Writer could be given to you in a reasonable amount of time.  Can this one part be fixed?  Probably.  Should it be fixed?  Eventually, I think it needs to be addressed.  Is it easy to change?  Probably not.

The 9900 character limit probably seemed like a good trade off at the time it was written.  If this was a deliberate choice, then this is not truly a bug.  This is the reason why I’m willing to let this be decided by someone (like Urmas) who is better informed than I am without more of a fight.  My wish then becomes that it is better documented.

I wrote all of this here because something like this is easy enough to work around for an expert (like me), but pretty frustrating for the casual user.  My suggestion to you is to play around with those filter settings like Urmas mentioned and I explained.  Read up on what those filter settings mean by doing Internet searches.

I hope this helps someone.

-- John
Comment 5 QA Administrators 2015-04-01 14:42:24 UTC Comment hidden (noise)
Comment 6 John B 2015-04-02 13:22:31 UTC
Bug is still present.  No indication of changes in bug behavior.

O.S.: Windows 8.1
LibreOffice: 4.4.1.2
Comment 7 tommy27 2016-04-16 07:25:13 UTC Comment hidden (noise)
Comment 8 QA Administrators 2017-05-22 13:26:44 UTC Comment hidden (noise)
Comment 9 Mike Kaganski 2020-05-17 09:54:03 UTC
This is caused by three things:

1. Opening a text file with incorrect paragraph break specification. Opening a text file by default uses Text filter (not Text (encoded)) with system line endings; thus opening a Linux (with LF ends of line) text file on Windows would *not* consider LFs as paragraph terminators (they would be imported as line breaks), and they would start becoming a large single peragraph;
2. Writer's ASCII filter has a hard arbitrary limit of 10 000 characters per a single paragraph. It splits paragraphs about 100 characters before that boundary.
3. Writing back using the same Text filter again uses system breaks, CRLF on Windows; so all *new* ~10000-char paragraphs get separated by those CRLFs.

The only thing to solve here IMO is removing the arbitrary limit, since Writer is already able to handle 2G character long paragraphs.

Code pointer: MAX_ASCII_PARA defined in sw/inc/shellio.hxx, and used in sw/source/filter/ascii/parasc.cxx.
Comment 10 John B 2020-05-19 03:26:05 UTC
Hi Mike,

It's been a long time since I've looked at this item. Although I haven't looked at the code or tried to force this item in a couple of years, your explanation is succinct and looks correct.

Thank you for looking at it. :)

John
Comment 11 Madhav Gupta 2020-10-22 11:39:43 UTC
I will try and patch this up as my first patch for libreOffice
Comment 12 Xisco Faulí 2021-02-09 14:15:53 UTC
Dear Madhav Gupta,
This bug has been in ASSIGNED status for more than 3 months without any
activity. Resetting it to NEW.
Please assign it back to yourself if you're still working on this.
Comment 13 Radhey Parekh 2021-03-01 05:36:25 UTC
As I see this task as still "NEW", I would like to take it as my first contribution in libreoffice!
Comment 14 Hossein 2021-08-29 00:08:47 UTC
(In reply to Radhey Parekh from comment #13)
> As I see this task as still "NEW", I would like to take it as my first
> contribution in libreoffice!

Dear Radhey
This bug has been in ASSIGNED status for a long time without any activity. I can help you to fix it. Do you still want to work on this issue?
Comment 15 Hossein 2021-08-29 00:16:41 UTC
Created attachment 174597 [details]
Sample file with 10k characters as contents

Opening this file in LibreOffice Writer, the last 0 is shown in a new line.
Comment 16 Radhey Parekh 2021-08-29 17:20:10 UTC
(In reply to Hossein from comment #14)
> (In reply to Radhey Parekh from comment #13)
> > As I see this task as still "NEW", I would like to take it as my first
> > contribution in libreoffice!
> 
> Dear Radhey
> This bug has been in ASSIGNED status for a long time without any activity. I
> can help you to fix it. Do you still want to work on this issue?

Sure sir! Even my patch was building perfectly on Windows but it was failing on Linux. I am ready to work again on this issue. Thanks :)
Comment 17 Telesto 2021-12-31 11:12:50 UTC
Only some words of caution. The 10k limit might be some arbitrary (low) value for these days, but not sure if fully removing the limit being a wise decision.

LibreOffice can't handle very very large single paragraphs. Typing additional line of text into a Editing > 40k character file already cause lags (for say 1 second). [example at bug 122952]

So as long nobody intends to solve that, I prefer to have some arbitrary limit until it's safe. [But maybe I'm a little to late; limit has been removed at multiple places already] 

My 2 cents
Comment 18 Xisco Faulí 2022-05-02 14:44:51 UTC
Dear  Radhey Parekh,
This bug has been in ASSIGNED status for more than 3 months without any
activity. Resetting it to NEW.
Please assign it back to yourself if you're still working on this.
Comment 19 Hossein 2022-06-29 23:07:30 UTC
Re-evaluating the EasyHack in 2022

This issue is still relevant. The one who wants to work on this should also take a look at the request in the bug 146323 as the code that is about to be removed might be useful for the other issue.
Comment 20 Radhey Parekh 2022-08-11 16:37:04 UTC
I've uploaded a new patchset with some changes in the unit test. Kindly have a look at it. Thanks!
Comment 21 Mike Kaganski 2022-08-24 05:05:02 UTC
*** Bug 150574 has been marked as a duplicate of this bug. ***
Comment 22 Commit Notification 2023-01-02 07:36:14 UTC
Radhey Parekh committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/745898eb2af2686ffbdfdc0e44984db67b172a59

tdf#70423 Remove txtimport break in 10k chars line

It will be available in 7.6.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 23 Commit Notification 2023-03-06 09:39:16 UTC
László Németh committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/13e34393e4564ef67d990c6dbe1991a0a6b288dd

tdf#154000 tdf#70423 sw: fix crash/freezing with huge text files

It will be available in 7.6.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 24 NISZ LibreOffice Team 2023-03-27 10:46:01 UTC
VERIFIED IN:
Version: 7.6.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: 67bb7f71b785d3d831ffaa47262b6cbd84e71c42
CPU threads: 8; OS: Windows 10.0 Build 19044; UI render: Skia/Vulkan; VCL: win
Locale: hu-HU (hu_HU); UI: hu-HU
Calc: CL threaded