Bug 136013

Summary: FILEOPEN Importing tsv/csv with no string delimiter causes whitespace only trailing column to corrupt
Product: LibreOffice Reporter: Andrew Crowe <andrew>
Component: CalcAssignee: Not Assigned <libreoffice-bugs>
Status: CLOSED DUPLICATE    
Severity: normal CC: documentfoundation
Priority: low Keywords: bibisected, regression
Version: 4.0.0.3 release   
Hardware: All   
OS: All   
Whiteboard:
Crash report or crash signature: Regression By:
Bug Depends on:    
Bug Blocks: 109236    
Attachments: CSV file that triggers issue
Screenshot of initial import dialog display
Screenshot of import dialog after changing settings
Screenshot after file opens in calc

Description Andrew Crowe 2020-08-22 12:33:21 UTC
Description:
When importing a tsv or csv without string delimiters, if the final column consists of only whitespace it adds corrupt data to that column.

If the final column is empty the row loads correctly. Also if the string delimiter is set to anything (even if the delimiter character does not appear in the document) the file loads correctly.

One interesting behavior is initially the csv import dialog doesn't show corruption in the preview, however if you change any options the corruption appears.

Tested reproducible on versions 5.4, 6.4, 7.0

Steps to Reproduce:
1. Have CSV/TSV file without string delimiters and with trailing column consisting of only whitespace
2. Turn off string delimiters in import dialog box
3. Click OK

Actual Results:
Right hand column contains corrupt data

Expected Results:
Right hand column blank


Reproducible: Always


User Profile Reset: Yes



Additional Info:
Version: 7.0.0.3 (x64)
Build ID: 8061b3e9204bef6b321a21033174034a5e2ea88e
CPU threads: 24; OS: Windows 10.0 Build 19041; UI render: Skia/Vulkan; VCL: win
Locale: en-GB (en_GB); UI: en-GB
Calc: CL
Comment 1 Andrew Crowe 2020-08-22 12:36:47 UTC
Created attachment 164559 [details]
CSV file that triggers issue
Comment 2 Andrew Crowe 2020-08-22 12:38:20 UTC
Created attachment 164560 [details]
Screenshot of initial import dialog display
Comment 3 Andrew Crowe 2020-08-22 12:38:55 UTC
Created attachment 164561 [details]
Screenshot of import dialog after changing settings
Comment 4 Andrew Crowe 2020-08-22 12:39:27 UTC
Created attachment 164562 [details]
Screenshot after file opens in calc
Comment 5 Justin L 2020-12-15 11:43:29 UTC
Confirmed. The key is to erase the double-quote in the string-delimiter box.

Seems to have worked in LO 3.6.
Bibisected with bibisect-linux-43all to get the range https://cgit.freedesktop.org/libreoffice/core/log/?qt=range&q=a1ac2538e9b287444500618ab4d2f0f06c25cf34..19f4ebd8a54da0ae03b9cc8481613e5cd20ee1e7

Nothing clearly obvious in this range, but various suspicious commits involving ICU and libexttextcat. 

Bad _bibisect 43all commit_ a67b874d60de1f1a44bef57a53a7b8a84db0ba58.
Comment 6 xpusostomos 2021-03-15 10:07:07 UTC
I think its worth adding this comment here rather than opening a new bug...

If you choose tab delimited, and string quote character double quote ( " ), then the following makes it choke

f1\tf2\t"f3",xxx\tf4

What happens, everything after f3... even to the very end of the file (no matter how many lines and fields that includes) will get dumped into one cell. Now one might argue that the above is badly formatted (should quotes end right at field end?), but this is not the right way to handle it.

Another thing, it wasn't obvious to me in the gui that the string delimited dropdown list was editable. I think a dropdown list here is pointless and distracting. Everyone uses either double quote or nothing. I would argue that as soon as you select tab delimited, this field should default to blank, because as far as I can tell, the whole internet is agreed that TSV files don't have a string quote character.
Comment 7 Eike Rathke 2021-08-29 20:42:52 UTC
(In reply to xpusostomos from comment #6)
> Another thing, it wasn't obvious to me in the gui that the string delimited
> dropdown list was editable. I think a dropdown list here is pointless and
> distracting. Everyone uses either double quote or nothing.
You certainly know everyone and every usage and can be sure no one, absolutely no one, uses anything else.

> I would argue
> that as soon as you select tab delimited, this field should default to
> blank, because as far as I can tell, the whole internet is agreed that TSV
> files don't have a string quote character.
Oh yes? Is it? Could you point out such agreement? So you'd argue that embedded tabs and embedded line feeds are not possible at all in a TSV file?
Comment 8 Eike Rathke 2021-08-29 20:57:12 UTC
Reproduced with 7.1.4
Appears to be fixed since 7.1.5, most likely with bug 142395.

*** This bug has been marked as a duplicate of bug 142395 ***
Comment 9 Mike Kaganski 2021-08-30 04:24:10 UTC
(In reply to Eike Rathke from comment #7)

I enjoyed comment 6 very much, made me recall playing with MySQL's "SELECT INTO OUTFILE" [1], where it puts even null bytes (and any other bytes that may appear in BLOBS), with configurable FIELDS ENCLOSED BY, LINES TERMINATED BY, and even absolutely inconsistent FIELDS ESCAPED BY, that needed a home-grown parser [2], because they obviously didn't know what xpusostomos knew ;)

[1] https://dev.mysql.com/doc/refman/8.0/en/select-into.html
[2] https://mikekaganski.wordpress.com/2021/02/18/reading-from-mysql-data-with-blobs-dumped-to-csv/