Bug 124511

Summary: handle import of multiple gzip csv streams
Product: LibreOffice Reporter: Jacek Pliszka <jacek.pliszka>
Component: CalcAssignee: Not Assigned <libreoffice-bugs>
Status: NEW ---    
Severity: enhancement CC: erack, jacek.pliszka, vsfoote, xiscofauli
Priority: medium    
Version: 6.1.5.2 release   
Hardware: x86-64 (AMD64)   
OS: All   
Whiteboard:
Crash report or crash signature: Regression By:
Bug Depends on:    
Bug Blocks: 109236    
Attachments: Simple example - 2 csv.gz
Simple example - 4 csv.gz
Simple example - 4 csv.gz - 2 empty

Description Jacek Pliszka 2019-04-02 20:29:09 UTC
gzip csv files generated in turns by pandas do not open correctly in libreoffice:
Python code:

import pandas as pd

a=pd.DataFrame({'a': [1,2], 'b':[3,4]})

b=pd.DataFrame({'a': [3,4], 'b':[5,6]})

a.to_csv('a.csv.gz', compression='gzip', header=True, mode='w')

b.to_csv('a.csv.gz', compression='gzip', header=False, mode='a')

The output file a.csv.gz gunzips correctly and opens correctly in less but in libre office only header and 2 first data rows appear when openning:


libreoffice a.csv.gz
Comment 1 Jacek Pliszka 2019-04-02 20:37:25 UTC
OK, looks like gunzip or less can handle 2 concatenated gzip files and libreoffice can not. Another example

$ gunzip -c z1.csv.gz 
,a,b
0,1,3
1,2,4
$ gunzip -c z2.csv.gz 
0,3,5
1,4,6
$ cat z1.csv.gz z2.csv.gz  > z3.csv.gz
$ gunzip -c z3.csv.gz 
,a,b
0,1,3
1,2,4
0,3,5
1,4,6

while libreoffice when opens z3.csv.gz shows only z1 contents
Comment 2 V Stuart Foote 2019-04-03 14:13:04 UTC
So question if it works without the gzip/gunzip steps. Or is issue in the concatenation?

Does a concatenated CSV get fully parsed into the Calc CSV input dialog? If not, is there an EOF or some other control character being inserted during the append.
Comment 3 Jacek Pliszka 2019-04-03 14:31:35 UTC
Issue is only when gzipped files are concatenated. everything is OK when files are not gzipped or gzipped after concatenation.

Only case when files are first gzipped then concatenated does not work.

pandas gzip less etc. handle the case when file consists of several gzips concatenated together and it would be good if calc could.
Comment 4 V Stuart Foote 2019-04-03 17:23:48 UTC
OK then, guess we are doing the correct thing then and this would be enhancement to handle stream of multiple gzip'd documents.  Onus would be on the user to ensure the format is correct--header on first, and subsequent only the data still in matching layout.

Please post a couple of sample concatenated gzip .csv streams.

But, kind of a specialized workflow, so not sure it belongs as a core feature of the Calc import dialog.

@Eike?
Comment 5 Jacek Pliszka 2019-04-03 17:49:02 UTC
Created attachment 150513 [details]
Simple example - 2 csv.gz
Comment 6 Jacek Pliszka 2019-04-03 17:49:29 UTC
Created attachment 150514 [details]
Simple example - 4 csv.gz
Comment 7 Jacek Pliszka 2019-04-03 17:49:57 UTC
Created attachment 150515 [details]
Simple example - 4 csv.gz - 2 empty
Comment 8 Xisco FaulĂ­ 2019-05-02 10:53:06 UTC
Make sense. Moving to NEW...