Bug 147914 - File over-read parsing XLS with mixed wide- and narrow-character strings
Summary: File over-read parsing XLS with mixed wide- and narrow-character strings
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: filter:xls
Depends on:
Blocks:
 
Reported: 2022-03-10 23:33 UTC by rennie.degraaf
Modified: 2022-12-21 11:50 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:


Attachments
File that reproduces the bug (107.50 KB, application/vnd.ms-excel)
2022-03-10 23:35 UTC, rennie.degraaf
Details
XLS file with mixed string, corrected length (107.50 KB, application/vnd.ms-excel)
2022-03-10 23:36 UTC, rennie.degraaf
Details
String block 4 header (11.05 KB, image/png)
2022-03-10 23:44 UTC, rennie.degraaf
Details
String block 4 end (16.76 KB, image/png)
2022-03-10 23:45 UTC, rennie.degraaf
Details
Bug in Calc (17.56 KB, image/png)
2022-03-10 23:57 UTC, rennie.degraaf
Details

Note You need to log in before you can comment on or make changes to this bug.
Description rennie.degraaf 2022-03-10 23:33:13 UTC
Description:
The XLS format has a maximum record length of 8224 bytes.  The maximum string length is 32767 characters (a character whose UTF-16 representation requires a conjugate pairs counts at two characters).  Consequently, long strings must be split across multiple records using "continue records" (https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-xls/999fae21-d3d9-42e8-8290-639782460c67).  

Strings are represented as "XLUnicodeRichExtendedString" objects (https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-xls/173d9f51-e5d3-43da-8de2-be7f22e119b9).  They may use either narrow (8-bit) or wide (UTF-16LE) characters; which is used by a particular string is indicated by a flag.  For whatever reason (blame some nameless dev in the 1990s), the flag is repeated in each continue record.  Consequently, it is valid for a string to start off using narrow characters and be continued by a wide character block.  Yes, this is perverse.

In order to test some other software that parses XLS, I used Excel to create an XLS with a 32767-character narrow-character string ("aaa....aaa"), then opened it up using a OLE compound document hex editor ("Compound File Explorer", though the tool that you use should not matter).  My string was split across four records, as expected (in the "Workbook" OLE stream).  I changed the narrow/wide character flag byte to 0x01 (indicating wide character data) on the 2nd and 4th blocks. Since XLS uses UTF-16 for wide characters, this changes the string to "aaa...aaa慡慡慡...慡慡慡aaa...aaa慡慡慡...慡慡慡".

However, I did *not* update the string length.  Since those two blocks are now wide characters but I did not add any additional data, the string should be shorter.  This makes the document invalid.  Excel goes into recovery mode when trying to load it.  However, Calc loads the following string:

aaa...aaa慡慡慡...慡慡慡aaa...aaa慡慡慡...慡慡慡一浡ե?慖畬ť?ɡ?慡愀慡愀慡ա?慡慡ୡ?敄捳楲瑰潩੮?桓牯⁴慮敭	䰀湯⁧慮敭䄀瑬牥慮整搠獥牣灩楴湯?潓敭桴湩

Copying the extraneous data into a text file, saving it as UTF-16LE and opening it in a hex editor reveals 0x76 bytes of file data following the end of the last string block:

04 00 00 4E 61 6D 65 05 3F 00 56 61 6C 75 65 01 3F 00 61 02 3F 00 61 61 03 00 00 61 61 61 04 00 00 61 61 61 61 05 3F 00 61 61 61 61 61 0B 3F 00 44 65 73 63 72 69 70 74 69 6F 6E 0A 3F 00 53 68 6F 72 74 20 6E 61 6D 65 09 00 00 4C 6F 6E 67 20 6E 61 6D 65 15 00 00 41 6C 74 65 72 6E 61 74 65 20 64 65 73 63 72 69 70 74 69 6F 6E 3F 00 53 6F 6D 65 74 68 69 6E

I didn't try debugging into Calc to see where/how it got this data.  There might be security implications depending on how/where the over-read occurs.

I created a second version of the XLS file in which I corrected the string length.  Calc appeared to handle that file correctly.

I tested this using release 7.3.1.3 on Windows 10 amd64.  I expect that the same will occur on other platforms and versions since XLS is a rather old format.

Steps to Reproduce:
1. Create a malformed XLS file as described above
2. Open in Calc

Actual Results:
Over-read file data is displayed in the document as described above

Expected Results:
No over-read file data should appear.


Reproducible: Always


User Profile Reset: No



Additional Info:
Version: 7.3.1.3 (x64) / LibreOffice Community
Build ID: a69ca51ded25f3eefd52d7bf9a5fad8c90b87951
CPU threads: 2; OS: Windows 10.0 Build 19042; UI render: Skia/Raster; VCL: win
Locale: en-US (en_US); UI: en-US
Calc: threaded
Comment 1 rennie.degraaf 2022-03-10 23:35:33 UTC
Created attachment 178786 [details]
File that reproduces the bug
Comment 2 rennie.degraaf 2022-03-10 23:36:23 UTC
Created attachment 178787 [details]
XLS file with mixed string, corrected length
Comment 3 rennie.degraaf 2022-03-10 23:37:41 UTC
Use attachment 178786 [details] to reproduce the bug.  Attachment attachment 178787 [details] is a version of the file with the string length corrected; Calc appears to handle it correctly.
Comment 4 rennie.degraaf 2022-03-10 23:44:13 UTC
Created attachment 178788 [details]
String block 4 header

This screen capture from my OLE hex editor shows the beginning of string block 4.  The selected byte is the narrow/wide character flag.  0 indicates narrow character data, 1 indicates wide.
Comment 5 rennie.degraaf 2022-03-10 23:45:30 UTC
Created attachment 178789 [details]
String block 4 end

This screen capture from my OLE hex editor shows the end of string block 4 with the additional file data that Calc loads as part of the string.
Comment 6 rennie.degraaf 2022-03-10 23:57:51 UTC
Created attachment 178790 [details]
Bug in Calc

This screen capture of Calc shows the end of the string that it loads with the extraneous data.
Comment 7 rennie.degraaf 2022-03-12 05:10:02 UTC
Also confirmed on 
Version: 6.4.7.2
Build ID: 1:6.4.7-0ubuntu0.20.04.2
CPU threads: 2; OS: Linux 5.4; UI render: default; VCL: kf5; 
Locale: en-US (en_US.UTF-8); UI-Language: en-US
Calc: threaded

For comparison, Gnumeric 1.12.46 loads the file without displaying an error to the user, but appears to fail to load the file's string table and dumps a couple warning messages to the console.
Comment 8 rennie.degraaf 2022-03-14 20:26:41 UTC
Also confirmed on the oldest release that I had installed on an old VM:
Version: 5.1.6.2.0+
Build ID: 5.1.6.2-8.fc24
CPU Threads: 1; OS Version: Linux 4.11; UI Render: default; Local: en-US (en_US.UTF-8); Calc: group

Apache OpenOffice 4.1.11 has the same problem.  This bug is probably very old.
Comment 9 Buovjaga 2022-12-21 11:50:28 UTC
Confirmed

Arch Linux 64-bit, X11
Version: 7.6.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: 8389048cb41291917449e87b2901d6133bce3373
CPU threads: 8; OS: Linux 6.0; UI render: default; VCL: kf5 (cairo+xcb)
Locale: fi-FI (fi_FI.UTF-8); UI: en-US
Calc: threaded Jumbo
Built on 21 December 2022