Bug 160081 - Performance: VLOOKUP is very slow even with 41670 rows of data
Summary: Performance: VLOOKUP is very slow even with 41670 rows of data
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
24.8.0.0 alpha0+ Master
Hardware: All Linux (All)
: medium enhancement
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: perf
Depends on:
Blocks:
 
Reported: 2024-03-07 07:41 UTC by Kevin Suo
Modified: 2024-03-07 13:41 UTC (History)
0 users

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Kevin Suo 2024-03-07 07:41:15 UTC
Steps to Reproduce:

1. Open the test ods file in attachment 147932 [details].

The file is 145 KB in size, contains only one column with 41670 records, so it is not too big dataset.

2. Insert new Sheet2. Copy column A of sheet 1 to Sheet2.

3. Set vlookup formula in Sheet2.B2, e.g.
=VLOOKUP(A2,$工作表1.A:A,1,0)

4. Double-click the cell handle in B2 to fill formulas down.

All the 8 cpu of my Thinkpad X1 Carbon is running at 99% for this and it took 6 minutes 23 seconds to run. By comparison, the same operation takes 85 seconds in MS Excel even with 2 CPU cores in a virtualbox guest OS.

Version: 24.8.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: 98c42f7e961e77d7f1c02d53862e4e78ecd07653
CPU threads: 8; OS: Linux 6.7; UI render: default; VCL: gtk3
Locale: zh-CN (zh_CN.UTF-8); UI: zh-CN
Calc: threaded
Fedora 39 Gnome Wayland
Comment 1 Kevin Suo 2024-03-07 08:02:44 UTC
There are two areas to optimize here:

First:

The slowness seems to be related to the settings in "Options > LibreOffice Calc > Calculation > Enable wildcards in formulas". When this is disabled, it only takes 11 seconds to run. So, why is this enabled by default? The help page seems to explain that "Enable wildcards in formulas for spreadsheets that need to be interoperable with Microsoft Excel", I agree we need interoperability, but should we reconsider this:

For formula:
=VLOOKUP(A2,$Sheet1.A:A,1,0)
and any cells in column A may (or may not) a wildcard, then does the user really intended to do vlookup with wildcard? I don't think so. If the first parameter of this function is a cell reference, we can assume that we simply want to find the exact math in Sheet1.A:A. Sheet2 column A may contain millions of cells, it makes no sense to the user to include wildcards in so many cells for a vlookup.

However, for a formula:
=VLOOKUP("Foo*",$Sheet1.A:A,1,0)
We can assume that the user intended to use wildcard in the vlookup in case "Enable wildcards in formulas" is enabled. This is because, when fill the formula down to other rows, the first paramter to this function is always "Foo*", and it is not a cell reference - it is something typed in manually by the user.

Second:

Even if we still use wildcards by default, there are rooms to optimize the vlookup related codes, given that Excel only uses 85s with 2 CPUs compared with out 6min with 8 cpus.
Comment 2 ady 2024-03-07 09:05:27 UTC
First, the "A:A" entire column notation is usually a source of slowness for Calc. Try changing it to use a delimited range of rows wherever you can.

Second, in LO 7.4 some changes involving lookup functions, among other functions, were introduced in order to obtain some potential performance gain. Since then, the changes were (partially) disabled because of incorrect results. ATM, in LO 24.3.1 and 24.8 alpha, we are still awaiting to fully solve the incorrect results, including in functions such as COUNTIFS().

So, if I may, before attempting another potential performance change, as a user I would like to be able to expect correct results (and I am not talking about the typical accuracy issues). Considering that ATM, after more than a year and a half, we still can't really trust Calc, I hope first the known (already reported) problems with these functions get resolved, and some (automated) tests get introduced, before attempting new potential performance gains.

In the meantime, I would suggest trying with delimiting the range (instead of full column notation), and perhaps some other ways for you (the user) to optimize formulas, if such edition is available (which is not always the case).
Comment 3 Kevin Suo 2024-03-07 09:17:51 UTC
(In reply to ady from comment #2)

> First, the "A:A" entire column notation is usually a source of slowness for Calc. Try changing it to use a delimited range of rows wherever you can.

No, that notation does not make a difference for this performance issue. Calc will automatically determine the last allocated cell in a column if we use A:A notation. Changing the formula to:
=VLOOKUP(A2,$工作表1.A$2:A$41670,1,0)
takes the same amount of time as compared with A:A.
Comment 4 ady 2024-03-07 10:41:13 UTC
(In reply to Kevin Suo from comment #0)
> Steps to Reproduce:
> 
> 1. Open the test ods file in attachment 147932 [details].
> 
> The file is 145 KB in size, contains only one column with 41670 records, so
> it is not too big dataset.
> 
> 2. Insert new Sheet2. Copy column A of sheet 1 to Sheet2.


When reading the whole set of steps, the “copy” part in step 2 seems unneeded. Alternatively, the use of the worksheet name in the formula in step 3 seems not needed.

> 
> 3. Set vlookup formula in Sheet2.B2, e.g.
> =VLOOKUP(A2,$工作表1.A:A,1,0)
> 
> 4. Double-click the cell handle in B2 to fill formulas down.
> 
> All the 8 cpu of my Thinkpad X1 Carbon is running at 99% for this and it
> took 6 minutes 23 seconds to run. By comparison, the same operation takes 85
> seconds in MS Excel even with 2 CPU cores in a virtualbox guest OS.

Since the default setting is to recognize asterisks as wildcards, but there is such a difference in performance when comparing with Excel, I must ask…

Is the resulting data in column B in Excel exactly the same (and in the same location/order) as in Calc?

In particular, we need to know whether the wildcard in Excel is considered for the formula only, or instead, it is also considered when evaluating the content of column A, when the formula uses a reference to it (instead of using a string that includes wildcards itself).

For example, when the content of a cell in column A is:
GZZN-AJLIA-5.6/5UV-5*55

and another is:
GZZN-AJLIA-5.6/5UV-5*555

then Calc considers the asterisk "*" as a wildcard, even when the asterisk is not used directly in the formula itself.

So, is Excel considering those asterisks in column A as wildcards too? Or, instead, the asterisk must be used in the formula itself for Excel to consider it a wildcard?

As for efficiency in general, a lookup on sorted data using the sort argument should be faster (but again, that's generally speaking and without counting for current known bugs).
Comment 5 Kevin Suo 2024-03-07 11:17:58 UTC
> When reading the whole set of steps, the “copy” part in step 2 seems unneeded. Alternatively, the use of the worksheet name in the formula in step 3 seems not needed.

There are plenty of methods to reproduce this. Those steps are just one of the to reproduce the slowness. Whether the worksheet name is needed depends on in which tab you are setting the formulas. 

The point is, do you reproduce the slowness? If it takes more than 85s for you following my steps, while it takes less than 85 in Excel or other spreadsheet applications, then simply mark this as NEW and wait someone interested to improve.

> we need to know whether the wildcard in Excel is considered for the formula only, 
> or instead, it is also considered when evaluating the content of column A

The current behaviour is the same as in Excel, i.e. both Calc and Excel also consider the wildcards in the contents referenced by the formula.
Comment 6 ady 2024-03-07 13:41:42 UTC
(In reply to Kevin Suo from comment #5)

> The point is, do you reproduce the slowness? If it takes more than 85s for
> you following my steps, while it takes less than 85 in Excel or other
> spreadsheet applications, then simply mark this as NEW and wait someone
> interested to improve.

I'll mark this as NEW, but my comment 2 is very much relevant, IMO above and before this report. Solving those issues (before this one) might have an impact on performance too (maybe for the better).