Bug 155946

Summary: Guess separator for the text import dialog
Product: LibreOffice Reporter: Eyal Rozenberg <eyalroz1>
Component: CalcAssignee: Not Assigned <libreoffice-bugs>
Status: NEW ---    
Severity: enhancement CC: erack
Priority: medium    
Version: unspecified   
Hardware: All   
OS: All   
See Also: https://bugs.documentfoundation.org/show_bug.cgi?id=152336
Whiteboard:
Crash report or crash signature: Regression By:
Bug Depends on:    
Bug Blocks: 109239    

Description Eyal Rozenberg 2023-06-20 10:18:58 UTC
When we paste multi-line text, the Text Import dialog springs up. At the moment, it offers us a default choice of text field separator - separate by Tab.

But if we are already parsing the text to look for newlines - why not also look for common separators as well?

* If the line has no tabs, definitely don't  offer tabs as the default
* Ditto for spaces and comments
* Out of the remaining possible separators - apply some simple heuristic for the choice, e.g. most commonly appearing except for at start and end of line.

The specific heuristic is a matter for bikeshedding, but even "first separator encountered" is better than what we have now.
Comment 1 Eike Rathke 2023-06-20 11:14:22 UTC
Not necessarily. If standard text is pasted from the system clipboard then offering Tab is actually a good choice because any text can contain all other separators without them being actually separators, specifically comma. Furthermore, if cells are copied to clipboard and pasted as text-only they will be separated by Tab, so at least in that case it's the only sensible choice. Also note that the last choice is remembered, so whether you actually get Tab offered depends on your previous action. For the first time of dialog usage we even already try to determine a separator in the context of ending a quoted field. "apply some simple heuristic" is wishful thinking, but what exactly should that "simple" be? The "if it has no [...separator...] then don't offer it" doesn't help either, because a checked separator that isn't used in data has no effect on the import, so not offering it is just cosmetic.
Comment 2 Eyal Rozenberg 2023-06-20 20:46:10 UTC
(In reply to Eike Rathke from comment #1)
> Not necessarily. 

Not necessarily what?

>If standard text is pasted from the system clipboard then
> offering Tab is actually a good choice because any text can contain all
> other separators without them being actually separators, specifically comma.

If there are no tabs, then offering a tab is obviously not a good choice. But other than that, and like I said - any reasonable heuristic is fine by me.

> Furthermore, if cells are copied to clipboard and pasted as text-only they
> will be separated by Tab, so at least in that case it's the only sensible
> choice. 

But that's a case where the pasted text already has tabs. The point of this issue is to not to assume this is the case always - which it isn't. 

> Also note that the last choice is remembered, so whether you
> actually get Tab offered depends on your previous action.

Well, yes, but the memory becomes irrelevant if the pasted text doesn't use that separator.

> For the first time
> of dialog usage we even already try to determine a separator in the context
> of ending a quoted field. "apply some simple heuristic" is wishful thinking,
> but what exactly should that "simple" be? The "if it has no
> [...separator...] then don't offer it" doesn't help either, because a
> checked separator that isn't used in data has no effect on the import, so
> not offering it is just cosmetic.

I'm not sure I follow. Of course it helps if the default choice in the dialog is of a separator that actually appears in the text rather than one which doesn't. 

Give the memory we have of the user's last choice, the simple heuristic might be: "User's last choice, unless the text doesn't have that separator (or even - unless the first line doesn't have), in which case the first separator which appears on the first line."

That's pretty simple. Feel free to suggest something else.
Comment 3 Eike Rathke 2023-06-21 12:10:42 UTC
(In reply to Eyal Rozenberg from comment #2)
> (In reply to Eike Rathke from comment #1)
> > Not necessarily. 
> 
> Not necessarily what?
Not necessarily this:
>> but even "first separator encountered" is better than what we have now.


> Give the memory we have of the user's last choice, the simple heuristic
> might be: "User's last choice, unless the text doesn't have that separator
> (or even - unless the first line doesn't have), in which case the first
> separator which appears on the first line."
With that we're back to "what is considered to be a separator".
a) the arbitrary comma encountered in a sentence?
b) or only if there's not a blank following?
c) if the first comma is at the end of line, does it constitute a separator?

I'd say no to a) and yes to b) and c).
Can that be generalized also for Tab and semicolon? Probably yes.
Can it for Space? No because it would split a sentence into fields.