Stricter string escape #619

markuspf · 2016-02-16T10:30:29Z

This pull request does essentially three things (it also contains #612, because I branched it off of that PR):

refactor parsing of escaped characters so that we do not have duplicate code in GetStr and GetChar
make it a syntax error if anything follows a \ in a string that is not allowed as per documented escape sequences
clean up some of the fallout from that. Those will probably be mostly actual bugs, because things like "\!" would just have been parsed as "!" anyway.
I am a slight bit concerned about operations such as "\+", "\-", "\*", "\/". They would have been parsed just as "+", "-", "*", "/" anyway, but there was a convention of backslashing them for some reason.

WARNING: This breaks packages, in particular GAPDoc

With this patch it is possible to give character values by using the common '\xHH' escape where H is a hexadecimal digit.

While we're here, we also make the scanning of escape sequences stricter.

This was before parsed as "+", so there should not be a functional change.

ChrisJefferson · 2016-02-16T10:38:21Z

I guess writing "\+" comes from the fact that you have to write \+ in GAP code to access these operators -- however, I did assume (incorrectly) that the strings "+" and "\+" would be in some way different (which it turns out they aren't).

frankluebeck · 2016-02-16T13:33:53Z

@markuspf wrote:

WARNING: This breaks packages, in particular GAPDoc

No problem for GAPDoc, I have removed the four \s which are not needed.

I have commented on a change which I think is wrong.

It seems you have missed two cases: In lines 291 and 292 in small/smlinfo.gi the $\Phi_{ should be changed to $\\Phi_{.

Note that the escapes \< and \> are not documented because I found it very difficult to describe what they really do (they are hints for the automatic line breaking mechanism). But you have kept them as valid escape sequences, so no problem.

Otherwise it looks ok to me.

fingolfin · 2016-02-16T16:01:46Z

@frankluebeck Hmm, what do you mean "no problem for GAPDoc" -- quite clearly, with this patch, GAPDoc 1.5.1 triggers syntax errors. Or did you mean that these are easy to fix, and that in the next GAPDoc version, it won't be a problem? (Still looking forward to that new release, by the way ;-)

fingolfin · 2016-02-16T16:05:49Z

There are errors in qaos and ctbllib triggered by this.

Hmm, could we perhaps do this: Keep accepting those "invalid" sequences (treating them as before), but printing a syntax warning. Then, merge that. After that, urge authors of affected packages to fix that (might be a problem for qaos, as that was last updated in 2008). Then eventually, turn it into a syntax error.

fingolfin · 2016-02-16T16:15:20Z

Some more potentially affected packages / files (from some grepping through my pkg directory; this is probably incomplete):

CAP/examples/VectorSpaces.gi:1009:# theorem_string := "\alpha:Mor, \beta:Mor ~|~ \IsMonomorphism( \alpha ) \vdash \IsMonomorphism( \ProjectionInFactorOfFiberProduct( [ \alpha, \beta ], 2 ) )";

Convex/gap/Cone.gd:292:DeclareOperation( "\*",

LinearAlgebraForCAP/gap/MatrixCategoryMorphism.gd:71:DeclareOperation( "\*",

ModulePresentationsForCAP/gap/ModulePresentationMorphism.gd:90:DeclareOperation( "\*",

ToricVarieties/gap/ToricDivisors.gd:377:DeclareOperation( "\+",

guava-3.12/lib/decoders.gi:422:   Error("\N This method only applies to GRS codes.\n");

hecke/gap/specht.gd:131:DeclareOperation("\+",[IsAlgebraObjModule,IsAlgebraObjModule]);

frankluebeck · 2016-02-17T10:58:36Z

Rethinking this, with the discussed examples in mind, I have changed my opinion.

I have looked into the documentation of some other languages with backslash escapes. In all cases it was not documented how a not explicitly mentioned backslash-something sequence is interpreted.

In contrast: GAP did document its behaviour since 30 years and as we have seen this definition is actually used in various places. So, why should we introduce a non backward compatible change for this (for which some code that is around for decades needs to be changed)?

In some examples above it is even nice for aesthetic reasons that one can write strings in the same way as identifiers:

DeclareOperation( "\+", ...

instead of the equivalent

DeclareOperation( "+", ...

because the corresponding identifier must be escaped anyway:

\+(1,2);

PS: I still agree that the advantage of being able to specify characters in hex notation justifies the slight backward incompatibility with respect to the rarely, if ever, used \x sequence.

ChrisJefferson · 2016-02-17T11:06:12Z

@frankluebeck how about (as a middle group) we should just forbid \ before a letter? Have any cases of this turned up which were not actually mistakes?

markuspf · 2016-02-17T11:11:34Z

One reason I see why we should maybe at least warn is the bugs that were found by making this change, i.e. instances of escaping that wasn't actually escaping.

That said, I do agree that the current behaviour is documented and that I got a bit worried about cases such as "\+", though one could argue that the correct thing to use here is "\\+" (but then this becomes another breaking change I think).

I am not sure about how much more complicated we'd want to make the rules for escaping.

bh11 · 2016-02-17T11:41:59Z

I like the idea of reserving just ascii letters. The '\N' above is probably a typo, meant to be '\n'.

Alternatively, we might use \0x instead of \x for hex strings, thus allowing a previously forbidden escape sequence.

I also wouldn't mind \u or \0u meaning UTF-8 encoding (whether this makes sense in the given context or not). This is no better or worse than octal and hex characters. Would we want four or five (hex) digit unicode codes?

frankluebeck · 2016-02-17T12:36:25Z

I thought about proposing to disallow only non-documented backslash escapes with letters for later extensions. But then decided against it because I cannot see a big enough benefit to justify the change of a decades old documented behaviour.

On the other hand I agree that this "middle version" is likely to reveal more bugs than cause annoyance.

With respect to unicode I agree with a comment by @stevelinton that we should first discuss a proper concept of strings, text objects, encodings, and so on, before extending the parser.

markuspf · 2016-02-18T10:50:00Z

I will close this PR, because it is subsumed in #612.

I am of course also not opposed to even deactivating the SyntaxWarning, but that is a trivial change in #612 now.

markuspf added 8 commits February 15, 2016 19:35

Add parsing of hex literals in strings

9037dc8

With this patch it is possible to give character values by using the common '\xHH' escape where H is a hexadecimal digit.

Document hexadecimal character entry

26387d5

Test hexadecimal string entries.

8c6286d

Refactor scanning of escape sequences

823cadd

While we're here, we also make the scanning of escape sequences stricter.

Remove escape from "\+"

ab9cea7

This was before parsed as "+", so there should not be a functional change.

Correct wrong octal character escape \238 to \240

6748649

Remove \! from helpt2t.gi

58b2ce4

Replace "\in" by "\\in"

5b16fc2

markuspf added kind: new feature do not merge PRs which are not yet ready to be merged (e.g. submitted for discussion, or test results) labels Feb 16, 2016

markuspf self-assigned this Feb 16, 2016

Replace "\*" by "*"

66a6b61

markuspf force-pushed the string-stricter-escape branch from 1d0e1c5 to 66a6b61 Compare February 16, 2016 14:59

Turn SyntaxError into SyntaxWarning for escape sequences

8ad536d

markuspf mentioned this pull request Feb 17, 2016

Add parsing of hex literals in strings #612

Merged

markuspf closed this Feb 18, 2016

markuspf deleted the string-stricter-escape branch February 5, 2017 12:36

fingolfin mentioned this pull request Mar 16, 2017

Add a "strict" mode #1191

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stricter string escape #619

Stricter string escape #619

markuspf commented Feb 16, 2016

ChrisJefferson commented Feb 16, 2016

frankluebeck commented Feb 16, 2016

fingolfin commented Feb 16, 2016

fingolfin commented Feb 16, 2016

fingolfin commented Feb 16, 2016

frankluebeck commented Feb 17, 2016

ChrisJefferson commented Feb 17, 2016

markuspf commented Feb 17, 2016

bh11 commented Feb 17, 2016

frankluebeck commented Feb 17, 2016

markuspf commented Feb 18, 2016

Stricter string escape #619

Stricter string escape #619

Conversation

markuspf commented Feb 16, 2016

ChrisJefferson commented Feb 16, 2016

frankluebeck commented Feb 16, 2016

fingolfin commented Feb 16, 2016

fingolfin commented Feb 16, 2016

fingolfin commented Feb 16, 2016

frankluebeck commented Feb 17, 2016

ChrisJefferson commented Feb 17, 2016

markuspf commented Feb 17, 2016

bh11 commented Feb 17, 2016

frankluebeck commented Feb 17, 2016

markuspf commented Feb 18, 2016