Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stricter string escape #619

Closed
wants to merge 10 commits into from

Conversation

markuspf
Copy link
Member

This pull request does essentially three things (it also contains #612, because I branched it off of that PR):

  • refactor parsing of escaped characters so that we do not have duplicate code in GetStr and GetChar
  • make it a syntax error if anything follows a \ in a string that is not allowed as per documented escape sequences
  • clean up some of the fallout from that. Those will probably be mostly actual bugs, because things like "\!" would just have been parsed as "!" anyway.
  • I am a slight bit concerned about operations such as "\+", "\-", "\*", "\/". They would have been parsed just as "+", "-", "*", "/" anyway, but there was a convention of backslashing them for some reason.

WARNING: This breaks packages, in particular GAPDoc

With this patch it is possible to give character values by using
the common '\xHH' escape where H is a hexadecimal digit.
While we're here, we also make the scanning of escape sequences stricter.
This was before parsed as "+", so there should not be a functional change.
@markuspf markuspf added kind: new feature do not merge PRs which are not yet ready to be merged (e.g. submitted for discussion, or test results) labels Feb 16, 2016
@markuspf markuspf self-assigned this Feb 16, 2016
@ChrisJefferson
Copy link
Contributor

I guess writing "\+" comes from the fact that you have to write \+ in GAP code to access these operators -- however, I did assume (incorrectly) that the strings "+" and "\+" would be in some way different (which it turns out they aren't).

@frankluebeck
Copy link
Member

@markuspf wrote:

WARNING: This breaks packages, in particular GAPDoc

No problem for GAPDoc, I have removed the four \s which are not needed.

I have commented on a change which I think is wrong.

It seems you have missed two cases: In lines 291 and 292 in small/smlinfo.gi the $\Phi_{ should be changed to $\\Phi_{.

Note that the escapes \< and \> are not documented because I found it very difficult to describe what they really do (they are hints for the automatic line breaking mechanism). But you have kept them as valid escape sequences, so no problem.

Otherwise it looks ok to me.

@markuspf markuspf force-pushed the string-stricter-escape branch from 1d0e1c5 to 66a6b61 Compare February 16, 2016 14:59
@fingolfin
Copy link
Member

@frankluebeck Hmm, what do you mean "no problem for GAPDoc" -- quite clearly, with this patch, GAPDoc 1.5.1 triggers syntax errors. Or did you mean that these are easy to fix, and that in the next GAPDoc version, it won't be a problem? (Still looking forward to that new release, by the way ;-)

@fingolfin
Copy link
Member

There are errors in qaos and ctbllib triggered by this.

Hmm, could we perhaps do this: Keep accepting those "invalid" sequences (treating them as before), but printing a syntax warning. Then, merge that. After that, urge authors of affected packages to fix that (might be a problem for qaos, as that was last updated in 2008). Then eventually, turn it into a syntax error.

@fingolfin
Copy link
Member

Some more potentially affected packages / files (from some grepping through my pkg directory; this is probably incomplete):

CAP/examples/VectorSpaces.gi:1009:# theorem_string := "\alpha:Mor, \beta:Mor ~|~ \IsMonomorphism( \alpha ) \vdash \IsMonomorphism( \ProjectionInFactorOfFiberProduct( [ \alpha, \beta ], 2 ) )";

Convex/gap/Cone.gd:292:DeclareOperation( "\*",

LinearAlgebraForCAP/gap/MatrixCategoryMorphism.gd:71:DeclareOperation( "\*",

ModulePresentationsForCAP/gap/ModulePresentationMorphism.gd:90:DeclareOperation( "\*",

ToricVarieties/gap/ToricDivisors.gd:377:DeclareOperation( "\+",

guava-3.12/lib/decoders.gi:422:   Error("\N This method only applies to GRS codes.\n");

hecke/gap/specht.gd:131:DeclareOperation("\+",[IsAlgebraObjModule,IsAlgebraObjModule]);

@frankluebeck
Copy link
Member

Rethinking this, with the discussed examples in mind, I have changed my opinion.

I have looked into the documentation of some other languages with backslash escapes. In all cases it was not documented how a not explicitly mentioned backslash-something sequence is interpreted.

In contrast: GAP did document its behaviour since 30 years and as we have seen this definition is actually used in various places. So, why should we introduce a non backward compatible change for this (for which some code that is around for decades needs to be changed)?

In some examples above it is even nice for aesthetic reasons that one can write strings in the same way as identifiers:

DeclareOperation( "\+", ...

instead of the equivalent

DeclareOperation( "+", ...

because the corresponding identifier must be escaped anyway:

\+(1,2);

PS: I still agree that the advantage of being able to specify characters in hex notation justifies the slight backward incompatibility with respect to the rarely, if ever, used \x sequence.

@ChrisJefferson
Copy link
Contributor

@frankluebeck how about (as a middle group) we should just forbid \ before a letter? Have any cases of this turned up which were not actually mistakes?

@markuspf
Copy link
Member Author

One reason I see why we should maybe at least warn is the bugs that were found by making this change, i.e. instances of escaping that wasn't actually escaping.

That said, I do agree that the current behaviour is documented and that I got a bit worried about cases such as "\+", though one could argue that the correct thing to use here is "\\+" (but then this becomes another breaking change I think).

I am not sure about how much more complicated we'd want to make the rules for escaping.

@bh11
Copy link
Contributor

bh11 commented Feb 17, 2016

I like the idea of reserving just ascii letters. The '\N' above is probably a typo, meant to be '\n'.

Alternatively, we might use \0x instead of \x for hex strings, thus allowing a previously forbidden escape sequence.

I also wouldn't mind \u or \0u meaning UTF-8 encoding (whether this makes sense in the given context or not). This is no better or worse than octal and hex characters. Would we want four or five (hex) digit unicode codes?

@frankluebeck
Copy link
Member

I thought about proposing to disallow only non-documented backslash escapes with letters for later extensions. But then decided against it because I cannot see a big enough benefit to justify the change of a decades old documented behaviour.

On the other hand I agree that this "middle version" is likely to reveal more bugs than cause annoyance.

With respect to unicode I agree with a comment by @stevelinton that we should first discuss a proper concept of strings, text objects, encodings, and so on, before extending the parser.

@markuspf
Copy link
Member Author

I will close this PR, because it is subsumed in #612.

I am of course also not opposed to even deactivating the SyntaxWarning, but that is a trivial change in #612 now.

@markuspf markuspf closed this Feb 18, 2016
@markuspf markuspf deleted the string-stricter-escape branch February 5, 2017 12:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do not merge PRs which are not yet ready to be merged (e.g. submitted for discussion, or test results) kind: new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants