-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add parsing of hex literals in strings #612
Conversation
Apparently gcc 4.x doesn't like inline function definitions that are in headers and not guarded. Well, I'll move the function to |
Not sure what you mean by "not guarded", but may the issue simply be that the function needs to be "static inline" instead of just "inline"? |
64d1e9d
to
8e67616
Compare
m( of course. Thanks for pointing that out. |
8e67616
to
7c7b050
Compare
Happy with this. It would be nice to see a comprehensive approach to dealing with Unicode strings (for instance one which allowed them to be viewed as a list of unicode characters) and then it might make sense to add some syntactic support for them, but I think it would be good to have the grand design first. |
This PR also is missing documentation changes and tests. I will add those
soon.
|
096cad7
to
a6a420f
Compare
While it doesn't directly fit in this patch, is there any reason not to make '\a' for all characters 'a' which do not have a special meaning an error, rather than just making the '' get ignored. |
I wondered that too. It should be easy to add either as a separate PR if people think it's sensible. |
I don't remember why hex characters in string literals were not introduced long ago. At least now I cannot see a reason against it. I would consider the slight incompatibility of That the GAP strings are really sequences of bytes. Introducing
The |
@frankluebeck great. We should add some cross-references from the strings section of the reference manual to the GAPDoc manual. Is there are a case for splitting GAPDoc into several packages? XML, Unicode, help viewers, help compiler? Do we need kernel support for (for instance) reading large UTF-8 files into a Unicode string? |
@stevelinton These are all interesting points, but perhaps for the mailing list, not on this PR, which will (hopefully) soon be merged, and then the discussion is "lost". |
such digits. The meaning is given in the following list | ||
such digits. | ||
If it is the character <C>x</C>, then there must be two hexadecimal | ||
digits. The meaning is given in the following list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds weird, in particular the bit saying "then there must be two hexadecimal such digits", which seems to be missing something. Finally, the sentence "They consist of two characters." was already before this patch quite misleading. How about this:
There are a number of <E>special character sequences</E> that can be used
between the singlequotes of a character literal or between the
doublequotes of a string literal to specify characters.
They consists of a backslash <C>\</C>, followed by a second character indicating the type of
special character sequence, and, depending on this type, possibly some more characters.
The following special character sequencs are currently defined. Any other sequence starting with
a backslash results in an error.
I think this is a good idea, thanks. Just have some minor quibbles with the documentation. |
e7624c1
to
8c6286d
Compare
Well this is fun: On Sun, Feb 14, 2016 at 11:57:38PM -0800, Christopher Jefferson wrote:
I tried patching this in and it turns out that in the library and packages there I'll make a separate PR for this... |
I should probably pull the refactoring from #619 into this PR before we merge it. One small issue I see is that the error message is currently less specific. That could be addressed though if people are worried about it. |
8c6286d
to
75fb611
Compare
I updated this PR in the following way
|
I have fixed |
Looks mostly fine to me. Using \0x?? for the hex characters is similar to the syntax in other languages. And it has the advantage that there is no backward compatibility problem. Of course, the refactoring is also sensible. But as discussed in the thread on "Stricter string escape" I'm not happy with the warning if a backslash is used with non-special characters (and the corresponding change of the documentation). This change seems purely cosmetic and it breaks without real need a behaviour that is documented since 30 years (and makes changes in several places necessary). |
Is there any preference regarding whether we just revert to the old behaviour with respect to I'd quite like to move this PR forward. |
Mild preference for the warning. Steve |
Now updated to only warn if a backslash is followed by a letter. Tests fail because of minor errors in |
64138bc
to
72ec6bb
Compare
What is the status of this? It seems like it was basically done (though travis failed), and nobody really had complaints left? @markuspf Perhaps you can rebase it, so that codecov gets run; I'd be happy to review this once more afterwards, too. |
IIRC, the new version of qaos contained a fix for the problem addresses by this PR, but ctbllib will fail the
|
So, how about completly disabling the warning for now (leave it in, but commented out)? We can re-evaluate enabling it in the future, when/if |
72ec6bb
to
5a0301b
Compare
@finglfin there you go. I disabled the syntax warning for now, though I have to admit I am uncomfortable with bugwards-compatibility hacks. |
Current coverage is 48.70% (diff: 74.07%)@@ master #612 diff @@
==========================================
Files 424 424
Lines 222109 222119 +10
Methods 3426 3429 +3
Messages 0 0
Branches 0 0
==========================================
+ Hits 108166 108193 +27
+ Misses 113943 113926 -17
Partials 0 0
|
I have not changed my opinion compared to my comment on Feb 26. The example from ctblib mentioned above even demonstrates that GAPs (documented!) behaviour can be useful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me. So now we need to decide whether we want to break backwards compatibility and documented behavior for the sake of hypothetical future enhancements or not.
Right now, there seems no urgent reason to do so; so perhaps postpone it, but still ask package authors to adjust their packages, and revisit this in the future?
This is translated to the character corresponding to the number | ||
<C>X * 64 + Y * 8 + Z modulo 256</C>. | ||
This can be used to specify and store arbitrary binary data as a string | ||
in &GAP;. | ||
</Item> | ||
<Mark> | ||
<Index><C>\0xYZ</C></Index> | ||
<Index>hexadecimal character codes</Index> | ||
<C>\xYZ</C></Mark> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still says \xYZ
, should be \0xYZ
.
<Index>escaping non-special characters</Index> | ||
other</Mark> | ||
<Item> | ||
For any other character the backslash is simply ignored. | ||
For any other character the backslash is ignored. If the character | ||
is a letter, that is one of <C>a..zA..Z</C>, then a warning is displayed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above: If we can't agree on changing the documented and actual behaviour, then this last sentence is controversial. In that case, we should drop it, but can keep the rest of this change.
They consist of a backslash <C>\</C> followed by a second character | ||
indicating the type of special character sequence, and possibly more characters. | ||
The following special character sequences are currently defined. Any other sequence | ||
starting with a backslash results in an error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can't agree on changing the documented and actual behaviour, then this last sentence is controversial. In that case, we should drop it, but can keep the rest of this change.
@@ -416,7 +416,7 @@ InstallGlobalFunction(SIMPLE_STRING, function(str) | |||
"efghijklmnopqrstuvwxyz[\000]^_\000abcdefghijklmnopqrstuvwxyz{ }~", | |||
"\177\200\201\202", | |||
"\203\204\205\206\207\210\211\212\213\214\215\216\217\220\221\222\223\224\225", | |||
"\226\227\230\231\232\233\234\235\236\237\238", | |||
"\226\227\230\231\232\233\234\235\236\237\240", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\238 is not a valid octal escape, whereas \240 is (for the same number, if one were to allow overflows in the octal digits).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahhhh of course! facepalm
GET_CHAR(); | ||
c += GetOctalDigits(); | ||
} else { | ||
/* This warning is currently disabled for backwards compatibility */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps elaborate a bit, and/or reference this PR. E.g.:
"This warning is currently disabled for backwards compatibility: It turns out there are some packages which still rely on this."
** '0..9', 'A..F', or 'a..f' and 0 otherwise. | ||
*/ | ||
#define IsHexDigit(ch) (isxdigit((unsigned int)ch)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Technically, there might be systems out there which don't have isxdigit
, but we'll just deal with them if we ever encounter them. So this is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isxdigit
is in ISO C90, I thought we're doing C99 even? Am I mixing something up?
} else { | ||
return (ch - '0'); | ||
} | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only used in one place... so unless we forsee use for it, why not just move it to scanner.c
? If something else ever needs it, we can still move it to a header (and then contemplate whether it needs extra safe guards or not etc.)
gap> x:='\0xFF'; | ||
'\377' | ||
gap> x:='\0xab'; | ||
'\253' | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These test cases only cover positives, but not malformed inputs, which we should also test. E.g. \0yAB
, \0X12
, \090
, \0a0
, \0x0
, \009
, \00a
, \00x
, \0x1g
, ..
I adapted the PR according to @fingolfin's comments. I should probably do some history rearrangement before we merge this (If everything is ok). |
Looks good to me now, thanks. If you want to cleanup the history, go ahead. Other than that, i think it can be merged now. |
This commit allows hexadecimal escapes of the form `0xHH` in string and character literals where `H` is a hexadecimal digit. The common codepaths for parsing the escape sequences in GetStr and GetChar are factored out into a common function GetEscapedChar. If an invalid escape sequence is read, a SyntaxWarning is displayed, but the old and documented behaviour to just ignore the backslash is preserved.
Code to display a warning is left commented out.
22adfad
to
0d7a403
Compare
With this patch it is possible to give character values by using
the common '\xHH' escape where H is a hexadecimal digit.
If this is accepted, I wonder whether there is also an interest for python's
\u
? It might not be as easy as this one as\u
parses to more than one byte.I think this might break existing code though, because