/******************************************************************************
* This file is part of The Unicode Tools Of Rexx (TUTOR) *
* See https://rexx.epbcn.com/tutor/ *
* and /~https://github.com/JosepMariaBlasco/TUTOR *
* Copyright © 2023-2025 Josep Maria Blasco <josep.maria.blasco@epbcn.com> *
* License: Apache License 2.0 (https://www.apache.org/licenses/LICENSE-2.0) *
******************************************************************************/
This is a BIG release, with many changes and additions, and with a lot of preliminary documentation.
The most prominent feature in this release is the addition of Unicode-enabled input/output stream built-in functions (BIFs). Here is the documentation for the stream BIFs.
A stream is said to be Unicode-enabled when an ENCODING
is specified in the STREAM
OPEN
command:
Call Stream filename, "Command", "Open read ENCODING UTF-8"
Stream I/O BIFs recognize that the stream is Unicode-enabled, and change their behaviour accordingly:
- The contents of the line is automatically decoded and converted to Unicode (i.e., to a UTF-8 presentation).
- Both
LINEIN
andCHARIN
return strings of typeTEXT
, composed of extended grapheme clusters. - When you call
CHARIN
and specify the length parameter, the appropriate number of characters (grapheme clusters) are read and returned. - Each encoding can specify its own set of end-of-line characters. For example, the IBM-1047 encoding (a variant of EBCDIC)
specifies that
"15"X
, the NL character, is to be used as end-of-line. BothLINEIN
andLINEOUT
honor this requirement, i.e., when reading lines, a line will be ended by"15"X
, and when writing lines, they will be ended by"15"X
too, instead of the usual LF or CRLF combination - When using Unicode semantics, some operations can become very expensive to implement. For example, a simple direct-access character substitution in a file is trivial to implement for ASCII streams, but it can become prohibitive when using a variable-length encoding. These operations have been restricted in the current release.
- Similarly, when the Unicode-enabled stream has a string target of
TEXT
(the default), some operations can become prohibitive too: aTEXT
"character" is, indeed, a grapheme cluster, and a grapheme cluster can have an arbitrary length. Direct-access character substitutions become too expensive to implement.
Note: We should start a discussion about what features we are used to, like direct-access character substitution, make sense and should be implemented for Unicode-enabled streams.
When using a Unicode-enabled stream, encoding and decoding errors can occur. By default, ill-formed characters are replaced by the Unicode
Replacement Character (U+FFFd
). You can explicitly request this behaviour by specifying the REPLACE
option in the ENCODING
of your stream:
Call Stream filename, "Command", "Open read ENCODING UTF-8 REPLACE"
REPLACE
is the default option for error handling. You can also specify SYNTAX
as an error handling option,
Call Stream filename, "Command", "Open read ENCODING UTF-8 SYNTAX"
Finding ill-formed characters will then raise a Syntax error. If the Syntax condition is trapped, you will be able to access the
undecoded or unencoded offending line or character sequence by using the "QUERY ENCODING LASTERROR"
STREAM
command:
Call Stream filename, "Command", "Open read ENCODING UTF-8 SYNTAX"
...
Signal On Syntax
...
var = LineIn(filename) -- May raise a Syntax error
-- Do something with "var"
...
Syntax:
offendingLine = Stream(filename, "Command", "Query Encoding Lasterror")
-- Do something with "offendingLine"
...
By default, Unicode-enabled streams return strings of type TEXT
, composed of grapheme clusters. In some occasions, you may prefer
to receive CODEPOINTS
strings. You can specify the target type in the ENCODING
section of your STREAM
OPEN
command:
Call Stream filename, "Command", "Open read ENCODING UTF-8 TEXT"
When you specify TEXT
(the default), returned strings are of type TEXT
. When you specify CODEPOINTS
, returned strings are
of type CODEPOINTS
.
Note: Some operations that are easy to implement for a CODEPOINTS
target type can become impractical when switching to a TEXT
type.
For example, UTF-32 is a fixed-length encoding, so that with a CODEPOINTS
target type, direct-access character positioning and
substitution is trivial to implement. On the other hand, if the target type is TEXT
, these operations become very difficult to implement.
The STREAM
BIF has been extended to support Unicode-enabled streams:
Call Stream filename, "Command", "Open read ENCODING IMB1047 CODEPOINTS SYNTAX" -- Now "filename" refers to a Unicode-enabled stream
Say Stream(filename, "Command", "Query Encoding Name") -- "IBM1047"
Say Stream(filename, "Command", "Query Encoding Target") -- "CODEPOINTS", the name of the target type
Say Stream(filename, "Command", "Query Encoding Error") -- "SYNTAX", the name of the error handling option
Say Stream(filename, "Command", "Query Encoding LastError") -- "", the offending line or character sequence
Say Stream(filename, "Command", "Query Encoding") -- "IBM1047 CODEPOINTS SYNTAX"
Although the simplicity and ease of use of Unicode-enabled streams is very convenient, in some cases you may want to resort to manual
encoding and decoding operations. For maximum control, you can use the new BIFs, ENCODE
and DECODE
(defined in
Unicode.cls).
DECODE
can be used as an encoding validator:
wellFormed = DECODE(string, encoding)
will return a boolean value indicating whether string can be decoded without errors by using the specified encoded (i.e., 1 when the decoding will succeed, and 0 otherwise).
You can also use DECODE
to decode a string, by specifying a target format (currently, only UTF-8 and UTF-32 are supported):
decoded = DECODE(string, encoding, "UTF-8")
In this case, the function will return the null string if string cannot be decoded without errors with the specified encoding, and the decoded version of its first argument if no ill-formed character combinations are found.
Since encoding and decoding are considered to be low-level operations, the results of ENCODE
and DECODE
are always BYTES
strings. If you need
more features for the returned strings, you can always promote the results to higher types by using the CODEPOINTS
and TEXT
BIFs.
A fourth argument to the ENCODE
BIF determines the way in which ill-formed character sequences are handled:
decoded = DECODE(string, encoding, "UTF-8", "REPLACE")
When the fourth argument is omitted, or is specified as ""
or "NULL"
(the default), a null string is returned if any ill-formed sequence is found.
When the fourth argument is "REPLACE"
, any ill-formed character is replaced with the Unicode Replacement Character (U+FFFD). When the fourth
argument if "SYNTAX"
, a Syntax error is raised in the event that an ill-formed sequence is found.
I have started to document the programs using ooRexxDoc. This is a work-in-progress.
To the rxu
Rexx Preprocessor for Unicode
- Recognize BIFs in CALL instructions.
- Remove support for OPTIONS CONVERSIONS (wanted to rethink the feature).
- Change "C" suffix for classic strings to "Y", as per Rony's suggestion.
- "U" strings are now BYTES strings.
- Implement DATATYPE(string, "C") (syntax checks uniCode strings).
- Implement LINEIN, LINEOUR, CHARIN, CHAROUT, CHARS and LINES.
To the main Unicode class, Unicode.cls
- Rename P2U to C2U, and create a new U2C BIF. Complete symmetry with C2X, X2C and DATATYPE("X").
A new encoding
subdirectory has been created. The main encoding class is
Encoding.cls
. Concrete encodings
are subclasses of Encoding.cls
, and are automatically recognized when they are added to the encoding
subdirectory.
Note: the encoding interface is likely to change in the following releases.
Numerous sample programs have been added to the samples
directory. Most of these programs test the behaviour of the enhanced BIFs.