Skip to content

Latest commit

 

History

History

multi-stage-table

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 

The MultiStageTable class

/******************************************************************************
 * This file is part of The Unicode Tools Of Rexx (TUTOR)                     *
 * See https://rexx.epbcn.com/tutor/                                          *
 *     and /~https://github.com/JosepMariaBlasco/TUTOR                          *
 * Copyright © 2023-2025 Josep Maria Blasco <josep.maria.blasco@epbcn.com>    *
 * License: Apache License 2.0 (https://www.apache.org/licenses/LICENSE-2.0)  *
 ******************************************************************************/

This class, defined in the /components/utilities/MultiStageTable.cls package, specializes in producing two-stage tables, three-stage tables, or, in general multi-stage tables.

Multi-stage tables are recommended in The Unicode Standard 15.0, section 5.1, Data Structures for Character Conversion, "Multistage Tables", pp. 196–7.

This is not a general implementation of multi-stage tables, but a custom, tailored one, specific to Unicode.

The indexes for these tables run from 0 to the size of the buffer. Negative values will raise a syntax error, and indexes greater than the size of the buffer will return width copies of "00"X.

compress (Class method)

Diagram for the MultiStageTable compress method

The compress method compresses a buffer and returns two smaller, compressed, tables, "chunks" and "offsets" (see below). The chunk_size argument, if specified, should be a power of two and a divisor of the length of the buffer; chunk_size defaults to 256.

Buffer is a n-byte string, where n is a multiple of chunk_size.

The compression technique works as follows: the buffer source string is supposed to be compressible, i.e., it is supposed to contain different segments which are identical (for example, because they are copies of "00"X). The buffer string will be broken in a series of substrings of size chunk_size, and, instead of storing the substring itself, we will store a reference to the substring. Thus, when two identical substrings of buffer are found, the first copy is stored, only once, in an indexed "chunks" table, and its offset (divided by chunk_size) is stored in a separate "offsets" table. A reference is supposed to be much smaller than the subarray itself, and this makes for the desired compression.

Note: To allow for maximum compression, we are supposing that the quantity of different sub-arrays does not exceed 256. This allows to store the references to the sub-arrays in a single byte.

[]

Diagram for the MultiStageTable [] method

Returns the n-th element of the multi-stage table, when 0 < n <= table_size, or a string containing width copies of "00"X, when n > table_size. Negative or non-numeric values of n will raise a Syntax error.

new

Diagram for the MultiStageTable new class method

Creates a new multi-stage table. This will be a two-stage table when big_values is not specified or has the .Nil value (the default), and a three-stage table when big_values is specified as a non-nil value.

The offset and chunks tables should have been created by the compress class method; chunk_size, when specified, should be the same value used in the compress call, and a power of two. Table_size should be the size of the buffer used in the compress call; it defaults to 217.

Width and big_values are optional. When width is specified for a two-stage table, a string of width bytes will be returned by the "[]" call; when width is specified for a three-stage table, the 1-byte value obtained from offset and chunks is multiplied by width and used as an index into big_values, to return a substring of width bytes.