Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stdlib] Implement Int.from_bytes() and Int.as_bytes() #3795

Closed
wants to merge 31 commits into from
Closed
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
fc608a7
Implement Int.from_bytes() function
msaelices Nov 12, 2024
cd91096
Add entry to the changelog about Int.from_bytes()
msaelices Nov 12, 2024
3eca741
Make the Int.from_bytes() accept an span of bytes instead of a list
msaelices Nov 12, 2024
b9f189d
[stdlib] Move the Int.from_bytes() big_endian arg to a comptime param
msaelices Nov 13, 2024
cc97ffe
[stdlib] Implement Int.as_bytes()
msaelices Nov 21, 2024
b580bc9
[stdlib] Update changelog to include Int.as_bytes()
msaelices Nov 21, 2024
569e6ff
[stdlib] Simplify is_big_endian import
msaelices Nov 22, 2024
239bba6
Optimize copying bytes into the array as we know it's a trivial type
msaelices Nov 22, 2024
21384ec
Merge branch 'nightly' into int-from-and-to-bytes
msaelices Nov 24, 2024
bc22870
Merge branch 'nightly' into int-from-and-to-bytes
msaelices Nov 26, 2024
63fa5b4
Infer the Span type based on DType
msaelices Nov 27, 2024
5b41e1f
Merge branch 'nightly' into int-from-and-to-bytes
msaelices Nov 28, 2024
b5e09c5
[stdlib] Revert part of the changes here as compiler does not support…
msaelices Nov 28, 2024
7766ea3
[stdlib] Move the Bytes alias in the test closer to the scope we use it
msaelices Nov 28, 2024
30027df
[stdlib] Less usage of UnsafePointer in the Int.from_bytes|as_bytes
msaelices Nov 28, 2024
eed9463
More meaningful test function name
msaelices Nov 29, 2024
7f142a8
Merge branch 'nightly' into int-from-and-to-bytes
msaelices Dec 4, 2024
3b93cd6
Fix issue in the previous merge conflict, as we need the Span
msaelices Dec 4, 2024
b0bc485
Merge branch 'nightly' into int-from-and-to-bytes
msaelices Dec 16, 2024
776b4df
Move the logic to SIMD so we can call it with whatever scalar we want
msaelices Dec 16, 2024
265b648
Move the Int.as_bytes() logic to SIMD, so we can call it with any sca…
msaelices Dec 16, 2024
0e5d293
Add reference ti new SIMD methods to the changelog
msaelices Dec 16, 2024
6e144d5
Tests for SIMD.from_bytes and SIMD.as_bytes
msaelices Dec 16, 2024
3211d24
Remove uneeded import sentences in int.mojo
msaelices Dec 16, 2024
8caad41
Try to not allocating memory by using InlineArray instead of List
msaelices Dec 17, 2024
dd29814
Fix the call expansion failed issue. Adapt unit tests
msaelices Dec 18, 2024
694693a
Merge branch 'nightly' into int-from-and-to-bytes
msaelices Dec 18, 2024
40b9fc0
Remove the raises declaration from Int.from_bytes() and SIMD.from_byt…
msaelices Dec 21, 2024
98c29fc
Remove the DType parameter from Int.from_bytes() and Int.as_bytes()
msaelices Dec 21, 2024
8db9a83
Merge branch 'nightly' into int-from-and-to-bytes
msaelices Dec 22, 2024
3af72c5
No need for the Span here
msaelices Dec 22, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,10 @@ what we publish.

### ⭐️ New

- New `Int.from_bytes()` and `Int.as_bytes()` functions to convert a list of
bytes to an integer and vice versa, accepting the endianess as an argument.
Similar to Python `int.from_bytes()` and `int.to_bytes()` functions.

- `StringRef` is now representable so `repr(StringRef("hello"))` will return
`StringRef('hello')`.

Expand Down
65 changes: 62 additions & 3 deletions stdlib/src/builtin/int.mojo
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ These are Mojo built-ins, so you don't need to import them.

from collections import KeyElement

from bit import byte_swap
from math import Ceilable, CeilDivable, Floorable, Truncable
from hashlib.hash import _hash_simd
from hashlib._hasher import _HashableWithHasher, _Hasher
Expand All @@ -27,12 +28,12 @@ from collections.string import (
)
from python import Python, PythonObject
from python._cpython import Py_ssize_t
from memory import UnsafePointer
from memory import memcpy, UnsafePointer

from utils import Writable, Writer
from utils import Span, Writable, Writer
from utils._visualizers import lldb_formatter_wrapping_type
from utils._select import _select_register_value as select
from sys import is_nvidia_gpu, bitwidthof
from sys import is_big_endian, is_nvidia_gpu, bitwidthof

# ===----------------------------------------------------------------------=== #
# Indexer
Expand Down Expand Up @@ -1194,6 +1195,64 @@ struct Int(

writer.write(self)

@staticmethod
fn from_bytes[
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think my comment before was ever properly addressed. I think this method should be on SIMD, not Int. That way the DType makes more sense. We can think more about what it would look like on Int later, and doing int(UInt64.from_bytes(span)) doesn't seem like too much of a hit to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed it. Done:
msaelices@776b4df
msaelices@265b648

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for all the change requests, but to be clear I am comfortable merging these changes on SIMD, but not on Int. I think the SIMD API makes a ton of sense, but the int API would need some more thought since taking a DType feels wrong.

Copy link
Contributor

@soraros soraros Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I think we should/could land the uncontroversial part (from/to_byte on SIMD) first and iterate. WDYT @msaelices?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, actually the whole point of this PR was to help Python developers who need to migrate Python code using int.from_bytes() and int.to_bytes(). So, to me, it's easier if we have it on the Int struct too, and iterate it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with having it on the Int struct too because we would either:

  • Break code users relied on
  • Never get around to revising it and have a substandard API

I'm also not compelled by the idea that the goal is to mirror a Python API considering your earlier argument in favor of breaking away from the Python API using a parameter. As I mentioned before, I think int(UInt64.from_bytes(my_span)) is fine for now

type: DType, big_endian: Bool = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of adding a from_bytes and to_bytes method, but I'm not a fan of using a boolean to represent the endianness here. If we wanted to be Pythonic, we could use StringLiteral in the parameter and constrain it to be "little" or "big", or we could make an Endian wrapper struct which we can parameterize other methods on in the future. WDYT @JoeLoser?

I also don't think we want to add a DType parameter here. We can just get the sizeof[Int](), or perhaps add a helper to get the DType equivalent to Int.

Copy link
Contributor Author

@msaelices msaelices Nov 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was adding these methods because they are very convenient for IO or networking protocols. For example I am needing these for a pure Mojo implementation of websockets.

If we don't have the DType parameter, and we get an array of 60 bytes from some stream, how can we do the following cases:

  1. I want to get 10 UInt64 integers (10x6=60)
  2. I want to get 10 Int64 integers
  3. I want to get 30 UInt16 integers (30x2=60)
  4. I want to get 30 Int16 integers

Python implementation of int.as_bytes() have the length and byteorder arguments, and int.from_bytes() assumes we always have a 4 bytes integers and only have the byteorder one, but as arguments (run-time) which is slower and in Mojo we can do it comptime.

If we want to implement something IO-related, like WebSockets, we already know the endianness of the target we want to read from or write to, so it's faster if we use parameters.

Copy link
Contributor

@lsh lsh Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If anything, these comments make me think that from_bytes and to_bytes belongs on SIMD, not Int.

Python implementation of int.as_bytes() have the length and byteorder arguments, and int.from_bytes() assumes we always have a 4 bytes integers and only have the byteorder one, but as arguments (run-time) which is slower and in Mojo we can do it comptime.

I agree :) that's why I still said it should be a parameter, but my comments were questioning what the parameter should look like, because I'm not sold on Bool.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah if we follow Python's lead, bytes is binary Byte and can be encoded in UInt16 etc. We could also hold off on doing it like Python and let this function be inferred and have bytes: Span[Scalar[D]].

I think type shouldn't be used as prolifically as it is for parameters since it's the name of Python's type builtin function that I think we'll eventually have in Mojo as well.

Suggested change
type: DType, big_endian: Bool = False
D: DType, big_endian: Bool = False

questioning what the parameter should look like, because I'm not sold on Bool.

As for this, I personally never liked sys.byteorder returning "big" or "little" because there is no third option, so a boolean makes sense to me 🤷‍♂️. But compatibility may be a solid argument in favor of it.

Copy link
Contributor Author

@msaelices msaelices Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely agree with @martinvuyk . We are already diverging from the Python int.from_bytes() and int.to_bytes(). As a Python developer, I personally don't mind making a slight adaptation (e.g., "big" -> True and "small" -> False). The real issue would be if there is no equivalent implementation in Mojo's stdlib; you would need to implement the logic yourself.

](bytes: Span[Byte]) raises -> Self:
"""Converts a byte array to an integer.

Args:
bytes: The byte array to convert.

Parameters:
type: The type of the integer.
big_endian: Whether the byte array is big-endian.

Returns:
The integer value.
"""
if type.sizeof() != len(bytes):
raise Error("Byte array size does not match the integer size.")
var ptr: UnsafePointer[Byte] = UnsafePointer.address_of(bytes[0])
var type_ptr: UnsafePointer[Scalar[type]] = ptr.bitcast[Scalar[type]]()
var value = type_ptr[]

@parameter
if is_big_endian() and not big_endian:
value = byte_swap(value)
elif not is_big_endian() and big_endian:
value = byte_swap(value)
return int(value)
Copy link
Contributor

@martinvuyk martinvuyk Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
var ptr: UnsafePointer[Byte] = UnsafePointer.address_of(bytes[0])
var type_ptr: UnsafePointer[Scalar[type]] = ptr.bitcast[Scalar[type]]()
var value = type_ptr[]
@parameter
if is_big_endian() and not big_endian:
value = byte_swap(value)
elif not is_big_endian() and big_endian:
value = byte_swap(value)
return int(value)
var ptr = bytes.unsafe_ptr().bitcast[Scalar[type]]()
@parameter
if is_big_endian() and not big_endian:
return int(byte_swap(ptr[]))
elif not is_big_endian() and big_endian:
return int(byte_swap(ptr[]))
else:
return int(ptr[])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if is_big_endian() == big_endian:
  ...
else:
  ...


fn as_bytes[type: DType, big_endian: Bool = False](self) -> List[Byte]:
"""Convert the integer to a byte array.

Parameters:
type: The type of the integer.
big_endian: Whether the byte array should be big-endian.

Returns:
The byte array.
"""
alias type_len = type.sizeof()
var value = Scalar[type](self)

@parameter
if is_big_endian() and not big_endian:
value = byte_swap(value)
elif not is_big_endian() and big_endian:
value = byte_swap(value)

var ptr: UnsafePointer[Scalar[type]] = UnsafePointer.address_of(value)
var byte_ptr: UnsafePointer[Byte] = ptr.bitcast[Byte]()
var list = List[Byte](capacity=type_len)

# TODO: Maybe this can be a List.extend(ptr, count) method
memcpy(list.unsafe_ptr(), byte_ptr, type_len)
list.size = type_len

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should add more UnsafePointer APIs than absolutely necessary

Suggested change
var ptr: UnsafePointer[Scalar[type]] = UnsafePointer.address_of(value)
var byte_ptr: UnsafePointer[Byte] = ptr.bitcast[Byte]()
var list = List[Byte](capacity=type_len)
# TODO: Maybe this can be a List.extend(ptr, count) method
memcpy(list.unsafe_ptr(), byte_ptr, type_len)
list.size = type_len
var ptr = UnsafePointer.address_of(value)
var list = List[Byte](capacity=type_len)
memcpy(list.unsafe_ptr(), ptr.bitcast[Byte](), type_len)
list.size = type_len

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Changed here: msaelices@30027df

return list^

@always_inline("nodebug")
fn __mlir_index__(self) -> __mlir_type.index:
"""Convert to index.
Expand Down
53 changes: 53 additions & 0 deletions stdlib/test/builtin/test_int.mojo
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ from testing import assert_equal, assert_true, assert_false, assert_raises
from python import PythonObject
from memory import UnsafePointer

alias Bytes = List[Byte]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: This definition is not decided over yet. could you take this into the scope of your test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Done: msaelices@7766ea3



def test_properties():
assert_equal(Int.MAX, (1 << bitwidthof[DType.index]() - 1) - 1)
Expand Down Expand Up @@ -246,6 +248,56 @@ def test_conversion_from_python():
assert_equal(Int.try_from_python(PythonObject(-1)), -1)


def test_from_bytes():
assert_equal(Int.from_bytes[DType.int16, big_endian=True](Bytes(0, 16)), 16)
assert_equal(
Int.from_bytes[DType.int16, big_endian=False](Bytes(0, 16)), 4096
)
assert_equal(
Int.from_bytes[DType.int16, big_endian=True](Bytes(252, 0)), -1024
)
assert_equal(
Int.from_bytes[DType.uint16, big_endian=True](Bytes(252, 0)), 64512
)
assert_equal(
Int.from_bytes[DType.int16, big_endian=False](Bytes(252, 0)), 252
)
assert_equal(
Int.from_bytes[DType.int32, big_endian=True](Bytes(0, 0, 0, 1)), 1
)
assert_equal(
Int.from_bytes[DType.int32, big_endian=False](Bytes(0, 0, 0, 1)),
16777216,
)
assert_equal(
Int.from_bytes[DType.int32, big_endian=True](Bytes(1, 0, 0, 0)),
16777216,
)
assert_equal(
Int.from_bytes[DType.int32, big_endian=True](Bytes(1, 0, 0, 1)),
16777217,
)
assert_equal(
Int.from_bytes[DType.int32, big_endian=False](Bytes(1, 0, 0, 1)),
16777217,
)
assert_equal(
Int.from_bytes[DType.int32, big_endian=True](Bytes(255, 0, 0, 0)),
-16777216,
)
for x_ref in List[Int](10, 100, -12, 0, 1, -1, 1000, -1000):
x = x_ref[]

@parameter
for b in range(2):
assert_equal(
Int.from_bytes[DType.int16, big_endian=b](
Int(x).as_bytes[DType.int16, big_endian=b]()
),
x,
)


def main():
test_properties()
test_add()
Expand All @@ -269,3 +321,4 @@ def main():
test_int_uint()
test_float_conversion()
test_conversion_from_python()
test_from_bytes()