Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZonedDateTime not isbits #271

Open
Keno opened this issue Jul 8, 2020 · 10 comments · May be fixed by #287 or #323
Open

ZonedDateTime not isbits #271

Keno opened this issue Jul 8, 2020 · 10 comments · May be fixed by #287 or #323
Labels
performance Changes that impact code performance

Comments

@Keno
Copy link

Keno commented Jul 8, 2020

@oxinabox and I were discussing some performance problems in the presence of large numbers of ZonedDateTime objects and I noted that them not being isbits is a problem if you put lots of them in a vector, because the GC time taken to mark such vectors will be O(number of elements) not O(number of vectors). Now, arguably this is a prime use case for a partially pooled vector that just pools the TimeZone objects, but keeps the raw DateTime inline, but I'm also wondering if ZonedDateTime itself couldn't be made to be isbits and not have this issue in the first place. In particular, it seems like there's probably only ever gonna be a small handful of TimeZones globally in the system, so we could just globally intern them and only refer to them by small integer identifiers (reconstituting them into the full object on access). Does that sound like a feasible thing to do? I think it would dramatically increase the performance of ZonedDateTime.

@omus omus changed the title ZonedDataTime not isbits ZonedDateTime not isbits Jul 9, 2020
@omus
Copy link
Member

omus commented Jul 9, 2020

I have briefly looked into this and I think it's a feasible thing to do. My main concern is how this will work with serialization specifically if a user had defined a custom time zone instead of a predefined time zone.

@oxinabox
Copy link
Contributor

Could we fix that by defining custom serialization, so that it still serializes the same as it does now, even though it just stores a integery identifier both pre-serialization and after deserialization ?

@kcajf
Copy link

kcajf commented Jul 31, 2020

/~https://github.com/kcajf/TimeZones.jl
I have a branch that was working towards this - i.e. generate code for all the timezones, each one taking up just 16 bits (VariableTimeZoneF). I then made VariableTimeZone parametric in the timezone type. It's a bit hacky, and I never had time to fully finish it, but the proof of concept seemed to work.
See #182 for more discussion too

kcajf@4caacd3#diff-5c5b3616b976774dc6e6c77d00227ba3
This is the commit where I added most of the code generation. When the package is built, it'll write a zones.jl file with the type definitions and a few methods (each implemented as a big if/else block switching on the timezone).

@omus
Copy link
Member

omus commented Aug 29, 2020

I'll be working on this next week.

@omus omus linked a pull request Sep 4, 2020 that will close this issue
@kcajf
Copy link

kcajf commented Sep 17, 2020

@omus what did you think of the idea (discussed in #182) of making ZonedDateTime parametric in the timezone type? Too much change? It would allow, for example, ZonedDateTime{UTC} to be only 8 bytes

@omus
Copy link
Member

omus commented Sep 18, 2020

Interface-wise I like the idea of a parametric ZonedDateTime. Adding the type parameter is definitely breaking which we can do but need to do the experimentation to be sure it's with it. The change in #287 doesn't break the external API so it's a good alternative implementation to compare to.

@omus omus added the performance Changes that impact code performance label Sep 23, 2020
@oxinabox
Copy link
Contributor

oxinabox commented Apr 9, 2021

I thought worth an update on some more thoughts i have had on using ShortStrings.jl
Particularly since making overloaded serialization is not as easy as it at first seems (see #323 (comment))

I did some checked, and if we represented timezone name not as a single (short) string,
but as a tuple/as three fields, after splittign the name on / then we have options.
That would be Region, Locality1, Locality2 (where the later two are optional and can be replaced with "" if not present).
In particular, all TimeZones name parts would be 15 characters or less.
Which means they would fit into UInt128, Which is much faster than larger types.

So updates: since: #287 (comment)

I've been doing a bunch of experimentation and I'll provide an overview of some of bits of knowledge I've discovered:

Would not be a problem as UInt128 is part of Base, it doesn't need BitIntegers.jl
But that bug is now fixed

This has now been resolved.

  • ShortStrings.jl require constructing a String again for most operations including hashing which results worse performance when combined with storing the fields externally via a Dict/Vector

This has now been resolved, pretty much all operations now don't require constructing a String anymore.
Though there is a open bug in hashing on 32bit. JuliaString/MurmurHash3.jl#12

  • Attempting to make ZonedDateTime an isbitstype by making all subfields isbitstype is challenging.

    • A FixedTimeZone can be made isbitstype by using ShortStrings.jl but VariableTimeZone requires using ShortString and StaticArrays.jl (to replace the Vector{Transition}) which performed very poorly (never actually finished).
    • Updating VariableTimeZone to no longer precompute transitions (thereby eliminating Vector{Transition}) and instead use zones/rules from the tzsource may be feasible but would require a tuple of zones/rules as well as require a reference to an function

Nothing has changed with this AFAIK.

So ShortString is more viable than before, but still might not be viable.

Though it might be worth considering, is making FixedTimeZone isbits itself going to get a lot of the way to solving the practical problems with bitstypes.
That would mean if we made ZonedDateTime parameteric on the type of its timezone field then any one with a FixedTimeZone would be isbits.
And any VariableTimeZone wouldn't be isbits but the vector of transitions that it carries with in it would be a vector of isbits which are much nicers on the GC.
(niceness to the GC is the primary problem with not being isbits)

@oxinabox
Copy link
Contributor

Yet another idea for this, from @iamed2 is to just solve it for UTC.
Since a lot of data is stored in UTC (even if it doesn't originate there).
So we introduce a new type UTCDateTime which just has behind it the DateTime object.
and it would act just like a ZonedDateTime with the UTC timezone, but it wouldn't store any timezone at all.
and it would subtype AbstractDateTime or possibly we introduce a new AbstractZonedDateTime to hold it.
This would be a nonbreaking change, (unlike anything that depends on #332)
however for anyone wanting to take advantage of this they would need to widen their type-signatures to AbstractZonedDateTime.

It doesn't solve the whole problem but it does for one special and important case.

@kcajf
Copy link

kcajf commented Apr 19, 2021

Making ZonedDateTime parametric certainly opens up the design space - making the breaking/deprecating change is the hard part. @oxinabox if we're talking about adding a new type just for UTC, it wouldn't seem much of a stretch to instead introduce an entire new, general/parametric type that can live alongside ZonedDateTime, and slowly mark ZonedDateTime as deprecated. I think any new type will cause a lot of confusion, so may as well get the most value out of it!

If the change is made, I still believe that the Rust-inspired enum approach that I demonstrated in my branch is a good way to go. The timezone database is limited in size and is almost static, so storing it in the DateTime objects themselves feels wasteful. As a reminder, here is what the code generated in the build step looks like:

struct FixedTimeZoneF <: TimeZone
    local_minus_utc::Int32
    function FixedTimeZoneF(secs::Int)
        if (-86_400 < secs) && (secs < 86_400)
           new(secs) 
        else
            error("Invalid offset")
        end
    end 
end

FixedTimeZoneF(s::Second) = FixedTimeZoneF(s.value)

function name(tz::FixedTimeZoneF)
    offset = tz.local_minus_utc

    # TODO: could use offset_string in utcoffsets.jl with some adaptation
    if offset < 0
        sig = '-'
        offset = -offset
    else
        sig = '+'
    end

    hour, rem = divrem(offset, 3600)
    minute, second = divrem(rem, 60)
    
    if hour == 0 && minute == 0 && second == 0
        name = "UTC"
    elseif second == 0
        name = @sprintf("UTC%c%02d:%02d", sig, hour, minute)
    else
        name = @sprintf("UTC%c%02d:%02d:%02d", sig, hour, minute, second)
    end
    return name
end

primitive type VariableTimeZoneF <: TimeZone 16 end

struct TransitionSet
    utc_datetimes::Vector{DateTime}
    utc_offsets::Vector{Second}
    dst_offsets::Vector{Second}
    names::Vector{String}
end

const Africa__Abidjan = reinterpret(VariableTimeZoneF, UInt16(1))
const Africa__Accra = reinterpret(VariableTimeZoneF, UInt16(2))
const Africa__Addis_Ababa = reinterpret(VariableTimeZoneF, UInt16(3))
const Africa__Algiers = reinterpret(VariableTimeZoneF, UInt16(4))
const Africa__Asmara = reinterpret(VariableTimeZoneF, UInt16(5))
const Africa__Asmera = reinterpret(VariableTimeZoneF, UInt16(6))
const Africa__Bamako = reinterpret(VariableTimeZoneF, UInt16(7))
const Africa__Bangui = reinterpret(VariableTimeZoneF, UInt16(8))
#...
const US__PacificNew = reinterpret(VariableTimeZoneF, UInt16(544))
const US__Samoa = reinterpret(VariableTimeZoneF, UInt16(545))
const WSU = reinterpret(VariableTimeZoneF, UInt16(546))
const WET = reinterpret(VariableTimeZoneF, UInt16(547))

function name(tz::VariableTimeZoneF)
    if tz == Africa__Abidjan; "Africa/Abidjan"
    elseif tz == Africa__Accra; "Africa/Accra"
    elseif tz == Africa__Addis_Ababa; "Africa/Addis_Ababa"
    elseif tz == Africa__Algiers; "Africa/Algiers"
    elseif tz == Africa__Asmara; "Africa/Asmara"
    elseif tz == Africa__Asmera; "Africa/Asmera"
    elseif tz == Africa__Bamako; "Africa/Bamako"
    elseif tz == Africa__Bangui; "Africa/Bangui"
    elseif tz == Africa__Banjul; "Africa/Banjul"
    elseif tz == Africa__Bissau; "Africa/Bissau"
    elseif tz == Africa__Blantyre; "Africa/Blantyre"
    elseif tz == Africa__Brazzaville; "Africa/Brazzaville"
    elseif tz == Africa__Bujumbura; "Africa/Bujumbura"
    ...
    

function cutoff(tz::VariableTimeZoneF)
    if tz == Africa__Abidjan; nothing
    elseif tz == Africa__Accra; nothing
    elseif tz == Africa__Addis_Ababa; nothing
    elseif tz == Africa__Algiers; nothing
    elseif tz == Africa__Asmara; nothing
    elseif tz == Africa__Asmera; nothing
    elseif tz == Africa__Bamako; nothing
    elseif tz == Africa__Bangui; nothing
    elseif tz == Africa__Banjul; nothing
    elseif tz == Africa__Bissau; nothing
    elseif tz == Africa__Blantyre; nothing
    ...



const Africa__Abidjan__transitions = TransitionSet(
    [DateTime(-146138511, 1, 1, 0, 0, 0), DateTime(1911, 12, 31, 23, 43, 52), ],
    [Second(968), Second(0), ],
    [Second(0), Second(0), ],
    ["LMT", "GMT", ]
)

const Africa__Accra__transitions = TransitionSet(
    [DateTime(-146138511, 1, 1, 0, 0, 0), DateTime(1917, 12, 31, 23, 59, 8), DateTime(1920, 9, 1, 0, 0, 0), DateTime(1920, 12, 30, 23, 40, 0), DateTime(1921, 9, 1, 0, 0, 0), DateTime(1921, 12, 30, 23, 40, 0), ... ],
    [Second(52), Second(0), Second(0), Second(0), Second(0), .,.. ],
    [Second(0), Second(0), Second(1200), ...],
    ["LMT", "GMT", "", "GMT", "", "GMT", "", ... ]
)

function transitions(tz::VariableTimeZoneF)
    if tz == Africa__Abidjan; Africa__Abidjan__transitions
    elseif tz == Africa__Accra; Africa__Accra__transitions
    elseif tz == Africa__Addis_Ababa; Africa__Addis_Ababa__transitions
    elseif tz == Africa__Algiers; Africa__Algiers__transitions
    elseif tz == Africa__Asmara; Africa__Asmara__transitions
    elseif tz == Africa__Asmera; Africa__Asmera__transitions
    elseif tz == Africa__Bamako; Africa__Bamako__transitions
    elseif tz == Africa__Bangui; Africa__Bangui__transitions
    elseif tz == Africa__Banjul; Africa__Banjul__transitions
    elseif tz == Africa__Bissau; Africa__Bissau__transitions
    elseif tz == Africa__Blantyre; Africa__Blantyre__transitions
    elseif tz == Africa__Brazzaville; Africa__Brazzaville__transitions
    elseif tz == Africa__Bujumbura; Africa__Bujumbura__transitions
    elseif tz == Africa__Cairo; Africa__Cairo__transitions
    elseif tz == Africa__Casablanca; Africa__Casablanca__transitions
    elseif tz == Africa__Ceuta; Africa__Ceuta__transitions
    elseif tz == Africa__Conakry; Africa__Conakry__transitions
    ....

The rest of the logic is implemented in terms of these functions and lookup tables. The code compiles down to fast jump tables. It is entirely allocation-free, and the types themselves take up very little memory (they basically just contain a 16bit enum, which is pretty much as good as it can get).

@Wynand
Copy link
Contributor

Wynand commented Oct 14, 2021

For reference, this made FixedTimeZones and Transitions isbits: #354

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Changes that impact code performance
Projects
None yet
5 participants