Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add simple actor telemetry #6294

Merged

Conversation

Aaronontheweb
Copy link
Member

@Aaronontheweb Aaronontheweb commented Dec 6, 2022

Changes

Fixes #6293

This is an initial implementation designed to provide basic actor start / stop / restart telemetry for keeping track of the number of alive actor instances in a given Akka.NET process. We have plans to use this in both Petabridge.Cmd and Phobos to improve observability experiences for users.

This design relies on publishing simple, lightweight events via the EventStream to listeners that subscriber to those events in-process. These metrics are not enabled by default and must be explicitly enabled via the following HOCON setting:

akka.actor.telemetry.enabled = true

Checklist

For significant changes, please ensure that the following have been completed (delete if not relevant):

Latest dev Benchmarks

Need to run actor spawn benchmarks with this setting enabled and not enabled

This PR's Benchmarks

Include data from after this change here.

@Aaronontheweb Aaronontheweb added akka-actor enhancement akka.net v1.4 Issues affecting Akka.NET v1.4 labels Dec 6, 2022
@Arkatufus
Copy link
Contributor

Arkatufus commented Dec 6, 2022

Would it be better to add a timestamp to the event messages? Maybe a DateTime.UtcNow.Ticks() or DateTimeOffset.UtcNow.ToUnixTimeMilliseconds(), so that it can be reconstructed later on by analytics.

@Aaronontheweb
Copy link
Member Author

Aaronontheweb commented Dec 6, 2022

Would it be better to add a timestamp to the event messages? Maybe a DateTime.UtcNow.Ticks() or DateTimeOffset.UtcNow.ToUnixTimeMilliseconds(), so that it can be reconstructed later on by analytics.

Since this is just a metric emission I wanted to keep it utterly bare minimum - a downstream consumer can convert these events into timed log entries if they wish.

}

// Create ActorTelemetryEvent messages for the following events: starting an actor, stopping an actor, restarting an actor
public readonly struct ActorStarted : IActorTelemetryEvent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth using Structs here if they are getting boxed anyway?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll benchmark this and find it - what I'd like to do is use structs and get rid of the boxing but that'll require a non-binary-compatible change to the EventStream class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that change would probably be an Akka.NET v1.5 thing.

Comment on lines 1513 to 1518
public sealed class Reason
{
public Reason(string type, string message = null) { }
public string Message { get; }
public string Type { get; }
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The public guid Error property is missing.
It would make it possible to tag and serialize/log exceptions into elasticsearch, seq, opentelemetry, Redis or simply into a S3 bucket and only transmit this Reason to remote nodes.
If needed the receiving remote system could try restore/deserialize the exception.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instead of Guid Error a property string Url would be better.
It will allow to have many/mix of error service types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use a log entry for this? These are just meant to be simple metric updates

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking we should just include the full Exception payload in here instead and get rid of the Reason. We don't / can't use Reason for stops (no way to tell why we're terminating) and I mentioned in the spec here that 100% of these events are meant to be consumed in-process via the EventStream. You can always transform / project them into something else later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Reason is need to transfer unknown Exception types between nodes transparently.
Has not much to to with the IActorTelemetryEvent them self.

Actor.Remote creates the assumption that all remoting is transparent.
But if a node transmit a unknown exception type to a node it fails horrible.

The general idea is to put the Èxception into memory cache or distributed storage/cache
then can then be loaded when really needed,
to get from Reason => Exception again

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By design, these events are not designed to be transmitted over the wire - local aggregation only. We don't want this data coupled to remoting, serialization, or persistence. "Simple metrics" is the goal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am asking only to make the Reason type multi purpose that it can be in general used instead of an exception.
The remote transparency has this intrinsic conflict with the Exception dotnet type.
The MS guys for system.runtime.remoting in framework 1.1 times already discovered it.

Comment on lines 1519 to 1522
public class static ReasonExtensions
{
public static Akka.Actor.Reason ToReason(this System.Exception ex) { }
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This extensions method should be on the ActorContext or ActorSystem,
to make it possible to retrieve a new AkkaErrorService from Extensions list.
The Default implementation can simply convert the Exception to a Reason with Error = Guid.Empty

Copy link
Member Author

@Aaronontheweb Aaronontheweb Dec 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These metrics are only produced within the ActorCell and it's performed automatically by Akka.NET - meaning there shouldn't be a reason to produce these from the outside.

In fact, that makes me wonder if I should be marking the constructors as internal....

Copy link
Member Author

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detailed my own changes - still need to run benchmarks so we can get data on struct vs. class for these types.

/// <remarks>
/// Not intended to be sent across network boundaries - should only be processed locally via the <see cref="EventStream"/>.
/// </remarks>
public interface IActorTelemetryEvent : INoSerializationVerificationNeeded, INotInfluenceReceiveTimeout
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are meant to be metric deltas / increments that can be aggregated by a subscriber on the EventStream - not used for tracing or logs. Specific to the IActorRef lifecycle.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handling matters like measuring message processing rates this way would be a disaster from a performance and memory management point of view - the overhead of measuring message processing rates would be greater than or equal to the overhead of processing the messages themselves. Use Phobos for that use case.

/// <code>
/// akka.actor.telemetry.enabled = on
/// </code>
public bool EmitActorTelemetry { get; }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting that controls the emission of telemetry events. Off by default.

telemetry{
# Enables telemetry emission from actors - actors will emit events
# whenever they are started, stopped, or restarted.
enabled = false
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relevant HOCON setting.

@@ -348,7 +351,9 @@ private void FinishRecreate(Exception cause, ActorBase failedActor)

if (System.Settings.DebugLifecycle)
Publish(new Debug(_self.Path.ToString(), freshActor.GetType(), "Restarted (" + freshActor + ")"));

if(System.Settings.EmitActorTelemetry)
System.EventStream.Publish(new ActorRestarted(Self, Props.Type, cause.ToReason()));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emit ActorRestarted events here automatically when an actor is being rebooted. Include details about the exception that caused it to restart (so that data can be aggregated downstream.)

@@ -449,6 +449,8 @@ private void Create(Exception failure)
CheckReceiveTimeout();
if (System.Settings.DebugLifecycle)
Publish(new Debug(Self.Path.ToString(), created.GetType(), "Started (" + created + ")"));
if(System.Settings.EmitActorTelemetry)
System.EventStream.Publish(new ActorStarted(Self, Props.Type));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emit actor started events here - we validate via our specs that these aren't duplicated when an actor restarts.


// create a unit test where a second ActorSystem connects to Sys and receives an IActorRef from Sys and subscribes to Telemetry events
[Fact]
public async Task RemoteActorRefs_should_not_produce_telemetry()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any IActorRef without an ActorCell will not produce metrics (by design) - so FutureActorRefs from Ask<T> and RemoteActorRefs don't influence these totals.

/// Pool routers should have their start / stop / restarts counted too
/// </summary>
[Fact]
public async Task ActorTelemetry_must_be_accurate_for_pool_router()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Router actors have an ActorCell even though they don't have a typical Mailbox, therefore they will also emit the same events are other local actors do for starting / stopping / restarting.

public IActorRef Subject { get; }
public Type ActorType { get; }

public Reason Reason { get; }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These messages are meant to be consumed in a local process - why not just include the Exception here so the consumer can decide how to report?

/// </summary>
public readonly struct ActorRestarted : IActorTelemetryEvent
{
public ActorRestarted(IActorRef subject, Type actorType, Reason reason)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mark all constructors as internal so these events can be emitted from user code? If I'm building reporting tools that are supposed to accurately report on what current load in a large Akka.Cluster looks like, someone else emitting their own versions of these metrics into the EventStream is going to throw this off and the other developers on their own team probably won't know that.

@Aaronontheweb
Copy link
Member Author

SpawnActor Benchmark Results

We ran the Akka.Benchmarks.Actor.SpawnActorBenchmarks on .NET 7.0.#6294

readonly struct Implementation

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.2251 (21H2)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100
  [Host]     : .NET 7.0.0 (7.0.22.51805), X64 RyuJIT
  Job-ZEAKWQ : .NET 7.0.0 (7.0.22.51805), X64 RyuJIT

InvocationCount=1  IterationCount=10  RunStrategy=Throughput  
UnrollFactor=1  WarmupCount=5  
Method ActorCount EnableTelemetry Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
Actor_spawn 100000 False 3.592 s 0.2078 s 0.1237 s 169000.0000 48000.0000 2000.0000 986 MB
Actor_spawn 100000 True 3.620 s 0.1311 s 0.0867 s 172000.0000 50000.0000 4000.0000 993 MB

sealed class Implementation

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.2251 (21H2)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100
  [Host]     : .NET 7.0.0 (7.0.22.51805), X64 RyuJIT
  Job-HEBSLY : .NET 7.0.0 (7.0.22.51805), X64 RyuJIT

InvocationCount=1  IterationCount=10  RunStrategy=Throughput  
UnrollFactor=1  WarmupCount=5  
Method ActorCount EnableTelemetry Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
Actor_spawn 100000 False 3.543 s 0.2174 s 0.1438 s 170000.0000 49000.0000 3000.0000 985 MB
Actor_spawn 100000 True 3.544 s 0.1591 s 0.1052 s 170000.0000 49000.0000 2000.0000 993 MB

Looks to me like the impact of boxing is higher than the impact of having another class to GC. We should go with the class implementation.

Copy link
Contributor

@Arkatufus Arkatufus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@Aaronontheweb
Copy link
Member Author

In terms of website documentation - I think we're going to need to expand our real-world deployment area of the site to include some discussion points on telemetry.

@Aaronontheweb Aaronontheweb merged commit 7f68c48 into akkadotnet:v1.4 Dec 8, 2022
@Aaronontheweb Aaronontheweb deleted the implement-telemetry-6293 branch December 8, 2022 21:50
@Aaronontheweb
Copy link
Member Author

@Arkatufus we're going to need to port this to v1.5 (the dev branch)

Arkatufus pushed a commit to Arkatufus/akka.net that referenced this pull request Dec 9, 2022
* added initial actor telemetry for akkadotnet#6293

* added basic telemetry tests for local actors

* added spec to validate that `RemoteActorRef` doesn't influence counters

* updated `SpawnActorBenchmarks` to include telemetry impact

* converted telemetry events into `sealed class`es with `internal` constructors

* removed `Reason`

(cherry picked from commit 7f68c48)
Aaronontheweb added a commit that referenced this pull request Dec 9, 2022
* add simple actor telemetry (#6294)

* added initial actor telemetry for #6293

* added basic telemetry tests for local actors

* added spec to validate that `RemoteActorRef` doesn't influence counters

* updated `SpawnActorBenchmarks` to include telemetry impact

* converted telemetry events into `sealed class`es with `internal` constructors

* removed `Reason`

(cherry picked from commit 7f68c48)

* Update API Verify list

* Fix API verify list

* Fix API verify list

* fix API Verify list

Co-authored-by: Aaron Stannard <aaron@petabridge.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
akka.net v1.4 Issues affecting Akka.NET v1.4 akka-actor enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants