-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add simple actor telemetry #6294
add simple actor telemetry #6294
Conversation
Would it be better to add a timestamp to the event messages? Maybe a |
Since this is just a metric emission I wanted to keep it utterly bare minimum - a downstream consumer can convert these events into timed log entries if they wish. |
} | ||
|
||
// Create ActorTelemetryEvent messages for the following events: starting an actor, stopping an actor, restarting an actor | ||
public readonly struct ActorStarted : IActorTelemetryEvent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth using Structs here if they are getting boxed anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll benchmark this and find it - what I'd like to do is use structs and get rid of the boxing but that'll require a non-binary-compatible change to the EventStream
class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that change would probably be an Akka.NET v1.5 thing.
public sealed class Reason | ||
{ | ||
public Reason(string type, string message = null) { } | ||
public string Message { get; } | ||
public string Type { get; } | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The public guid Error
property is missing.
It would make it possible to tag and serialize/log exceptions into elasticsearch, seq, opentelemetry, Redis or simply into a S3 bucket and only transmit this Reason
to remote nodes.
If needed the receiving remote system could try restore/deserialize the exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe instead of Guid Error
a property string Url
would be better.
It will allow to have many/mix of error service types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use a log entry for this? These are just meant to be simple metric updates
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking we should just include the full Exception
payload in here instead and get rid of the Reason
. We don't / can't use Reason
for stops (no way to tell why we're terminating) and I mentioned in the spec here that 100% of these events are meant to be consumed in-process via the EventStream
. You can always transform / project them into something else later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Reason
is need to transfer unknown Exception types between nodes transparently.
Has not much to to with the IActorTelemetryEvent
them self.
Actor.Remote
creates the assumption that all remoting is transparent.
But if a node transmit a unknown exception type to a node it fails horrible.
The general idea is to put the Èxception
into memory cache or distributed storage/cache
then can then be loaded when really needed,
to get from Reason => Exception again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By design, these events are not designed to be transmitted over the wire - local aggregation only. We don't want this data coupled to remoting, serialization, or persistence. "Simple metrics" is the goal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am asking only to make the Reason
type multi purpose that it can be in general used instead of an exception.
The remote transparency has this intrinsic conflict with the Exception
dotnet type.
The MS guys for system.runtime.remoting
in framework 1.1 times already discovered it.
public class static ReasonExtensions | ||
{ | ||
public static Akka.Actor.Reason ToReason(this System.Exception ex) { } | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This extensions method should be on the ActorContext or ActorSystem,
to make it possible to retrieve a new AkkaErrorService from Extensions list.
The Default implementation can simply convert the Exception
to a Reason
with Error = Guid.Empty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These metrics are only produced within the ActorCell
and it's performed automatically by Akka.NET - meaning there shouldn't be a reason to produce these from the outside.
In fact, that makes me wonder if I should be marking the constructors as internal
....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Detailed my own changes - still need to run benchmarks so we can get data on struct
vs. class
for these types.
/// <remarks> | ||
/// Not intended to be sent across network boundaries - should only be processed locally via the <see cref="EventStream"/>. | ||
/// </remarks> | ||
public interface IActorTelemetryEvent : INoSerializationVerificationNeeded, INotInfluenceReceiveTimeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are meant to be metric deltas / increments that can be aggregated by a subscriber on the EventStream
- not used for tracing or logs. Specific to the IActorRef
lifecycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handling matters like measuring message processing rates this way would be a disaster from a performance and memory management point of view - the overhead of measuring message processing rates would be greater than or equal to the overhead of processing the messages themselves. Use Phobos for that use case.
/// <code> | ||
/// akka.actor.telemetry.enabled = on | ||
/// </code> | ||
public bool EmitActorTelemetry { get; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting that controls the emission of telemetry events. Off by default.
telemetry{ | ||
# Enables telemetry emission from actors - actors will emit events | ||
# whenever they are started, stopped, or restarted. | ||
enabled = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
relevant HOCON setting.
@@ -348,7 +351,9 @@ private void FinishRecreate(Exception cause, ActorBase failedActor) | |||
|
|||
if (System.Settings.DebugLifecycle) | |||
Publish(new Debug(_self.Path.ToString(), freshActor.GetType(), "Restarted (" + freshActor + ")")); | |||
|
|||
if(System.Settings.EmitActorTelemetry) | |||
System.EventStream.Publish(new ActorRestarted(Self, Props.Type, cause.ToReason())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Emit ActorRestarted
events here automatically when an actor is being rebooted. Include details about the exception that caused it to restart (so that data can be aggregated downstream.)
@@ -449,6 +449,8 @@ private void Create(Exception failure) | |||
CheckReceiveTimeout(); | |||
if (System.Settings.DebugLifecycle) | |||
Publish(new Debug(Self.Path.ToString(), created.GetType(), "Started (" + created + ")")); | |||
if(System.Settings.EmitActorTelemetry) | |||
System.EventStream.Publish(new ActorStarted(Self, Props.Type)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Emit actor started events here - we validate via our specs that these aren't duplicated when an actor restarts.
|
||
// create a unit test where a second ActorSystem connects to Sys and receives an IActorRef from Sys and subscribes to Telemetry events | ||
[Fact] | ||
public async Task RemoteActorRefs_should_not_produce_telemetry() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any IActorRef
without an ActorCell
will not produce metrics (by design) - so FutureActorRef
s from Ask<T>
and RemoteActorRef
s don't influence these totals.
/// Pool routers should have their start / stop / restarts counted too | ||
/// </summary> | ||
[Fact] | ||
public async Task ActorTelemetry_must_be_accurate_for_pool_router() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Router actors have an ActorCell
even though they don't have a typical Mailbox
, therefore they will also emit the same events are other local actors do for starting / stopping / restarting.
public IActorRef Subject { get; } | ||
public Type ActorType { get; } | ||
|
||
public Reason Reason { get; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These messages are meant to be consumed in a local process - why not just include the Exception
here so the consumer can decide how to report?
/// </summary> | ||
public readonly struct ActorRestarted : IActorTelemetryEvent | ||
{ | ||
public ActorRestarted(IActorRef subject, Type actorType, Reason reason) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mark all constructors as internal
so these events can be emitted from user code? If I'm building reporting tools that are supposed to accurately report on what current load in a large Akka.Cluster looks like, someone else emitting their own versions of these metrics into the EventStream
is going to throw this off and the other developers on their own team probably won't know that.
|
Method | ActorCount | EnableTelemetry | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|---|---|
Actor_spawn | 100000 | False | 3.592 s | 0.2078 s | 0.1237 s | 169000.0000 | 48000.0000 | 2000.0000 | 986 MB |
Actor_spawn | 100000 | True | 3.620 s | 0.1311 s | 0.0867 s | 172000.0000 | 50000.0000 | 4000.0000 | 993 MB |
sealed class
Implementation
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.2251 (21H2)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100
[Host] : .NET 7.0.0 (7.0.22.51805), X64 RyuJIT
Job-HEBSLY : .NET 7.0.0 (7.0.22.51805), X64 RyuJIT
InvocationCount=1 IterationCount=10 RunStrategy=Throughput
UnrollFactor=1 WarmupCount=5
Method | ActorCount | EnableTelemetry | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|---|---|
Actor_spawn | 100000 | False | 3.543 s | 0.2174 s | 0.1438 s | 170000.0000 | 49000.0000 | 3000.0000 | 985 MB |
Actor_spawn | 100000 | True | 3.544 s | 0.1591 s | 0.1052 s | 170000.0000 | 49000.0000 | 2000.0000 | 993 MB |
Looks to me like the impact of boxing is higher than the impact of having another class to GC. We should go with the class
implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
In terms of website documentation - I think we're going to need to expand our real-world deployment area of the site to include some discussion points on telemetry. |
@Arkatufus we're going to need to port this to v1.5 (the |
* added initial actor telemetry for akkadotnet#6293 * added basic telemetry tests for local actors * added spec to validate that `RemoteActorRef` doesn't influence counters * updated `SpawnActorBenchmarks` to include telemetry impact * converted telemetry events into `sealed class`es with `internal` constructors * removed `Reason` (cherry picked from commit 7f68c48)
* add simple actor telemetry (#6294) * added initial actor telemetry for #6293 * added basic telemetry tests for local actors * added spec to validate that `RemoteActorRef` doesn't influence counters * updated `SpawnActorBenchmarks` to include telemetry impact * converted telemetry events into `sealed class`es with `internal` constructors * removed `Reason` (cherry picked from commit 7f68c48) * Update API Verify list * Fix API verify list * Fix API verify list * fix API Verify list Co-authored-by: Aaron Stannard <aaron@petabridge.com>
Changes
Fixes #6293
This is an initial implementation designed to provide basic actor start / stop / restart telemetry for keeping track of the number of alive actor instances in a given Akka.NET process. We have plans to use this in both Petabridge.Cmd and Phobos to improve observability experiences for users.
This design relies on publishing simple, lightweight events via the
EventStream
to listeners that subscriber to those events in-process. These metrics are not enabled by default and must be explicitly enabled via the following HOCON setting:Checklist
For significant changes, please ensure that the following have been completed (delete if not relevant):
Latest
dev
BenchmarksNeed to run actor spawn benchmarks with this setting enabled and not enabled
This PR's Benchmarks
Include data from after this change here.