Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log record receipt timestamp #1875

Closed
pmm-sumo opened this issue Aug 18, 2021 · 17 comments · Fixed by #2184
Closed

Log record receipt timestamp #1875

pmm-sumo opened this issue Aug 18, 2021 · 17 comments · Fixed by #2184
Labels
help wanted Extra attention is needed release:required-logdatamodel-ga Required for declaring log data model stable spec:logs Related to the specification/logs directory

Comments

@pmm-sumo
Copy link
Contributor

pmm-sumo commented Aug 18, 2021

What are you trying to achieve?

Currently, logs data model describes a single (optional) Timestamp field which describes when the event occurred

Unlike for e.g. spans, there are at least three ways a log record can have timestamp associated:

  1. It can be set by a library supporting OTLP natively
  2. It can be parsed out from the log Body or Attributes (e.g. using timestamp operator for filelogreceiver)
  3. It can be set to receipt timestamp (which is a good fallback when the others are not available or fail)

While the first two can be considered mutually exclusive (if the record had timestamp already assigned, it does not make much sense to attempt parsing log body for it), this is not true with the last one. However, there's only one field where all of them can fit.

This brings several problems, such as:

  • Inability to fallback to receipt timestamp when parsing yielded incorrect value (also, no easy ability to find records with potential timestamp parsing errors)
  • No way to to tell using what method the timestamp was assigned to the record
  • Lack of data to measure delay between log occurrence and log being collected

E.g. currently filelogreceiver sets Timestamp to receipt time - by default, or parsed timestamp - when timestamp operator is used. This makes it impossible to tell later if the log timestamp was parsed or not.

Proposed solution

Perhaps we could add another field to the log record which would store the receipt timestamp (ReceiptTimestamp). It would be filled in the record on the first receiver that retrieved it and then passed through.

Timestamp field use would be limited only to storing the result of timestamp parsing or the value set explicitly in an OTLP-compatible library.

Alternatives

  • the receipt timestamp could be instead stored as an optional attribute
  • we could keep single Timestamp but add flags that would describe how it was created (also, attributes describing which parsing rule/template/regex was picked)
@pmm-sumo pmm-sumo added the spec:logs Related to the specification/logs directory label Aug 18, 2021
@pmm-sumo
Copy link
Contributor Author

pmm-sumo commented Aug 24, 2021

@bogdandrutu, following discussion during the specs SIG, here's a bunch of cases

  1. Timestamp parsing mistakes

Consider following log:

2021-08-19T11:55:15.498-0600	info	service/collector.go:242	Loading configuration..

While this is a common format, lets assume that the parsing rule missed the timezone information and the timestamp was parsed as: 2021-08-19 11:55:15.498 UTC (while it should have been 2021-08-19 17:55:15.498 UTC).

If we would have receipt timestamp, its value could have been e.g. 2021-08-19 17:55:16.200 UTC. Having such data, the user could run a query to find records where both values differ significantly and would notice there must be some issue with the parsing rules, making it easier to spot such errors and fix them.

  1. Logs without date/time

It is possible to get into even bigger timestamp parsing errors in some edge cases. E.g. when the raw log does not include real date/time but the regex could still catch it and attempt parsing it or when the date present in message does not describe the event timestamp but something else. This can be blamed to some extent on the ill-configuration of parsing rules, but having an ability to match the parsed timestamp against the receipt timestamp could be helpful in detecting such issues

Last successful login at Aug 12th, 8:45:11
  1. Understanding how the timestamp was assigned to know if it could be overwritten with (re)parsing

Additionally, in certain cases the user might prefer to leave the parsing to the backend (vendor). This is currently not well supported since we do not have information if the timestamp associated with the record is either:

  • a receipt timestamp (set as default and acting as a fallback)
  • a parsed timestamp
  • timestamp coming from a first-party application (which should be never overwritten)

This might be solved by the second of the alternatives I listed (storing info/flag how the timestamp was associated)

@tigrannajaryan
Copy link
Member

Sorry for posting this late. We discussed this in last SIG meeting but the discussion was not captured anywhere.

Proposed solution

Perhaps we could add another field to the log record which would store the receipt timestamp (ReceiptTimestamp). It would be filled in the record on the first receiver that retrieved it and then passed through.

Timestamp field use would be limited only to storing the result of timestamp parsing or the value set explicitly in an OTLP-compatible library.

I think adding a second timestamp field can be justified if we have a strong evidence that it will be populated in significant portion of cases. I am not certain that is going to be the case. Also, it is possible to imagine more than 2 types of timestamps. Are we going to add all those possible types of timestamps as fields and have them almost always unspecified?

Alternatives

  • the receipt timestamp could be instead stored as an optional attribute

This does sound interesting. We could define any number of semantic conventions for interesting timestamps and they could be set as optional attributes on the log record.

  • we could keep single Timestamp but add flags that would describe how it was created (also, attributes describing which parsing rule/template/regex was picked)

Would this flag be an enumeration, a choice from a fixed list or something more complicated? How would a parsing rule be described?

@pmm-sumo
Copy link
Contributor Author

Proposed solution
(...)
Timestamp field use would be limited only to storing the result of timestamp parsing or the value set explicitly in an OTLP-compatible library.

I think adding a second timestamp field can be justified if we have a strong evidence that it will be populated in significant portion of cases. I am not certain that is going to be the case. Also, it is possible to imagine more than 2 types of timestamps. Are we going to add all those possible types of timestamps as fields and have them almost always unspecified?

I believe that it depends on the log source. I have prepared a summary below. ReceiptTimestamp would come largely from file receiver whenever there's a chance that timestamp might be not accurately determined (the parsed value could be empty or incorrect)

Timestamp ReceiptTimestamp
System Logs via File parsed yes
System Logs via wire (e.g. Windows Event Logs) set maybe
Infra Logs (e.g. Kubernetes Events) set maybe
3rd-party App Logs via File/Stdout parsed yes
3rd-party App Logs via structured format (e.g. Windows Event Log) set maybe
Legacy 1st-party App Logs via File/Stdout parsed yes
Legacy 1st-party App Logs via Direct to Collector set no
New First-Party Application Logs set no

Alternatives

  • the receipt timestamp could be instead stored as an optional attribute

This does sound interesting. We could define any number of semantic conventions for interesting timestamps and they could be set as optional attributes on the log record.

Yes, that's serving the same goal as the proposed solution. It's just that in certain workloads all logs might have this field set so perhaps it could be more efficient to keep it as a field rather than attribute (especially since it would be present for each record).

  • we could keep single Timestamp but add flags that would describe how it was created (also, attributes describing which parsing rule/template/regex was picked)

Would this flag be an enumeration, a choice from a fixed list or something more complicated? How would a parsing rule be described?

I think it would be an enumeration. Some possible values:

  • Set (e.g. for new first-party log source or transforming from other structured format)
  • Parsed (e.g. when the value was parsed by filelogreceiver)
  • Receipt (e.g. when timestamp parsing didn't match anything)

@djaglowski
Copy link
Member

I am in favor of adding ReceiptTimestamp as a top level field on the log data model, but I propose the following:

  • OTel components should generally consider both Timestamp and ReceiptTimestamp to be immutable.
  • ReceiptTimestamp is required. It should represent the first time that an OTel component interacted with the log.
  • Timestamp is optional. It may be initialized by a first-party source, or by any component in the pipeline that can extract a meaningful timestamp from the log.

@tigrannajaryan tigrannajaryan added the help wanted Extra attention is needed label Oct 13, 2021
@tigrannajaryan
Copy link
Member

• Inability to fallback to receipt timestamp when parsing yielded incorrect value (also, no easy ability to find records with potential timestamp parsing errors)

@pmm-sumo It is not clear why this is so. If the parsing failed with an error then we will not set the timestamp field, so it will be empty. If the parsing failed silently and produced some value how will we know it is incorrect even if it is recorded in a different field? What is “incorrect” in this context?

@pmm-sumo
Copy link
Contributor Author

• Inability to fallback to receipt timestamp when parsing yielded incorrect value (also, no easy ability to find records with potential timestamp parsing errors)

@pmm-sumo It is not clear why this is so. If the parsing failed with an error then we will not set the timestamp field, so it will be empty. If the parsing failed silently and produced some value how will we know it is incorrect even if it is recorded in a different field? What is “incorrect” in this context?

Let's consider today's date, which I would write down as 10/11/21. In some countries it could be written as 11/10/21. If the parser is set incorrectly, it might yield October 11th rather than November 10th. This will not produce error but would store incorrect date otherwise.

Having the receipt timestamp (and observing it's so far away from the parsed date) would make it much easier to spot such incorrect parsing instances

@tigrannajaryan
Copy link
Member

A slightly different approach is to re-define the current Timestamp field to mean "the time when Otel first observed this event". For Otel SDK-based sources this will be the generation time. For other source this will be the receipt/collection time by Collector. This is similar to what @djaglowski describes as "It should represent the first time that an OTel component interacted with the log".

This Timestamp field will be always present in any OTLP data, in any log records coming out of Otel SDK and any log records coming out of Otel Collector. With such re-definition we can also entertain the idea to make this field mandatory. Essentially anyone who creates an OTLP record MUST populate this field to some value.

One possible downside of this approach is that it is not possible to tell if the timestamp was natively generated at the source or was "assigned" by the Otel component that first observed the event, which can be much later than the original generation time). Question: does it matter? And if it matters is it not solved by having the ParsedTimestamp (see next paragraph)?

The other timestamp field can be called ParsedTimestamp and will be optional. It can be set by the Collector when reading from filelog or by anyone who sends OTLP if they happen to do parsing. So, this is a field that is specifically there for better support of legacy log formats, which is an explicit goal that we have, so it sounds justified to me.

Question: is there any other timestamp kind other than these 2 that we want?

--

The reason I suggest this alternate approach is because otherwise I am having a hard time defining the semantics of the Timestamp field precisely and in a concise manner. We should not expect to provide a table like the one @pmm-sumo posted above so that we can have a precise of definition of what Timestamp field represents.

@tigrannajaryan
Copy link
Member

tigrannajaryan commented Nov 10, 2021

Side note: we have flags field in OTLP which has a lot of reserved bits, so if we wanted to record any indicators about the timestamps that would be fairly cheap to do. For example if we need to know whether the Timestamp is generated by Otel SDK or populated by Collector when it saw the event the first time that would be a single bit.

@pmm-sumo
Copy link
Contributor Author

I like the alternative approach. I think it is an elegant solution. I don't have any other common timestamp fields on my mind. We can ask for those during the SIG. I those would be present, they could be also made part of semantic conventions.

@tigrannajaryan
Copy link
Member

From Log SIG today:

We probably want 2 fields: ObservedTimestamp and SourceTimestamp (names preliminary). ObservedTimestamp is the time when event was first observed in Otel (generated by Otel SDK or collected by Otel Collector). SourceTimestamp is what we could extract from the source (e.g. parse from file, copy from another protocol, etc).

For backends the best timestamp to use would be:

  • SourceTimestamp if present.
  • ObservedTimestamp otherwise.

If SourceTimestamp and ObservedTimestamp differ significantly it can be an indication that something went wrong (lagging collection, incorrect parsing, etc).

(Perhaps we keep the current name of the field Timestamp and it is the ObservedTimestamp described above).

@djaglowski
Copy link
Member

In addition to what Tigran captured, I think it's important to note that first party applications would directly populate the SourceTimestamp as the log is created.

Perhaps we keep the current name of the field Timestamp and it is the ObservedTimestamp described above

If we are to keep the current name of the Timestamp field, I think it should represent the time the log/event was emitted, to the best of our knowledge. (e.g. it should mean SourceTimestamp). My rationale is that the SourceTimestamp is fundamentally associated with the event itself, but that the ObservedTimestamp is only context about how the event was processed.

@tigrannajaryan
Copy link
Member

tigrannajaryan commented Dec 2, 2021

@pmm-sumo @djaglowski should we move forward with this? I think we are in virtually in agreement, except perhaps the naming of the fields. If we want to change the name and clarify the meaning of the Timestamp field I believe we should do it before we declare the data model stable.

@pmm-sumo
Copy link
Contributor Author

pmm-sumo commented Dec 2, 2021

@tigrannajaryan this solves my original problem I mentioned in the issue so green light from my end.

I don't have strong opinion on names. Would be nice to keep Timestamp, but I am not sure which of the two is a better fit to be honest.

BTW, I am looking at ECS and I see three kinds of timestamp are listed there. They seem to follow model used for file log collection (which simplifies the options):

  • @timestamp - (...) typically representing when the event was generated by the source. If the event source has no original timestamp, this value is typically populated by the first time the event was received by the pipeline
  • event.created - (...) when the event was first read by an agent, or by your pipeline (...) This can be used to monitor your agent’s or pipeline’s ability to keep up with your event source.
  • event.ingested - when an event arrived in the central data store

Assuming no tampering, the timestamps should chronologically look like this:
@timestamp < event.created < event.ingested.

@tigrannajaryan
Copy link
Member

What do you think about these definitions for our data model:

  • Timestamp - Time when the event occurred measured by the origin clock, i.e. the time at the source. This field is optional, it may be missing if the source timestamp is unknown.
  • ObservedTimestamp - Time when the event was observed by the collection system. For events that originate in OpenTelemetry this timestamp is typically set at the generation time and is equal to Timestamp. For events originating externally and collected by OpenTelemetry (e.g. using Collector) this is the time when any of OpenTelemetry's code observed the event measure by the clock of the OpenTelemetry code. This field SHOULD be set once the event is observed by OpenTelemetry.

We can also provide guidance to systems which need to receive OpenTelemetry logs but support only one timestamp (also when translating to a non-OTLP format):

  • Use Timestamp if it is present, otherwise use ObservedTimestamp.

@djaglowski
Copy link
Member

@tigrannajaryan, I agree we should resolve this before the data model is declared stable, and I support your latest proposed design.

@tigrannajaryan
Copy link
Member

@pmm-sumo what do you think?

@pmm-sumo
Copy link
Contributor Author

pmm-sumo commented Dec 2, 2021

@pmm-sumo what do you think?

Plus one on the recent proposal. I would be actually good with any variation that solves the above issue. I think it fits the use cases we're having

@tigrannajaryan tigrannajaryan added the release:required-logdatamodel-ga Required for declaring log data model stable label Dec 2, 2021
tigrannajaryan added a commit to tigrannajaryan/opentelemetry-specification that referenced this issue Dec 2, 2021
Contributes to open-telemetry#1875

This is part 1 of the change. Part 2 will add the observed timestamp field. See the issue
for the discussion and the description of the source vs observed timestamps.
tigrannajaryan added a commit to tigrannajaryan/opentelemetry-specification that referenced this issue Dec 2, 2021
Contributes to open-telemetry#1875

This is part 1 of the change. Part 2 will add the observed timestamp field. See the issue
for the discussion and the description of the source vs observed timestamps.
tigrannajaryan added a commit to tigrannajaryan/opentelemetry-specification that referenced this issue Dec 2, 2021
Resolves open-telemetry#1875

See the issue for the discussion and the description of the source vs observed timestamps.
tigrannajaryan added a commit to tigrannajaryan/opentelemetry-specification that referenced this issue Dec 2, 2021
Resolves open-telemetry#1875

See the issue for the discussion and the description of the source vs observed timestamps.
tigrannajaryan added a commit to tigrannajaryan/opentelemetry-specification that referenced this issue Dec 2, 2021
Resolves open-telemetry#1875

See the issue for the discussion and the description of the source vs observed timestamps.
tigrannajaryan added a commit to tigrannajaryan/opentelemetry-specification that referenced this issue Dec 10, 2021
Resolves open-telemetry#1875

See the issue for the discussion and the description of the source vs observed timestamps.
jmacd added a commit that referenced this issue Dec 10, 2021
* Add ObservedTimestamp to the Log Data Model

Resolves #1875

See the issue for the discussion and the description of the source vs observed timestamps.

* Fixed based on PR comments

Co-authored-by: Joshua MacDonald <jmacd@users.noreply.github.com>
tigrannajaryan added a commit that referenced this issue Dec 10, 2021
Contributes to #1875

This is part 1 of the change. Part 2 will add the observed timestamp field. See the issue
for the discussion and the description of the source vs observed timestamps.
joaopgrassi pushed a commit to dynatrace-oss-contrib/semantic-conventions that referenced this issue Mar 21, 2024
* Add ObservedTimestamp to the Log Data Model

Resolves open-telemetry/opentelemetry-specification#1875

See the issue for the discussion and the description of the source vs observed timestamps.

* Fixed based on PR comments

Co-authored-by: Joshua MacDonald <jmacd@users.noreply.github.com>
joaopgrassi pushed a commit to dynatrace-oss-contrib/semantic-conventions that referenced this issue Mar 21, 2024
Contributes to open-telemetry/opentelemetry-specification#1875

This is part 1 of the change. Part 2 will add the observed timestamp field. See the issue
for the discussion and the description of the source vs observed timestamps.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed release:required-logdatamodel-ga Required for declaring log data model stable spec:logs Related to the specification/logs directory
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants