Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore data sync errors and allow custom should_capture via Delegator pattern #160

Merged
merged 15 commits into from
Oct 20, 2020

Conversation

ChrisBAshton
Copy link
Contributor

In order to prevent thousands of transient errors being reported to Sentry during the data sync window (errors that occur only because databases are copied/provisioned overnight and are sometimes unavailable), we want to configure the should_capture logic in govuk_app_config in the same way as Content Publisher and Email Alert API, so that the fix is applied centrally. This involves setting the should_capture behaviour to ignore any errors of type PG::Error, if we are in the data sync window (10pm to 8am, according to Puppet).

However, we want to allow downstream apps to be able to define their own should_capture conditions too. Therefore, this PR:

  • Changes the object apps interact with when they call GovukError.configure - it now yields to a RavenDelegator, which uses the Delegator pattern as suggested by @kevindew. The pattern delegates methods to the Raven.configuration object unless they've been explicitly overridden.
  • ^ this enables us to define our own default should_capture=(callback), inside DefaultConfiguration - but also allows downstream apps to provide their own should_capture=(callback) callbacks too. In our implementation, both the default callback and the custom callback must return true for the overall should_capture callback to return true.
  • The default should_capture callback examines the error type and determines whether or not we are in the data sync window to decide if the error should be captured. It does this by looking at a global GOVUK_DATA_SYNC_PERIOD ENV variable of the form 22:00-08:00. PG::Error errors that occur during this time frame will be ignored.
  • The ENV variable doesn't exist yet: I will raise a separate PR in govuk-puppet for that.

Trello: https://trello.com/c/rAlvGlSp/2123-5-ignore-data-sync-errors-in-govukappconfig
Fixes #152. Supersedes #159 (the code has been largely recreated by hand here, as the commit history was meandering and no longer particularly valuable).

Copy link
Member

@kevindew kevindew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Chris, lots of great improvements here and this is shaping up great. There's a few areas where I think the suggestions I've made didn't come out too clearly so I did some hacking myself so I could try point to what I meant rather than just try describe it in words - hopefully that provides greater clarity.

I cobbled my bits together here: 81c78c5

So these are the bits that perhaps I think our wires ended up a bit crossed on.

I think we should make the data sync error side distinct from the configuration. I've tried applying that like so: 81c78c5#diff-e455739649a523940d6e5fc5353656a0R3-R36 - that also reflects a difference on should capture, I'd expect an application defining should_capture will replace a previously defined version, but always keep the data sync check.

81c78c5#diff-60bf2f76f5886144ddb9f524d7a5a8edR49-R51 represents how I'd expect that errors were configured, so they rather transparently fit into configuration rather than containing some of the logic. I also felt it didn't feel quite right to be setting our default configuration in and amongst a class, seems to raise questions about what to test or not.

I did have a quick pass at matching to the exception class by converting the names to strings here: 81c78c5#diff-e455739649a523940d6e5fc5353656a0R30-R32 to be honest I don't know if that is sufficient or not as Raven seems to have used a more complex constant resolving process. It seemed to work in my test though. I guess a thing to_s might not cover is two constants pointing to the same object, but that seemed pretty edge case.

Hope this helps and happy to have a chat if some of my suggestions are ambiguous, there are a variety of subtle things in my hacking that I haven't explained.

govuk_app_config.gemspec Outdated Show resolved Hide resolved
govuk_app_config.gemspec Outdated Show resolved Hide resolved
lib/govuk_app_config/govuk_error/govuk_data_sync.rb Outdated Show resolved Hide resolved
lib/govuk_app_config/govuk_error/govuk_data_sync.rb Outdated Show resolved Hide resolved
lib/govuk_app_config/govuk_error/raven_delegator.rb Outdated Show resolved Hide resolved
lib/govuk_app_config/govuk_error/configure_defaults.rb Outdated Show resolved Hide resolved
lib/govuk_app_config/govuk_error/configure_defaults.rb Outdated Show resolved Hide resolved
lib/govuk_app_config/govuk_error/configure_defaults.rb Outdated Show resolved Hide resolved
@ChrisBAshton ChrisBAshton force-pushed the datasync-errors-new branch 2 times, most recently from 9c80568 to bd8759c Compare October 9, 2020 09:10
@ChrisBAshton
Copy link
Contributor Author

Thanks again for the review, @kevindew. Apologies it's taken a while to address your comments - other things have taken priority this week.

I've administered your suggestions in the relevant commits, and then added a final commit for moving the data sync logic into the RavenDelegator.

I'm aware you've opted for a configure.rb for the default configuration and a configuration.rb for the should_capture logic. I've left as-is for now (configure_defaults.rb and raven_delegator.rb respectively) as I think this is more scannable, but I'm happy to be talked out of it. (I'm personally of the view that 'configuration' implies where the default config lives, and I wouldn't know what 'configure' is supposed to do in that context).

Copy link
Member

@kevindew kevindew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Chris, this is looking really close now. Nice job.

Regarding the areas you referenced in your previous comment.

ConfigureDefaults class, to me this isn't a class in a traditional sense all we're looking for is vessel for the configuration code, and it doesn't really behave like a class since we just use it as a function to apply side effects to the argument.

I put it in a govuk_error/configure file to reference that's it's just to configure that module. It could well go inside the block of /~https://github.com/alphagov/govuk_app_config/pull/160/files#diff-de4bdf6f6685d2a23ac3a9ea5c54b985R10-R12 instead - yeah my concerns are mostly about this being a class.

With the naming switch from GovukError::RavenDelegator to GovukError::Configuration, my logic for that was that the object is a configuration object for GovukError, whereas it does more than be a delegator to Raven. That naming thought more like it was describing the implementation than the purpose.

lib/govuk_app_config/govuk_error/raven_delegator.rb Outdated Show resolved Hide resolved
lib/govuk_app_config/govuk_error/govuk_data_sync.rb Outdated Show resolved Hide resolved
spec/govuk_error/raven_delegator_spec.rb Outdated Show resolved Hide resolved
spec/govuk_error/raven_delegator_spec.rb Outdated Show resolved Hide resolved
spec/govuk_error/raven_delegator_spec.rb Outdated Show resolved Hide resolved
lib/govuk_app_config/govuk_error/configure_defaults.rb Outdated Show resolved Hide resolved
@ChrisBAshton ChrisBAshton force-pushed the datasync-errors-new branch 2 times, most recently from 26acc13 to d9dcdcd Compare October 12, 2020 12:16
@ChrisBAshton
Copy link
Contributor Author

Thanks @kevindew - some excellent suggestions there, have applied them all.

@ChrisBAshton ChrisBAshton requested a review from kevindew October 12, 2020 12:17
Copy link
Member

@kevindew kevindew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great thanks Chris - I think we're through all the meat of this now, just a few minor things.

Have you given it a try out on an app? You could try install with Content Publisher I did that myself for a quick spin earlier in the process.

spec/govuk_error/configuration_spec.rb Outdated Show resolved Hide resolved
spec/integration/configuring_spec.rb Outdated Show resolved Hide resolved
lib/govuk_app_config/govuk_error.rb Outdated Show resolved Hide resolved
spec/lib/govuk_error_spec.rb Outdated Show resolved Hide resolved
lib/govuk_app_config/govuk_error/configure.rb Outdated Show resolved Hide resolved
lib/govuk_app_config/govuk_error/configure.rb Outdated Show resolved Hide resolved
@ChrisBAshton ChrisBAshton requested a review from kevindew October 14, 2020 17:34
@ChrisBAshton ChrisBAshton force-pushed the datasync-errors-new branch 2 times, most recently from 661ce7c to 92a83fa Compare October 14, 2020 17:44
Copy link
Member

@kevindew kevindew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work Chris 💯 🏅 . I've got two very minor comments but this shouldn't need an extra review.

README.md Outdated Show resolved Hide resolved
lib/govuk_app_config/govuk_error/configure.rb Outdated Show resolved Hide resolved
describe ".in_progress?" do
it "returns false if we are outside of the time range" do
data_sync = GovukError::GovukDataSync.new("22:30-8:30")
at(hour: 21) { expect(data_sync.in_progress?).to eq(false) }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice 👍

ChrisBAshton added a commit to alphagov/govuk-puppet that referenced this pull request Oct 16, 2020
This is to enable alphagov/govuk_app_config#160,
which ignores certain exceptions if they occur during the data
sync period. We want to set this via Puppet so that it is DRY and
configured in only one place.

Trello: https://trello.com/c/rAlvGlSp/2123-5-ignore-data-sync-errors-in-govukappconfig
ChrisBAshton added a commit to alphagov/govuk-puppet that referenced this pull request Oct 16, 2020
This is to enable alphagov/govuk_app_config#160,
which ignores certain exceptions if they occur during the data
sync period. We want to set this via Puppet so that it is DRY and
configured in only one place.

Trello: https://trello.com/c/rAlvGlSp/2123-5-ignore-data-sync-errors-in-govukappconfig
ChrisBAshton added a commit to alphagov/govuk-puppet that referenced this pull request Oct 16, 2020
This is to enable alphagov/govuk_app_config#160,
which ignores certain exceptions if they occur during the data
sync period. We want to set this via Puppet so that it is
configured in only one repo.

Whilst it would be nice to read the start and end times from
govuk_data_sync_in_progress, there is precedent for duplicating
the values - see
/~https://github.com/alphagov/govuk-puppet/blob/5b1dc13ea3c7180318da284fe1dbe90a14460234/modules/monitoring/manifests/contacts.pp#L57-L59

Trello: https://trello.com/c/rAlvGlSp/2123-5-ignore-data-sync-errors-in-govukappconfig
ChrisBAshton added a commit to alphagov/govuk-puppet that referenced this pull request Oct 16, 2020
This is to enable alphagov/govuk_app_config#160,
which ignores certain exceptions if they occur during the data
sync period. We want to set this via Puppet so that it is
configured in only one repo.

Whilst it would be nice to read the start and end times from
govuk_data_sync_in_progress, there is precedent for duplicating
the values - see
/~https://github.com/alphagov/govuk-puppet/blob/5b1dc13ea3c7180318da284fe1dbe90a14460234/modules/monitoring/manifests/contacts.pp#L57-L59

Trello: https://trello.com/c/rAlvGlSp/2123-5-ignore-data-sync-errors-in-govukappconfig
ChrisBAshton added a commit to alphagov/govuk-puppet that referenced this pull request Oct 19, 2020
This is to enable alphagov/govuk_app_config#160,
which ignores certain exceptions if they occur during the data
sync period. We want to set this via Puppet so that it is
configured in only one repo. In the next commit I will DRY up the
values so that they're only set in one place, but for now, there
is precedent for duplicating the values - see
/~https://github.com/alphagov/govuk-puppet/blob/5b1dc13ea3c7180318da284fe1dbe90a14460234/modules/monitoring/manifests/contacts.pp#L57-L59

We only want to define GOVUK_DATA_SYNC_PERIOD on Integration and
Staging. If we hit an error on Production during the data sync
period, we still want to log it to Sentry.

Trello: https://trello.com/c/rAlvGlSp/2123-5-ignore-data-sync-errors-in-govukappconfig
Useful utility for devs, should be included by default.
ChrisBAshton and others added 14 commits October 20, 2020 13:04
In the next few commits, we're going to be making some changes to
the logic inside the Raven configuration, so we need a class we
can write tests against.
Switching to the [Delegator pattern], rather than allowing
apps to work with the `Raven.configuration` object directly,
gives us an opportunity to perform some custom logic prior
to passing the configuration to Raven.

We anticipate using this to allow apps to specify
`should_capture` logic without overwriting the default logic
in govuk_app_config, by defining our own `should_capture=`
method in ConfigureDefaults which knows how to merge the
two callbacks.

[Delegator pattern]: https://ruby-doc.org/stdlib-2.5.1/libdoc/delegate/rdoc/SimpleDelegator.html
This is based on the should_capture callback defined in
[email-alert-api][], but differs in that it doesn't refer to
the `PG::Error` constant, as that would raise errors in apps
that don't use PostgreSQL. Instead, it uses introspection to
compare the exception's class to a known string ("PG::Error").

We currently have a hard-coded period of 10pm - 8am
([according to puppet][puppet]) in which to ignore PostgreSQL
errors in all environments. In subsequent commits I will have
this read from an environment variable instead.

We're using ActiveSupport's TimeHelpers utility as per
https://docs.publishing.service.gov.uk/manual/conventions-for-rails-applications.html#testing-utilities

Finally, I had to explicitly set the `current_environment=`
to the ENV or the configuring_spec.rb would fail (environment
would be "default"). I could probably change the test itself but
I think it's more useful/self-documenting to have the assignment
in code.

[email-alert-api]: /~https://github.com/alphagov/email-alert-api/blob/d77a94be840a9995e53279b896c5634a7e24f9a6/config/initializers/govuk_error.rb
[puppet]: /~https://github.com/alphagov/govuk-puppet/blob/4bcf837e0dc4f48be59716921841307dfa291ebe/modules/govuk_data_sync_in_progress/manifests/init.pp#L20-L24
Previous approach only looked 1 level deep. On @kevindew's suggestion,
we can reuse Raven code to look infinitely deep:
#159 (comment)
We will do some work in govuk-puppet to create this ENV variable
based on the time ranges that are already defined in puppet:
/~https://github.com/alphagov/govuk-puppet/blob/4bcf837e0dc4f48be59716921841307dfa291ebe/modules/govuk_data_sync_in_progress/manifests/init.pp#L20-L24

I have encapsulated the time parsing in its own class:
GovukDataSync. This sticks to the sole responsibility principle.

NB: As this is a lambda, we cannot use a `return` statement to
exit the `should_capture` lambda early. We have to process all
the logic up front and then use ruby's implicit return.
https://www.reddit.com/r/ruby/comments/4z86sg/another_lambda_vs_proc_vs_blocks_article/
Apps can hook into `should_capture` behaviour by specifying
their own `should_capture=(lambda)` hook. This gets merged
with the default `should_capture` behaviour. An app can
call `should_capture=(lambda)` multiple times if it wishes.

All of the lambdas will be evaluated, and if they all return
`true`, then `should_capture` will return `true`. If any
of them return `false`, then it will return `false`.

This seems a sensible behaviour to roll with for now, as
this is all somewhat hypothetical. The only apps making use of
custom `should_capture` callbacks are [content-publisher][] and
[email-alert-api][], and both are currently only using it to do
what we are defining as the default anyway, so will be removed
soon.

In the future we may want to consider supporting different
lambda merge strategies, so that for example `should_capture`
returns `true` if ANY of the lambdas returns `true` (as opposed
to having to ALL return `true`). The framework is in place
now, so it should be fairly simple to cross that bridge when
we come to it.

[content-publisher]: /~https://github.com/alphagov/content-publisher/blob/6b756d8e52c858abf81eff10966ba39070c9f229/config/initializers/govuk_error.rb
[email-alert-api]: /~https://github.com/alphagov/email-alert-api/blob/afa38dd9087daec1f849f27dbc3941653a5d503b/config/initializers/govuk_error.rb
Prior to this commit, if an app wanted to ignore a particular
type of exception that occurs during the data sync window, they'd
have the facility to do so via the `should_capture=` callback but
would need to recreate the GovukDataSync logic and keep that in
sync with govuk-puppet.

Now, they can simply append their exception to the newly added
`data_sync_excluded_exceptions` property.
Rather than defining the default `should_capture` callback in the
default configuration, which is a little different to the simple
strings, arrays and booleans that exist in this file (although not
that far removed from the `proc`s), it is cleaner to encapsulate
this logic in the RavenDelegator class where we define the
`should_capture=` merging strategy.

We've also made the decision that a `should_capture=` call will
overwrite any previous calls, except for the data sync callback.

I've moved/removed some of the tests as appropriate.

Co-authored-by: Kevin Dew <Kevin.Dew@digital.cabinet-office.gov.uk>
There was previously a lot of emphasis on PG::Error errors, as
these were initially hard-coded. Now that it is more config
driven, we should make the tests a little more abstract.

I've also rearranged the tests to start with the simplest case
(capture errors) and finish with the most complex (ignoring
errors whose original cause was an error we've configured to
ignore).
Previous class 'ConfigureDefaults' doesn't really behave like a class as
all it does is use a function to apply side effects to the argument.
It is cleaner to move the default configuration here (alternatively it
could have been inline in govuk_app_config.rb itself but that felt like
we were burdening the file with too much responsibility.)

It does become very difficult to test the config (and would rely
heavily upon mocks and stubbing), so I've sacrificed the
'silence_ready' tests, which were the only ones that were arguably
useful. Ideally our default config should be as hard-coded and
predictable as possible, so that tests aren't even necessary.
The object is a configuration object for GovukError, and does more
than be a delegator to Raven. The previous naming referenced the
implementation rather than the purpose. For that reason,
'Configuration' seems a better class name.

Co-authored-by: Kevin Dew <Kevin.Dew@digital.cabinet-office.gov.uk>
This is to prevent any surprising behaviour should the downstream
app decide to call 'GovukError.configure' twice. Prior to this
commit, calling twice would keep all the Raven configuration but
lose any custom configuration.

I then needed to edit the 'govuk_error_spec' configure test to
prevent `double("Configuration")` from leaking between tests. I
also found that the change would cause configuring_spec.rb to
fail, as the initialised GovukError module would no longer be
looking for the SENTRY_CURRENT_ENV variable at the time the test
would run.

Once I changed this test to accommodate the singleton pattern, we
no longer needed the `config.current_environment=` line, which was
only really there to satisfy configuring_spec.rb. The line is
unnecessary as the default behaviour is to delegate that to Raven,
which is already being set via the ENV variable, so we don't need
to create that ourselves. We also no longer need configuring_spec
itself - its behaviour is now tested in govuk_error_spec.
@ChrisBAshton
Copy link
Contributor Author

(@kevindew I've just pushed a 'bump to 2.5.0' commit, which involved a rebase as the version had changed while this PR was open. There are no other new changes).

@ChrisBAshton ChrisBAshton merged commit f763d43 into master Oct 20, 2020
@ChrisBAshton ChrisBAshton deleted the datasync-errors-new branch October 20, 2020 12:08
ChrisBAshton added a commit to alphagov/email-alert-api that referenced this pull request Oct 20, 2020
As of govuk_app_config 2.5.0, excluding errors of type `PG::Error`
that occur during the data sync on integration/staging now happens
automatically. See alphagov/govuk_app_config#160

Trello: https://trello.com/c/rAlvGlSp/2123-5-ignore-data-sync-errors-in-govukappconfig
ChrisBAshton added a commit to alphagov/content-publisher that referenced this pull request Oct 20, 2020
As of govuk_app_config 2.5.0, excluding errors of type `PG::Error`
that occur during the data sync on integration/staging now happens
automatically. See alphagov/govuk_app_config#160

Trello: https://trello.com/c/rAlvGlSp/2123-5-ignore-data-sync-errors-in-govukappconfig
ChrisBAshton added a commit to alphagov/content-publisher that referenced this pull request Oct 20, 2020
As of govuk_app_config 2.5.0, excluding errors of type `PG::Error`
that occur during the data sync on integration/staging now happens
automatically. See alphagov/govuk_app_config#160

Trello: https://trello.com/c/rAlvGlSp/2123-5-ignore-data-sync-errors-in-govukappconfig
ChrisBAshton added a commit to alphagov/email-alert-api that referenced this pull request Oct 20, 2020
As of govuk_app_config 2.5.0, excluding errors of type `PG::Error`
that occur during the data sync on integration/staging now happens
automatically. See alphagov/govuk_app_config#160

Trello: https://trello.com/c/rAlvGlSp/2123-5-ignore-data-sync-errors-in-govukappconfig
ChrisBAshton added a commit that referenced this pull request Oct 27, 2020
We sometimes spin up temporary environments (e.g. 'test-pink-aws')
while developing applications. If there are errors in those apps,
we can accidentally send many thousands of errors to Sentry, which
can mean we hit rate limits on our other, production,
environments, as we have a single shared Sentry account.

This commit introduces a `active_sentry_environments` config
property, which is checked against the SENTRY_CURRENT_ENV ENV
variable. If an error occurs in an environment we care about,
it is reported to Sentry, otherwise it gets ignored.

There is prior art for this - in #160 we added a similar check,
such that if a particular error occurs during the nightly data
sync, it gets ignored. I've made sure this new functionality is
compatible with what's already been added.

Trello: https://trello.com/c/GfFwJG0K/2180-investigate-disabling-sentry-on-environments-we-dont-care-about-timebox-2-days
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Proposal for a generalised approach for data sync error suppression
2 participants