Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CORE-7996 Trial: build in an evaluation period license #23893

Merged
merged 7 commits into from
Oct 30, 2024

Conversation

pgellert
Copy link
Contributor

@pgellert pgellert commented Oct 23, 2024

Create a builtin fallback license that allows customers to trial
Enterprise features for a trial period starting at cluster creation.

Fixes https://redpandadata.atlassian.net/browse/CORE-7996

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.2.x
  • v24.1.x
  • v23.3.x

Release Notes

Features

  • After the cluster is first formed, a trial license is automatically loaded to provide an evaluation period of enterprise features.

@pgellert pgellert requested a review from a team October 23, 2024 17:02
@pgellert pgellert self-assigned this Oct 23, 2024
@pgellert pgellert requested review from BenPope and removed request for a team October 23, 2024 17:02
@pgellert
Copy link
Contributor Author

/ci-repeat

Comment on lines 721 to 722
if (ss::this_shard_id() == 0) {
vlog(featureslog.debug, "Initializing builtin trial license.");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this conditional just for the log message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, to only log on shard 0 and not all cores

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe log the timestamp as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe log the timestamp as well

Yeah, and then make_builtin_trial_license wouldn't need to log, which it does on every core.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -2407,7 +2407,7 @@ void admin_server::register_features_routes() {
ss::httpd::features_json::license_response res;
res.loaded = false;
const auto& ft = _controller->get_feature_table().local();
const auto& license = ft.get_license();
const auto& license = ft.get_license(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do want to return the evaluation license via the get_license endpoint

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sort of related to this - the evaluation license has to supersede all functional effects of license enforcement, but

  1. should the response from GET /v1/features/enterprise look like there's a real license installed? or
  2. should we suppress the license nag during the trial period?
  3. should the has_valid_license bool in phone-home telemetry be an enum? or maybe we add an additional field for "evaluation_period"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO yes, the evaluation period should work as if a real trial license had been installed (i.e., don't nag until expiry, show that there is a trial license installed, expiry time metric > 0).

% rpk cluster license info
LICENSE INFORMATION
===================
License status:     valid
License violation:  false
Organization:       Redpanda Built-In Evaluation Period
Type:               free_trial
Expires:            Dec 6 2024

should the has_valid_license bool in phone-home telemetry be an enum? or maybe we add an additional field for "evaluation_period"?

Probably easiest to add a new field for "evaluation_period". I guess this could be derived from the license hash (id_hash == 0) but probably best to add it explicitly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, totally missed the free trial type on the license!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The request from product is to have it return "trial" just like if it's a generated trial license. We should be able to tell that it's the evaluation by:

  • The organization, and
  • The empty SHA256 checksum

@@ -312,10 +313,14 @@ ss::future<> metrics_reporter::try_initialize_cluster_info() {

auto& first_cfg = batches.front();

_cluster_info.creation_timestamp = first_cfg.header().first_timestamp;
co_await _feature_table.invoke_on_all([&](features::feature_table& ft) {
ft.set_builtin_trial_license(_cluster_info.creation_timestamp);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do we do if creation timestamp is bogus (e.g. ntp isn't running and clocks aren't accurate)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good point. Specifically, when only the creation timestamp is bogus, but now the time is accurate, right? Surely, all bets are off if the time is bogus throughout the cluster's runtime (e.g., license checks against the timestamp in the license would break equally). Do you have any ideas? The only improvement I can think of is sanity-checking that the creation timestamp is in the past, but it's hard to determine the cluster creation time retrospectively because the prefix of the controller log is snapshotted away after ~60s.

@pgellert
Copy link
Contributor Author

force-push:

  • Introduced a should_sanction method that returns false until the cluster birth-based trial license is set. This avoids the need for the node-start-based cluster license and its ignoring parameter in get_license.
  • Renamed the trial license org to "Redpanda Built-In Evaluation Period"

@pgellert
Copy link
Contributor Author

force-push:

  • fix a bug (negated boolean ahead of should_sanction)
  • add a unit test to demonstrate the expected feature_table::should_sanction behaviour
  • have revoke_license revoke the builtin license as well

@pgellert
Copy link
Contributor Author

/ci-repeat

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Oct 24, 2024

Comment on lines +738 to +755
if (_license) {
return _license->is_expired();
} else if (_builtin_trial_license) {
return _builtin_trial_license->is_expired();
}

// While we are yet to initialize _builtin_trial_license on cluster
// creation, be permissive
return _builtin_trial_license_initialized;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (_license) {
return _license->is_expired();
} else if (_builtin_trial_license) {
return _builtin_trial_license->is_expired();
}
// While we are yet to initialize _builtin_trial_license on cluster
// creation, be permissive
return _builtin_trial_license_initialized;
auto license = get_license();
if (!license) {
// ...
return _builtin_trial_license_initialized;
}
return license->is_expired();

are these 2 identical? Or am i missing an edge case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are. I'm happy to switch to that if you think that's simpler.

Comment on lines 723 to 720
_builtin_trial_license_initialized = true;
_builtin_trial_license = make_builtin_trial_license(
model::to_time_point(cluster_creation_timestamp));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: while make_builtin_trial_license doesn't seem to be able to throw, we should probably set the bool after it is initialized, mainly for sanity reasons.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@oleiman
Copy link
Member

oleiman commented Oct 26, 2024

FYI - tripped over

if license is None or license['loaded'] is not True:
while working on DT for the config stuff

That check is no longer meaningful w/ a trial license. could be

            if license is None or not license['loaded'] or \
                        license['license']['type'] == 'free_trial':

or something like that.

Copy link
Member

@BenPope BenPope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good.

Comment on lines 721 to 722
if (ss::this_shard_id() == 0) {
vlog(featureslog.debug, "Initializing builtin trial license.");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe log the timestamp as well

Yeah, and then make_builtin_trial_license wouldn't need to log, which it does on every core.

src/v/model/timestamp.h Outdated Show resolved Hide resolved
Create a builtin fallback license that allows customers to trial
Enterprise features for a trial period starting at cluster creation.

This commit implements the `get_license` interface that will be used in
following commits to start using this built in trial period license when
no license has been configured yet.
Use the built in evaluation period trial license in the following admin
API endpoints:
* `GET /v1/features/license`
* `PUT /v1/features/license`
* `GET /v1/features/enterprise`

This is to ensure that the trial license and its expiry time is visible
to rpk and the customer, both for the `rpk cluster license info/set`
commands and the rpk-side license nag.
This ensures that we allow major version upgrades during the evaluation
period.
This ensures that the expiry of the built in trial license can be
monitored.
We do not want to emit the license nag during the evaluation period.
Helper to tell whether to act on an expired license or evaluation period
by restricting enterprise feature usage
Have `revoke_license` revoke the builtin trial license as well for more
convenient testing.
@pgellert
Copy link
Contributor Author

pgellert commented Oct 28, 2024

force-push:

  • fixed the unit/fixture/dt test failures detected in CI
  • addressed code review feedback
  • introduced feature_table::get_configured_license() which never returns the evaluation period license (used for taking controller snapshots and for cloud storage-based whole cluster recovery)

@pgellert pgellert marked this pull request as ready for review October 28, 2024 10:34
@pgellert pgellert changed the title CORE-7996 Trial License POC CORE-7996 Trial: build in an evaluation period license Oct 28, 2024
@vbotbuildovich
Copy link
Collaborator

Retry command for Build#57223

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/write_caching_fi_e2e_test.py::WriteCachingFailureInjectionE2ETest.test_crash_all_with_consumer_group

Copy link
Contributor

@michael-redpanda michael-redpanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few non-blocking comments, looks good

src/v/redpanda/admin/server.cc Outdated Show resolved Hide resolved
src/v/redpanda/admin/server.cc Outdated Show resolved Hide resolved
src/v/features/feature_table.cc Outdated Show resolved Hide resolved
tests/rptest/services/admin.py Show resolved Hide resolved
Comment on lines +93 to +96
def _disable_evaluation_period(self):
self.redpanda.set_environment(
{'__REDPANDA_DISABLE_BUILTIN_TRIAL_LICENSE': True})
self.redpanda.restart_nodes(self.redpanda.nodes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: maybe move this to redpanda.py, but this isn't a blocker

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW the restart is only needed if the brokers have already been started. Tests that manually start the brokers (empty setUp method, eg. the tests in license_enforcement_test.py) can set the env var before the first broker is started and the brokers will pick it up on startup.

In this test case, I chose to restart the nodes instead of switching to manually start them because that seemed easier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is all getting ripped up in my PR anyway. i think "get the test passing and move on" is the right course of action.

src/v/cluster/feature_manager.cc Outdated Show resolved Hide resolved
Comment on lines +49 to +50
self.redpanda.set_environment(
{'__REDPANDA_DISABLE_BUILTIN_TRIAL_LICENSE': True})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: maybe this be a variable passed to the redpanda constructor or a method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't add it in the constructor because I only wanted it to apply to the first two tests but not test_enterprise_cluster_bootstrap. start and restart don't currently take extra environment variables; only extra configs (at the moment).

I guess I could factor out the first two tests into a separate test class, and then I could pass it into the constructor.

])
def test_get_enterprise(self, with_license):
@matrix(with_license=[True, False], with_evaluation_period=[True, False])
def test_get_enterprise(self, with_license, with_evaluation_period):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might consider just marking these ok to fail temporarily and otherwise leaving exactly as is. your changes look fine but it'll be a pain to integrate with #23899

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eh, probably not worth delaying merge. i can deal with it.

Comment on lines +747 to +751
if (_license) {
return _license->is_expired();
} else if (_builtin_trial_license) {
return _builtin_trial_license->is_expired();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it the case that you are creating a special carve-out for the trial license? it would seem that an alternative would have been to generate an actual license that would expire as expected, and re-use all of the existing machinery. is that because you can't generate a license without the private key? if so, would it have been an option to expand the semantics of the existing license abstraction as opposed to expressing a trial license explicitly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are able to create a license in the broker without the private key because internally the license is stored unencrypted, only licenses passed in from outside through the admin API are signed and validated.

We store the real license and the eval license separately (ie. have the eval license carved out) because we need a way to tell them apart at some points of the code. For example, we only recover the user-configured license during whole cluster recovery but not the evaluation period license, and we only write the user-configured license into the controller snapshot. We could represent this information differently (eg. introduce security::license_type::evaluation_period or add a new source_type field to security::license), but storing it as a separate field seemed simplest to me because it guarantees by design that we can't overwrite a real license with an evaluation period license. Storing the license in memory instead of writing it into the controller log also makes it easy to change the behaviour of the trial license later if the product asks change (eg. change the evaluation period length).

(fyi I've drafted up the alternative you proposed; it would look something like this: #23945)

Let me know what you think with the above in mind. I'm open to changing it, potentially even to this alternative implementation, but the reasons above are why the evaluation period license is currently carved out into a separate field.

Copy link
Member

@dotnwat dotnwat Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the detailed response, it's awesome! i think what we are doing in this PR is ok. actually, it's great--not generalizing until we get a few examples (currently N=2) makes a lot of sense. but maybe you can see where it could break down in the future: product asks for a new kind of license, and then we are asked for carve-outs for different features, etc... pretty soon the carve-out approach becomes unwieldy.

Copy link
Contributor Author

@pgellert pgellert Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah makes sense. I would expect that product asks would mostly just change the way the evaluation period behaves and we would not need to introduce a new kind of license. But if we do, we can certainly think about ways of simplifying things at N=3.

@pgellert pgellert mentioned this pull request Oct 29, 2024
7 tasks
@pgellert
Copy link
Contributor Author

/cdt

@pgellert pgellert mentioned this pull request Oct 29, 2024
7 tasks
@pgellert
Copy link
Contributor Author

pgellert commented Oct 30, 2024

CDT test failures:

FIPS CDT test failures:

@michael-redpanda michael-redpanda merged commit cc888d0 into redpanda-data:dev Oct 30, 2024
27 of 30 checks passed
auto expiry = std::chrono::duration_cast<std::chrono::seconds>(
expiry_time.time_since_epoch());

if (std::getenv("__REDPANDA_DISABLE_BUILTIN_TRIAL_LICENSE") != nullptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the double prefix underscore is odd. we have so many notations we should unify.

RPK has env variables for example as well that do not have this. @twmb - we should have a developer experience here unify these. the product flags feel disjointed.

Copy link
Contributor

@michael-redpanda michael-redpanda Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intended use for this environmental variable is for testing and not for end-user consumption. This pattern matches other environmental variables here, here, and here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i pinged @mattschumpert on this too to make sure all the vars are consistent. this should have product input for consistency. i only care about consistency, the prefix,postfix,dot-notation,etc doesn't matter as much to me as consistency across all product lines.

Copy link

@mattschumpert mattschumpert Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ENV vars specifically are so internal facing and no user ever sees them unless their browsing the code that it only matters for redpanda devs I think.

I've never once looked at any of these (only a couple of documented external env vars where we don't prefix).

the '__REDPANDA' seems to be our convention for internal, and seems consistent so it makes sense to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeh, internal consumption only. Should never be in documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants