CORE-7996 Trial: build in an evaluation period license #23893

pgellert · 2024-10-23T17:02:36Z

Create a builtin fallback license that allows customers to trial
Enterprise features for a trial period starting at cluster creation.

Fixes https://redpandadata.atlassian.net/browse/CORE-7996

Backports Required

Release Notes

Features

After the cluster is first formed, a trial license is automatically loaded to provide an evaluation period of enterprise features.

pgellert · 2024-10-23T17:03:09Z

/ci-repeat

src/v/features/feature_table.cc

michael-redpanda · 2024-10-23T18:11:11Z

src/v/features/feature_table.cc

+    if (ss::this_shard_id() == 0) {
+        vlog(featureslog.debug, "Initializing builtin trial license.");
+    }


Is this conditional just for the log message?

Yes, to only log on shard 0 and not all cores

Maybe log the timestamp as well

Maybe log the timestamp as well

Yeah, and then make_builtin_trial_license wouldn't need to log, which it does on every core.

michael-redpanda · 2024-10-23T18:14:09Z

src/v/redpanda/admin/server.cc

@@ -2407,7 +2407,7 @@ void admin_server::register_features_routes() {
          ss::httpd::features_json::license_response res;
          res.loaded = false;
          const auto& ft = _controller->get_feature_table().local();
-          const auto& license = ft.get_license();
+          const auto& license = ft.get_license(false);


we do want to return the evaluation license via the get_license endpoint

sort of related to this - the evaluation license has to supersede all functional effects of license enforcement, but

should the response from GET /v1/features/enterprise look like there's a real license installed? or

should we suppress the license nag during the trial period?

should the has_valid_license bool in phone-home telemetry be an enum? or maybe we add an additional field for "evaluation_period"?

IMO yes, the evaluation period should work as if a real trial license had been installed (i.e., don't nag until expiry, show that there is a trial license installed, expiry time metric > 0).

% rpk cluster license info LICENSE INFORMATION =================== License status: valid License violation: false Organization: Redpanda Built-In Evaluation Period Type: free_trial Expires: Dec 6 2024

should the has_valid_license bool in phone-home telemetry be an enum? or maybe we add an additional field for "evaluation_period"?

Probably easiest to add a new field for "evaluation_period". I guess this could be derived from the license hash (id_hash == 0) but probably best to add it explicitly.

Aha, totally missed the free trial type on the license!

The request from product is to have it return "trial" just like if it's a generated trial license. We should be able to tell that it's the evaluation by:

The organization, and

The empty SHA256 checksum

src/v/features/feature_table.h

dotnwat · 2024-10-24T00:26:05Z

src/v/cluster/metrics_reporter.cc

@@ -312,10 +313,14 @@ ss::future<> metrics_reporter::try_initialize_cluster_info() {

    auto& first_cfg = batches.front();

+    _cluster_info.creation_timestamp = first_cfg.header().first_timestamp;
+    co_await _feature_table.invoke_on_all([&](features::feature_table& ft) {
+        ft.set_builtin_trial_license(_cluster_info.creation_timestamp);


what do we do if creation timestamp is bogus (e.g. ntp isn't running and clocks aren't accurate)?

Yeah, that's a good point. Specifically, when only the creation timestamp is bogus, but now the time is accurate, right? Surely, all bets are off if the time is bogus throughout the cluster's runtime (e.g., license checks against the timestamp in the license would break equally). Do you have any ideas? The only improvement I can think of is sanity-checking that the creation timestamp is in the past, but it's hard to determine the cluster creation time retrospectively because the prefix of the controller log is snapshotted away after ~60s.

src/v/features/feature_table.cc

pgellert · 2024-10-24T14:48:06Z

force-push:

Introduced a should_sanction method that returns false until the cluster birth-based trial license is set. This avoids the need for the node-start-based cluster license and its ignoring parameter in get_license.
Renamed the trial license org to "Redpanda Built-In Evaluation Period"

pgellert · 2024-10-24T16:07:49Z

force-push:

fix a bug (negated boolean ahead of should_sanction)
add a unit test to demonstrate the expected feature_table::should_sanction behaviour
have revoke_license revoke the builtin license as well

pgellert · 2024-10-24T16:15:27Z

/ci-repeat

src/v/features/feature_table.h

vbotbuildovich · 2024-10-24T18:59:51Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/57127#0192bf96-5477-437e-b9d1-cef05cec4b52
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/57127#0192bfae-8978-4c60-9209-9c26b3caef88

IoannisRP · 2024-10-25T10:07:19Z

src/v/features/feature_table.cc

+    if (_license) {
+        return _license->is_expired();
+    } else if (_builtin_trial_license) {
+        return _builtin_trial_license->is_expired();
+    }
+
+    // While we are yet to initialize _builtin_trial_license on cluster
+    // creation, be permissive
+    return _builtin_trial_license_initialized;


Suggested change

if (_license) {

return _license->is_expired();

} else if (_builtin_trial_license) {

return _builtin_trial_license->is_expired();

}

// While we are yet to initialize _builtin_trial_license on cluster

// creation, be permissive

return _builtin_trial_license_initialized;

auto license = get_license();

if (!license) {

// ...

return _builtin_trial_license_initialized;

}

return license->is_expired();

are these 2 identical? Or am i missing an edge case?

Yes, they are. I'm happy to switch to that if you think that's simpler.

IoannisRP · 2024-10-25T10:09:03Z

src/v/features/feature_table.cc

+    _builtin_trial_license_initialized = true;
+    _builtin_trial_license = make_builtin_trial_license(
+      model::to_time_point(cluster_creation_timestamp));


nit: while make_builtin_trial_license doesn't seem to be able to throw, we should probably set the bool after it is initialized, mainly for sanity reasons.

oleiman · 2024-10-26T02:57:10Z

FYI - tripped over

redpanda/tests/rptest/services/redpanda.py

Line 5358 in fb388a0

if license is None or license['loaded'] is not True:

while working on DT for the config stuff

That check is no longer meaningful w/ a trial license. could be

            if license is None or not license['loaded'] or \
                        license['license']['type'] == 'free_trial':

or something like that.

BenPope

Looking good.

BenPope · 2024-10-28T09:46:34Z

src/v/features/feature_table.cc

+    if (ss::this_shard_id() == 0) {
+        vlog(featureslog.debug, "Initializing builtin trial license.");
+    }


Maybe log the timestamp as well

Yeah, and then make_builtin_trial_license wouldn't need to log, which it does on every core.

src/v/model/timestamp.h

Create a builtin fallback license that allows customers to trial Enterprise features for a trial period starting at cluster creation. This commit implements the `get_license` interface that will be used in following commits to start using this built in trial period license when no license has been configured yet.

Use the built in evaluation period trial license in the following admin API endpoints: * `GET /v1/features/license` * `PUT /v1/features/license` * `GET /v1/features/enterprise` This is to ensure that the trial license and its expiry time is visible to rpk and the customer, both for the `rpk cluster license info/set` commands and the rpk-side license nag.

This ensures that we allow major version upgrades during the evaluation period.

This ensures that the expiry of the built in trial license can be monitored.

We do not want to emit the license nag during the evaluation period.

Helper to tell whether to act on an expired license or evaluation period by restricting enterprise feature usage

Have `revoke_license` revoke the builtin trial license as well for more convenient testing.

pgellert · 2024-10-28T10:32:33Z

force-push:

fixed the unit/fixture/dt test failures detected in CI
addressed code review feedback
introduced feature_table::get_configured_license() which never returns the evaluation period license (used for taking controller snapshots and for cloud storage-based whole cluster recovery)

vbotbuildovich · 2024-10-28T13:52:29Z

Retry command for Build#57223

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/write_caching_fi_e2e_test.py::WriteCachingFailureInjectionE2ETest.test_crash_all_with_consumer_group

michael-redpanda

Just a few non-blocking comments, looks good

src/v/redpanda/admin/server.cc

src/v/features/feature_table.cc

tests/rptest/services/admin.py

michael-redpanda · 2024-10-28T14:19:03Z

tests/rptest/tests/enterprise_features_license_test.py

+    def _disable_evaluation_period(self):
+        self.redpanda.set_environment(
+            {'__REDPANDA_DISABLE_BUILTIN_TRIAL_LICENSE': True})
+        self.redpanda.restart_nodes(self.redpanda.nodes)


suggestion: maybe move this to redpanda.py, but this isn't a blocker

FWIW the restart is only needed if the brokers have already been started. Tests that manually start the brokers (empty setUp method, eg. the tests in license_enforcement_test.py) can set the env var before the first broker is started and the brokers will pick it up on startup.

In this test case, I chose to restart the nodes instead of switching to manually start them because that seemed easier.

this is all getting ripped up in my PR anyway. i think "get the test passing and move on" is the right course of action.

src/v/cluster/feature_manager.cc

michael-redpanda · 2024-10-28T14:24:39Z

tests/rptest/tests/license_enforcement_test.py

+        self.redpanda.set_environment(
+            {'__REDPANDA_DISABLE_BUILTIN_TRIAL_LICENSE': True})


suggestion: maybe this be a variable passed to the redpanda constructor or a method?

I didn't add it in the constructor because I only wanted it to apply to the first two tests but not test_enterprise_cluster_bootstrap. start and restart don't currently take extra environment variables; only extra configs (at the moment).

I guess I could factor out the first two tests into a separate test class, and then I could pass it into the constructor.

tests/rptest/tests/cluster_features_test.py

oleiman · 2024-10-28T17:19:52Z

tests/rptest/tests/enterprise_features_license_test.py

-    ])
-    def test_get_enterprise(self, with_license):
+    @matrix(with_license=[True, False], with_evaluation_period=[True, False])
+    def test_get_enterprise(self, with_license, with_evaluation_period):


might consider just marking these ok to fail temporarily and otherwise leaving exactly as is. your changes look fine but it'll be a pain to integrate with #23899

eh, probably not worth delaying merge. i can deal with it.

dotnwat · 2024-10-28T17:22:47Z

src/v/features/feature_table.cc

+    if (_license) {
+        return _license->is_expired();
+    } else if (_builtin_trial_license) {
+        return _builtin_trial_license->is_expired();
+    }


is it the case that you are creating a special carve-out for the trial license? it would seem that an alternative would have been to generate an actual license that would expire as expected, and re-use all of the existing machinery. is that because you can't generate a license without the private key? if so, would it have been an option to expand the semantics of the existing license abstraction as opposed to expressing a trial license explicitly?

We are able to create a license in the broker without the private key because internally the license is stored unencrypted, only licenses passed in from outside through the admin API are signed and validated.

We store the real license and the eval license separately (ie. have the eval license carved out) because we need a way to tell them apart at some points of the code. For example, we only recover the user-configured license during whole cluster recovery but not the evaluation period license, and we only write the user-configured license into the controller snapshot. We could represent this information differently (eg. introduce security::license_type::evaluation_period or add a new source_type field to security::license), but storing it as a separate field seemed simplest to me because it guarantees by design that we can't overwrite a real license with an evaluation period license. Storing the license in memory instead of writing it into the controller log also makes it easy to change the behaviour of the trial license later if the product asks change (eg. change the evaluation period length).

(fyi I've drafted up the alternative you proposed; it would look something like this: #23945)

Let me know what you think with the above in mind. I'm open to changing it, potentially even to this alternative implementation, but the reasons above are why the evaluation period license is currently carved out into a separate field.

thanks for the detailed response, it's awesome! i think what we are doing in this PR is ok. actually, it's great--not generalizing until we get a few examples (currently N=2) makes a lot of sense. but maybe you can see where it could break down in the future: product asks for a new kind of license, and then we are asked for carve-outs for different features, etc... pretty soon the carve-out approach becomes unwieldy.

yeah makes sense. I would expect that product asks would mostly just change the way the evaluation period behaves and we would not need to introduce a new kind of license. But if we do, we can certainly think about ways of simplifying things at N=3.

pgellert · 2024-10-29T15:21:24Z

/cdt

pgellert · 2024-10-30T12:26:47Z

CDT test failures:

IcebergRESTCatalogSmokeTest.test_basic -- https://redpandadata.atlassian.net/browse/CORE-8097
LargeMessagesTest.test_large_messages_throughput -- https://redpandadata.atlassian.net/browse/DEVPROD-2066
TestReadReplicaService.test_writes_forbidden -- https://redpandadata.atlassian.net/browse/CORE-7867
NodeWiseRecoveryTest.test_node_wise_recovery -- https://redpandadata.atlassian.net/browse/CORE-7977
PartitionBalancerTest.test_transfer_controller_leadership -- https://redpandadata.atlassian.net/browse/CORE-7799 (different test, same symptom as this one and a few others)
AlterConfigMixedNodeTest.test_alter_config_shadow_indexing_mixed_node -- seems unrelated, maybe this could be flaky on line 409 reaching redpanda.remote.readreplica instead of read?
DataMigrationsApiTest.* -- https://redpandadata.atlassian.net/browse/CORE-8064
DescribeTopicsTest.test_describe_topics_with_documentation_and_types -- seems unrelated (looks like the test needs updating with the new topic property topic_property_iceberg_translation_interval_ms)

FIPS CDT test failures:

IcebergRESTCatalogSmokeTest.test_basic -- https://redpandadata.atlassian.net/browse/CORE-8097
LargeMessagesTest.test_large_messages_throughput -- https://redpandadata.atlassian.net/browse/DEVPROD-2066
EndToEndShadowIndexingTest.test_write -- https://redpandadata.atlassian.net/browse/CORE-7936
LicenseEnforcementTest.test_escape_hatch_license_variable -- unrelated, created https://redpandadata.atlassian.net/browse/CORE-8098 to track it

emaxerrno · 2024-10-31T18:35:17Z

src/v/features/feature_table.cc

+    auto expiry = std::chrono::duration_cast<std::chrono::seconds>(
+      expiry_time.time_since_epoch());
+
+    if (std::getenv("__REDPANDA_DISABLE_BUILTIN_TRIAL_LICENSE") != nullptr) {


the double prefix underscore is odd. we have so many notations we should unify.

RPK has env variables for example as well that do not have this. @twmb - we should have a developer experience here unify these. the product flags feel disjointed.

The intended use for this environmental variable is for testing and not for end-user consumption. This pattern matches other environmental variables here, here, and here

i pinged @mattschumpert on this too to make sure all the vars are consistent. this should have product input for consistency. i only care about consistency, the prefix,postfix,dot-notation,etc doesn't matter as much to me as consistency across all product lines.

The ENV vars specifically are so internal facing and no user ever sees them unless their browsing the code that it only matters for redpanda devs I think.

I've never once looked at any of these (only a couple of documented external env vars where we don't prefix).

the '__REDPANDA' seems to be our convention for internal, and seems consistent so it makes sense to me.

Yeh, internal consumption only. Should never be in documentation.

pgellert requested a review from a team October 23, 2024 17:02

pgellert self-assigned this Oct 23, 2024

pgellert requested review from BenPope and removed request for a team October 23, 2024 17:02

github-actions bot added the area/redpanda label Oct 23, 2024

pgellert requested review from oleiman, michael-redpanda and IoannisRP October 23, 2024 17:02

oleiman reviewed Oct 23, 2024

View reviewed changes

src/v/features/feature_table.cc Show resolved Hide resolved

oleiman reviewed Oct 23, 2024

View reviewed changes

src/v/features/feature_table.cc Outdated Show resolved Hide resolved

oleiman reviewed Oct 23, 2024

View reviewed changes

src/v/features/feature_table.cc Outdated Show resolved Hide resolved

michael-redpanda reviewed Oct 23, 2024

View reviewed changes

src/v/features/feature_table.h Outdated Show resolved Hide resolved

dotnwat reviewed Oct 24, 2024

View reviewed changes

oleiman mentioned this pull request Oct 24, 2024

[CORE-7997] Suppress enterprise configs when license is missing or expired #23899

Merged

9 tasks

BenPope reviewed Oct 24, 2024

View reviewed changes

src/v/features/feature_table.cc Outdated Show resolved Hide resolved

pgellert force-pushed the trial/poc branch from 46c4f79 to 72f67bc Compare October 24, 2024 14:45

pgellert force-pushed the trial/poc branch from 72f67bc to c3eb6f9 Compare October 24, 2024 16:05

github-actions bot added the area/build label Oct 24, 2024

pgellert requested review from michael-redpanda, oleiman, BenPope and dotnwat October 24, 2024 16:15

BenPope mentioned this pull request Oct 24, 2024

[CORE-7998] PoC: Tiered Storage Sanctioning #23895

Closed

7 tasks

michael-redpanda reviewed Oct 24, 2024

View reviewed changes

src/v/features/feature_table.h Show resolved Hide resolved

IoannisRP reviewed Oct 25, 2024

View reviewed changes

BenPope reviewed Oct 28, 2024

View reviewed changes

pgellert added 7 commits October 28, 2024 10:27

cluster: use evaluation period in upgrade blocking

ab46d1e

This ensures that we allow major version upgrades during the evaluation period.

features: use evaluation period in expiry metric

2d65d5c

This ensures that the expiry of the built in trial license can be monitored.

features: use evaluation period in license nag

ffd8138

We do not want to emit the license nag during the evaluation period.

features: implement should_sanction method

cad7c7d

Helper to tell whether to act on an expired license or evaluation period by restricting enterprise feature usage

features: revoke_license also revoke trial license

7702ed1

Have `revoke_license` revoke the builtin trial license as well for more convenient testing.

pgellert force-pushed the trial/poc branch from c3eb6f9 to 7702ed1 Compare October 28, 2024 10:32

pgellert marked this pull request as ready for review October 28, 2024 10:34

pgellert requested review from BenPope, michael-redpanda and IoannisRP October 28, 2024 10:34

pgellert changed the title ~~CORE-7996 Trial License POC~~ CORE-7996 Trial: build in an evaluation period license Oct 28, 2024

michael-redpanda approved these changes Oct 28, 2024

View reviewed changes

oleiman reviewed Oct 28, 2024

View reviewed changes

tests/rptest/tests/cluster_features_test.py Show resolved Hide resolved

oleiman reviewed Oct 28, 2024

View reviewed changes

dotnwat reviewed Oct 28, 2024

View reviewed changes

BenPope approved these changes Oct 29, 2024

View reviewed changes

pgellert mentioned this pull request Oct 29, 2024

CORE-7996 Trial/poc 2.0 #23945

Closed

7 tasks

pgellert mentioned this pull request Oct 29, 2024

[DNM] Trial License Test FIPS CDT #23949

Closed

7 tasks

michael-redpanda merged commit cc888d0 into redpanda-data:dev Oct 30, 2024
27 of 30 checks passed

emaxerrno reviewed Oct 31, 2024

View reviewed changes

		self.redpanda.set_environment(
		{'__REDPANDA_DISABLE_BUILTIN_TRIAL_LICENSE': True})

CORE-7996 Trial: build in an evaluation period license #23893

CORE-7996 Trial: build in an evaluation period license #23893

Conversation

pgellert commented Oct 23, 2024 • edited Loading

Backports Required

Release Notes

Features

pgellert commented Oct 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgellert commented Oct 24, 2024

pgellert commented Oct 24, 2024

pgellert commented Oct 24, 2024

vbotbuildovich commented Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oleiman commented Oct 26, 2024

BenPope left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgellert commented Oct 28, 2024 • edited Loading

vbotbuildovich commented Oct 28, 2024

Retry command for Build#57223

michael-redpanda left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dotnwat Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

pgellert Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

pgellert commented Oct 29, 2024

pgellert commented Oct 30, 2024 • edited Loading

Choose a reason for hiding this comment

michael-redpanda Oct 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattschumpert Oct 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgellert commented Oct 23, 2024 •

edited

Loading

vbotbuildovich commented Oct 24, 2024 •

edited

Loading

pgellert commented Oct 28, 2024 •

edited

Loading

dotnwat Oct 29, 2024 •

edited

Loading

pgellert Oct 29, 2024 •

edited

Loading

pgellert commented Oct 30, 2024 •

edited

Loading

michael-redpanda Oct 31, 2024 •

edited

Loading

mattschumpert Oct 31, 2024 •

edited

Loading