dbt self-managed state #3159

ncolomer · 2021-03-11T16:57:31Z

Describe the feature

Thanks to dbt's state capabilities, one can use dbt as a state machine by making it always refers to its latest state.

The problem is that it's up to the user to manage storage of the state files + the logic to provide them back to dbt.

We implemented our own version of this mechanism in our CI/CD layer using the following logic:

remote_sate_uri  = "cloud://{remote_state_bucket}/{contextual_prefix}"
local_state_path = "{dbt_target_path}/state"
dbt_state_opt    = ""

if remote_state_enabled and exists("{remote_sate_uri}/manifest.json"):
  pull(from="{remote_sate_uri}/manifest.json", to=local_state_path)
  dbt_state_opt = "--state {local_state_path}"

execute("dbt run {dbt_state_opt} ...")

if remote_state_enabled:
  push(from="{dbt_target_path}/manifest.json", to=remote_sate_uri)

Notes:

in our case, remote_sate_uri points to a cloud storage, but it could either be a local/network file system
contextual_prefix value is computed from contextual information like the environment or the customer

Describe alternatives you've considered

It would be great if those operations could be natively managed by dbt. This would actually be very close in spirit to the terraform's state feature, and more specifically terraform_remote_state.

In terms of implementation, it'd probably mean (naive speculation):

add an option into dbt_project.yml to enable the feature and set the state uri
handle syncing the state in a pre-execution phase
persist back the new state on success in a post-execution phase
eventually add a flag to the dbt cli to temporarily disable the feature

Additional context

This feature request is agnostic to the underlying backend.

Who will this benefit?

Any dbt user that currently uses state-related state: selector or deferring features and managed to store the point-in-time state files by themselves.

Are you interested in contributing this feature?

We can try to help, but let's first check if this makes sense to owners/community!

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2021-03-12T15:05:56Z

Thanks for the detailed and thoughtful proposal @ncolomer! I have a few reservations, which I'll share below.

It makes plenty of sense to me that a person or team using dbt's "stateful" features might establish a naming convention for a long-lived state directory that always stores their previous run artifacts. Instead of having to pass --state my-stateful-folder/ as a CLI arg every time, when it's always the same path, I see value in adding a project-level config to dbt_project.yml:

state-paths: ["my-stateful-folder"]

The piece I'm really not sure of is the automated behind-the-scenes syncing and persisting of state with a cloud storage provider.

This feature request is agnostic to the underlying backend.

Am I right in thinking that dbt would need to store the underlying logic for connecting/authenticating/pulling/pushing to that backend? And that we would need to implement this logic differently for S3, Azure Blob, GCS, HDFS, ...? The link to terraform_remote_state is interesting: integrating with different cloud providers is essential to what Terraform does; offering a straightforward and abstracted approach makes sense. But dbt doesn't know anything about the way it's hosted or deployed in a cloud provider—it only knows about the local file system where it's invoked, and the database/query engine it's connecting to.

I do think the right place for this code is in:

The deployment/orchestration tool (like the snippet you included above), which can know about the exact cloud provider and remote storage you're using. This could be Airflow, Dagster, dbt Cloud, Gitlab CI/CD, GitHub Actions, ...
Some lightweight shell scripts, or even text editor plugins, that are shared by your team in local development and wrap around aws/azure-cli/gcloud/etc

From one organization to another, I don't actually think much of this code would be repeated boilerplate. Maybe the logic codifying when dbt ought to pull remote artifacts (at the beginning) vs. push its local artifacts to a remote endpoint (at the end), but even that is potentially controversial. Do you want to push artifacts from invocations that failed to compile? Failed to execute? From every invocation in a multi-step workflow, or just the last step? dbt is an opinionated tool for modular data transformation; I don't see it as having an opinionated approach for file management to the same degree.

It's worth saying that dbt does do some file system operations today. All invocations write logs, and compiled files to a target directory; deps performs git cloning; clean deletes those scratch files. I think it's fair to say that even these basic file operations are some of the trickiest code to maintain in dbt, in terms of debugging and exception handling. A lot of that code needs to be written differently based on the operating system you're using (POSIX vs. Windows), and it still encounters lots of little surprising bugs (e.g. #3035).

Instead, I see a narrow role for dbt in state management: Making its JSON artifacts as rich, usable, and consistent as possible. When and how to write those artifacts feels very much like a matter for dbt—and, for features that require --state, how to read them. For the questions of where, when, and how to move around those artifacts once written, dbt feels less well positioned to furnish good answers.

Of course, I welcome disagreement on any of the above :)

github-actions · 2021-10-14T01:49:02Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

ncolomer added enhancement New feature or request triage labels Mar 11, 2021

jtcohen6 added state Stateful selection (state:modified, defer) and removed triage labels Mar 12, 2021

jtcohen6 mentioned this issue May 6, 2021

Support for deference to models imported from a package. #3309

Closed

github-actions bot added the stale Issues that have gone stale label Oct 14, 2021

github-actions bot closed this as completed Oct 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dbt self-managed state #3159

dbt self-managed state #3159

ncolomer commented Mar 11, 2021

jtcohen6 commented Mar 12, 2021

github-actions bot commented Oct 14, 2021

dbt self-managed state #3159

dbt self-managed state #3159

Comments

ncolomer commented Mar 11, 2021

Describe the feature

Describe alternatives you've considered

Additional context

Who will this benefit?

Are you interested in contributing this feature?

jtcohen6 commented Mar 12, 2021

github-actions bot commented Oct 14, 2021