Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbt self-managed state #3159

Closed
ncolomer opened this issue Mar 11, 2021 · 2 comments
Closed

dbt self-managed state #3159

ncolomer opened this issue Mar 11, 2021 · 2 comments
Labels
enhancement New feature or request stale Issues that have gone stale state Stateful selection (state:modified, defer)

Comments

@ncolomer
Copy link

Describe the feature

Thanks to dbt's state capabilities, one can use dbt as a state machine by making it always refers to its latest state.

The problem is that it's up to the user to manage storage of the state files + the logic to provide them back to dbt.

We implemented our own version of this mechanism in our CI/CD layer using the following logic:

remote_sate_uri  = "cloud://{remote_state_bucket}/{contextual_prefix}"
local_state_path = "{dbt_target_path}/state"
dbt_state_opt    = ""

if remote_state_enabled and exists("{remote_sate_uri}/manifest.json"):
  pull(from="{remote_sate_uri}/manifest.json", to=local_state_path)
  dbt_state_opt = "--state {local_state_path}"

execute("dbt run {dbt_state_opt} ...")

if remote_state_enabled:
  push(from="{dbt_target_path}/manifest.json", to=remote_sate_uri)

Notes:

  • in our case, remote_sate_uri points to a cloud storage, but it could either be a local/network file system
  • contextual_prefix value is computed from contextual information like the environment or the customer

Describe alternatives you've considered

It would be great if those operations could be natively managed by dbt. This would actually be very close in spirit to the terraform's state feature, and more specifically terraform_remote_state.

In terms of implementation, it'd probably mean (naive speculation):

  • add an option into dbt_project.yml to enable the feature and set the state uri
  • handle syncing the state in a pre-execution phase
  • persist back the new state on success in a post-execution phase
  • eventually add a flag to the dbt cli to temporarily disable the feature

Additional context

This feature request is agnostic to the underlying backend.

Who will this benefit?

Any dbt user that currently uses state-related state: selector or deferring features and managed to store the point-in-time state files by themselves.

Are you interested in contributing this feature?

We can try to help, but let's first check if this makes sense to owners/community!

@ncolomer ncolomer added enhancement New feature or request triage labels Mar 11, 2021
@jtcohen6 jtcohen6 added state Stateful selection (state:modified, defer) and removed triage labels Mar 12, 2021
@jtcohen6
Copy link
Contributor

Thanks for the detailed and thoughtful proposal @ncolomer! I have a few reservations, which I'll share below.

It makes plenty of sense to me that a person or team using dbt's "stateful" features might establish a naming convention for a long-lived state directory that always stores their previous run artifacts. Instead of having to pass --state my-stateful-folder/ as a CLI arg every time, when it's always the same path, I see value in adding a project-level config to dbt_project.yml:

state-paths: ["my-stateful-folder"]

The piece I'm really not sure of is the automated behind-the-scenes syncing and persisting of state with a cloud storage provider.

This feature request is agnostic to the underlying backend.

Am I right in thinking that dbt would need to store the underlying logic for connecting/authenticating/pulling/pushing to that backend? And that we would need to implement this logic differently for S3, Azure Blob, GCS, HDFS, ...? The link to terraform_remote_state is interesting: integrating with different cloud providers is essential to what Terraform does; offering a straightforward and abstracted approach makes sense. But dbt doesn't know anything about the way it's hosted or deployed in a cloud provider—it only knows about the local file system where it's invoked, and the database/query engine it's connecting to.

I do think the right place for this code is in:

  • The deployment/orchestration tool (like the snippet you included above), which can know about the exact cloud provider and remote storage you're using. This could be Airflow, Dagster, dbt Cloud, Gitlab CI/CD, GitHub Actions, ...
  • Some lightweight shell scripts, or even text editor plugins, that are shared by your team in local development and wrap around aws/azure-cli/gcloud/etc

From one organization to another, I don't actually think much of this code would be repeated boilerplate. Maybe the logic codifying when dbt ought to pull remote artifacts (at the beginning) vs. push its local artifacts to a remote endpoint (at the end), but even that is potentially controversial. Do you want to push artifacts from invocations that failed to compile? Failed to execute? From every invocation in a multi-step workflow, or just the last step? dbt is an opinionated tool for modular data transformation; I don't see it as having an opinionated approach for file management to the same degree.

It's worth saying that dbt does do some file system operations today. All invocations write logs, and compiled files to a target directory; deps performs git cloning; clean deletes those scratch files. I think it's fair to say that even these basic file operations are some of the trickiest code to maintain in dbt, in terms of debugging and exception handling. A lot of that code needs to be written differently based on the operating system you're using (POSIX vs. Windows), and it still encounters lots of little surprising bugs (e.g. #3035).

Instead, I see a narrow role for dbt in state management: Making its JSON artifacts as rich, usable, and consistent as possible. When and how to write those artifacts feels very much like a matter for dbt—and, for features that require --state, how to read them. For the questions of where, when, and how to move around those artifacts once written, dbt feels less well positioned to furnish good answers.

Of course, I welcome disagreement on any of the above :)

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Oct 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale Issues that have gone stale state Stateful selection (state:modified, defer)
Projects
None yet
Development

No branches or pull requests

2 participants