-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dbt self-managed state #3159
Comments
Thanks for the detailed and thoughtful proposal @ncolomer! I have a few reservations, which I'll share below. It makes plenty of sense to me that a person or team using dbt's "stateful" features might establish a naming convention for a long-lived state directory that always stores their previous run artifacts. Instead of having to pass state-paths: ["my-stateful-folder"] The piece I'm really not sure of is the automated behind-the-scenes syncing and persisting of state with a cloud storage provider.
Am I right in thinking that dbt would need to store the underlying logic for connecting/authenticating/pulling/pushing to that backend? And that we would need to implement this logic differently for S3, Azure Blob, GCS, HDFS, ...? The link to I do think the right place for this code is in:
From one organization to another, I don't actually think much of this code would be repeated boilerplate. Maybe the logic codifying when dbt ought to pull remote artifacts (at the beginning) vs. push its local artifacts to a remote endpoint (at the end), but even that is potentially controversial. Do you want to push artifacts from invocations that failed to compile? Failed to execute? From every invocation in a multi-step workflow, or just the last step? dbt is an opinionated tool for modular data transformation; I don't see it as having an opinionated approach for file management to the same degree. It's worth saying that dbt does do some file system operations today. All invocations write logs, and compiled files to a target directory; Instead, I see a narrow role for dbt in state management: Making its JSON artifacts as rich, usable, and consistent as possible. When and how to write those artifacts feels very much like a matter for dbt—and, for features that require Of course, I welcome disagreement on any of the above :) |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days. |
Describe the feature
Thanks to dbt's state capabilities, one can use dbt as a state machine by making it always refers to its latest state.
The problem is that it's up to the user to manage storage of the state files + the logic to provide them back to dbt.
We implemented our own version of this mechanism in our CI/CD layer using the following logic:
Notes:
remote_sate_uri
points to a cloud storage, but it could either be a local/network file systemcontextual_prefix
value is computed from contextual information like the environment or the customerDescribe alternatives you've considered
It would be great if those operations could be natively managed by dbt. This would actually be very close in spirit to the terraform's state feature, and more specifically
terraform_remote_state
.In terms of implementation, it'd probably mean (naive speculation):
dbt_project.yml
to enable the feature and set the state uridbt
cli to temporarily disable the featureAdditional context
This feature request is agnostic to the underlying backend.
Who will this benefit?
Any dbt user that currently uses state-related
state:
selector or deferring features and managed to store the point-in-time state files by themselves.Are you interested in contributing this feature?
We can try to help, but let's first check if this makes sense to owners/community!
The text was updated successfully, but these errors were encountered: