Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow configurable include_policy for snowflake #1603

Closed
alanmcruickshank opened this issue Jul 11, 2019 · 15 comments
Closed

Allow configurable include_policy for snowflake #1603

alanmcruickshank opened this issue Jul 11, 2019 · 15 comments

Comments

@alanmcruickshank
Copy link
Contributor

Feature

Feature description

Copied from a discussion between @alanmcruickshank and @drewbanin on slack

Question

We've got an issue with deploying our DBT models in snowflake. In short we build in one database (a staging database) and then atomically rename with a live database. That means users never see an interruption in the things they're querying because it's an atomic rename (ALTER DATABASE ... SWAP WITH).

The problem we have is that the SWAP WITH command is not swapping everything tidily. All the tables (materialised views) swap fine (:+1: :slightly_smiling_face: ) but the views still refer to the staging database (:thumbsdown: :disappointed: ). I thought this was a problem with snowflake so I've been talking to their support, but I think it might actually be about how dbt :dbt: executes build commands on snowflake.

Example 1 (of one way of doing what dbt is doing, and getting the result I was hoping for)

create schema test_d1.s1;
create database test_d2;
create schema test_d2.s1;
create table test_d1.s1.t1 AS SELECT column1 as a, column2 as b FROM VALUES (1, 2), (3, 4);
create table test_d2.s1.t1 AS SELECT column1 as a, column2 as b FROM VALUES (5, 6), (7, 8);
use database test_d1;
create view s1.v1 AS SELECT * FROM s1.t1;
use database test_d2;
create view s1.v1 AS SELECT * FROM s1.t1;
select * from test_d1.s1.v1; -- Returns 1,2,3,4
ALTER DATABASE test_d1 SWAP WITH test_d2;
select * from test_d1.s1.v1; -- Returns 5,6,7,8 (WHICH IS GOOD!)
select * from test_d1.s1.t1; -- Returns 5,6,7,8
drop database test_d1;
drop database test_d2;

Example 2 (of the other way of doing what dbt is doing, and getting the result I'm actually getting)

create database test_d1;
create schema test_d1.s1;
create database test_d2;
create schema test_d2.s1;
create table test_d1.s1.t1 AS SELECT column1 as a, column2 as b FROM VALUES (1, 2), (3, 4);
create table test_d2.s1.t1 AS SELECT column1 as a, column2 as b FROM VALUES (5, 6), (7, 8);
create view test_d1.s1.v1 AS SELECT * FROM test_d1.s1.t1;
create view test_d2.s1.v1 AS SELECT * FROM test_d2.s1.t1;
select * from test_d1.s1.v1; -- Returns 1,2,3,4
ALTER DATABASE test_d1 SWAP WITH test_d2;
select * from test_d1.s1.v1; -- Returns 1,2,3,4 (WHICH IS BAD!)
select * from test_d1.s1.t1; -- Returns 5,6,7,8
drop database test_d1;
drop database test_d2;

From the results I'm getting it looks like dbt :dbt: is doing the latter, which makes total sense if you're potentially deploying to multiple databases and want to be specific. It makes life really hard though in the case that we're trying to use dbt in to deploy things, where we'd actively like it to not specify the database at build time.

I had a look at the source, and I wonder if this could be an option for the snowflake connector. I got lost in the source code around include_policy, which looks like it might be relevant.

Has anyone else solved this problem? ... or if not @drewbanin banin I'm happy to do some leg work to get this feature in, but I might need some pointers on exactly how the object names are resolved.

Suggestion

You’re totally right to identify that

  1. dbt is interpolating database names when it references views
  2. the include_policy is probably the right place to look in the codebase

Somebody, somewhere, asked about this exact use case very recently but I can’t seem to find it 😞 Will follow up here when I do!

We made a change to dbt in the 0.13.x releases which began interpolating database names when models were referenced. This is indeed controlled by the include_policy, and is intended to make the destination database for models configurable. So, in recent versions of dbt, you should be able to run:

{{ config(database='some_special_database') }}
select ..

This will configure the model to be built into the specified database. Further, models that reference this model will look for it in the right place!

I’d like to continue to support this use case, but I also totally buy the merits of the use case that you’re describing. So, maybe instead of removing the database from the include_policy, we should expose a configuration similar to what we do for quoting. Unsure if you’re familiar, but the quoting: config in dbt_project.yml overrides the default quote_policy for relations. I think that by setting include_policy.database to False, dbt should do pretty much exactly what you need here. There might be tiny other changes to make, but this is going to be the place to start I think!

If you don’t mind creating an issue, I’d be happy to follow up with some pointers to the relevant lines of code 🙂

Who will this benefit?

This will benefit anyone doing blue/green style in snowflake. In particular people with large deployments with a mix of tables and views.

@alanmcruickshank
Copy link
Contributor Author

@drewbanin - I'm happy to have a crack at this, I've been meaning to contribute for a while. Is it going to be easier to do this specifically for snowflake, or something common across adapters?

In either case, can you point me at where the meat of the configurable quote_policy logic is done, and I'll try to copy that as part of configurable include_policy.

@drewbanin
Copy link
Contributor

Cool @alanmcruickshank - that sounds awesome! I started digging into the code here -- it certainly does look like you'll want to change include_policy things to work like the quote_policy. With that said... this is a lot less self-contained than I thought it would be :/

I'm afraid I don't even have a good place to point you -- this logic is currently baked way into a lot of different places in the dbt codebase. Happy to investigate, but also wanted to give you something helpful before the end of the week:

One terrible, but possibly interesting approach could be to make your own wrapper around ref()? The ref function returns a Relation which points to another model in the graph. Just as an example, you can do something like:

{% macro relative_ref(model_name) %}
    {% set rel = ref(model_name) %}
    {% if rel is string %}
        {# This is an ephemeral model #}
        {% do return(rel) %}
    {% else %}
        {% set new_rel = rel.include(database=False) %}
        {% do return(new_rel) %}
    {% endif %}
{% endmacro %}

The follow example shows that relative_ref throws away the database name when interpolating a ref:

select * from {{ relative_ref('stg_snowplow_events') }}
union all
select * from {{ ref('stg_snowplow_events') }}

-- compiles to:

select * from analytics.stg_snowplow_events
union all
select * from analytics.analytics.stg_snowplow_events

So, I really want to be clear that this isn't a good idea, but it's one idea that came to mind in thinking about how this could work.


In playing around with this code, I recognized a distinction that I wasn't considering before. The quoting: config in dbt_project.yml controls:

  1. how the relation is quoted when it is created
  2. how the relation is quoted when it is referenced

In this case, it doesn't sound like there's any problem including the database name when the relations are created, right? create table database.schema.table as (...) should be fine? Including the database name here makes it possible for dbt to render relations into different logical databases, which is a pretty neat feature worth preserving IMO!

Your use-case requires that the database is excluded when models are referenced. Since all model references go through the ref function (sources shouldn't be relevant here, right?) maybe there's an opportunity for a simple implementation which just makes the ref function more natively hackable.

I can think of other types of cool things we could do here, like hacking refs to point to production datasets for certain types of models... or similar. I'm not sure that's a good idea, but it's a new thought for me and one worth exploring! Let me know what you think :)

@alanmcruickshank
Copy link
Contributor Author

@drewbanin - I love the ideas!

I'd love to have confidence that whatever I actually do is having the effect that I'd like. I notice that there's no unit tests in place that directly test the ref function. I assume no objection to me making some?

I'm assuming they'd go in the unit tests folder in their own new file (probably called test_reference.py). If you reckon they should actually be part of one of the other existing test files then just shout. :)

@drewbanin
Copy link
Contributor

I think that's reasonable! What are you actually planning on implementing? I wrote this up on a Friday afternoon, so I'd also love to get @beckjake's thoughts on the topic too :)

@beckjake
Copy link
Contributor

I am always in favor of more unit tests! We do test it pretty heavily implicitly via our integration suite, but that's never enough.

My instinct is that that the right approach here is to alter how Relation objects handle their include policies and make it match the quote policies. You could probably get away with not including it in the dbt.context.common.RelationProxy code that currently handles quotes, but it's probably going to be easiest to validate by just perfectly mirroring quote_policy behavior.

The big problem is that we don't use {{ this }} like I'd imagine in materializations - that would be cool, because then you could just overrride dbt.context.common.get_this_relation. Instead, most materializations construct a new Relation via the DBWrapper (so via the RelationProxy...) and use that. But they also use that to reference things that could potentially be in a different database. So you have to be all clever and do something like this in RelationProxy:

class RelationProxy
    def __init__(self, adapter, model):
        self.adapter = adapter
        self.model = model

    def create(self, *args, **kwargs):
        kwargs['quote_policy'] = dbt.utils.merge(
            self.quoting_config,
            kwargs.pop('quote_policy', {})
        )
        result = self.relation_type.create(*args, **kwargs)
        if result.identifier == self.model.alias:
            # handle adapters that must default to include_policy['database'] is False
            result = result.include(**result.DEFAULTS['include_policy'])
        return result

Which, ugh, I mean it'll work but you also have to pass model in to the DBWrapper, and that sure feels like a lot of special-ness!

@alanmcruickshank
Copy link
Contributor Author

@beckjake @drewbanin

What are you actually planning on implementing?

Great question.

I've got two trains of thought, and I think I prefer the first one over the second one - but open to implementing the other (or both while I'm at it if you prefer).

  1. Inspired by the relative_ref idea, I'd love to implement that, in particular where the level
    of referencing depends on the location of both models. If they're in the same schema, then
    only the identifier is reference, if they're in different schemas but the same database then
    the schema and identifier is referenced but not the database. The feature is turned off/on by:
    • Project level config in dbt_project.yml:
      references:
          relative: True
      
    • Schema level config in dbt_project.yml:
      models:
        my_project:
          reporting:
            schema: reporting
            materialized: view
            relative_ref: True
      
    • Model level config:
      {{ config(relative_ref = True) }}
      ...
      
    • Relative ref overrides on particular refs {{ ref('source_model', relative=True) }}
  2. More similar to the quote_policy, we include a configurable dict in the same places
    as the quote policy to configure the includes. These would only apply when rendering the
    sql.

I'm leaning much more toward the first idea - what do you guys think? Not sure if all the config levels are necessary. Maybe just the schema level config, with a ref override is enough?

@beckjake
Copy link
Contributor

I am amenable to the idea of a relative_ref config. It's nice and explicit about what's going on, which is always a good thing. The hard part is that you'll have to implement it in the providers, and probably the ParserUtils ref behavior - that code is a little finicky and prone to breaking integration tests in frustrating ways.

I definitely would rather not have it be available in the root of the project config. If you look at how dbt handles quoting, pushing that stuff up from the base project level is pretty convoluted and painful and involves changing a lot. But it's really easy to do it from the schema level, I think you'd just add it to the SourceConfig class and change dbt.context.runtime.RefResolver (or something like that) to check the config value.

I think ref(..., relative=True) would be a reasonable alternative, but I wouldn't suggest doing both because then it makes it hard to decide who "wins" if the ref call says relative=True but the model says relative_ref=False. Without having spent any time on implementation, I think it's a bit trickier than the config setting. Unless you mean that the config value would change how all ref calls TO the node work? That seems tricky!

So when choosing between those two, the question really is: who knows that the ref should be relative to the call, the referrer or the referent? I think it's probably the referrer, but I would defer to your expertise/needs on this!

@alanmcruickshank
Copy link
Contributor Author

I want to try and get a rendering context so that I can test that rendering in the unit tests. The function I think I want to test is dbt.compilation.Compiler.compile_node, but I think to get at that I need a Compiler, which in turn requires a Profile and that requires credentials. That starts feeling a lot like an integration test.

@beckjake - is there a way of getting a rendering context so that I can test rendering without requiring an actual database connection (or a way of mocking enough things to achieve the same result)? Or am I barking up the wrong tree and should I just resign myself to having to test this as an integration test?

@beckjake
Copy link
Contributor

I don't know if there's a good way, this is why we end up with a lot of integration tests, unfortunately. You could use mock.MagicMock for your config object and try to set a lot of .side_effects and .return_value, but it gets pretty tedious.

A better path, though still a lot of work, might be to use mock.patch() to patch out dbt.context.runtime.generate and set the .return_value attribute on the patched mock to a dictionary that matches the context you'd like to make, probably with a lot of mock.MagicMocks in it.

But also, if that seems hard or you get stuck, don't feel bad at all about making it an integration test.

@alanmcruickshank
Copy link
Contributor Author

Hey guys - I've finally got the compiler into a test harness so that I can make sure I don't break anything in the process. 😃

I've also got some more detail on the plan of what I'll implement.

On the schema config, there will be a config option called relative-ref which is boolean, and defaults to False (so that default behaviour is the same). If set to true, any models in this schema, will only qualify names as far as they need to. i.e. if they're in the same schema, only the identifier will be specified, if they're in the same database, but different schema then only the schema and the database will be specified.

Within each model the ref function itself is also configurable. Setting here will always override any other config, the idea being that this can make case by case exceptions to the broader rule for that schema. That would look like {{ ref('some_model', relative=True) }}, and introduces the idea of key word arguments to the ref function. This could also be used to explicitly not be relative, even if the schema is set to default to being relative.

This solves my use case, but is it worth having something more controllable? My original suggestion had the idea of being able to specify an include_policy just like a quoting_policy, but I'm not sure I need to anymore. Can you guys think of any other use cases where that might be required?

@alanmcruickshank
Copy link
Contributor Author

alanmcruickshank commented Jul 24, 2019

Continuing the discussion from #1624 , which is now just the unit tests.

@drewbanin - I've taken a look at the macro framework, and I've hit a roadblock - which I'm not sure how to get past.

Patching ref using a macro is easy...

{% macro ref(model) %}
    {{ log("Patched Ref!", True) }}
    {{ ref(model) }}
{% endmacro %}

But that inner ref function ends up calling the macro itself again, and we get never ending recursion. We could fix this problem either by:

  1. Introducing the underlying ref function with the name _ref and then always having a default ref macro which could be overwritten by plugins, and in normal circumstances just redirects to _ref.
  2. Using some kind of namespacing, so that within the new ref macro, we could qualify the name of the underlying ref function within the macro so that it knows that it's not referencing itself. That would mean there isn't always a ref macro, but one could be introduced in a plugin if people wanted to.

Both feel a bit hacky, I'm leaning slightly toward the first one because I don't know how we'd achieve the latter. Any ideas?

@drewbanin
Copy link
Contributor

Yeah - I think we'll need to preserve the dbt builtin macros in the context. I think this is as easy as adding a line like:

context['builtins'] = context

somewhere around here

So, that solves the masking issue - the ref macro you showed here could use builtins.ref directly.

The bigger problem exists around packages: The macros that a package can access are limited to:

  1. the macros in that package
  2. the macros in packages which that package imports

So, we'd in fact need to make ref (and probably source) somehow special. This would be similar to our implementation of generate_schema_name(). If there is a generate_schema_name macro present in the root project, then dbt will use that macro to parse the schema name for all models in the project (regardless of the package which provides them).

I'm not exactly certain where that code should live (in context/common.py? somewhere else?) but this feels like a reasonable approach to me.

It's conceivable that users wouldn't actually implement ref, but instead would implement something like process_ref which accepts arguments including (but not limited to) the nodes for the referrer and the referent. We could provide a default implementation for ref (as we do for generate_schema_name) which users can then override.

This feels kind of roundabout, but I think it's a more general solution to this class of problem. We have some other issues like #1612 and #1592 which could also be addressable using the approach outlined here.

You buy this?

@alanmcruickshank
Copy link
Contributor Author

alanmcruickshank commented Jul 24, 2019

Yep I buy this.

I think you put it quite nicely in the details for #1592.

should we encourage users to override the source macro? While flexible, sounds like it could lead to really subtle and hard to identify bugs!

I'm totally bought into being able to extend the ref and source functions using macros, and being able to implement plugins to override this functionality. I think to avoid the hard-to-debug issue and perhaps also the problem which could arise if people want to have two different smart ways of processing the ref function.

I wonder if there's a different way of looking at this problem? Rather than override the ref function, we actually just want something which mutates the output of the ref function, potentially in multiple stages.

Let's take two of the use cases we've been discussing. 1) relative refs, and 2) the functionality from #1612 where we might point to a different root schema in development.

If we wanted to do both, and we just overrode the ref function then only one would work and torpedo the other (perhaps in an unpredictable way too! 😄 ). What we'd do in python is to chain functions. e.g.

actual_ref = dev_references(relative_references(ref)

In specifying it like that, we're clear that first we apply the relative ref, and then override that with the dev references if appropriate. We are explicit about overrides. I think there could be a macro equivalent of this...

That also fits nicely with your ideas I think, that a package or plugin cannot its-self override ref, but it could provide helper functions which the user could then incorporate into their own root ref macro.

@drewbanin
Copy link
Contributor

Sure, I think that overriding the ref method and adding new methods which post-process its output are both good ways to accomplish this goal!

Unsure about the macro chaining solution though: I don't quite understand how dev_references and relative_references would play off of each other here. Certainly one would need to "win", right? If these operations were more additive in nature, then I think this approach would be a good one (see #1096). As I understand it, the relative_references approach would truncate the database (+schema?) from the Relation, whereas the dev_references approach would conditionally supply production database/schema names in the Relation!

So, I think one macro is probably appropriate here -- users can always make multiple helper macros, then chain them together themselves as appropriate.

Let's not try to tackle all of these things at once: I'm happy to start by solving the problem stated here (make it possible to render relation names without a database/schema). We can build on top of this in future iterations for sure.

I think a good way to proceed is:

  1. Add builtins to the context (including at least source and ref, but probably the entire context itself?)
  2. Update the _add_macros method to treat ref (and source?) as "override" macros (as opposed to global/local). In that method, if package_name == config.project_name (you'll need to pass in the config argument) then the macro is supplied in the root-project. We'll want to clobber the macros in the context with this "override" macro definition=
  3. Supply a ref macro which implements the "relative" behavior described here

I just did barely enough work here to assess the feasibility of this plan, so please let me know if i missed anything glaring!

@drewbanin
Copy link
Contributor

closed by #2028

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants