Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding query-diff cli command #6514

Merged
merged 36 commits into from
Aug 23, 2023
Merged

adding query-diff cli command #6514

merged 36 commits into from
Aug 23, 2023

Conversation

jycor
Copy link
Contributor

@jycor jycor commented Aug 16, 2023

@macneale4
Copy link
Contributor

This seems like a lot of heavy lifting on the CLI side. Is there a reason we aren't doing the work in a table function? Then it would be available to SQL only users. Adding a CLI wrapper around it would be easy.

@jycor
Copy link
Contributor Author

jycor commented Aug 16, 2023

I agree, I'm currently exploring ways to implement this as a table function or a stored procedure.
One challenge I'm facing is how to parse, analyze, and execute the queries from within a table function/stored procedure.

Copy link
Contributor

@max-hoffman max-hoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

James and I talked about this yesterday, I think this is fine I just am worried about how limited it is in the broad search space of query diff. As is this will only return results for two queries that are identical and have the same underlying table, but different data versions. So like a slightly expanded form of diff that lets us do aggregation and join diffs.

Some of the other things in the space of query diff:

  • queries are the same, schemas are different
  • queries are the same, but aggregation or join precludes using dolt diff
  • queries are different (schemas are different is arguably a subset of this)

It is almost like we are introducing a third diff dimension: schema->query plan->data. We have to diff the two query plans after diffing the schema in the general case to determine whether there are smarter ways to diff the data. Even if the queries are the same, we still have to walk the AST or plan to see whether materializing nodes prevent using dolt diff on underlying tables.

And re: bisection/temporary table materialization, both of those require reading the entire tables, which makes them more expensive than the naive algo.

Other thoughts:

  • None of the current logic is really Dolt-specific. We could put all of this into GMS, have the schema comparison only test column shape+types (ignore table name), and have this be a generic query differ.
  • The Dolt side could re-use our schema merge logic to go through the left/right schemas, match columns on name/type or tag, and then interpolate new query (strings or nodes) that will be schema-comparable. This would let us finesse queries to handle schema diff, with the generic query_diff living and testing independently in GMS.

I am not against merging what we have. It is just unclear to me what customers expect from this feature.

return nil, nil, fmt.Errorf("query must be a string, not %T", q)
}
qStr = strings.TrimSpace(qStr)
if !strings.HasPrefix(strings.ToLower(qStr), "select") { // TODO: allow "with?"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could parse.Parse, see whether the AST is a SELECT

Comment on lines +139 to +142
if !expr.Resolved() {
return nil, ErrInvalidNonLiteralArgument.New(tf.Name(), expr.String())
}
// prepared statements resolve functions beforehand, so above check fails
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this new or copied? I think expressions are going to be resolved by default now, even if not literal

Comment on lines +297 to +299
if !tf.Resolved() {
return nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think this should be possible?

if !tf.Resolved() {
return nil
}
if tf.sqlSch == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also seems like we should have errored upstream to reach this

Copy link
Contributor

@max-hoffman max-hoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as a resurrection of the old query diff in a SQL context. We'll try to review options with people when you get back, or when a customer asks us for more.

@max-hoffman max-hoffman merged commit 89d79f7 into main Aug 23, 2023
@max-hoffman max-hoffman deleted the james/diff branch August 23, 2023 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants