Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add S3ToRedshift example dag and system test #8877

Merged
merged 1 commit into from
Jun 10, 2020

Conversation

feluelle
Copy link
Member

@feluelle feluelle commented May 15, 2020

  • add howto docs for S3ToRedshift example dag
  • add terraform which runs terraform CLI commands in an isolated docker container

NOTE: This system test uses terraform to provide the infrastructure needed to run this example dag.


Make sure to mark the boxes below before creating PR: [x]

  • Description above provides context of the change
  • Unit tests coverage for changes (not needed for documentation changes)
  • Target Github ISSUE in description if exists
  • Commits follow "How to write a good git commit message"
  • Relevant documentation is updated including usage instructions.
  • I will engage committers as explained in Contribution Workflow Example.

In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.
Read the Pull Request Guidelines for more information.

@feluelle feluelle requested review from ashb and potiuk May 15, 2020 16:01
@boring-cyborg boring-cyborg bot added the provider:amazon AWS/Amazon - related issues label May 15, 2020
@feluelle feluelle force-pushed the test/s3_to_redshift_system branch from cdbdff7 to 3abe736 Compare May 16, 2020 17:17
@feluelle feluelle changed the title [WIP] Add S3ToRedshift example dag and system test Add S3ToRedshift example dag and system test May 16, 2020
@feluelle feluelle force-pushed the test/s3_to_redshift_system branch from 3abe736 to 1e977e1 Compare May 16, 2020 17:23
@feluelle
Copy link
Member Author

Ready for review. PTAL @ashb @potiuk

Maybe @xinbinhuang @mustafagok @baolsen if you have time :)


# [START howto_operator_s3_to_redshift_env_variables]
S3_BUCKET = getenv("S3_BUCKET", "bucket")
S3_KEY = getenv("S3_KEY", "key")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we relying on env vars, rather Airflow Variables+templates? {{ var.s3_key}}

Copy link
Member Author

@feluelle feluelle May 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have this in all of our system tests. So that we can easily pass them to the example dags. See /~https://github.com/apache/airflow/blob/master/TESTING.rst#environment-for-system-tests

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to switch to Airflow Variables+templates like Ash said and set them via env variables: https://airflow.apache.org/docs/stable/concepts.html?highlight=xcom#storing-variables-in-environment-variables.

So we only need to prefix all env vars with AIRFLOW_VAR.
For example: set AIRFLOW_VAR_S3_KEY=key in variables.env and access them via {{ var.s3_key}} here. WDYT @potiuk @turbaszek ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But how do we then define defaults in case these var.s are not set? Or should we make all vars required then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine and when we automate it, we should define files where we define ENV variables.

Using Airflow Variables in example DAGs is a very bad idea. People use it as - well - examples - and using airflow Variables in DAGs in this way is very bad from Database load point of view. Until we avoid parsing the DAGs by all entities frequently we should not use variables in this way (we even have best practice about it) - basically every time the DAG is parsed, DB communication occurs and the extra connection is made - that is quite bad for scalability and not obvious at all, so if people take that literally as examples (which they did in the past) they will implement bad practices.

@feluelle feluelle force-pushed the test/s3_to_redshift_system branch 2 times, most recently from 398fa82 to ad3a6fc Compare May 18, 2020 13:22
@feluelle feluelle force-pushed the test/s3_to_redshift_system branch 2 times, most recently from 138645f to 4fd0a69 Compare June 5, 2020 18:10
Copy link
Member Author

@feluelle feluelle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First try (this works 🎉 ) - I switched to terraform for providing infra. WDYT? @potiuk @ashb @mik-laj @turbaszek @kaxil

Dockerfile.ci Outdated Show resolved Hide resolved
Dockerfile.ci Outdated Show resolved Hide resolved
scripts/ci/docker-compose/local-prod.yml Outdated Show resolved Hide resolved
scripts/ci/docker-compose/local.yml Outdated Show resolved Hide resolved
Comment on lines 21 to 34
class Terraform(SystemTest):
TERRAFORM_DIR: str

def setUp(self) -> None:
self.execute_cmd(["terraform", "init", "-input=false", self.TERRAFORM_DIR])
self.execute_cmd(["terraform", "plan", "-input=false", self.TERRAFORM_DIR])
self.execute_cmd(["terraform", "apply", "-input=false", "-auto-approve", self.TERRAFORM_DIR])

def get_tf_output(self, name):
output = self.check_output(["terraform", "output", name]).decode('utf-8').replace("\r\n", "")
self.log.info(output)
return output

def tearDown(self) -> None:
self.execute_cmd(["terraform", "plan", "-destroy", "-input=false", self.TERRAFORM_DIR])
self.execute_cmd(["terraform", "destroy", "-input=false", "-auto-approve", self.TERRAFORM_DIR])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not quite happy about it. This can probably be structured better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can consider using/taking a look at /~https://github.com/beelit94/python-terraform

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked this out, but didn't find that really useful to use an extra package only to wrap bash inside python. And it also seems to not be compatible to the latest terraform 0.12.x releases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I would suggest to add init, plan, apply as methods/functions to limit the number of replicated code using self.execute_cmd. WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could do that.

The real issue I am having with this, is also that super().setUp() and super().tearDown() are not explicit enough. Terraform is very much hidden behind them. You can easily forget to call those when you overwrite those which for super().tearDown() could be dramatically. But maybe we could add a static check that checks that?!

@feluelle feluelle force-pushed the test/s3_to_redshift_system branch from 4fd0a69 to 16bb5c7 Compare June 5, 2020 18:16
@feluelle
Copy link
Member Author

feluelle commented Jun 5, 2020

My variables which are not committed and I used for testing are:

variables.env

S3_BUCKET=test-airflow-bucket-${AIRFLOW_BREEZE_UNIQUE_SUFFIX}
TF_VAR_s3_bucket=${S3_BUCKET}
TF_VAR_redshift_cluster_identifier=test-airflow-cluster-${AIRFLOW_BREEZE_UNIQUE_SUFFIX}
HOST_AIRFLOW_SOURCES=/Users/feluelle/PycharmProjects/incubator-airflow

and terraform.tfvars

aws_region = "eu-central-1"
redshift_database_name = "dev"
redshift_master_username = "master"
redshift_master_password = "1Two3Four"

I added terraform.tfvars to .gitignore, because I was not sure if we should commit it since it contains "sensitive information" eventhough we only use it for testing.

The full example dag takes about 7 mins to run.

@feluelle feluelle force-pushed the test/s3_to_redshift_system branch 2 times, most recently from ea8c668 to c9406b0 Compare June 9, 2020 12:40
- add howto docs for S3ToRedshift example dag
- add terraform which runs terraform CLI commands in an isolated docker container

NOTE: This system test uses terraform to provide the infrastructure needed to run this example dag.
@feluelle feluelle force-pushed the test/s3_to_redshift_system branch from c9406b0 to 3c93c08 Compare June 9, 2020 13:13

# [START howto_operator_s3_to_redshift_env_variables]
S3_BUCKET = getenv("S3_BUCKET", "bucket")
S3_KEY = getenv("S3_KEY", "key")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine and when we automate it, we should define files where we define ENV variables.

Using Airflow Variables in example DAGs is a very bad idea. People use it as - well - examples - and using airflow Variables in DAGs in this way is very bad from Database load point of view. Until we avoid parsing the DAGs by all entities frequently we should not use variables in this way (we even have best practice about it) - basically every time the DAG is parsed, DB communication occurs and the extra connection is made - that is quite bad for scalability and not obvious at all, so if people take that literally as examples (which they did in the past) they will implement bad practices.

@potiuk potiuk merged commit a69b031 into apache:master Jun 10, 2020
@feluelle feluelle mentioned this pull request Jun 19, 2020
6 tasks
@feluelle feluelle deleted the test/s3_to_redshift_system branch July 29, 2020 09:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
provider:amazon AWS/Amazon - related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants