Add S3ToRedshift example dag and system test #8877

feluelle · 2020-05-15T16:01:05Z

add howto docs for S3ToRedshift example dag
add terraform which runs terraform CLI commands in an isolated docker container

NOTE: This system test uses terraform to provide the infrastructure needed to run this example dag.

Make sure to mark the boxes below before creating PR: [x]

Description above provides context of the change
Unit tests coverage for changes (not needed for documentation changes)
Target Github ISSUE in description if exists
Commits follow "How to write a good git commit message"
Relevant documentation is updated including usage instructions.
I will engage committers as explained in Contribution Workflow Example.

In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.
Read the Pull Request Guidelines for more information.

tests/test_utils/postgres_system_helpers.py

feluelle · 2020-05-16T17:27:59Z

Ready for review. PTAL @ashb @potiuk

Maybe @xinbinhuang @mustafagok @baolsen if you have time :)

ashb · 2020-05-16T17:47:55Z

airflow/providers/amazon/aws/example_dags/example_s3_to_redshift.py

+
+# [START howto_operator_s3_to_redshift_env_variables]
+S3_BUCKET = getenv("S3_BUCKET", "bucket")
+S3_KEY = getenv("S3_KEY", "key")


Why are we relying on env vars, rather Airflow Variables+templates? {{ var.s3_key}}

We have this in all of our system tests. So that we can easily pass them to the example dags. See /~https://github.com/apache/airflow/blob/master/TESTING.rst#environment-for-system-tests

I think it makes sense to switch to Airflow Variables+templates like Ash said and set them via env variables: https://airflow.apache.org/docs/stable/concepts.html?highlight=xcom#storing-variables-in-environment-variables.

So we only need to prefix all env vars with AIRFLOW_VAR.
For example: set AIRFLOW_VAR_S3_KEY=key in variables.env and access them via {{ var.s3_key}} here. WDYT @potiuk @turbaszek ?

But how do we then define defaults in case these var.s are not set? Or should we make all vars required then?

I think this is fine and when we automate it, we should define files where we define ENV variables.

Using Airflow Variables in example DAGs is a very bad idea. People use it as - well - examples - and using airflow Variables in DAGs in this way is very bad from Database load point of view. Until we avoid parsing the DAGs by all entities frequently we should not use variables in this way (we even have best practice about it) - basically every time the DAG is parsed, DB communication occurs and the extra connection is made - that is quite bad for scalability and not obvious at all, so if people take that literally as examples (which they did in the past) they will implement bad practices.

tests/test_utils/amazon_system_helpers.py

tests/test_utils/postgres_system_helpers.py

tests/providers/amazon/aws/operators/test_s3_to_redshift_system.py

feluelle

First try (this works 🎉 ) - I switched to terraform for providing infra. WDYT? @potiuk @ashb @mik-laj @turbaszek @kaxil

Dockerfile.ci

airflow/providers/amazon/aws/example_dags/example_s3_to_redshift.py

scripts/ci/docker-compose/local-prod.yml

scripts/ci/docker-compose/local.yml

feluelle · 2020-06-05T17:38:02Z

tests/test_utils/terraform.py

+class Terraform(SystemTest):
+    TERRAFORM_DIR: str
+
+    def setUp(self) -> None:
+        self.execute_cmd(["terraform", "init", "-input=false", self.TERRAFORM_DIR])
+        self.execute_cmd(["terraform", "plan", "-input=false", self.TERRAFORM_DIR])
+        self.execute_cmd(["terraform", "apply", "-input=false", "-auto-approve", self.TERRAFORM_DIR])
+
+    def get_tf_output(self, name):
+        output = self.check_output(["terraform", "output", name]).decode('utf-8').replace("\r\n", "")
+        self.log.info(output)
+        return output
+
+    def tearDown(self) -> None:
+        self.execute_cmd(["terraform", "plan", "-destroy", "-input=false", self.TERRAFORM_DIR])
+        self.execute_cmd(["terraform", "destroy", "-input=false", "-auto-approve", self.TERRAFORM_DIR])


I am not quite happy about it. This can probably be structured better.

We can consider using/taking a look at /~https://github.com/beelit94/python-terraform

I checked this out, but didn't find that really useful to use an extra package only to wrap bash inside python. And it also seems to not be compatible to the latest terraform 0.12.x releases.

Then I would suggest to add init, plan, apply as methods/functions to limit the number of replicated code using self.execute_cmd. WDYT?

I could do that.

The real issue I am having with this, is also that super().setUp() and super().tearDown() are not explicit enough. Terraform is very much hidden behind them. You can easily forget to call those when you overwrite those which for super().tearDown() could be dramatically. But maybe we could add a static check that checks that?!

feluelle · 2020-06-05T18:30:32Z

My variables which are not committed and I used for testing are:

variables.env

S3_BUCKET=test-airflow-bucket-${AIRFLOW_BREEZE_UNIQUE_SUFFIX}
TF_VAR_s3_bucket=${S3_BUCKET}
TF_VAR_redshift_cluster_identifier=test-airflow-cluster-${AIRFLOW_BREEZE_UNIQUE_SUFFIX}
HOST_AIRFLOW_SOURCES=/Users/feluelle/PycharmProjects/incubator-airflow

and terraform.tfvars

aws_region = "eu-central-1"
redshift_database_name = "dev"
redshift_master_username = "master"
redshift_master_password = "1Two3Four"

I added terraform.tfvars to .gitignore, because I was not sure if we should commit it since it contains "sensitive information" eventhough we only use it for testing.

The full example dag takes about 7 mins to run.

tests/providers/amazon/aws/operators/test_s3_to_redshift_system.py

- add howto docs for S3ToRedshift example dag - add terraform which runs terraform CLI commands in an isolated docker container NOTE: This system test uses terraform to provide the infrastructure needed to run this example dag.

potiuk · 2020-06-10T15:18:18Z

airflow/providers/amazon/aws/example_dags/example_s3_to_redshift.py

+
+# [START howto_operator_s3_to_redshift_env_variables]
+S3_BUCKET = getenv("S3_BUCKET", "bucket")
+S3_KEY = getenv("S3_KEY", "key")


I think this is fine and when we automate it, we should define files where we define ENV variables.

Using Airflow Variables in example DAGs is a very bad idea. People use it as - well - examples - and using airflow Variables in DAGs in this way is very bad from Database load point of view. Until we avoid parsing the DAGs by all entities frequently we should not use variables in this way (we even have best practice about it) - basically every time the DAG is parsed, DB communication occurs and the extra connection is made - that is quite bad for scalability and not obvious at all, so if people take that literally as examples (which they did in the past) they will implement bad practices.

feluelle requested review from ashb and potiuk May 15, 2020 16:01

boring-cyborg bot added the provider:amazon AWS/Amazon - related issues label May 15, 2020

feluelle commented May 15, 2020

View reviewed changes

tests/test_utils/postgres_system_helpers.py Outdated Show resolved Hide resolved

feluelle force-pushed the test/s3_to_redshift_system branch from cdbdff7 to 3abe736 Compare May 16, 2020 17:17

feluelle changed the title ~~[WIP] Add S3ToRedshift example dag and system test~~ Add S3ToRedshift example dag and system test May 16, 2020

feluelle force-pushed the test/s3_to_redshift_system branch from 3abe736 to 1e977e1 Compare May 16, 2020 17:23

ashb reviewed May 16, 2020

View reviewed changes

tests/test_utils/amazon_system_helpers.py Outdated Show resolved Hide resolved

ashb reviewed May 16, 2020

View reviewed changes

tests/test_utils/postgres_system_helpers.py Outdated Show resolved Hide resolved

feluelle commented May 16, 2020

View reviewed changes

tests/test_utils/postgres_system_helpers.py Outdated Show resolved Hide resolved

xinbinhuang reviewed May 16, 2020

View reviewed changes

tests/providers/amazon/aws/operators/test_s3_to_redshift_system.py Outdated Show resolved Hide resolved

feluelle force-pushed the test/s3_to_redshift_system branch 2 times, most recently from 398fa82 to ad3a6fc Compare May 18, 2020 13:22

feluelle force-pushed the test/s3_to_redshift_system branch 2 times, most recently from 138645f to 4fd0a69 Compare June 5, 2020 18:10

feluelle commented Jun 5, 2020

View reviewed changes

feluelle force-pushed the test/s3_to_redshift_system branch from 4fd0a69 to 16bb5c7 Compare June 5, 2020 18:16

turbaszek reviewed Jun 9, 2020

View reviewed changes

tests/providers/amazon/aws/operators/test_s3_to_redshift_system.py Show resolved Hide resolved

feluelle force-pushed the test/s3_to_redshift_system branch 2 times, most recently from ea8c668 to c9406b0 Compare June 9, 2020 12:40

Add S3ToRedshift example dag and system test

3c93c08

- add howto docs for S3ToRedshift example dag - add terraform which runs terraform CLI commands in an isolated docker container NOTE: This system test uses terraform to provide the infrastructure needed to run this example dag.

feluelle force-pushed the test/s3_to_redshift_system branch from c9406b0 to 3c93c08 Compare June 9, 2020 13:13

turbaszek approved these changes Jun 10, 2020

View reviewed changes

potiuk approved these changes Jun 10, 2020

View reviewed changes

potiuk merged commit a69b031 into apache:master Jun 10, 2020

feluelle mentioned this pull request Jun 11, 2020

Use Terraform for System Testing #9009

Closed

feluelle mentioned this pull request Jun 19, 2020

Add AWS ECS system test #8888

Merged

6 tasks

feluelle deleted the test/s3_to_redshift_system branch July 29, 2020 09:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add S3ToRedshift example dag and system test #8877

Add S3ToRedshift example dag and system test #8877

feluelle commented May 15, 2020 •

edited

Loading

feluelle commented May 16, 2020

ashb May 16, 2020

feluelle May 16, 2020 •

edited

Loading

feluelle Jun 6, 2020

feluelle Jun 9, 2020

potiuk Jun 10, 2020

feluelle left a comment •

edited

Loading

feluelle Jun 5, 2020

turbaszek Jun 9, 2020

feluelle Jun 9, 2020

turbaszek Jun 9, 2020

feluelle Jun 9, 2020

feluelle commented Jun 5, 2020

potiuk Jun 10, 2020

Add S3ToRedshift example dag and system test #8877

Add S3ToRedshift example dag and system test #8877

Conversation

feluelle commented May 15, 2020 • edited Loading

feluelle commented May 16, 2020

Choose a reason for hiding this comment

feluelle May 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

feluelle left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

feluelle commented Jun 5, 2020

Choose a reason for hiding this comment

feluelle commented May 15, 2020 •

edited

Loading

feluelle May 16, 2020 •

edited

Loading

feluelle left a comment •

edited

Loading