Make Redshift to S3 authentication mechanisms mutually exclusive #291

JoshRosen · 2016-10-26T00:43:12Z

This patch makes a breaking change to how Redshift to S3 communication is authenticated. Previously, the implicit default behavior was to forward Spark's S3 credentials to Redshift and this default would be used unless aws_iam_role or the temporary_aws_* options were set. This behavior is slightly dodgy because it meant that a typo in the IAM settings (i.e. using the parameter redshift_iam_role instead of the correct aws_iam_role) would cause a default authentication mechanism to be used instead.

To fix that gap, this patch changes this library so that Spark's S3 credentials will only be forwarded to Redshift if forward_spark_s3_credentials is set to true. This option is mutually-exclusive with the aws_iam_role and temporary_aws_* options and is set to false by default. The net effect of this change is that users who were already using ``aws_iam_roleor thetemporary_aws_*` options will be unaffected, while users relying on the old default behavior will need to set `forward_spark_s3_credentials` to `true` in order to continue using that authentication scheme.

I have updated the README with a new section explaining the different connections involved in using this library and the different authentication mechanisms available for them. I also added a new section describing how to enable encryption of data in transit and at rest.

Because of the backwards-incompatible nature of this change, I'm bumping the version number to 3.0.0-preview1-SNAPSHOT.

codecov-io · 2016-10-26T00:49:23Z

Current coverage is 88.64% (diff: 94.44%)

Merging #291 into master will increase coverage by 0.04%

@@             master       #291   diff @@
==========================================
  Files            15         15          
  Lines           754        766    +12   
  Methods         611        615     +4   
  Messages          0          0          
  Branches        143        151     +8   
==========================================
+ Hits            668        679    +11   
- Misses           86         87     +1   
  Partials          0          0

Powered by Codecov. Last update 9ed18a0...ee47736

yhuai · 2016-10-26T03:39:58Z

lgtm1

thomasdesr · 2016-10-31T22:00:36Z

README.md

+      (`ACCESSKEY`, `SECRETKEY`).
+
+      Due to [Hadoop limitations](https://issues.apache.org/jira/browse/HADOOP-3733), this
+      approach will not work for secret keys which contain forward slash (`/`) characters.


I assume most users are familiar with the need to urlencode s3 urls as the workaround to this type of issue; however since that doesn't work maybe mention that urlencode won't fix this issue (for people like me who didn't open the link on the first read through :D)

thomasdesr · 2016-10-31T22:01:49Z

README.md

+        break or change in the future:
+
+        ```python
+        sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "YOUR_KEY_ID")


Is there a non-internal method that may not be as pretty or is this the best we can do?

i.e. for pyspark > N use this supported method, for pyspark <= N use this workaround

Alas, there's still not a better method than this :(

thomasdesr · 2016-10-31T22:04:44Z

README.md

+        will be passed to Redshift; otherwise, AWS keys will be passed. These credentials are
+        sent as part of the JDBC query, so therefore it is **strongly recommended** to enable SSL
+        encryption of the JDBC connection when using this authentication method.
+    3. **Use Security Token Service (STS) credentials**: You may configure the


These are also passed from the driver to redshift right? If so, shouldn't we also be encouraging TLS?

If they're passed through some other mechanism than JDBC, maybe we can document that as well?

Yeah, we should probably also recommend TLS for these as well. Even though these keys are time-expiring it's still necessary to guard against eavesdropping of them.

thomasdesr · 2016-10-31T22:07:12Z

README.md

+requires AWS credentials with read and write access to a S3 bucket (specified using the `tempdir`
+configuration parameter).
+
+> **:warning: Note**: This library does not clean up the temporary files that it creates in S3.


Do these temporary files also need to be encrypted & authenticated and thus touched on in the encryption section?

These are the temporary files being referred to in the Encrypting UNLOAD data stored in S3 (data stored when reading from Redshift) and Encrypting COPY data stored in S3 (data stored when writing to Redshift) headings under the Encryption setting.

thomasdesr · 2016-11-01T00:08:11Z

Lgtm

JoshRosen · 2016-11-01T00:11:27Z

I'm going to merge this and will begin packaging a preview release. Thanks for the reviews!

This patch makes a breaking change to how Redshift to S3 communication is authenticated. Previously, the implicit default behavior was to forward Spark's S3 credentials to Redshift and this default would be used unless `aws_iam_role` or the `temporary_aws_*` options were set. This behavior is slightly dodgy because it meant that a typo in the IAM settings (i.e. using the parameter `redshift_iam_role` instead of the correct `aws_iam_role`) would cause a default authentication mechanism to be used instead. To fix that gap, this patch changes this library so that Spark's S3 credentials will only be forwarded to Redshift if `forward_spark_s3_credentials` is set to `true`. This option is mutually-exclusive with the `aws_iam_role` and `temporary_aws_*` options and is set to `false` by default. The net effect of this change is that users who were already using ``aws_iam_role` or the `temporary_aws_*` options will be unaffected, while users relying on the old default behavior will need to set `forward_spark_s3_credentials` to `true` in order to continue using that authentication scheme. I have updated the README with a new section explaining the different connections involved in using this library and the different authentication mechanisms available for them. I also added a new section describing how to enable encryption of data in transit and at rest. Because of the backwards-incompatible nature of this change, I'm bumping the version number to `3.0.0-preview1-SNAPSHOT`. Author: Josh Rosen <rosenville@gmail.com> Author: Josh Rosen <joshrosen@databricks.com> Closes #291 from JoshRosen/credential-mechanism-enforcement.

steveloughran · 2016-11-11T15:35:24Z

One thing to note here is that s3a in Hadoop 2.8 supports STS via the property fs.s3a.session.token; to be ready for that Hadoop version you should really be syncing that too.

See: hadoop-aws docs

JoshRosen · 2016-11-12T23:00:31Z

Good catch @steveloughran; I've filed #296 to make sure that I don't forget to fix this before the next release.

JoshRosen added 8 commits October 25, 2016 15:43

Massive README update on authentication.

043969f

Update ToC

8be89af

Rename parameter to convey forwarding.

668a70a

Update code and tests to enforce new spec.

8d25ac1

Add migration instructions to README.

f46c8dc

Change version to 3.0.0-preview1-SNAPSHOT

f1c4b48

Fix DecimalIntegrationSuite

7d17fd9

Fix two more suites.

ee47736

JoshRosen added enhancement documentation labels Oct 26, 2016

JoshRosen added this to the 3.0.0-preview1 milestone Oct 26, 2016

JoshRosen assigned liancheng, thomasdesr and yhuai Oct 26, 2016

thomasdesr suggested changes Oct 31, 2016

View reviewed changes

JoshRosen added 3 commits October 31, 2016 17:02

Also recommend SSL when using STS tokens.

0b84b9d

Clarify that backslashes cannot be urlencoded

e10247b

Add forward pointer to encryption docs.

b04810a

JoshRosen closed this Nov 1, 2016

JoshRosen deleted the credential-mechanism-enforcement branch November 1, 2016 00:12

JoshRosen mentioned this pull request Nov 12, 2016

When credential forwarding is enabled, forward fs.s3a.session.token when present #296

Open

JoshRosen mentioned this pull request Nov 19, 2016

Backport #277 to branch-1.x #304

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Redshift to S3 authentication mechanisms mutually exclusive #291

Make Redshift to S3 authentication mechanisms mutually exclusive #291

JoshRosen commented Oct 26, 2016

codecov-io commented Oct 26, 2016 •

edited

Loading

yhuai commented Oct 26, 2016

thomasdesr Oct 31, 2016

thomasdesr Oct 31, 2016

JoshRosen Oct 31, 2016

thomasdesr Oct 31, 2016

thomasdesr Oct 31, 2016

JoshRosen Oct 31, 2016

JoshRosen Nov 1, 2016

thomasdesr Oct 31, 2016

JoshRosen Oct 31, 2016

thomasdesr commented Nov 1, 2016

JoshRosen commented Nov 1, 2016

steveloughran commented Nov 11, 2016

JoshRosen commented Nov 12, 2016

Make Redshift to S3 authentication mechanisms mutually exclusive #291

Make Redshift to S3 authentication mechanisms mutually exclusive #291

Conversation

JoshRosen commented Oct 26, 2016

codecov-io commented Oct 26, 2016 • edited Loading

Current coverage is 88.64% (diff: 94.44%)

yhuai commented Oct 26, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasdesr commented Nov 1, 2016

JoshRosen commented Nov 1, 2016

steveloughran commented Nov 11, 2016

JoshRosen commented Nov 12, 2016

codecov-io commented Oct 26, 2016 •

edited

Loading