Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Redshift to S3 authentication mechanisms mutually exclusive #291

Closed
wants to merge 11 commits into from

Conversation

JoshRosen
Copy link
Contributor

This patch makes a breaking change to how Redshift to S3 communication is authenticated. Previously, the implicit default behavior was to forward Spark's S3 credentials to Redshift and this default would be used unless aws_iam_role or the temporary_aws_* options were set. This behavior is slightly dodgy because it meant that a typo in the IAM settings (i.e. using the parameter redshift_iam_role instead of the correct aws_iam_role) would cause a default authentication mechanism to be used instead.

To fix that gap, this patch changes this library so that Spark's S3 credentials will only be forwarded to Redshift if forward_spark_s3_credentials is set to true. This option is mutually-exclusive with the aws_iam_role and temporary_aws_* options and is set to false by default. The net effect of this change is that users who were already using ``aws_iam_roleor thetemporary_aws_*` options will be unaffected, while users relying on the old default behavior will need to set `forward_spark_s3_credentials` to `true` in order to continue using that authentication scheme.

I have updated the README with a new section explaining the different connections involved in using this library and the different authentication mechanisms available for them. I also added a new section describing how to enable encryption of data in transit and at rest.

Because of the backwards-incompatible nature of this change, I'm bumping the version number to 3.0.0-preview1-SNAPSHOT.

@codecov-io
Copy link

codecov-io commented Oct 26, 2016

Current coverage is 88.64% (diff: 94.44%)

Merging #291 into master will increase coverage by 0.04%

@@             master       #291   diff @@
==========================================
  Files            15         15          
  Lines           754        766    +12   
  Methods         611        615     +4   
  Messages          0          0          
  Branches        143        151     +8   
==========================================
+ Hits            668        679    +11   
- Misses           86         87     +1   
  Partials          0          0          

Powered by Codecov. Last update 9ed18a0...ee47736

@yhuai
Copy link

yhuai commented Oct 26, 2016

lgtm1

(`ACCESSKEY`, `SECRETKEY`).

Due to [Hadoop limitations](https://issues.apache.org/jira/browse/HADOOP-3733), this
approach will not work for secret keys which contain forward slash (`/`) characters.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume most users are familiar with the need to urlencode s3 urls as the workaround to this type of issue; however since that doesn't work maybe mention that urlencode won't fix this issue (for people like me who didn't open the link on the first read through :D)

break or change in the future:

```python
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "YOUR_KEY_ID")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a non-internal method that may not be as pretty or is this the best we can do?

i.e. for pyspark > N use this supported method, for pyspark <= N use this workaround

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alas, there's still not a better method than this :(

will be passed to Redshift; otherwise, AWS keys will be passed. These credentials are
sent as part of the JDBC query, so therefore it is **strongly recommended** to enable SSL
encryption of the JDBC connection when using this authentication method.
3. **Use Security Token Service (STS) credentials**: You may configure the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are also passed from the driver to redshift right? If so, shouldn't we also be encouraging TLS?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they're passed through some other mechanism than JDBC, maybe we can document that as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we should probably also recommend TLS for these as well. Even though these keys are time-expiring it's still necessary to guard against eavesdropping of them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

requires AWS credentials with read and write access to a S3 bucket (specified using the `tempdir`
configuration parameter).

> **:warning: Note**: This library does not clean up the temporary files that it creates in S3.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these temporary files also need to be encrypted & authenticated and thus touched on in the encryption section?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the temporary files being referred to in the Encrypting UNLOAD data stored in S3 (data stored when reading from Redshift) and Encrypting COPY data stored in S3 (data stored when writing to Redshift) headings under the Encryption setting.

@thomasdesr
Copy link

Lgtm

@JoshRosen
Copy link
Contributor Author

I'm going to merge this and will begin packaging a preview release. Thanks for the reviews!

JoshRosen added a commit that referenced this pull request Nov 1, 2016
This patch makes a breaking change to how Redshift to S3 communication is authenticated. Previously, the implicit default behavior was to forward Spark's S3 credentials to Redshift and this default would be used unless `aws_iam_role` or the `temporary_aws_*` options were set. This behavior is slightly dodgy because it meant that a typo in the IAM settings (i.e. using the parameter `redshift_iam_role` instead of the correct `aws_iam_role`) would cause a default authentication mechanism to be used instead.

To fix that gap, this patch changes this library so that Spark's S3 credentials will only be forwarded to Redshift if `forward_spark_s3_credentials` is set to `true`. This option is mutually-exclusive with the `aws_iam_role` and `temporary_aws_*` options and is set to `false` by default. The net effect of this change is that users who were already using ``aws_iam_role` or the `temporary_aws_*` options will be unaffected, while users relying on the old default behavior will need to set `forward_spark_s3_credentials` to `true` in order to continue using that authentication scheme.

I have updated the README with a new section explaining the different connections involved in using this library and the different authentication mechanisms available for them. I also added a new section describing how to enable encryption of data in transit and at rest.

Because of the backwards-incompatible nature of this change, I'm bumping the version number to `3.0.0-preview1-SNAPSHOT`.

Author: Josh Rosen <rosenville@gmail.com>
Author: Josh Rosen <joshrosen@databricks.com>

Closes #291 from JoshRosen/credential-mechanism-enforcement.
@JoshRosen JoshRosen closed this Nov 1, 2016
@JoshRosen JoshRosen deleted the credential-mechanism-enforcement branch November 1, 2016 00:12
@steveloughran
Copy link

One thing to note here is that s3a in Hadoop 2.8 supports STS via the property fs.s3a.session.token; to be ready for that Hadoop version you should really be syncing that too.

See: hadoop-aws docs

@JoshRosen
Copy link
Contributor Author

Good catch @steveloughran; I've filed #296 to make sure that I don't forget to fix this before the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants