-
Notifications
You must be signed in to change notification settings - Fork 348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Redshift to S3 authentication mechanisms mutually exclusive #291
Conversation
Current coverage is 88.64% (diff: 94.44%)@@ master #291 diff @@
==========================================
Files 15 15
Lines 754 766 +12
Methods 611 615 +4
Messages 0 0
Branches 143 151 +8
==========================================
+ Hits 668 679 +11
- Misses 86 87 +1
Partials 0 0
|
lgtm1 |
(`ACCESSKEY`, `SECRETKEY`). | ||
|
||
Due to [Hadoop limitations](https://issues.apache.org/jira/browse/HADOOP-3733), this | ||
approach will not work for secret keys which contain forward slash (`/`) characters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume most users are familiar with the need to urlencode s3 urls as the workaround to this type of issue; however since that doesn't work maybe mention that urlencode won't fix this issue (for people like me who didn't open the link on the first read through :D)
break or change in the future: | ||
|
||
```python | ||
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "YOUR_KEY_ID") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a non-internal method that may not be as pretty or is this the best we can do?
i.e. for pyspark > N use this supported method, for pyspark <= N use this workaround
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alas, there's still not a better method than this :(
will be passed to Redshift; otherwise, AWS keys will be passed. These credentials are | ||
sent as part of the JDBC query, so therefore it is **strongly recommended** to enable SSL | ||
encryption of the JDBC connection when using this authentication method. | ||
3. **Use Security Token Service (STS) credentials**: You may configure the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are also passed from the driver to redshift right? If so, shouldn't we also be encouraging TLS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If they're passed through some other mechanism than JDBC, maybe we can document that as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we should probably also recommend TLS for these as well. Even though these keys are time-expiring it's still necessary to guard against eavesdropping of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
requires AWS credentials with read and write access to a S3 bucket (specified using the `tempdir` | ||
configuration parameter). | ||
|
||
> **:warning: Note**: This library does not clean up the temporary files that it creates in S3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these temporary files also need to be encrypted & authenticated and thus touched on in the encryption section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the temporary files being referred to in the Encrypting UNLOAD
data stored in S3 (data stored when reading from Redshift) and Encrypting COPY
data stored in S3 (data stored when writing to Redshift) headings under the Encryption setting.
Lgtm |
I'm going to merge this and will begin packaging a preview release. Thanks for the reviews! |
This patch makes a breaking change to how Redshift to S3 communication is authenticated. Previously, the implicit default behavior was to forward Spark's S3 credentials to Redshift and this default would be used unless `aws_iam_role` or the `temporary_aws_*` options were set. This behavior is slightly dodgy because it meant that a typo in the IAM settings (i.e. using the parameter `redshift_iam_role` instead of the correct `aws_iam_role`) would cause a default authentication mechanism to be used instead. To fix that gap, this patch changes this library so that Spark's S3 credentials will only be forwarded to Redshift if `forward_spark_s3_credentials` is set to `true`. This option is mutually-exclusive with the `aws_iam_role` and `temporary_aws_*` options and is set to `false` by default. The net effect of this change is that users who were already using ``aws_iam_role` or the `temporary_aws_*` options will be unaffected, while users relying on the old default behavior will need to set `forward_spark_s3_credentials` to `true` in order to continue using that authentication scheme. I have updated the README with a new section explaining the different connections involved in using this library and the different authentication mechanisms available for them. I also added a new section describing how to enable encryption of data in transit and at rest. Because of the backwards-incompatible nature of this change, I'm bumping the version number to `3.0.0-preview1-SNAPSHOT`. Author: Josh Rosen <rosenville@gmail.com> Author: Josh Rosen <joshrosen@databricks.com> Closes #291 from JoshRosen/credential-mechanism-enforcement.
One thing to note here is that s3a in Hadoop 2.8 supports STS via the property See: hadoop-aws docs |
Good catch @steveloughran; I've filed #296 to make sure that I don't forget to fix this before the next release. |
This patch makes a breaking change to how Redshift to S3 communication is authenticated. Previously, the implicit default behavior was to forward Spark's S3 credentials to Redshift and this default would be used unless
aws_iam_role
or thetemporary_aws_*
options were set. This behavior is slightly dodgy because it meant that a typo in the IAM settings (i.e. using the parameterredshift_iam_role
instead of the correctaws_iam_role
) would cause a default authentication mechanism to be used instead.To fix that gap, this patch changes this library so that Spark's S3 credentials will only be forwarded to Redshift if
forward_spark_s3_credentials
is set totrue
. This option is mutually-exclusive with theaws_iam_role
andtemporary_aws_*
options and is set tofalse
by default. The net effect of this change is that users who were already using ``aws_iam_roleor the
temporary_aws_*` options will be unaffected, while users relying on the old default behavior will need to set `forward_spark_s3_credentials` to `true` in order to continue using that authentication scheme.I have updated the README with a new section explaining the different connections involved in using this library and the different authentication mechanisms available for them. I also added a new section describing how to enable encryption of data in transit and at rest.
Because of the backwards-incompatible nature of this change, I'm bumping the version number to
3.0.0-preview1-SNAPSHOT
.