Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column names cannot contain spaces when saving back to Redshift #84

Open
JoshRosen opened this issue Sep 12, 2015 · 3 comments
Open

Column names cannot contain spaces when saving back to Redshift #84

JoshRosen opened this issue Sep 12, 2015 · 3 comments

Comments

@JoshRosen
Copy link
Contributor

Because spark-redshift uses Avro in its write path, it inherits the limitations of Avro's schema validation: we cannot create tables with columns that contain non-letter-or-_ characters, such as spaces or quotes.

In order to work around these limitations, I think that we'd have to use a different set of column names when writing to Avro, then use a JSONPaths file to map those columns back to the original column names.

@JoshRosen JoshRosen added the bug label Sep 12, 2015
JoshRosen added a commit that referenced this issue Sep 14, 2015
This patch modifies `JDBCWrapper.schemaString` to wrap column names in quotes, which is necessary in order to allow us to create tables with columns whose names are reserved words or which contain spaces. This fixes #80.

Note that, by itself, this patch does not enable full support for creating Redshift tables with column names that contain spaces; we are currently constrained by Avro's schema validation rules (see #84).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #85 from JoshRosen/column-name-escaping.
JoshRosen added a commit that referenced this issue Oct 25, 2016
This patch adds new options to allow CSV to be used as the intermediate data format when writing data to Redshift. This can offer large performance benefits because Redshift's Avro reader can be very slow. This patch is based on #165 by emlyn and incorporates changes from me in order to add documentation, make the new option case-insensitive, improve some error messages, and add tests.

Using CSV for writes also allows us to write to tables whose column names are unsupported by Avro, so #84 is partially addressed by this patch.

As a hedge, I've marked this feature as "Experimental" and I'll remove that label after it's been tested in the wild a bit more.

Fixes #73.

Author: Josh Rosen <joshrosen@databricks.com>
Author: Josh Rosen <rosenville@gmail.com>
Author: Emlyn Corrin <Emlyn.Corrin@microsoft.com>
Author: Emlyn Corrin <emlyn@swiftkey.com>

Closes #288 from JoshRosen/use-csv-for-writes.
@l15k4
Copy link

l15k4 commented Dec 9, 2016

Hey, colons and dashes too, I have fields like this c-geo:c3 and getting :

[info]   org.apache.avro.SchemaParseException: Illegal character in: c-geo:c3
[info]   at org.apache.avro.Schema.validateName(Schema.java:1083)

because neither : nor - is supported ...

Spec https://avro.apache.org/docs/current/spec.html#names doesn't say anything about this. Wdyt ?

@JoshRosen
Copy link
Contributor Author

JoshRosen commented Dec 9, 2016

The Avro spec actually does say something about this (emphasis mine to make things clearer):

Record, enums and fixed are named types. Each has a fullname that is composed of two parts; a name and a namespace. Equality of names is defined on the fullname.

The name portion of a fullname, record field names, and enum symbols must:

  • start with [A-Za-z_]
  • subsequently contain only [A-Za-z0-9_]

My PR description also notes this:

we cannot create tables with columns that contain non-letter-or-_ characters, such as spaces or quotes.

Starting in version 3.0.0-preview1, you can use the csv tempformat during writes and that format has none of these restrictions on names. In that version, there's even a helpful error message which detects when you've hit this case using AVRO serialization and suggests the CSV tempformat as a workaround:

I'll mark this issue as "documentation" to remind me to add a section to the README documenting this workaround.

@JamieL22
Copy link

JamieL22 commented May 5, 2022

Is this fixed yet? (white space in column names)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants