[SPARK-10978] [SQL] Allow data sources to eliminate filters #9399

liancheng · 2015-11-01T21:14:08Z

This PR adds a new method unhandledFilters to BaseRelation. Data sources which implement this method properly may avoid the overhead of defensive filtering done by Spark SQL.

SparkQA · 2015-11-01T21:28:09Z

Test build #44768 has finished for PR 9399 at commit 16f3ca3.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2015-11-01T23:50:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

+    relation: LogicalRelation,
+    projects: Seq[NamedExpression],
+    filterPredicates: Seq[Expression],
+    scanBuilder: (Seq[Attribute], Seq[Expression], Seq[Filter]) => RDD[InternalRow]) = {


It is not obvious that we need both Seq[Expression] and Seq[Filter]. Can you add comments to explain what are these?

SparkQA · 2015-11-02T01:04:52Z

Test build #44776 has finished for PR 9399 at commit fec7d25.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

chenghao-intel · 2015-11-02T02:20:59Z

One more consideration for this improvement, as we probably need to optimize the filters by folding the expression, as the partition keys are actually are the constant value in execution, simply adding the unhandledFilters probably does not work for partition based data source. So I am wondering if we can leave the unhandledFilters and handledFilters to data source implementation itself, we can provide the utilities or the default implementation for the common operations within the buildScan.

yhuai · 2015-11-02T02:23:59Z

@chenghao-intel Can you give an example showing unhandledFilters is insufficient? Also, regarding "So I am wondering if we can leave the unhandledFilters and handledFilters to data source implementation itself, we can provide the utilities or the default implementation for the common operations within the buildScan", can you explain it?

chenghao-intel · 2015-11-02T04:34:30Z

Actually I am talking that it probably give us some troubles in getting the unhandledFilters if we planned to optimize the cases where partition keys combined in the filter, as the partition key will be constant value during the execution for EACH PARTITION, and we may not able to make it filterable in planning stage, at least the code will be more complicated, and besides, I don't see too much benefit if we exposed the unhandledFilters for DataSourceStrategy, so I am suggesting if we can leave the unhandledFilters for BaseRelation.buildScan, as unhandledFilters can be considered as a private/protected method in BaseRelation, or we can provide some common operation helper functions to simplify the implementation for the new data source developers.

Sorry if I missed something.

liancheng · 2015-11-02T05:18:13Z

@chenghao-intel Could you please give an example?

chenghao-intel · 2015-11-02T06:34:51Z

Oh, for example: let's say we have the table src (key, value) partition (p1)
For the query like "SELECT value FROM src WHERE key > p1",

And we assume the p1 candidates are 10, 100, and the key range is (0, 50).
-- unhandledFilter = Array.empty
This probably fail in key > 10 (p1 = 10), as we may not able to filter records during the scan, before we taking out all of the records, or in buildScan, we should add an extra filter operation on RDD[Row].
-- unhandledFilter = key > p1
We will loss the optimization for partition (p1 = 100), since the concrete filter is key > 100, and we should always return RDD[Row].empty, as the range of key is (0, 50).

I mean it will be confused to the new data source developers, how to define the unhandledFilter. as the partition key is not treated like the normal attributes, at least it requires more work in getting the concrete value and multiple filter in the planning stage for different partition keys, what's the unhandledFilter supposed to retrieve?

On the other hand, I am not sure if it's really necessary to expose the unhandledFilter, as it's will be new API for data source that the developer should be aware for optimization purpose, but, we we pass down the filters via API def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] and its variants already. Splitting the filter expressions into 2 parts, and executed in different operators (DataSourceStrategy and DataSource impelementation) seems making thing more complicated, despite we will do the splitting in the data source implementation, but probably not wise enough to expose that externally.

chenghao-intel · 2015-11-02T06:42:29Z

Sorry, I am challenging this as it's about the API, which probably difficult to change back once it's released, and we'd better think further, by adding the partition key cases.

liancheng · 2015-11-02T12:50:11Z

@chenghao-intel Thanks for the comment. That's a good point and I didn't consider this situation when writing this PR. However, fortunately we don't even try to push down predicates that reference any partition columns (see here). In general, when implementing a data source, developers shouldn't worry about partitioning. A HadoopFsRelation data source is only responsible for returning data within a single partition, the query planner does the rest including partition pruning. So I think this is fine?

SparkQA · 2015-11-02T13:11:57Z

Test build #44811 has finished for PR 9399 at commit b658aaa.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-11-02T17:37:26Z

Test build #44814 has finished for PR 9399 at commit 7c17dd1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

chenghao-intel · 2015-11-03T00:26:20Z

Ok, actually I was planning to optimize the expression with partition key, which will introduce the ConstantFolding, as the partition key will be a constant value in runtime.

I know, for DataSource developer, the HadoopFsRelation will only return the single partition data as the RDD, that's what my question, how to define the unhandledFilter, will the filter with partition key always be part of the unhandledFilter?

yhuai · 2015-11-03T00:32:22Z

unhandledFilter will not see filters using partitioning columns.

chenghao-intel · 2015-11-03T00:38:31Z

Ok, thanks for explanation.

yhuai · 2015-11-03T05:04:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

+  //  2. A `Seq[Expression]`, containing all gathered Catalyst filter expressions, used by
+  //     `CatalystScan`.
+  //  3. A `Seq[Filter]`, containing all data source `Filter`s that are converted from (possibly a
+  //     subset of) Catalyst filter expressions and can be handled by `relation`.


So, Seq[Expression] is used for data source that understand catalyst expressions. Seq[Filter] is used for data sources that only understand Filter API? If so, can we make it clear that 2 and 3 will not not used together?

The first Seq[Expression] argument is only used to handle CatalystScan, which is only left for experimenting purposes, no built-in concrete data sources implement CatalystScan now. The second Seq[Filter] argument is used to handle all other relation traits that support filter push-down, e.g. PrunedFilteredScan and HadoopFsRelation. Added comments to explain this.

yhuai · 2015-11-03T06:26:16Z

sql/core/src/test/scala/org/apache/spark/sql/sources/FilteredScanSuite.scala


-  def testPushDown(sqlString: String, expectedCount: Int): Unit = {
+  def testPushDown(


Let's also add checks to make sure the Filter operator added by Spark SQL only contains unhandled predicates, unconvertible predicates, and predicates involving partition columns.

SimpleFilteredScan doesn't support partitioning. Updated SimpleTextRelation to make it support column pruning and filter push-down, and implemented unhandledFilters there to add these tests.

yhuai · 2015-11-03T08:53:22Z

Overall LGTM. Once we update the FilteredScanSuite, we are good to go.

liancheng · 2015-11-03T15:49:58Z

retest this please

yhuai · 2015-11-03T15:56:34Z

test this please

SparkQA · 2015-11-03T16:57:28Z

Test build #44928 has finished for PR 9399 at commit 326ea24.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-11-03T17:21:47Z

Test build #44927 has finished for PR 9399 at commit ddac7ac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2015-11-03T17:51:23Z

I will merge it once it passes jenkins. Let's have a test to make sure those handled filters will not show up in the Filter operator.

SparkQA · 2015-11-03T17:58:32Z

Test build #44933 has finished for PR 9399 at commit 92dfc55.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2015-11-03T18:06:42Z

Thanks! Merging!

yhuai · 2015-11-03T18:10:06Z

Let's also have some test cases that having a column that is used in handled filters as well as in unhandled/unconvertible filters.

SparkQA · 2015-11-03T18:19:06Z

Test build #44934 has finished for PR 9399 at commit 92dfc55.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

This PR adds test cases that test various column pruning and filter push-down cases. Author: Cheng Lian <lian@databricks.com> Closes #9468 from liancheng/spark-10978.follow-up. (cherry picked from commit c048929) Signed-off-by: Yin Huai <yhuai@databricks.com>

This PR adds test cases that test various column pruning and filter push-down cases. Author: Cheng Lian <lian@databricks.com> Closes #9468 from liancheng/spark-10978.follow-up.

Spark 1.6 extends `BaseRelation` with a new API which allows data sources to tell Spark which filters they handle, allowing Spark to eliminate its own defensive filtering for filters that are handled by the data source (see apache/spark#9399 for more details). This patch implements this new API in `spark-redshift` and adds tests. Author: Josh Rosen <joshrosen@databricks.com> Closes #128 from JoshRosen/support-filter-skipping-in-spark-1.6.

liancheng added 4 commits November 2, 2015 01:00

Adds BaseRelation.unhandledFilters()

c5f1123

Fixes test failures

e70dc76

Reverts ParquetRelation changes

7b5f884

Tests required columns for PrunedFilteredScan

16f3ca3

yhuai reviewed Nov 1, 2015
View reviewed changes

Addresses Yin's PR comments

b658aaa

liancheng force-pushed the spark-10978.unhandled-filters branch from fec7d25 to b658aaa Compare November 2, 2015 13:03

Fixes Scala style

7c17dd1

yhuai reviewed Nov 3, 2015
View reviewed changes

liancheng added 2 commits November 3, 2015 22:30

Addresses comments and updates tests

569e966

More comments

326ea24

liancheng force-pushed the spark-10978.unhandled-filters branch from ddac7ac to 326ea24 Compare November 3, 2015 14:52

More tests

92dfc55

asfgit closed this in ebf8b0b Nov 3, 2015

liancheng deleted the spark-10978.unhandled-filters branch November 4, 2015 00:54

liancheng added a commit to liancheng/spark that referenced this pull request Nov 4, 2015

More comprehensive tests for PR apache#9399

d5c9285

JoshRosen mentioned this pull request Nov 24, 2015

Implement Spark 1.6's new unhandledFilters API databricks/spark-redshift#128

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-10978] [SQL] Allow data sources to eliminate filters #9399

[SPARK-10978] [SQL] Allow data sources to eliminate filters #9399

liancheng commented Nov 1, 2015

SparkQA commented Nov 1, 2015

yhuai Nov 1, 2015

SparkQA commented Nov 2, 2015

chenghao-intel commented Nov 2, 2015

yhuai commented Nov 2, 2015

chenghao-intel commented Nov 2, 2015

liancheng commented Nov 2, 2015

chenghao-intel commented Nov 2, 2015

chenghao-intel commented Nov 2, 2015

liancheng commented Nov 2, 2015

SparkQA commented Nov 2, 2015

SparkQA commented Nov 2, 2015

chenghao-intel commented Nov 3, 2015

yhuai commented Nov 3, 2015

chenghao-intel commented Nov 3, 2015

yhuai Nov 3, 2015

liancheng Nov 3, 2015

yhuai Nov 3, 2015

liancheng Nov 3, 2015

yhuai commented Nov 3, 2015

liancheng commented Nov 3, 2015

yhuai commented Nov 3, 2015

SparkQA commented Nov 3, 2015

SparkQA commented Nov 3, 2015

yhuai commented Nov 3, 2015

SparkQA commented Nov 3, 2015

yhuai commented Nov 3, 2015

yhuai commented Nov 3, 2015

SparkQA commented Nov 3, 2015


		def testPushDown(sqlString: String, expectedCount: Int): Unit = {
		def testPushDown(

[SPARK-10978] [SQL] Allow data sources to eliminate filters #9399

[SPARK-10978] [SQL] Allow data sources to eliminate filters #9399

Conversation

liancheng commented Nov 1, 2015

SparkQA commented Nov 1, 2015

yhuai Nov 1, 2015

Choose a reason for hiding this comment

SparkQA commented Nov 2, 2015

chenghao-intel commented Nov 2, 2015

yhuai commented Nov 2, 2015

chenghao-intel commented Nov 2, 2015

liancheng commented Nov 2, 2015

chenghao-intel commented Nov 2, 2015

chenghao-intel commented Nov 2, 2015

liancheng commented Nov 2, 2015

SparkQA commented Nov 2, 2015

SparkQA commented Nov 2, 2015

chenghao-intel commented Nov 3, 2015

yhuai commented Nov 3, 2015

chenghao-intel commented Nov 3, 2015

yhuai Nov 3, 2015

Choose a reason for hiding this comment

liancheng Nov 3, 2015

Choose a reason for hiding this comment

yhuai Nov 3, 2015

Choose a reason for hiding this comment

liancheng Nov 3, 2015

Choose a reason for hiding this comment

yhuai commented Nov 3, 2015

liancheng commented Nov 3, 2015

yhuai commented Nov 3, 2015

SparkQA commented Nov 3, 2015

SparkQA commented Nov 3, 2015

yhuai commented Nov 3, 2015

SparkQA commented Nov 3, 2015

yhuai commented Nov 3, 2015

yhuai commented Nov 3, 2015

SparkQA commented Nov 3, 2015