Merge branch 'develop'

dimajix · Oct 13, 2021 · 27de909 · 27de909
2 parents d9b6215 + 4540007
commit 27de909
Show file tree

Hide file tree

Showing 465 changed files with 30,800 additions and 6,211 deletions.
diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
@@ -62,6 +62,7 @@ build-default:
 
 build-hadoop2.6-spark2.4:
   stage: build
+  image: dimajix/maven-npm:jdk-1.8
   script: 'mvn ${MAVEN_CLI_OPTS} clean package -Phadoop-2.6 -Pspark-2.4 -Ddockerfile.skip'
   artifacts:
     name: "flowman-dist-hadoop2.6-spark2.4"
@@ -71,6 +72,7 @@ build-hadoop2.6-spark2.4:
 
 build-hadoop2.7-spark2.4:
   stage: build
+  image: dimajix/maven-npm:jdk-1.8
   script: 'mvn ${MAVEN_CLI_OPTS} clean package -Phadoop-2.7 -Pspark-2.4 -Ddockerfile.skip'
   artifacts:
     name: "flowman-dist-hadoop2.9-spark2.4"
@@ -80,6 +82,7 @@ build-hadoop2.7-spark2.4:
 
 build-hadoop3.1-spark2.4:
   stage: build
+  image: dimajix/maven-npm:jdk-1.8
   script: 'mvn ${MAVEN_CLI_OPTS} clean package -Phadoop-3.1 -Pspark-2.4 -Ddockerfile.skip'
   artifacts:
     name: "flowman-dist-hadoop3.1-spark2.4"
@@ -125,9 +128,20 @@ build-hadoop3.2-spark3.1:
 
 build-cdh6.3:
   stage: build
+  image: dimajix/maven-npm:jdk-1.8
   script: 'mvn ${MAVEN_CLI_OPTS} clean package -PCDH-6.3 -Ddockerfile.skip'
   artifacts:
     name: "flowman-dist-cdh6.3"
     paths:
       - flowman-dist/target/flowman-dist-*-bin.tar.gz
     expire_in: 5 days
+
+build-cdp7.1:
+  stage: build
+  image: dimajix/maven-npm:jdk-1.8
+  script: 'mvn ${MAVEN_CLI_OPTS} clean package -PCDP-7.1 -Ddockerfile.skip'
+  artifacts:
+    name: "flowman-dist-cdp7.1"
+    paths:
+      - flowman-dist/target/flowman-dist-*-bin.tar.gz
+    expire_in: 5 days
diff --git a/.travis.yml b/.travis.yml
@@ -46,3 +46,7 @@ jobs:
     - name: CDH 6.3
       jdk: openjdk8
       script: mvn clean install -PCDH-6.3 -Ddockerfile.skip
+
+    - name: CDP 7.1
+      jdk: openjdk8
+      script: mvn clean install -PCDP-7.1 -Ddockerfile.skip
diff --git a/BUILDING.md b/BUILDING.md
@@ -6,17 +6,16 @@ is installed on the build machine.
 ## Prerequisites
 
 You need the following tools installed on your machine:
-* JDK 1.8 or later. If you build a variant with Scala 2.11, you have to use JDK 1.8 (and not anything newer like
-  Java 11). This mainly affects builds with Spark 2.x
+* JDK 1.8 or later - but not too new (Java 16 is currently not supported)
 * Apache Maven (install via package manager download from https://maven.apache.org/download.cgi)
 * npm (install via package manager or download from https://www.npmjs.com/get-npm)
 * Windows users also need Hadoop winutils installed. Those can be retrieved from /~https://github.com/cdarlint/winutils
 and later. See some additional details for building on Windows below.
-
+  
 
 # Build with Maven
 
-Building Flowman with the default settings (i.e. Hadoop and Spark version) is as easy as
+Building Flowman with the default settings (i.e. newest supported Spark and Hadoop versions will be used) is as easy as
 
 ```shell
 mvn clean install
@@ -28,9 +27,29 @@ The main artifacts will be a Docker image 'dimajix/flowman' and additionally a t
 version of Flowman for direct installation in cases where Docker is not available or when you want to run Flowman 
 in a complex environment with Kerberos. You can find the `tar.gz` file in the directory `flowman-dist/target`
 
+## Skip Tests
+
+In case you don't want to run tests, you can simply append `-DskipTests`
+
+```shell
+mvn clean install -DskipTests
+```
+
+## Skip Docker Image
+
+In case you don't want to build the Docker image (for example when the build itself is done within a Docker container), 
+you can simply append `-Ddockerfile.skip`
+
+```shell
+mvn clean install -Ddockerfile.skip
+```
+
 
 # Custom Builds
 
+Flowman supports various versions of Spark and Hadoop to match your requirements and your environment. By providing
+appropriate build profiles, you can easily create a custom build.
+
 ## Build on Windows
 
 Although you can normally build Flowman on Windows, it is recommended to use Linux instead. But nevertheless Windows
@@ -47,12 +66,7 @@ value "core.autocrlf" to "input"
 git config --global core.autocrlf input
 ```
 
-You might also want to skip unittests (the HBase plugin is currently failing under windows)
-
-```shell
-mvn clean install -DskipTests
-```
-
+
 It may well be the case that some unittests fail on Windows - don't panic, we focus on Linux systems and ensure that
 the `master` branch really builds clean with all unittests passing on Linux.
 
@@ -86,9 +100,22 @@ using the correct version. The following profiles are available:
 * hadoop-3.1
 * hadoop-3.2
 * CDH-6.3
+* CDP-7.1
 
 With these profiles it is easy to build Flowman to match your environment. 
 
+
+## Building for specific Java Version
+
+If nothing else is set on the command line, Flowman will now build for Java 11 (except when building the profile
+CDH-6.3, where Java 1.8 is used). If you are still stuck on Java 1.8, you can simply override the Java version by 
+specifying the property `java.version`
+
+```shell
+mvn install -Djava.version=1.8
+```
+
+
 ## Building for Open Source Hadoop and Spark
 
 ### Spark 2.4 and Hadoop 2.6:
@@ -135,17 +162,27 @@ mvn clean install -Pspark-3.1 -Phadoop-3.2
 
 ## Building for Cloudera
 
-The Maven project also contains preconfigured profiles for Cloudera CDH 6.3.
+The Maven project also contains preconfigured profiles for Cloudera CDH 6.3 and for Cloudera CDP 7.1.
 
 ```shell
-mvn clean install -Pspark-2.4 -PCDH-6.3 -DskipTests
+mvn clean install -PCDH-6.3 -DskipTests
 ```
 
+```shell
+mvn clean install -PCDP-7.1 -DskipTests
+```
+
+
 # Coverage Analysis
+
+Flowman also now supports creating a coverage analysis via the scoverage Maven plugin. It is not part of the default
+build and has to be triggered explicitly:
+
 ```shell
 mvn scoverage:report
 ```
 
+
 # Building Documentation
 
 Flowman also contains Markdown documentation which is processed by Sphinx to generate the online HTML documentation.

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,34 @@
+# Version 0.18.0 - 2021-10-13
+
+* Improve automatic schema migration for Hive and JDBC relations
+* Improve support of CHAR(n) and VARCHAR(n) types. Those types will now be propagates to Hive with newer Spark versions
+* Support writing to dynamic partitions for file relations, Hive tables, JDBC relations and Delta tables
+* Fix the name of some config variables (floman.* => flowman.*)
+* Added new config variables `flowman.default.relation.migrationPolicy` and `flowman.default.relation.migrationStrategy`
+* Add plugin for supporting DeltaLake (https://delta.io), which provides `deltaTable` and `deltaFile` relation types
+* Fix non-deterministic column order in `schema` mapping, `values` mapping and `values` relation 
+* Mark Hive dependencies has 'provided', which reduces the size of dist packages
+* Significantly reduce size of AWS dependencies in AWS plugin
+* Add new build profile for Cloudera CDP-7.1
+* Improve Spark configuration of `LocalSparkSession` and `TestRunner`  
+* Update Spark 3.0 build profile to Spark 3.0.3
+* Upgrade Impala JDBC driver from 2.6.17.1020 to 2.6.23.1028
+* Upgrade MySQL JDBC driver from 8.0.20 to 8.0.25  
+* Upgrade MariaDB JDBC driver from 2.2.4 to 2.7.3
+* Upgrade several Maven plugins to latest versions
+* Add new config option `flowman.workaround.analyze_partition` to workaround CDP 7.1 issues
+* Fix migrating Hive views to tables and vice-versa
+* Add new option "-j <n>" to allow running multiple job instances in parallel
+* Add new option "-j <n>" to allow running multiple tests in parallel
+* Add new `uniqueKey` assertion
+* Add new `schema` assertion
+* Update Swagger libraries for `swagger` schema
+* Implement new `openapi` plugin to support OpenAPI 3.0 schemas
+* Add new `readHive` mapping
+* Add new `simpleReport` and `report` hook
+* Implement new templates
+
+
 # Version 0.17.1 - 2021-06-18
 
 * Bump CDH version to 6.3.4

diff --git a/INSTALLING.md b/INSTALLING.md
@@ -215,8 +215,7 @@ config:
   - datanucleus.rdbms.datastoreAdapterClassName=org.datanucleus.store.rdbms.adapter.DerbyAdapter
 
 plugins:
-  - flowman-example
-  - flowman-hbase
+  - flowman-delta
   - flowman-aws
   - flowman-azure
   - flowman-kafka

diff --git a/README.md b/README.md
@@ -4,6 +4,8 @@
 [![Build Status](https://travis-ci.org/dimajix/flowman.svg?branch=develop)](https://travis-ci.org/dimajix/flowman)
 [![Documentation](https://readthedocs.org/projects/flowman/badge/?version=latest)](https://flowman.readthedocs.io/en/latest/)
 
+[Flowman.io](https://flowman.io)
+
 Flowman is a Spark based ETL program that simplifies the act of writing data transformations.
 The main idea is that users write so called *specifications* in purely declarative YAML files
 instead of writing Spark jobs in Scala or Python. The main advantage of this approach is that
@@ -27,7 +29,8 @@ and schema information) in a single place managed by a single program.
 
 ## Documentation
 
-You can find comprehensive documentation at [Read the Docs](https://flowman.readthedocs.io/en/latest/). 
+You can find the official homepage at [Flowman.io](https://flowman.io)
+ and a comprehensive documentation at [Read the Docs](https://flowman.readthedocs.io/en/latest/). 
 
 
 # Installation

diff --git a/build-release.sh b/build-release.sh
@@ -9,7 +9,7 @@ build_profile() {
     do
         profiles="$profiles -P$p"
     done
-    mvn clean install $profiles -DskipTests
+    mvn clean install $profiles -DskipTests -Ddockerfile.skip
     cp flowman-dist/target/flowman-dist-*.tar.gz release
 }
 
@@ -20,3 +20,4 @@ build_profile hadoop-3.2 spark-3.0
 build_profile hadoop-2.7 spark-3.1
 build_profile hadoop-3.2 spark-3.1
 build_profile CDH-6.3
+build_profile CDP-7.1
diff --git a/docker/conf/default-namespace.yml b/docker/conf/default-namespace.yml
@@ -26,8 +26,11 @@ store:
 plugins:
   - flowman-aws
   - flowman-azure
+  - flowman-delta
   - flowman-kafka
   - flowman-mariadb
   - flowman-mysql
+  - flowman-mssqlserver
   - flowman-swagger
+  - flowman-openapi
   - flowman-json
diff --git a/docker/pom.xml b/docker/pom.xml
@@ -10,7 +10,7 @@
     <parent>
         <groupId>com.dimajix.flowman</groupId>
         <artifactId>flowman-root</artifactId>
-        <version>0.17.1</version>
+        <version>0.18.0</version>
         <relativePath>../pom.xml</relativePath>
     </parent>
 
@@ -27,6 +27,12 @@
                 <docker.base-image.version>2.4.5</docker.base-image.version>
             </properties>
         </profile>
+        <profile>
+            <id>CDP-7.1</id>
+            <properties>
+                <docker.base-image.version>2.4.5</docker.base-image.version>
+            </properties>
+        </profile>
     </profiles>
 
     <build>

diff --git a/docs/building.md b/docs/building.md
@@ -32,6 +32,23 @@ Building Flowman with the default settings (i.e. Hadoop and Spark version) is as
 
     mvn clean install
 
+### Skip Tests
+
+In case you don't want to run tests, you can simply append `-DskipTests`
+
+```shell
+mvn clean install -DskipTests
+```
+
+### Skip Docker Image
+
+In case you don't want to build the Docker image (for example when the build itself is done within a Docker container),
+you can simply append `-Ddockerfile.skip`
+
+```shell
+mvn clean install -Ddockerfile.skip
+```
+
 ## Main Artifacts
 
 The main artifacts will be a Docker image 'dimajix/flowman' and additionally a tar.gz file containing a runnable 
@@ -41,6 +58,9 @@ in a complex environment with Kerberos. You can find the `tar.gz` file in the di
 
 ## Custom Builds
 
+Flowman supports various versions of Spark and Hadoop to match your requirements and your environment. By providing
+appropriate build profiles, you can easily create a custom build.
+
 ### Build on Windows
 
 Although you can normally build Flowman on Windows, you will need the Hadoop WinUtils installed. You can download
@@ -86,6 +106,18 @@ using the correct version. The following profiles are available:
 
 With these profiles it is easy to build Flowman to match your environment. 
 
+
+### Building for specific Java Version
+
+If nothing else is set on the command line, Flowman will now build for Java 11 (except when building the profile
+CDH-6.3, where Java 1.8 is used). If you are still stuck on Java 1.8, you can simply override the Java version by
+specifying the property `java.version`
+
+```shell
+mvn install -Djava.version=1.8
+```
+
+
 ### Building for Open Source Hadoop and Spark
 
 Spark 2.4 and Hadoop 2.6:
@@ -119,11 +151,21 @@ Spark 3.1 and Hadoop 3.2
 
 ### Building for Cloudera
 
-The Maven project also contains preconfigured profiles for Cloudera CDH 6.3.
+The Maven project also contains preconfigured profiles for Cloudera CDH 6.3 and for CDP 7.1.
 
-    mvn clean install -Pspark-2.4 -PCDH-6.3 -DskipTests
+    mvn clean install -PCDH-6.3 -DskipTests
+    mvn clean install -PCDP-7.1 -DskipTests
 
 
+## Coverage Analysis
+
+Flowman also now supports creating a coverage analysis via the scoverage Maven plugin. It is not part of the default
+build and has to be triggered explicitly:
+
+```shell
+mvn scoverage:report
+```
+
 ## Building Documentation
 
 Flowman also contains Markdown documentation which is processed by Sphinx to generate the online HTML documentation.

diff --git a/docs/cli/flowexec.md b/docs/cli/flowexec.md
@@ -47,10 +47,22 @@ This will execute the whole job by executing the desired lifecycle for the `main
 * `-nl` or `--no-lifecycle` only execute the specified lifecycle phase, without all preceeding phases. For example
 the whole lifecycle for `verify` includes the phases `create` and `build` and these phases would be executed before
 `verify`. If this is not what you want, then use the option `-nl`
+* `-j <n>` runs multiple job instances in parallel. This is very useful for running a job for a whole range of dates.
+* `-t <target>` only executes the given target(s), which are specified as a RegEx.
+
+The following example will only execute the `BUILD` phase of the job `daily`, which defines a parameter
+`processing_datetime` with type datetiem. The job will be executed for the whole date range from 2021-06-01 until 
+2021-08-10 with a step size of one day. Flowman will execute up to four jobs in parallel (`-j 4`).
+
+```
+flowexec job build daily processing_datetime:start=2021-06-01T00:00 processing_datetime:end=2021-08-10T00:00 processing_datetime:step=P1D --target parquet_lineitem --no-lifecycle -j 4
+```
 
 
 ## Target Commands
-It is also possible to perform actions on individual targets using the `target` command group.
+It is also possible to perform actions on individual targets using the `target` command group. In most cases this is
+inferior to using the `job` interface above, since typical jobs will also define appropriate environment variables
+which might be required by targets.
 
 ### List Targets
 ```shell script