Skip to content

Commit

Permalink
Merge branch 'develop'
Browse files Browse the repository at this point in the history
  • Loading branch information
kupferk committed Oct 13, 2021
2 parents d9b6215 + 4540007 commit 27de909
Show file tree
Hide file tree
Showing 465 changed files with 30,800 additions and 6,211 deletions.
14 changes: 14 additions & 0 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ build-default:

build-hadoop2.6-spark2.4:
stage: build
image: dimajix/maven-npm:jdk-1.8
script: 'mvn ${MAVEN_CLI_OPTS} clean package -Phadoop-2.6 -Pspark-2.4 -Ddockerfile.skip'
artifacts:
name: "flowman-dist-hadoop2.6-spark2.4"
Expand All @@ -71,6 +72,7 @@ build-hadoop2.6-spark2.4:

build-hadoop2.7-spark2.4:
stage: build
image: dimajix/maven-npm:jdk-1.8
script: 'mvn ${MAVEN_CLI_OPTS} clean package -Phadoop-2.7 -Pspark-2.4 -Ddockerfile.skip'
artifacts:
name: "flowman-dist-hadoop2.9-spark2.4"
Expand All @@ -80,6 +82,7 @@ build-hadoop2.7-spark2.4:

build-hadoop3.1-spark2.4:
stage: build
image: dimajix/maven-npm:jdk-1.8
script: 'mvn ${MAVEN_CLI_OPTS} clean package -Phadoop-3.1 -Pspark-2.4 -Ddockerfile.skip'
artifacts:
name: "flowman-dist-hadoop3.1-spark2.4"
Expand Down Expand Up @@ -125,9 +128,20 @@ build-hadoop3.2-spark3.1:

build-cdh6.3:
stage: build
image: dimajix/maven-npm:jdk-1.8
script: 'mvn ${MAVEN_CLI_OPTS} clean package -PCDH-6.3 -Ddockerfile.skip'
artifacts:
name: "flowman-dist-cdh6.3"
paths:
- flowman-dist/target/flowman-dist-*-bin.tar.gz
expire_in: 5 days

build-cdp7.1:
stage: build
image: dimajix/maven-npm:jdk-1.8
script: 'mvn ${MAVEN_CLI_OPTS} clean package -PCDP-7.1 -Ddockerfile.skip'
artifacts:
name: "flowman-dist-cdp7.1"
paths:
- flowman-dist/target/flowman-dist-*-bin.tar.gz
expire_in: 5 days
4 changes: 4 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,3 +46,7 @@ jobs:
- name: CDH 6.3
jdk: openjdk8
script: mvn clean install -PCDH-6.3 -Ddockerfile.skip

- name: CDP 7.1
jdk: openjdk8
script: mvn clean install -PCDP-7.1 -Ddockerfile.skip
61 changes: 49 additions & 12 deletions BUILDING.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,17 +6,16 @@ is installed on the build machine.
## Prerequisites

You need the following tools installed on your machine:
* JDK 1.8 or later. If you build a variant with Scala 2.11, you have to use JDK 1.8 (and not anything newer like
Java 11). This mainly affects builds with Spark 2.x
* JDK 1.8 or later - but not too new (Java 16 is currently not supported)
* Apache Maven (install via package manager download from https://maven.apache.org/download.cgi)
* npm (install via package manager or download from https://www.npmjs.com/get-npm)
* Windows users also need Hadoop winutils installed. Those can be retrieved from /~https://github.com/cdarlint/winutils
and later. See some additional details for building on Windows below.


# Build with Maven

Building Flowman with the default settings (i.e. Hadoop and Spark version) is as easy as
Building Flowman with the default settings (i.e. newest supported Spark and Hadoop versions will be used) is as easy as

```shell
mvn clean install
Expand All @@ -28,9 +27,29 @@ The main artifacts will be a Docker image 'dimajix/flowman' and additionally a t
version of Flowman for direct installation in cases where Docker is not available or when you want to run Flowman
in a complex environment with Kerberos. You can find the `tar.gz` file in the directory `flowman-dist/target`

## Skip Tests

In case you don't want to run tests, you can simply append `-DskipTests`

```shell
mvn clean install -DskipTests
```

## Skip Docker Image

In case you don't want to build the Docker image (for example when the build itself is done within a Docker container),
you can simply append `-Ddockerfile.skip`

```shell
mvn clean install -Ddockerfile.skip
```


# Custom Builds

Flowman supports various versions of Spark and Hadoop to match your requirements and your environment. By providing
appropriate build profiles, you can easily create a custom build.

## Build on Windows

Although you can normally build Flowman on Windows, it is recommended to use Linux instead. But nevertheless Windows
Expand All @@ -47,12 +66,7 @@ value "core.autocrlf" to "input"
git config --global core.autocrlf input
```

You might also want to skip unittests (the HBase plugin is currently failing under windows)

```shell
mvn clean install -DskipTests
```


It may well be the case that some unittests fail on Windows - don't panic, we focus on Linux systems and ensure that
the `master` branch really builds clean with all unittests passing on Linux.

Expand Down Expand Up @@ -86,9 +100,22 @@ using the correct version. The following profiles are available:
* hadoop-3.1
* hadoop-3.2
* CDH-6.3
* CDP-7.1

With these profiles it is easy to build Flowman to match your environment.


## Building for specific Java Version

If nothing else is set on the command line, Flowman will now build for Java 11 (except when building the profile
CDH-6.3, where Java 1.8 is used). If you are still stuck on Java 1.8, you can simply override the Java version by
specifying the property `java.version`

```shell
mvn install -Djava.version=1.8
```


## Building for Open Source Hadoop and Spark

### Spark 2.4 and Hadoop 2.6:
Expand Down Expand Up @@ -135,17 +162,27 @@ mvn clean install -Pspark-3.1 -Phadoop-3.2

## Building for Cloudera

The Maven project also contains preconfigured profiles for Cloudera CDH 6.3.
The Maven project also contains preconfigured profiles for Cloudera CDH 6.3 and for Cloudera CDP 7.1.

```shell
mvn clean install -Pspark-2.4 -PCDH-6.3 -DskipTests
mvn clean install -PCDH-6.3 -DskipTests
```

```shell
mvn clean install -PCDP-7.1 -DskipTests
```


# Coverage Analysis

Flowman also now supports creating a coverage analysis via the scoverage Maven plugin. It is not part of the default
build and has to be triggered explicitly:

```shell
mvn scoverage:report
```


# Building Documentation

Flowman also contains Markdown documentation which is processed by Sphinx to generate the online HTML documentation.
Expand Down
31 changes: 31 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,34 @@
# Version 0.18.0 - 2021-10-13

* Improve automatic schema migration for Hive and JDBC relations
* Improve support of CHAR(n) and VARCHAR(n) types. Those types will now be propagates to Hive with newer Spark versions
* Support writing to dynamic partitions for file relations, Hive tables, JDBC relations and Delta tables
* Fix the name of some config variables (floman.* => flowman.*)
* Added new config variables `flowman.default.relation.migrationPolicy` and `flowman.default.relation.migrationStrategy`
* Add plugin for supporting DeltaLake (https://delta.io), which provides `deltaTable` and `deltaFile` relation types
* Fix non-deterministic column order in `schema` mapping, `values` mapping and `values` relation
* Mark Hive dependencies has 'provided', which reduces the size of dist packages
* Significantly reduce size of AWS dependencies in AWS plugin
* Add new build profile for Cloudera CDP-7.1
* Improve Spark configuration of `LocalSparkSession` and `TestRunner`
* Update Spark 3.0 build profile to Spark 3.0.3
* Upgrade Impala JDBC driver from 2.6.17.1020 to 2.6.23.1028
* Upgrade MySQL JDBC driver from 8.0.20 to 8.0.25
* Upgrade MariaDB JDBC driver from 2.2.4 to 2.7.3
* Upgrade several Maven plugins to latest versions
* Add new config option `flowman.workaround.analyze_partition` to workaround CDP 7.1 issues
* Fix migrating Hive views to tables and vice-versa
* Add new option "-j <n>" to allow running multiple job instances in parallel
* Add new option "-j <n>" to allow running multiple tests in parallel
* Add new `uniqueKey` assertion
* Add new `schema` assertion
* Update Swagger libraries for `swagger` schema
* Implement new `openapi` plugin to support OpenAPI 3.0 schemas
* Add new `readHive` mapping
* Add new `simpleReport` and `report` hook
* Implement new templates


# Version 0.17.1 - 2021-06-18

* Bump CDH version to 6.3.4
Expand Down
3 changes: 1 addition & 2 deletions INSTALLING.md
Original file line number Diff line number Diff line change
Expand Up @@ -215,8 +215,7 @@ config:
- datanucleus.rdbms.datastoreAdapterClassName=org.datanucleus.store.rdbms.adapter.DerbyAdapter
plugins:
- flowman-example
- flowman-hbase
- flowman-delta
- flowman-aws
- flowman-azure
- flowman-kafka
Expand Down
5 changes: 4 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@
[![Build Status](https://travis-ci.org/dimajix/flowman.svg?branch=develop)](https://travis-ci.org/dimajix/flowman)
[![Documentation](https://readthedocs.org/projects/flowman/badge/?version=latest)](https://flowman.readthedocs.io/en/latest/)

[Flowman.io](https://flowman.io)

Flowman is a Spark based ETL program that simplifies the act of writing data transformations.
The main idea is that users write so called *specifications* in purely declarative YAML files
instead of writing Spark jobs in Scala or Python. The main advantage of this approach is that
Expand All @@ -27,7 +29,8 @@ and schema information) in a single place managed by a single program.

## Documentation

You can find comprehensive documentation at [Read the Docs](https://flowman.readthedocs.io/en/latest/).
You can find the official homepage at [Flowman.io](https://flowman.io)
and a comprehensive documentation at [Read the Docs](https://flowman.readthedocs.io/en/latest/).


# Installation
Expand Down
3 changes: 2 additions & 1 deletion build-release.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ build_profile() {
do
profiles="$profiles -P$p"
done
mvn clean install $profiles -DskipTests
mvn clean install $profiles -DskipTests -Ddockerfile.skip
cp flowman-dist/target/flowman-dist-*.tar.gz release
}

Expand All @@ -20,3 +20,4 @@ build_profile hadoop-3.2 spark-3.0
build_profile hadoop-2.7 spark-3.1
build_profile hadoop-3.2 spark-3.1
build_profile CDH-6.3
build_profile CDP-7.1
3 changes: 3 additions & 0 deletions docker/conf/default-namespace.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,11 @@ store:
plugins:
- flowman-aws
- flowman-azure
- flowman-delta
- flowman-kafka
- flowman-mariadb
- flowman-mysql
- flowman-mssqlserver
- flowman-swagger
- flowman-openapi
- flowman-json
8 changes: 7 additions & 1 deletion docker/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
<parent>
<groupId>com.dimajix.flowman</groupId>
<artifactId>flowman-root</artifactId>
<version>0.17.1</version>
<version>0.18.0</version>
<relativePath>../pom.xml</relativePath>
</parent>

Expand All @@ -27,6 +27,12 @@
<docker.base-image.version>2.4.5</docker.base-image.version>
</properties>
</profile>
<profile>
<id>CDP-7.1</id>
<properties>
<docker.base-image.version>2.4.5</docker.base-image.version>
</properties>
</profile>
</profiles>

<build>
Expand Down
46 changes: 44 additions & 2 deletions docs/building.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,23 @@ Building Flowman with the default settings (i.e. Hadoop and Spark version) is as

mvn clean install

### Skip Tests

In case you don't want to run tests, you can simply append `-DskipTests`

```shell
mvn clean install -DskipTests
```

### Skip Docker Image

In case you don't want to build the Docker image (for example when the build itself is done within a Docker container),
you can simply append `-Ddockerfile.skip`

```shell
mvn clean install -Ddockerfile.skip
```

## Main Artifacts

The main artifacts will be a Docker image 'dimajix/flowman' and additionally a tar.gz file containing a runnable
Expand All @@ -41,6 +58,9 @@ in a complex environment with Kerberos. You can find the `tar.gz` file in the di

## Custom Builds

Flowman supports various versions of Spark and Hadoop to match your requirements and your environment. By providing
appropriate build profiles, you can easily create a custom build.

### Build on Windows

Although you can normally build Flowman on Windows, you will need the Hadoop WinUtils installed. You can download
Expand Down Expand Up @@ -86,6 +106,18 @@ using the correct version. The following profiles are available:

With these profiles it is easy to build Flowman to match your environment.


### Building for specific Java Version

If nothing else is set on the command line, Flowman will now build for Java 11 (except when building the profile
CDH-6.3, where Java 1.8 is used). If you are still stuck on Java 1.8, you can simply override the Java version by
specifying the property `java.version`

```shell
mvn install -Djava.version=1.8
```


### Building for Open Source Hadoop and Spark

Spark 2.4 and Hadoop 2.6:
Expand Down Expand Up @@ -119,11 +151,21 @@ Spark 3.1 and Hadoop 3.2

### Building for Cloudera

The Maven project also contains preconfigured profiles for Cloudera CDH 6.3.
The Maven project also contains preconfigured profiles for Cloudera CDH 6.3 and for CDP 7.1.

mvn clean install -Pspark-2.4 -PCDH-6.3 -DskipTests
mvn clean install -PCDH-6.3 -DskipTests
mvn clean install -PCDP-7.1 -DskipTests


## Coverage Analysis

Flowman also now supports creating a coverage analysis via the scoverage Maven plugin. It is not part of the default
build and has to be triggered explicitly:

```shell
mvn scoverage:report
```

## Building Documentation

Flowman also contains Markdown documentation which is processed by Sphinx to generate the online HTML documentation.
Expand Down
14 changes: 13 additions & 1 deletion docs/cli/flowexec.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,10 +47,22 @@ This will execute the whole job by executing the desired lifecycle for the `main
* `-nl` or `--no-lifecycle` only execute the specified lifecycle phase, without all preceeding phases. For example
the whole lifecycle for `verify` includes the phases `create` and `build` and these phases would be executed before
`verify`. If this is not what you want, then use the option `-nl`
* `-j <n>` runs multiple job instances in parallel. This is very useful for running a job for a whole range of dates.
* `-t <target>` only executes the given target(s), which are specified as a RegEx.

The following example will only execute the `BUILD` phase of the job `daily`, which defines a parameter
`processing_datetime` with type datetiem. The job will be executed for the whole date range from 2021-06-01 until
2021-08-10 with a step size of one day. Flowman will execute up to four jobs in parallel (`-j 4`).

```
flowexec job build daily processing_datetime:start=2021-06-01T00:00 processing_datetime:end=2021-08-10T00:00 processing_datetime:step=P1D --target parquet_lineitem --no-lifecycle -j 4
```


## Target Commands
It is also possible to perform actions on individual targets using the `target` command group.
It is also possible to perform actions on individual targets using the `target` command group. In most cases this is
inferior to using the `job` interface above, since typical jobs will also define appropriate environment variables
which might be required by targets.

### List Targets
```shell script
Expand Down
Loading

0 comments on commit 27de909

Please sign in to comment.