Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an option in joins to specify row order #3233

Merged
merged 14 commits into from
Dec 24, 2022
11 changes: 10 additions & 1 deletion NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,19 @@
## New functionalities

* Add `Iterators.partition` support
([#3212](/~https://github.com/JuliaData/DataFrames.jl/pull/3212))
([#3212](/~https://github.com/JuliaData/DataFrames.jl/pull/3212))
* Add `allunique` and allow transformations in `cols` argument of `describe`
and `nonunique` when working with `SubDataFrame`
([3232](/~https://github.com/JuliaData/DataFrames.jl/pull/3232))
* Joining functions now support `order` keyword argument allowing the user
to specify the order of the rows in the produced table
([#3233](/~https://github.com/JuliaData/DataFrames.jl/pull/3233))

## Bug fixes

* passing very many data frames to `innerjoin` and `outerjoin`
does not lead to stack overflow
([#3233](/~https://github.com/JuliaData/DataFrames.jl/pull/3233))

# DataFrames.jl v1.4.4 Patch Release Notes

Expand Down
251 changes: 235 additions & 16 deletions docs/src/man/joins.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
# Database-Style Joins

We often need to combine two or more data sets together to provide a complete picture of the topic we are studying. For example, suppose that we have the following two data sets:
## Intoduction to joins
bkamins marked this conversation as resolved.
Show resolved Hide resolved

We often need to combine two or more data sets together to provide a complete
picture of the topic we are studying. For example, suppose that we have the
following two data sets:

```jldoctest joins
julia> using DataFrames
Expand All @@ -22,7 +26,8 @@ julia> jobs = DataFrame(ID=[20, 40], Job=["Lawyer", "Doctor"])
2 │ 40 Doctor
```

We might want to work with a larger data set that contains both the names and jobs for each ID. We can do this using the `innerjoin` function:
We might want to work with a larger data set that contains both the names and
jobs for each ID. We can do this using the `innerjoin` function:

```jldoctest joins
julia> innerjoin(people, jobs, on = :ID)
Expand All @@ -34,21 +39,29 @@ julia> innerjoin(people, jobs, on = :ID)
2 │ 40 Jane Doe Doctor
```

In relational database theory, this operation is generally referred to as a join.
The columns used to determine which rows should be combined during a join are called keys.
In relational database theory, this operation is generally referred to as a
join. The columns used to determine which rows should be combined during a join
are called keys.

The following functions are provided to perform seven kinds of joins:

- `innerjoin`: the output contains rows for values of the key that exist in all passed data frames.
- `leftjoin`: the output contains rows for values of the key that exist in the first (left) argument,
whether or not that value exists in the second (right) argument.
- `rightjoin`: the output contains rows for values of the key that exist in the second (right) argument,
whether or not that value exists in the first (left) argument.
- `outerjoin`: the output contains rows for values of the key that exist in any of the passed data frames.
- `semijoin`: Like an inner join, but output is restricted to columns from the first (left) argument.
- `antijoin`: The output contains rows for values of the key that exist in the first (left) but not the second (right) argument.
As with `semijoin`, output is restricted to columns from the first (left) argument.
- `crossjoin`: The output is the cartesian product of rows from all passed data frames.
- `innerjoin`: the output contains rows for values of the key that exist in all
passed data frames.
- `leftjoin`: the output contains rows for values of the key that exist in the
first (left) argument, whether or not that value exists in the second (right)
argument.
- `rightjoin`: the output contains rows for values of the key that exist in the
second (right) argument, whether or not that value exists in the first (left)
argument.
- `outerjoin`: the output contains rows for values of the key that exist in any
of the passed data frames.
- `semijoin`: Like an inner join, but output is restricted to columns from the
first (left) argument.
- `antijoin`: The output contains rows for values of the key that exist in the
first (left) but not the second (right) argument. As with `semijoin`, output
is restricted to columns from the first (left) argument.
- `crossjoin`: The output is the cartesian product of rows from all passed data
frames.

See [the Wikipedia page on SQL joins](https://en.wikipedia.org/wiki/Join_(SQL)) for more information.

Expand Down Expand Up @@ -124,8 +137,10 @@ julia> crossjoin(people, jobs, makeunique = true)
4 │ 40 Jane Doe 60 Astronaut
```

In order to join data frames on keys which have different names in the left and right tables,
you may pass `left => right` pairs as `on` argument:
## Joining on key columns with different names

In order to join data frames on keys which have different names in the left and
right tables, you may pass `left => right` pairs as `on` argument:

```jldoctest joins
julia> a = DataFrame(ID=[20, 40], Name=["John Doe", "Jane Doe"])
Expand Down Expand Up @@ -198,6 +213,8 @@ julia> innerjoin(a, b, on = [:City => :Location, :Job => :Work])
9 │ New York Doctor 5 e
```

## Handling of duplicate keys and tracking source data frame

Additionally, notice that in the last join rows 2 and 3 had the same values on
`on` variables in both joined `DataFrame`s. In such a situation `innerjoin`,
`outerjoin`, `leftjoin` and `rightjoin` will produce all combinations of
Expand Down Expand Up @@ -248,3 +265,205 @@ julia> outerjoin(a, b, on=:ID, validate=(true, true), source=:source)

Note that this time we also used the `validate` keyword argument and it did not
produce errors as the keys defined in both source data frames were unique.

## Renaming joined columns

Often you want to keep track of the source data frame of a given column.
This feature is supported with the `ranamecols` keyword argument:

```jldoctest joins
julia> innerjoin(a, b, on=:ID, renamecols = "_left" => "_right")
1×3 DataFrame
Row │ ID Name_left Job_right
│ Int64 String String
─────┼─────────────────────────────
1 │ 20 John Lawyer
```

In the above example we added `"_left"` suffix to the non-key columns from
the left table and `"_right"` suffix to the non-key columns from the right
table.
bkamins marked this conversation as resolved.
Show resolved Hide resolved

Alternatively it is allowed to pass a function transforming column names:
```jldoctest joins
julia> innerjoin(a, b, on=:ID, renamecols = lowercase => uppercase)
1×3 DataFrame
Row │ ID name JOB
│ Int64 String String
─────┼───────────────────────
1 │ 20 John Lawyer

```

## Matching missing values in joins

By default when you try to to perform a join on a key that has `missing` values
you get an error:

```jldoctest joins
julia> df1 = DataFrame(id=[1, missing, 3], a=1:3)
3×2 DataFrame
Row │ id a
│ Int64? Int64
─────┼────────────────
1 │ 1 1
2 │ missing 2
3 │ 3 3

julia> df2 = DataFrame(id=[1, 2, missing], b=1:3)
3×2 DataFrame
Row │ id b
│ Int64? Int64
─────┼────────────────
1 │ 1 1
2 │ 2 2
3 │ missing 3

julia> innerjoin(df1, df2, on=:id)
ERROR: ArgumentError: missing values in key columns are not allowed when matchmissing == :error
```

If you would prefer for `missing` values to be matched as equal pass
`matchimssing=:equal` keyword argument:
bkamins marked this conversation as resolved.
Show resolved Hide resolved

```jldoctest joins
julia> innerjoin(df1, df2, on=:id, matchmissing=:equal)
2×3 DataFrame
Row │ id a b
│ Int64? Int64 Int64
─────┼───────────────────────
1 │ 1 1 1
2 │ missing 2 3
```

Alternatively you might want to drop all rows with `missing` values. In this
case pass `matchmissing=:notequal`:

```jldoctest joins
julia> innerjoin(df1, df2, on=:id, matchmissing=:notequal)
1×3 DataFrame
Row │ id a b
│ Int64? Int64 Int64
─────┼──────────────────────
1 │ 1 1 1
```

## Specifying row order in the join result

By default the order of rows produced by the join operation is undefined:

```jldoctest joins
julia> df_left = DataFrame(id=[1, 2, 4, 5], left=1:4)
4×2 DataFrame
Row │ id left
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 4 3
4 │ 5 4

julia> df_right = DataFrame(id=[2, 1, 3, 6, 7], right=1:5)
5×2 DataFrame
Row │ id right
│ Int64 Int64
─────┼──────────────
1 │ 2 1
2 │ 1 2
3 │ 3 3
4 │ 6 4
5 │ 7 5

julia> outerjoin(df_left, df_right, on=:id)
7×3 DataFrame
Row │ id left right
│ Int64 Int64? Int64?
─────┼─────────────────────────
1 │ 2 2 1
2 │ 1 1 2
3 │ 4 3 missing
4 │ 5 4 missing
5 │ 3 missing 3
6 │ 6 missing 4
7 │ 7 missing 5
```

If you would like the result to keep row order of the left table pass
`order=:left` keyword argument:
bkamins marked this conversation as resolved.
Show resolved Hide resolved

```jldoctest joins
julia> outerjoin(df_left, df_right, on=:id, order=:left)
7×3 DataFrame
Row │ id left right
│ Int64 Int64? Int64?
─────┼─────────────────────────
1 │ 1 1 2
2 │ 2 2 1
3 │ 4 3 missing
4 │ 5 4 missing
5 │ 3 missing 3
6 │ 6 missing 4
7 │ 7 missing 5
```

Note that in this case keys missing from left table arre put after the keys
present in the left table.
bkamins marked this conversation as resolved.
Show resolved Hide resolved

Similarly `order=:right` keeps the order of the right table (and puts keys
not present in it at the end):

```jldoctest joins
julia> outerjoin(df_left, df_right, on=:id, order=:right)
7×3 DataFrame
Row │ id left right
│ Int64 Int64? Int64?
─────┼─────────────────────────
1 │ 2 2 1
2 │ 1 1 2
3 │ 3 missing 3
4 │ 6 missing 4
5 │ 7 missing 5
6 │ 4 3 missing
7 │ 5 4 missing
```

## In-place left join

A common operation is adding data from a reference table to some main table.
It is possible to perform such an in-place update using the `leftjoin!`
function. In this case left table is updated in place with matching rows from
bkamins marked this conversation as resolved.
Show resolved Hide resolved
the right table.

```jldoctest joins
julia> main = DataFrame(id=1:4, main=1:4)
4×2 DataFrame
Row │ id main
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 3
4 │ 4 4

julia> leftjoin!(main, DataFrame(id=[2, 4], info=["a", "b"]), on=:id);

julia> main
4×3 DataFrame
Row │ id main info
│ Int64 Int64 String?
─────┼───────────────────────
1 │ 1 1 missing
2 │ 2 2 a
3 │ 3 3 missing
4 │ 4 4 b
```

Note that in this case the order and number of rows in the left table is not
changed. Therefore, in particular, it is not allowed to have duplicate keys
in the right table:

```
julia> leftjoin!(main, DataFrame(id=[2, 2], info_bad=["a", "b"]), on=:id)
ERROR: ArgumentError: duplicate rows found in right table
```

Loading