Package

org.apache.spark.rdd

mergejoin

Permalink

package mergejoin

Visibility
  1. Public
  2. All

Type Members

  1. class MergeJoin[K, V, W, O] extends Iterator[O]

    Permalink

    :: DeveloperApi ::

    :: DeveloperApi ::

    Merge-join implementation that will create a spill-able collection for the right-side to be iterated over for each matching key on the left side. This enables joins that don't require that values for any given key to be required to fit in memory, but it *will* try to buffer as many values as possible using Spark's built-in 'ExternalSorter', and it's a private class so that's why this class is packaged here.

    There are numerous optimizations in place to try to minimize the work being done in the join:

    1) Since the Joiner returns a value Iterator for each key, we don't need to invoke join logic on each value iteration-- instead we perform join logic for each unique key. This also allows each 'Joiner' to optimize via the methods below based on the join being performed.

    2) In cases were we have a key on one side but not the other, we skip creation of the spillable collection and write the output tuples directly according to the Joiner's leftOuter/rightOuter method.

    3) In cases where there are no values to emit for a particular key, the Joiner can emit an empty Iterator, in which case we will immediately move to the next key without emitting+filtering tuples for those values.

    4) In cases where we have a key on both sides, we invoke the Joiner's inner method. The default implementation will create a spill-able collection for the right side that will buffer as many values as possible in memory before spilling to disk... so we only pay the penalty for spilling to disk on keys where it is absolutely necessary.

    Annotations
    @DeveloperApi()

Value Members

  1. object MergeJoin

    Permalink

Ungrouped