LINQ gems: Left Outer Join

Doing a left outer join in T-SQL is as straightforward as adding a single keyword to the JOIN statement. I remember that when I first tried to perform a left outer join in a LINQ query, I was honestly surprised about how confusing the syntax was. Even now, after years of using it, I sometimes hesitate while trying to remember the exact syntax.

In LINQ, the left outer join is accomplished in 2 phases:

  1. Performing a group join between the first and second collection
  2. Running the Enumerable.DefaultIfEmpty extension method on the expanded group join results

A true left outer join "experience" can only be accomplished by executing both steps. The group join by itself however produces very similar results and sometimes it may even be a better option than doing the actual left outer join.

Let's first have a look at just the group join.

Group Join

One easy way of how to accomplish a behavior similar to a left outer join is using the group join. The group join simply pairs elements from the first (left) collection with a set of matching elements from the second (right) collection, producing a correlated subset of right elements for each member of the left collection. If there are no correlated elements found for a member of the left collection, the sequence of correlated elements is empty.

Even though this is not a true left outer join (as known from the SQL syntax), the group join is often quite useful and, to be honest, many times I'd wish that this type of join was (in some form) available in T-SQL as well.

Consider this example where a collection of days in a week is joined to another collection containing delivery times for a particular day. We're interested in outputting all of the days, regardless of whether a delivery exists for that day or not. For that reason, we're using a group join here:

// Test data
var days = new[] { "Mon", "Tue", "Wed", "Thu", "Fri" };
var deliveries = new[]
{
    (day: "Mon", time: "8AM"),
    (day: "Thu", time: "10AM"),
    (day: "Thu", time: "2PM")
};

// Group Join
var query = 
    from day in days
    join delv in deliveries on day equals delv.day into dayDelvs
    select new { day, deliveries = dayDelvs.Select(dd => dd.time) };

// Output
query.ToList().ForEach(dd => 
    Console.WriteLine($"{dd.day} deliveries: { string.Join(", ", dd.deliveries) }"));

Produces the following output:

Mon deliveries: 8AM
Tue deliveries:
Wed deliveries:
Thu deliveries: 10AM, 2PM
Fri deliveries:

As you can see in the output, the days without any deliveries are preserved, so the concept of left outer join is maintained. In addition, we get a linked sub-collection containing the matching delivery times, which can then be easily processed, in our case using string.Join().

Left Outer Join

To accomplish the full behavior of a left outer join, we still start with a group join. In addition though we need to expand the results of the group join and process it further using the DefaultIfEmpty method. DefaultIfEmpty ensures that each item in the left collection is included in the result set even when that element has no matches in the right collection.

Following is a modification of the first example, this time with a call to DefaultIfEmpty included in order to produce an actual left outer join:

// Test data
var days = new[] { "Mon", "Tue", "Wed", "Thu", "Fri" };
var deliveries = new[]
{
    (day: "Mon", time: "8AM"),
    (day: "Thu", time: "10AM"),
    (day: "Thu", time: "2PM")
};

// Left Outer Join
var q = from day in days
        join delv in deliveries on day equals delv.day into dayDelvs
        from dayDelivery in dayDelvs.DefaultIfEmpty()
        select (day, dayDelivery.time);

// Output
q.ToList().ForEach(dd => 
    Console.WriteLine($"{dd.day} delivery: {dd.time}"));

Outputs:

Mon delivery: 8AM
Tue delivery:
Wed delivery:
Thu delivery: 10AM
Thu delivery: 2PM
Fri delivery:

Unlike with the group join, with left outer join we get duplicate rows for Thursday, as there are multiple deliveries scheduled for Thursday. This is the main difference, which needs to be considered when making the choice between a group join and left outer join.

Adding Indexed Select

Looking at the output of the previous example, I think it would be nice if multiple deliveries for the same day would be ranked. Which is a perfect usage scenario for the Indexed Select presented in my previous post.

// Test data
var days = new[] { "Mon", "Tue", "Wed", "Thu", "Fri" };
var deliveries = new[]
{
    (day: "Mon", time: "8AM"),
    (day: "Thu", time: "10AM"),
    (day: "Thu", time: "2PM")
};

// Left Outer Join + Indexed Select
var q = from day in days
        join delv in deliveries on day equals delv.day into dayDelvs
        let numDayDelvs = dayDelvs.Select((dd, num) => new { dd.day, dd.time, num })
        from deliveredDay in numDayDelvs.DefaultIfEmpty()
        select (day, deliveredDay?.time, deliveredDay?.num);

// Output
q.ToList().ForEach(dd =>
{
    if (dd.time != null)
        Console.WriteLine($"{dd.day} delivery #{dd.num + 1}: {dd.time}");
    else
        Console.WriteLine($"{dd.day} delivery:");
});

Gives us a ranked output:

Mon delivery #1: 8AM
Tue delivery:
Wed delivery:
Thu delivery #1: 10AM
Thu delivery #2: 2PM
Fri delivery:

We could of course further enhance the example by only showing the rank for days, which have more than one delivery assigned. I'll leave that as an exercise for the reader.

Alternative syntax

There is an alternative syntax for the left outer join. It also uses the DefaultIfEmpty method, but the syntax is a bit more readable.

Let's take the previous example and rewrite it a little bit:

// Test data
var days = new[] { "Mon", "Tue", "Wed", "Thu", "Fri" };
var deliveries = new[]
{
    (day: "Mon", time: "8AM"),
    (day: "Thu", time: "10AM"),
    (day: "Thu", time: "2PM")
};

// Left Outer Join using alternative syntax
var q = from day in days
        from dayDelivery in deliveries.Where(dd => dd.day == day).DefaultIfEmpty()
        select (day, dayDelivery.time);

// Output
q.ToList().ForEach(dd =>
    Console.WriteLine($"{dd.day} delivery: {dd.time}"));

We get exactly the same output that we got when using the standard left outer join syntax:

Mon delivery: 8AM
Tue delivery:
Wed delivery:
Thu delivery: 10AM
Thu delivery: 2PM
Fri delivery:

Now you may wonder why this alternative syntax is not the recommended way of doing a left outer join in LINQ? Well, the answer is fairly simple - for performance reasons. If you run the alternative query in a pure LINQ-To-Objects scenario (such as the scenario I use in my examples), you won't get any caching of data; the second collection will be scanned (iterated through entirely) several times. This is fine for small data sets, but for large collections, it could easily become an unnecessary performance hog.

For database-driven LINQ providers though, this approach represents an equally effective syntax for writing a left outer join. Simply because the DB provider is usually capable of recognizing the alternative syntax and translating it into a standard left outer join SQL statement. For example, when using the Entity Framework Core SQL Server Provider, the generated SQL query is the same for both the standard and alternative left outer join LINQ syntaxes.


If you find this post interesting, be sure to check out my other posts in the LINQ gems series.

No Comments Yet