JOINS: SQL Tables vs Python Dataframes.

Kolaru Gideon Mosimiloluwa
7 min readApr 11, 2024

--

JOINS! As it stands, this is the subject in data science that has taken me the most time to wrap my head around. Every time I try to learn it, my brain acts like it’s on break. Well, not anymore, obviously; that’s the reason I can write this article. What are joins?

Now, Imagine it’s your birthday party and you have three separate guest lists: one for family members, one for your friends from school and one for your friends in the neighborhood. You want to create a single “list” that includes all invited guests. This scenario can help illustrate JOINs in a real-life context.

In this scenario:

Your family members’ list represents one dataset, let’s call it “Family”, containing columns like “Name,” “Email,” “Phone Number” and “Relationship” for each family member.

Your school friends’ and neigbours have two separate lists representing datasets, say “SchoolFriends” and “Neighbours” containing similar columns like “Name,” “Email,” and “Phone Number” for each family member.

Now, you need to combine these three lists into one comprehensive guest list to send out invitations. This is where JOINs come into play.

SQL JOINS

SQL JOINS are fundamental to database management and querying, enabling the retrieval of data from multiple tables based on specified criteria. They allow us to combine related data from different tables into a single one, like in the case illustrated above.

Let’s picture a scenario where you have two tables in a database: one containing information about employees (such as employee ID, name, and department ID) and another containing details about departments (such as department ID, name and head). To retrieve information about employees along with their corresponding departments, you would use SQL JOINS.

To see what both tables look like, let’s print all the columns from both tables separately and limit it to the first 5 results. You can also do that using the code below:

SELECT *
FROM employees
LIMIT 5;
SELECT * 
FROM departments
LIMIT 5;

Here’s what the tables look like below:

A screenshot of the first 5 rows of both tables.
EMPLOYEES(TOP) AND DEPARTMENTS(BOTTOM) TABLES.

COMPONENTS OF SQL JOINS

  1. The tables to be joined. These tables are specified in the FROM clause of the SQL query.
  2. The criteria to match rows from one table to another. This condition is specified in the ON or USING clause.
  3. The type of JOIN being performed. This determines how rows are combined and what results are returned. Common types include INNER JOIN, LEFT JOIN, RIGHT JOIN, and OUTER JOIN.
  4. Result Set: The output of the JOIN operation, which is a new table containing columns from both tables involved in the JOIN. The result set includes rows that meet the specified join condition.
  5. Additional clauses that can be used to further filter or manipulate the result set, such as the WHERE, ORDER BY clauses, etc.

TYPES OF SQL JOINS

There are several types of SQL JOINS, each serving a different purpose. Let’s discuss them one after the other.

  • INNER JOIN: This type of JOIN returns rows from both tables where there is a match based on the specified condition. It filters out rows where no match is found. In the case of the database above, suppose we want to see employees, their departments and their respective heads in one table, we can use an inner join to get that done.
SELECT employeeid, fullname, department, d.head
FROM employees e
INNER JOIN Departments d
-- Specify Columns to join on.
ON e.departmentid = d.id
LIMIT 10;

The operation above gives this result:

The joined table returns all employees because the join condition (column we joined “ON”) matches on both tables, i.e. all employees have departmental IDs; hence, it can be found in both tables.

  • LEFT JOIN: We use this when we want the query to return all rows from the left table and only rows where there is a match on the right table. Suppose you want to retrieve a list of all employees along with their respective departments, ensuring that every employee is included in the result set regardless of whether they are assigned to a department. If a department has no employees, the employee-related columns (employee_id and employee_name) will display as NULL.

This can be demonstrated using the code below:

SELECT employeeid, fullname, department, head AS department_head
FROM employees e
LEFT JOIN departments d ON e.departmentid = d.id
-- Specified a condition with a WHERE clause to see the employees without departments assigned
WHERE departmentid IS NULL;
The result showing the employees without departments.
  • RIGHT JOIN: The right join basically does the same thing the left join does but in reverse. We can achieve the same result as above by using the right join and switching the order of the joins as follows:
SELECT employeeid, fullname, department, head AS department_head
FROM departments d
RIGHT JOIN employees e ON e.departmentid = d.id
WHERE departmentid IS NULL;
  • OUTER JOIN: Suppose you want to retrieve a comprehensive list of both employees and departments, ensuring that every employee and every department are included in the result set. The employee-related columns return null in columns where there are missing values and vice versa. The code snippet below can be used to achieve this:
SELECT employeeid, fullname, department, head AS department_head
FROM employees e
FULL OUTER JOIN departments d ON e.departmentid = d.department_id;

You can try this out and see the result for yourself.

PYTHON JOINS

Joins in Python are equally as useful and effective as they are in SQL. In Python, we use joins to “merge” dataframes (python’s version of SQL tables). This is done using the .merge() method in the pandas module.

Without wasting time, let’s dive in.

First, you import your data into Python, depending on the format the data is in. For the sake of demonstration, we will be using the same data as above but in .csv format.

Import your data in three steps:

  • Import dependencies (numpy, pandas, etc…)
import numpy as np
import pandas as pd
  • Import both tables using “pd.read…” as seen below.
#import the department table
dept = pd.read_csv('department.csv')
#import the employee table
emp = pd.read_csv('employees.csv')
  • Print the first few rows to be sure they imported correctly.
First few roles of both tables printed using the .head() method.
First few roles of both tables printed using the .head() method.

Now that our data is imported correctly, let’s discuss the components of a python join operation.

COMPONENTS OF PYTHON JOINS

  1. The dataframes to be joined. The tables are read using the pd.read… method in the pandas module as seen above.
  2. The .merge() method.
  3. The criteria to match the tables on. This is specified using the on, left_on or right_onparameters. If no column is specified, results are joined on a column common to both tables.
  4. The type of JOIN to be performed. This is specified using the how parameter and the inner, outer, left, rightarguments are accepted to perform the inner outer, left and right joins, respectively. This is set to innerby default.
  5. There are other additional parameters that can be specified like left_index, right_index, suffixes, sort, validate.
  6. The result. This result can be assigned a name and can be saved to memory to be used for further analysis.

TYPES OF PYTHON JOINS

Similar to SQL, we have the INNER, OUTER, LEFT, RIGHT, JOINS. They mean the same in both languages. Here is how to get these done:

  • INNER JOIN:
#Inner Join of the emp and dept tables on the department_id column.
#Setting sort=True arranges the results by department_id
emp_dep_inner = emp.merge(dept, on = "department_id",sort=True)
print (emp_dep_inner)
  • LEFT JOIN:
emp_dep_left = emp.merge(dept, how = 'left', on = "department_id",sort=True)
print (emp_dep_left)

This will return three rows of null values for the columns from the right table because three employees have missing values on the column we are merging on.

  • RIGHT JOIN:
emp_dep_right = emp.merge(dept, how = 'right', on = "department_id",sort=True)
print (emp_dep_right)

As explained above, the right join returns all rows on the right and only the rows where the nerge criteria is matched on the left table. This operation returns 90 rows eliminating the three employees without department ids.

  • OUTER JOIN:
emp_dep_outer = emp.merge(dept, how = 'outer', on = "department_id",sort=True)
print (emp_dep_outer)

An outer join returns the matched rows as well as the unmatched rows of one of the tables, or of both tables. In this case, there are no unmatched rows in the right table, hence why we have an identical result as the left join.

SUMMARY OF COMPARISON BETWEEN SQL JOINS AND PYTHON JOINS

A table comparing joins in SQL and Python.

CONCLUSION

There’s no doubt that proficiency in joining tables stands as a foundational skill in data science and data manipulation, particularly when handling structured data. This article is just proof that regardless of the tool, you can achieve the result you want.

We have discussed the various types of joins, inner join, outer join, left outer join, and right outer join. We illustrated the execution of these operations using the JOIN keyword in SQL and the merge() function in Python. Also, we examined the differences between the JOIN operations of both languages.

Click Here to get access to the files used on my GitHub. I hope you find some value in this piece of writing. If you did, feel free to send feedback; it’d be really appreciated.

--

--

Kolaru Gideon Mosimiloluwa
Kolaru Gideon Mosimiloluwa

Written by Kolaru Gideon Mosimiloluwa

Data scientist | SQL | Python | Excel | Power BI | Tableau

No responses yet