SQL Hidden Mistakes Made by Experienced Data Analysts Part 2

E. Sultan, PhD
3 min readJul 15, 2023

In part 1 of this article, you can see that even simple queries can return misleading (wrong) results if we are not careful. Now, let's see an intermediate-level query that also can go wrong.

Let's say you have two tables, the first contains occasions to contact your clients and the second contains the status on whether you contacted them or not :

occasions
+-----------+---------------+-----------------------+
| client_id | client_name | occasion |
+-----------+---------------+-----------------------+
| 1 | Mike Johnson | New Product Launch |
| 1 | Mike Johnson | Promotional Offer |
| 2 | Sarah Williams| New Product Launch |
| 2 | Sarah Williams| Promotional Offer |
+-----------+---------------+-----------------------+
status
+-----------+-----------------------+-----------+
| client_id | occasion | contacted |
+-----------+-----------------------+-----------+
| 1 | New Product Launch | true |
| 1 | Promotional Offer | true |
| 2 | New Product Launch | false |
| 2 | Promotional Offer | true |
+-----------+---------------+-------------------+

Now say you want to get number of contacted clients in each occasion :

SELECT o.occasion, COUNT(DISTINCT s.client_id) AS num_contacted_clients
FROM occasions o
JOIN status s ON o.client_id = s.client_id
AND s.contacted = true
GROUP BY o.occasion;

Which gives the following wrong result :

+-----------------------+------------------------+
| occasion | num_contacted_clients |
+-----------------------+------------------------+
| New Product Launch | 2 |
| Promotional Offer | 2 |
+-----------------------+------------------------+

To understand what went wrong here, it is enough to see the table that results from joining the two tables. Basically, these tables has many-to-many relationship and they were supposed to be joined on multiple keys (customer_id and occasion, in this case).

To correct for this mistake, it is enough to change the above query and connect on the two keys :

SELECT o.occasion, COUNT(DISTINCT s.client_id) AS num_contacted_clients
FROM occasions o
JOIN status s ON o.client_id = s.client_id
AND o.occasion = s.occasion
WHERE s.contacted = true
GROUP BY o.occasion;

Which gives the correct result :

+-----------------------+------------------------+
| occasion | num_contacted_clients |
+-----------------------+------------------------+
| New Product Launch | 1 |
| Promotional Offer | 2 |
+-----------------------+------------------------+

Here is the whole SQL code to test it yourself on a platform like programiz :

CREATE TABLE occasions (
client_id INT,
client_name VARCHAR(255),
occasion VARCHAR(255)
);

INSERT INTO occasions (client_id, client_name, occasion)
VALUES
(1, 'Mike Johnson', 'New Product Launch'),
(1, 'Mike Johnson', 'Promotional Offer'),
(2, 'Sarah Williams', 'New Product Launch'),
(2, 'Sarah Williams', 'Promotional Offer');

CREATE TABLE status (
client_id INT,
client_name VARCHAR(255),
occasion VARCHAR(255),
contacted BOOLEAN
);

INSERT INTO status (client_id, client_name, occasion, contacted)
VALUES
(1, 'New Product Launch', true),
(1, 'Promotional Offer', true),
(2, 'New Product Launch', false),
(2, 'Promotional Offer', true);

SELECT o.occasion, COUNT(DISTINCT s.client_id) AS num_contacted_clients
FROM occasions o
JOIN status s ON o.client_id = s.client_id
WHERE s.contacted = true
GROUP BY o.occasion;

Summary :

It is not easy to avoid these types of errors specially they are hidden when they happen. But it is enough to know they exist, and ask yourself what kind of relation two tables have before joining them.

SQL Joins on Multiple Keys & when it is a must

As a general rule, tables should be joined on all columns that are common between the tables. The only time we can get away without doing that is when a column value is in one-to-one relation with another column (ex. a column that contain client’s gender should not cause an issue if it is not used in the joining).

Check also my Google interview experience. And a fast way to intermediate level in SQL here.

About me : I am a data scientist who is passionate about explaining data concepts & tools in plain English.

You can connect with me on LinkedIn, the link is in the Bio.

WRITER at MLearning.ai // Code Interpreter // Jailbreaking GPT-4!

--

--

E. Sultan, PhD

Senior Data Scientist | Creating AI apps, including backend, front end & deployment | PhD in Statistics | 📍 Paris | linkedin.com/in/eisultan