Duplicate Rows Detection in MySQL

We've been working on a Candidate Management system for JoBins, a platform that allows recruitment agents to Manage their Candidates. Sometimes, the same candidate is submitted by multiple agents, which causes redundancy and affects the overall data quality.

To tackle this, we wanted to build a system that automatically flags duplicate candidates.

For the sake of simplicity in this blog post, we’ll consider a candidate as a duplicate if their first name and last name match with another candidate in the system.

Requirements:

Our database may already contain thousands of candidates, so we need a way to compare each candidate against all others to detect duplicates.
Every time a new candidate is added, the system should check for duplicates among all existing records.
When a candidate's information is updated, we should recheck for duplicates, as the update might introduce or resolve a duplication.

Let's start solving

A good starting point for solving this problem is by using the GROUP BY clause. This allows us to group rows that share the same values in specified columns—in our case, first_name and last_name.

select first_name, last_name 
from candidates
group by first_name, last_name

Now, if you include JSON_ARRAYAGG(id) in your SELECT statement, you'll get a list of all candidate IDs that share the same name.

select first_name, last_name, json_arragg(id) as duplicates
from candidates
group by first_name, last_name

The output of this query would look like:

Image description

This data is enough to flag the row as a duplicate, with IDs [1, 2, 3] representing all matching candidates—including itself. While the current row's ID is part of the list, that's not a problem. We can easily filter it out on the frontend when displaying duplicate matches.

If you're curious about how to retrieve and display all related duplicates in the UI, feel free to drop a comment—I'm happy to help!

To support this, we simply add a new JSON column to store all duplicate IDs (including the current one).

ALTER TABLE candidates
    ADD COLUMN duplicates JSON NULL;

Populate duplicates

At the very beginning, we need to populate the duplicates column for each row. To do this, we group candidates by their first_name and last_name, aggregate their IDs using JSON_ARRAYAGG, and then join the result back to the original candidates table to update the duplicates field.

Here’s how the query looks:

update candidates
join (
  select json_arrayagg(id) as ids
  from candidates
  group by first_name, last_name
) as duplicates
on json_contains(duplicates.ids, cast(id as json))
set candidates.duplicates = duplicates.ids

The slightly better query would be:

update candidates
join (
  select json_arrayagg(id) as ids, first_name, last_name
  from candidates
  group by first_name, last_name
) as duplicates
on candidates.first_name = duplicates.first_name,
and candidates.last_name = duplicates.last_name
set candidates.duplicates = duplicates.ids

The second query is better as because it doesn't include expensive json_constains search in each row.

Congratulations!, we populated duplicates for each rows.

Add a row

In addition to a row, we should check if the row is duplicated with existing row or not, if it's duplicated we can simple update those rows which are duplicated with the current row, It can be implement as:

Example: a new row is added with first_name=Bedram and last_name=Tamang

update candidates
join (
  select json_arrayagg(id) as ids, first_name, last_name
  from candidates
+ where first_name = 'Bedram' and last_name = 'Tamang'
  group by first_name, last_name
) as duplicates
on candidates.first_name = duplicates.first_name,
and candidates.last_name = duplicates.last_name
set candidates.duplicates = duplicates.ids

Update a Row

Updating a row is a bit more complex than inserting one, because a row may already be marked as a duplicate of other records. When a candidate is updated, we not only need to re-evaluate the new data for duplicates, but also clean up the old duplicate group it previously belonged to.

For example, suppose two rows were marked as duplicates with first_name = 'Bedram' and last_name = 'Tamang'. Now, if one of them is updated to first_name = 'Adam' and last_name = 'Wathan', we need to:

Recalculate duplicates for the new name.
Remove the updated row from the old duplicate group and update that group accordingly.

So the idea is to search both the old and new groups, and update all affected rows in both.

update candidates
join (
  select json_arrayagg(id) as ids, first_name, last_name
  from candidates
+ where ((first_name = 'Bedram' and last_name = 'Tamang') # Searching for previous group
+ or (first_name = 'Adam' and last_name = 'Wathan')) # Searching for update
  group by first_name, last_name
) as duplicates
on candidates.first_name = duplicates.first_name,
and candidates.last_name = duplicates.last_name
set candidates.duplicates = duplicates.ids

Conclusion

Handling duplicate records is a common challenge in systems that rely on user-submitted data—especially in recruitment platforms like JoBins where multiple agents might submit the same candidate. In this post, we tackled this issue using MySQL by grouping candidates based on their first_name and last_name, and flagging duplicates using a JSON column.

We covered:

How to identify and group duplicate candidates.
How to initially populate duplicates for existing records.
How to handle duplicates when adding a new candidate.
How to manage updates and maintain accurate duplicate groupings.

With this setup, you can easily flag and manage duplicates directly in your database without needing additional backend logic. And since the duplicates field stores all related IDs, it becomes straightforward to visualize relationships and resolve conflicts in the UI.

If you want to extend this further, consider adding fuzzy matching for minor typos (e.g., using SOUNDEX() or Levenshtein distance), or applying more robust deduplication strategies using a scoring system.

Thanks for reading! Feel free to drop questions or suggestions in the comments—always happy to chat tech ✌️