We've been working on a Candidate Management system for JoBins, a platform that allows recruitment agents to Manage their Candidates. Sometimes, the same candidate is submitted by multiple agents, which causes redundancy and affects the overall data quality.
To tackle this, we wanted to build a system that automatically flags duplicate candidates.
For the sake of simplicity in this blog post, we’ll consider a candidate as a duplicate if their first name and last name match with another candidate in the system.
A good starting point for solving this problem is by using the GROUP BY
clause. This allows us to group rows that share the same values in specified columns—in our case, first_name
and last_name
.
select first_name, last_name
from candidates
group by first_name, last_name
Now, if you include JSON_ARRAYAGG(id)
in your SELECT
statement, you'll get a list of all candidate IDs that share the same name.
select first_name, last_name, json_arragg(id) as duplicates
from candidates
group by first_name, last_name
The output of this query would look like:
This data is enough to flag the row as a duplicate, with IDs [1, 2, 3]
representing all matching candidates—including itself. While the current row's ID is part of the list, that's not a problem. We can easily filter it out on the frontend when displaying duplicate matches.
If you're curious about how to retrieve and display all related duplicates in the UI, feel free to drop a comment—I'm happy to help!
To support this, we simply add a new JSON
column to store all duplicate IDs (including the current one).
ALTER TABLE candidates
ADD COLUMN duplicates JSON NULL;
At the very beginning, we need to populate the duplicates
column for each row. To do this, we group candidates by their first_name
and last_name
, aggregate their IDs using JSON_ARRAYAGG
, and then join the result back to the original candidates
table to update the duplicates
field.
Here’s how the query looks:
update candidates
join (
select json_arrayagg(id) as ids
from candidates
group by first_name, last_name
) as duplicates
on json_contains(duplicates.ids, cast(id as json))
set candidates.duplicates = duplicates.ids
The slightly better query would be:
update candidates
join (
select json_arrayagg(id) as ids, first_name, last_name
from candidates
group by first_name, last_name
) as duplicates
on candidates.first_name = duplicates.first_name,
and candidates.last_name = duplicates.last_name
set candidates.duplicates = duplicates.ids
The second query is better as because it doesn't include expensive json_constains
search in each row.
Congratulations!, we populated duplicates for each rows.
In addition to a row, we should check if the row is duplicated with existing row or not, if it's duplicated we can simple update those rows which are duplicated with the current row, It can be implement as:
Example: a new row is added with first_name=Bedram
and last_name=Tamang
update candidates
join (
select json_arrayagg(id) as ids, first_name, last_name
from candidates
+ where first_name = 'Bedram' and last_name = 'Tamang'
group by first_name, last_name
) as duplicates
on candidates.first_name = duplicates.first_name,
and candidates.last_name = duplicates.last_name
set candidates.duplicates = duplicates.ids
Updating a row is a bit more complex than inserting one, because a row may already be marked as a duplicate of other records. When a candidate is updated, we not only need to re-evaluate the new data for duplicates, but also clean up the old duplicate group it previously belonged to.
For example, suppose two rows were marked as duplicates with first_name = 'Bedram'
and last_name = 'Tamang'
. Now, if one of them is updated to first_name = 'Adam'
and last_name = 'Wathan'
, we need to:
So the idea is to search both the old and new groups, and update all affected rows in both.
update candidates
join (
select json_arrayagg(id) as ids, first_name, last_name
from candidates
+ where ((first_name = 'Bedram' and last_name = 'Tamang') # Searching for previous group
+ or (first_name = 'Adam' and last_name = 'Wathan')) # Searching for update
group by first_name, last_name
) as duplicates
on candidates.first_name = duplicates.first_name,
and candidates.last_name = duplicates.last_name
set candidates.duplicates = duplicates.ids
Handling duplicate records is a common challenge in systems that rely on user-submitted data—especially in recruitment platforms like JoBins where multiple agents might submit the same candidate. In this post, we tackled this issue using MySQL by grouping candidates based on their first_name
and last_name
, and flagging duplicates using a JSON
column.
We covered:
With this setup, you can easily flag and manage duplicates directly in your database without needing additional backend logic. And since the duplicates
field stores all related IDs, it becomes straightforward to visualize relationships and resolve conflicts in the UI.
If you want to extend this further, consider adding fuzzy matching for minor typos (e.g., using SOUNDEX()
or Levenshtein distance), or applying more robust deduplication strategies using a scoring system.
Thanks for reading! Feel free to drop questions or suggestions in the comments—always happy to chat tech ✌️