Migrating 20 years old legacy application database into modern database

In the previous article, I mentioned how I rewrote 20 years old legacy PHP application using Laravel framework from scratch. One of the challenging task that I did was data migration parts. jobsnepal.com is the first online job portal of the country serving since 2000.

Over time, various developers worked on this project and made necessary changes to the database schemas, which resulted in a messy structure that did not align with proper database design principles. Before I joined, the company was focused solely on migrating the application logic. However, after studying the database for a month, I realized it needed a complete redesign. I proposed this idea to the company, and they agreed. This led to a comprehensive redesign of the database and the successful migration of 20 years’ worth of data into the newly designed structure.

Major changes

Streamline Users

The application had three types of users: Admin, Employer, and JobSeeker, each with its own separate table. The logic for authentication, session management, and other related functionalities was duplicated across these three user types. To streamline the system and eliminate redundancy, we decided to merge these three tables into a single table named 'user'.

There were approximately 20 system admin accounts, over 10,000 company accounts, and more than 300,000 jobseeker accounts. I allocated the first 1-15,000 IDs for company users which enables us to use the same primary key for company and 15,001-15,050 for admin accounts, and the remaining IDs for jobseeker accounts.

The reason for allocating the company first is because the company is the most important user in the perspective that it could have years of history, and preserving the primary key would ease data validation and migration work.

Streamline Category

Three different categories were serving very similar purposes, so we decided to merge these three categories.

Account section redesigned

The messiest part of the database was the account section. It was extremely difficult to track the history within this section, and during our review, we discovered several errors, including accounts being written off without any clear reason. Due to these issues, we decided to completely redesign the account section from scratch. In this redesign, we explored the idea of a transaction in which each change in the account is a transaction, so we never rewrite it, instead, we will create new traction for each change.

Data Migration

Prior to this project, I had no experience with data migration jobs and was not initially hired for this task. However, I had some familiarity with Python tools like Pandas, so I decided to write a script using Python packages such as Pandas and NumPy to migrate 20 years of old data into the new architecture.

Table Merge

During the database redesign, I merged several tables, including the users table mentioned earlier. Here’s an example of how I merged two tables into one:

In the old database, there were two separate tables for categories and industries—where categories referred to job categories and industries referred to company sectors. Many items were common between these two tables.

While redesigning the database, I discovered that columns like JobAlertData and Flag were not in use. Therefore, I chose not to migrate that data to the new design. Instead, I added new columns: flag and status. (Note: The new Flag column will not contain the same data as the old jobcategory column; it will use ENUM data type to indicate whether the entry is a category or industry.)

I then exported the table data from the old database into a CSV file and wrote a script to handle the migration.

# category.py
import pandas as pd

jobcategory = pd.read_csv(
            "csv/Job/jobcategory.csv", sep=";", header=None)

Here, at first I am trying to migrate jobcategory data into category, so I am using python's pandas library to read csv file as in above code.

dataframe = pd.DataFrame(jobcategory).rename(
            columns={0: "id", 1: "name"})

Here, I rename the column name JobCategoryID to id and JobCategoryName to name map new database.

I have also added flag and status columns in new table so I added that flag column to job and status to Active as:

dataframe["flag"] = "Job"
dataframe["status"]="Active"

Another interesting part is id column, as I had allotted 1-1000 rows for industry table so, I had to offset 1000 for each id for this data so, I wrote function as:

def getId(self, x):
    return 1000 + int(x)

And call this function in lambda function as:

dataframe.loc[:, "id"] = dataframe["id"].apply(lambda x: self.getId(x))

As application was written in laravel, so I have to add created_at and updated columns as well in the table, which was done as

from datetime import datetime

date=datetime.now()

settings created_at and updated_at for laravel

dataframe["created_at"] = date.strftime("%Y-%m-%d %H:%M:%S")
dataframe["updated_at"] = date.strftime("%Y-%m-%d %H:%M:%S")
Complete scripts looks like:

# category.py
import pandas as pd
from datetime import datetime


class JobCategory:
    def __init__(self):
        super().__init__()

    def data(self):
        jobcategory = pd.read_csv(
            "csv/Job/jobcategory.csv", sep=";", header=None)
        dataframe = dataframe.DataFrame(jobcategory).rename(
            columns={0: "id", 1: "name"})
        dataframe.loc[:, "id"] = dataframe["id"].apply(lambda x: self.getId(x))
        dataframe["flag"] = "job"
        
        # Setting current time
        date=datetime.now()

        # settings created_at and updated_at for laravel
        dataframe["created_at"] = date.strftime("%Y-%m-%d %H:%M:%S")
        dataframe["updated_at"] = date.strftime("%Y-%m-%d %H:%M:%S")
        

        return dataframe[["id", "name", "created_at", "updated_at", "flag"]]

    """
    Offsetting the Jobcategory By 1000
    """

    def getId(self, x):
        return 1000 + int(x)

As you see in the above script, the formated data is returning as output, which can directly store into database by calling store function:

Get data from above class

data = JobCategory().data()

Databse config

DATABASE_CONFIG_URI ="mysql://username:[email protected]/database"

Setup database connection

engine = create_engine(DATABASE_CONFIG_URI)
conn = self.engine.connect()

Insert into database

data.to_sql(table, conn, if_exists="append", index=False)

Complete python class would looks like:

# category.py
import pandas as pd
from datetime import datetime
from sqlalchemy import *
import pymysql
pymysql.install_as_MySQLdb()


class JobCategory:
    def __init__(self):
        super().__init__()

    def data(self):
        jobcategory = pd.read_csv(
            "csv/Job/jobcategory.csv", sep=";", header=None)
        dataframe = dataframe.DataFrame(jobcategory).rename(
            columns={0: "id", 1: "name"})
        dataframe.loc[:, "id"] = dataframe["id"].apply(lambda x: self.getId(x))
        dataframe["flag"] = "job"

        # Setting current time
        date = datetime.now()

        # settings created_at and updated_at for laravel
        dataframe["created_at"] = date.strftime("%Y-%m-%d %H:%M:%S")
        dataframe["updated_at"] = date.strftime("%Y-%m-%d %H:%M:%S")

        return dataframe[["id", "name", "created_at", "updated_at", "flag"]]

    """
    Offsetting the Jobcategory By 1000
    """

    def getId(self, x):
        return 1000 + int(x)


if __name__ == "__main__":
    # Get data from above class
    data = JobCategory().data()

    # Databse config
    DATABASE_CONFIG_URI = "mysql://username:[email protected]/database"

    # Setup database connection
    engine = create_engine(DATABASE_CONFIG_URI)
    conn = self.engine.connect()

    # Insert into database
    data.to_sql(table, conn, if_exists="append", index=False)

Conclusion

We successfully migrated about 20 years of data in December 11, 2019. It took about one and half day to migrate all of our data using 8 GB RAM pc.