28 Aug, 2020 Root User

In the previous article, I have mentioned about how did I rewrote 20 years old legacy PHP application using Laravel framework from scratch. One of the challenging task that I have done in jobsnepal.com was in data migration parts. I believe jobsnepal.com is a one of the pioneer organization that has been doing online business since 2000. (as It's main source of income is Job Posting)


As long as it's been serving, different developers had came into this company and contributed some features to the core system of job portal of this company i.e jobsnepal.com. Their task was always to complete the tasks as soon as possible, with out thinking about the architecture of software application. By editing the same codebase for the period of eighteen years (I joined in 2018) by different developers, it becomes one of dirtiest codebase that I have ever seen before. I always used to afraid to touch the codebase as it is quite large application and changing a part of code has always fear of breaking things in other parts. The status of database layer is also same as application. Different developers came into the system and edit the database structure without thinking about database architecture. As a result, database also becomes one of the worst place with several types of logical mistakes.


Before I joined, company was thinking only about copying the database into laravel application, but later I redesigned the database from scratch using all the modern tools and technique. After database redesigning, the main challenge was to move 20 years old data into newly designed database in laravel application.


Main challenges

  • Users table: Job portal mainly has three types users:

    1. Jobseeker: who searches jobs
    2. Company: Who provides jobs
    3. System admin: Who manages jobseeker and company

In the old application, they had been storing, three types of users in three different tables. And there was three different login page to login three different users. If you see this works, It's repeating same work in three different places which is against DRY principle.


So, I decided to merge three tables into one.


There was about 20 system admin account, 10,000+ company accounts, and more than 300,000 jobseeker accounts. I alloted first 1-15,000 for company users, 15000-15,050 admin account, and after that jobseeker account.


Prior, I didn't have any experience about data migration jobs, and I was even not supposed to hire for this jobs. But I have little experience about python tools called pandas. So I decided to write a script in python using packages such as pandas and numpy to migrate 20 years of old data into new architecture.


  • Table Merge: I had merged many tables while redesigning database. You can consider above users table as well for references. Here, I will show you a example how did I merge two tables into one.

In old database, there was two separate tables for categories and Industry, category refers to job's category and Industry refers to company industry. There was many common items between these two tables. The schemas of these two table can show as:

Industry Table:

Image Text

JobCategory Table:

Image Text

And my merged table structure was:

Image Text

While redesigning database, I found that column such as JobAlertData and Flag, was not in use. So I decided not to migrate that data into new design, Instead I add another columns flag and status. (Note: Flag will not have same data as in jobcategory instead It will be ENUM datatype which store whether the is category or industry of both).


So, I download that table data in csv file from old database and write script as:

# category.py
import pandas as pd

jobcategory = pd.read_csv(
            "csv/Job/jobcategory.csv", sep=";", header=None)

Here, at first I am trying to migrate jobcategory data into category, so I am using python's pandas library to read csv file as in above code.


dataframe = pd.DataFrame(jobcategory).rename(
            columns={0: "id", 1: "name"})

Here, I rename the column name JobCategoryID to id and JobCategoryName to name map new database.

I have also added flag and status columns in new table so I added that flag column to job and status to Active as:


dataframe["flag"] = "Job"
dataframe["status"]="Active"

Another interesting part is id column, as I had allotted 1-1000 rows for industry table so, I had to offset 1000 for each id for this data so, I wrote function as:


def getId(self, x):
    return 1000 + int(x)

And call this function in lambda function as:

dataframe.loc[:, "id"] = dataframe["id"].apply(lambda x: self.getId(x))

As application was written in laravel, so I have to add created_at and updated columns as well in the table, which was done as

from datetime import datetime

# Setting current time
date=datetime.now()

# settings created_at and updated_at for laravel
dataframe["created_at"] = date.strftime("%Y-%m-%d %H:%M:%S")
dataframe["updated_at"] = date.strftime("%Y-%m-%d %H:%M:%S")

Complete scripts looks like:

# category.py
import pandas as pd
from datetime import datetime


class JobCategory:
    def __init__(self):
        super().__init__()

    def data(self):
        jobcategory = pd.read_csv(
            "csv/Job/jobcategory.csv", sep=";", header=None)
        dataframe = dataframe.DataFrame(jobcategory).rename(
            columns={0: "id", 1: "name"})
        dataframe.loc[:, "id"] = dataframe["id"].apply(lambda x: self.getId(x))
        dataframe["flag"] = "job"
        
        # Setting current time
        date=datetime.now()

        # settings created_at and updated_at for laravel
        dataframe["created_at"] = date.strftime("%Y-%m-%d %H:%M:%S")
        dataframe["updated_at"] = date.strftime("%Y-%m-%d %H:%M:%S")
        

        return dataframe[["id", "name", "created_at", "updated_at", "flag"]]

    """
    Offsetting the Jobcategory By 1000
    """

    def getId(self, x):
        return 1000 + int(x)

As you see in the above script, the formated data is returning as output, which can directly store into database by calling store function:

# Get data from above class
data = JobCategory().data()

# Databse config
DATABASE_CONFIG_URI ="mysql://username:[email protected]/database"

# Setup database connection
engine = create_engine(DATABASE_CONFIG_URI)
conn = self.engine.connect()

# Insert into database
data.to_sql(table, conn, if_exists="append", index=False)

Complete python class would looks like:

# category.py
import pandas as pd
from datetime import datetime
from sqlalchemy import *
import pymysql
pymysql.install_as_MySQLdb()


class JobCategory:
    def __init__(self):
        super().__init__()

    def data(self):
        jobcategory = pd.read_csv(
            "csv/Job/jobcategory.csv", sep=";", header=None)
        dataframe = dataframe.DataFrame(jobcategory).rename(
            columns={0: "id", 1: "name"})
        dataframe.loc[:, "id"] = dataframe["id"].apply(lambda x: self.getId(x))
        dataframe["flag"] = "job"

        # Setting current time
        date = datetime.now()

        # settings created_at and updated_at for laravel
        dataframe["created_at"] = date.strftime("%Y-%m-%d %H:%M:%S")
        dataframe["updated_at"] = date.strftime("%Y-%m-%d %H:%M:%S")

        return dataframe[["id", "name", "created_at", "updated_at", "flag"]]

    """
    Offsetting the Jobcategory By 1000
    """

    def getId(self, x):
        return 1000 + int(x)


if __name__ == "__main__":
    # Get data from above class
    data = JobCategory().data()

    # Databse config
    DATABASE_CONFIG_URI = "mysql://username:[email protected]/database"

    # Setup database connection
    engine = create_engine(DATABASE_CONFIG_URI)
    conn = self.engine.connect()

    # Insert into database
    data.to_sql(table, conn, if_exists="append", index=False)



Conclusion: We successfully migrated about 20 years of data in December 11, 2019. It took about one and half day to migrate all of our data using 8 GB RAM pc.






Profile Image

Prosperous Nepal is possible only from technological innovation.


© 2021, All right reserved.