Preserving Local Languages Digitally: Fine-Tuning OpenAI for Limbu Translation
09 Oct, 2024 Bedram Tamang

Limbu is a language spoken by the Limbu ethnic group native to Eastern Nepal and some parts of India. It has its own script, called the Limbu script or Kirat Sirijunga, which is used to write the language. The language has a rich cultural and historical significance for the Limbu community and is still spoken by a significant number of people in the region. Despite being spoken by a significant number of individuals, there is a notable scarcity of digital traces for these languages. They are predominantly transmitted orally, although there has been a recent emergence of primary schools in eastern Nepal and certain regions of India incorporating them into their curriculum.

In the evolving world of machine learning, fine-tuning pre-trained models can significantly enhance their performance on specific tasks. In this blog, I’ll walk you through the process of fine-tuning OpenAI’s GPT-4 model to build an English-to-Limbu language translator. By leveraging a dataset and OpenAI’s API, we create a custom translator that can bridge the language gap.

Why Fine-Tune Models?

Fine-tuning involves taking a general-purpose pre-trained model and refining it on a specific task, such as translating between languages. In this case, we used OpenAI’s GPT-4o-mini model and trained it using a dataset consisting of English and Limbu sentences. The primary advantage is that fine-tuned models can significantly outperform generic models when working with niche datasets.

Steps Involved

1. Preparing the Dataset

The first step was preparing a dataset consisting of more than 1000 pairs of English-Limbu translations, created with the help of a linguist. This dataset contains English phrases and Limbu script translations, which provides a comprehensive resource for fine-tuning.

The dataset was structured in a CSV file with the following format:

English Limbu
Hi. ᤜᤠᤤ ॥
Run! ᤗᤠ᤺ᤣᤰᤋᤧ᤹ ᥄
Who? ᤜᤠ᤺᤺ᤳ ᥅
Wow! ᤀᤠᤠᤶᤔᤤ ᥄
Fire! ᤔᤡ ᥄

The first step was loading the dataset from a CSV file containing English phrases and their corresponding translations in Limbu.

import pandas as pd
df = pd.read_csv('001-english-limbu-manual.csv')

We used Pandas to load and handle the CSV file, making it easier to manipulate and filter the data as required.

2. Structuring the Training Data

To make the data compatible with OpenAI's fine-tuning API, we structured it into conversations. Each conversation contains a system message, user input, and the corresponding assistant response.

system = "You are a english to limbu or limbu to english language translator."

def prepare_example_conversation(row):
    return {
        "messages": [
            {"role": "system", "content": system},
            {"role": "user", "content": row['English']},
            {"role": "assistant", "content": row["Limbu"]},
        ]
    }

training_df = df.loc[0:1000]
training_data = training_df.apply(prepare_example_conversation, axis=1).tolist()

The dataset was divided into training and validation sets to ensure the model’s generalization ability.

3. Creating JSONL Files

The OpenAI fine-tuning API requires the data to be in JSONL format (JSON Lines), where each line represents a single data entry. We wrote two files: one for training and another for validation.

def write_jsonl(data_list: list, filename: str) -> None:
    with open(filename, "w") as out:
        for ddict in data_list:
            jout = json.dumps(ddict) + "\n"
            out.write(jout)

write_jsonl(training_data, "english_limbu_train.jsonl")
write_jsonl(validation_data, "english_limbu_valite.jsonl")

4. Uploading Data for Fine-Tuning

Once the files were ready, we uploaded them to OpenAI using their client.files.create method. The purpose parameter is set to "fine-tune" to specify the use case.

training_file_id = upload_file("english_limbu_train.jsonl", "fine-tune")
validation_file_id = upload_file("english_limbu_valite.jsonl", "fine-tune")

5. Fine-Tuning the Model

With the files uploaded, we kicked off the fine-tuning process by specifying the model to fine-tune (in our case, "gpt-4o-mini-2024-07-18"), the training and validation files, and a custom suffix.


response = client.fine_tuning.jobs.create(
    training_file=training_file_id,
    validation_file=validation_file_id,
    model=MODEL,
    suffix="limbu-trans",
)

Once the fine-tuning job starts, we periodically check its status and monitor the events using:

response = client.fine_tuning.jobs.retrieve(job_id)
print("Status:", response.status)

The job completes when the status is "succeeded".

6. Testing the Fine-Tuned Model

After successful fine-tuning, we retrieve the fine-tuned model ID. This model can now be used to translate between English and Limbu. For instance:

test_messages = [{"role": "system", "content": system},
                 {"role": "user", "content": "translate: hello my friend"}]

response = client.chat.completions.create(
    model=fine_tuned_model_id, messages=test_messages, temperature=0, max_tokens=500
)
print(response.choices[0].message.content)

In this example, the model generates the Limbu translation for "hello my friend."

Conclusion

By fine-tuning a GPT-4 model using a small, custom dataset, we created a functional English-Limbu language translator. This process is applicable not just for language translation, but for any task-specific model training. The fine-tuning API simplifies the workflow, making it accessible even to those without deep expertise in machine learning.

This translator is just one example of how AI can help preserve and promote lesser-known languages.



Profile Image

© 2024, All right reserved.