I need to create my own dataset based on mlabonne/orpo-dpo-mix-40k. but when i does it and create a dataset for ORPO training it gives error

waadarsh · April 22, 2024, 4:35am

---------------------------------------------------------------------------
UndefinedError                            Traceback (most recent call last)
<ipython-input-14-09e2a2282d7c> in <cell line: 3>()
      1 tokenizer = AutoTokenizer.from_pretrained(base_model)
      2 
----> 3 dataset = dataset.map(
      4     format_chat_template,
      5 )

8 frames
/usr/local/lib/python3.10/dist-packages/jinja2/environment.py in handle_exception(self, source)
    934         from .debug import rewrite_traceback_stack
    935 
--> 936         raise rewrite_traceback_stack(source=source)
    937 
    938     def join_path(self, template: str, parent: str) -> str:

<template> in top-level template code()

UndefinedError: 'str object' has no attribute 'role'

when i did futher checking, and got dataset.features from ‘mlabonne/orpo-dpo-mix-40k’ dataset it showed as

{'source': Value(dtype='string', id=None),
 'chosen': [{'content': Value(dtype='string', id=None),
   'role': Value(dtype='string', id=None)}],
 'rejected': [{'content': Value(dtype='string', id=None),
   'role': Value(dtype='string', id=None)}],
 'prompt': Value(dtype='string', id=None)}

but my dataset has features :

{'chosen': Value(dtype='string', id=None),
 'rejected': Value(dtype='string', id=None),
 'prompt': Value(dtype='string', id=None)}

there is no role, how to add it. i am using a csv for creating a dataset using dataset.

from datasets import load_dataset
dataset = load_dataset(‘csv’, data_files=‘my_file.csv’)

this is the code that gives error


def format_chat_template(row):
    row["chosen"] = tokenizer.apply_chat_template(row["chosen"], tokenize=False)
    row["rejected"] = tokenizer.apply_chat_template(row["rejected"], tokenize=False)
    return row

dataset = dataset.map(
    format_chat_template,
    num_proc= os.cpu_count(),
)
dataset = dataset.train_test_split(test_size=0.01)

this is for Fine-tune Llama 3 with ORPO (huggingface.co)

Hamana0509 · April 22, 2024, 10:02am

I have the same issue, Does anyone know how to solve it?

amr-mohamed · April 26, 2024, 4:32pm

I was getting the same error.

However, there are two things to know that can cause the error:

Firstly, you need to prepare your dataset, such that your chosen column of string-encoded lists. Each list has at least two dicrionaries.
The first dictionary should contain the prompt on which the chosen answer should be answered by the assistant. The second dictionary in the same list should be the chosen answer itself, associated with the role key as the assistant.

For a better view, your chosen column should contain such structures:

[{'content': prompt, 'role: 'user'} , {'content': chosen_answer, 'role: 'assistant'}

You should proceed further with applying the same technique with the rejected answers, and store them in similar structures.

Secondly, after you are done with the data processing part and your dataset is ready, you should make sure you parse these lists back correctly, which can sometimes give back errors, by parsing I mean converting the lists from the string format to actual list of dictionaries. (refer to ast for this step.

If this method doesn’t work, please provide me with more context such as examples from the dataset used.

DinoDS · March 4, 2026, 3:09pm

Yep, that error is happening because apply_chat_template() expects a list of message dicts like [{ "role": "...", "content": "..." }, ...], but your CSV has chosen and rejected as plain strings. So Jinja tries to read .role on a string and crashes.

Two easy fixes.

Fix 1: Wrap your strings into chat messages (most common)

If your CSV has:

prompt = user instruction
chosen = preferred assistant answer
rejected = worse assistant answer

Then convert them like this:

import os
from datasets import load_dataset

dataset = load_dataset("csv", data_files="my_file.csv")["train"]

def format_chat_template(row):
    chosen_msgs = [
        {"role": "user", "content": row["prompt"]},
        {"role": "assistant", "content": row["chosen"]},
    ]
    rejected_msgs = [
        {"role": "user", "content": row["prompt"]},
        {"role": "assistant", "content": row["rejected"]},
    ]

    row["chosen"] = tokenizer.apply_chat_template(chosen_msgs, tokenize=False)
    row["rejected"] = tokenizer.apply_chat_template(rejected_msgs, tokenize=False)
    return row

dataset = dataset.map(format_chat_template, num_proc=os.cpu_count())
dataset = dataset.train_test_split(test_size=0.01, seed=42)

That will make your data compatible with the mlabonne/orpo-dpo-mix-40k style schema.

Fix 2: If your CSV already stores JSON lists of messages

If your chosen column looks like a JSON string (starts with [ and has role/content), then parse it first:

import json

def format_chat_template(row):
    chosen_msgs = json.loads(row["chosen"])
    rejected_msgs = json.loads(row["rejected"])
    row["chosen"] = tokenizer.apply_chat_template(chosen_msgs, tokenize=False)
    row["rejected"] = tokenizer.apply_chat_template(rejected_msgs, tokenize=False)
    return row

Quick check: print one row["chosen"] before mapping to confirm whether it is plain text or JSON.

Also, are you still hitting this issue now, or did you solve it and you are trying to improve your ORPO dataset quality next?

Topic		Replies	Views
ORPO/DPO dataset clarification 🤗Datasets	3	467	August 29, 2024
Autotrain-advanced LLM finetuning: issues with ORPO/DPO dataset format 🤗AutoTrain	6	727	May 27, 2024
Custom dataset fails. "Please pass features or at least one example when writing data" Beginners	1	139	March 10, 2025
Autotrain ORPO Error 500 Beginners	2	101	October 9, 2024
Autotrain ORPO Dataset format 🤗AutoTrain	4	307	October 8, 2024

I need to create my own dataset based on mlabonne/orpo-dpo-mix-40k. but when i does it and create a dataset for ORPO training it gives error

Fix 1: Wrap your strings into chat messages (most common)

Fix 2: If your CSV already stores JSON lists of messages

Related topics