Instruction finetuning & Self-Instruct

Internship_LLM 2024. 1. 26. 10:54

728x90

Instruction tuning과 Self-Instruct에 대한 간단한 code 흐름을 정리해보려고 한다.

※ Instruction tuning

instruction dataset이 json 파일이라면, with open등으로 json 파일을 받아와서 데이터프레임 형식으로 변환한다.
사용할 토크나이저와 모델을 만든다.

transformers 기반의 AutoTokenizer를 사용하고, pretrained된 모델을 넣어서 토크나이징
instruction tuning시에는 AutoModelForSeq2SeqLM을 사용하여 모델을 만들어 주는것이 일반적이므로

model_name = "bert-base-uncased"  # 예시 모델 이름
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

3. 데이터셋은 사용자가 넣은 데이터셋에서 instruction, input, output을 나누어서 CustomDataSet class를 구축해주고, 인코딩 작업을 실시한다.

# 인코딩
        instruction_encoding = self.tokenizer(
            instruction_text, max_length=self.max_length, padding='max_length', truncation=True
        )
        input_encoding = self.tokenizer(
            input_text, max_length=self.max_length, padding='max_length', truncation=True
        )
	output_encoding = self.tokenizer(
            output_text, max_length=self.max_length, padding='max_length', truncation=True
        )

4. 인코딩 작업을 위에서 만들어 준 tokenizer를 사용하여 작성해준 후에 TrainingArguments와 Trainer를 설정해준다.

# 훈련 데이터셋 준비
train_dataset = CustomDataset(tokenizer, df)

# 훈련 설정
training_args = TrainingArguments(
    output_dir='./results',          
    num_train_epochs=3,              
    per_device_train_batch_size=16,  
    warmup_steps=500,                
    weight_decay=0.01,               
    logging_dir='./logs',            
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

# 훈련 시작
trainer.train()

이렇게 trainer를 사용해서 train 해주면, instruction tuning 된 model이 만들어지게 된다.

※ Self-Instruct

모델과 토크나이저를 로드하고 평가한다.

model_name = "bert-base-uncased"  # 예시 모델 이름
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
model.eval()

2. Self-Instruct 프로세스를 instruction과 대답을 생성하는 부분으로 나누어 구현해 준다.

def generate_instruction(model, tokenizer, seed_sentence, max_length=50):
    inputs = tokenizer.encode(seed_sentence, return_tensors='pt')
    outputs = model.generate(inputs, max_length=max_length)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

def model_response_to_instruction(model, tokenizer, instruction, max_length=50):
    inputs = tokenizer.encode(instruction, return_tensors='pt')
    outputs = model.generate(inputs, max_length=max_length)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

3. seed 문장이나, seed 파일을 불러오고, 반복 수 만큼 무작위로 선택된 시드를 기반으로 새로운 instruction과 response를 생성하여 model이 다양한 유형의 instruction을 생성할 수 있도록 한다.

seed_sentences = 시드문장
for _ in range(5):
    seed = random.choice(seed_sentences)
    instruction = generate_instruction(model, tokenizer, seed)
    response = model_response_to_instruction(model, tokenizer, instruction)
    print(f"Instruction: {instruction}\nResponse: {response}\n")

728x90

저작자표시

'Internship_LLM' 카테고리의 다른 글

LLaMA 논문 리뷰 (2)	2024.01.31
LLM Parameters (0)	2024.01.26
SELF-INSTRUCT 논문 리뷰 (0)	2024.01.24
GPT - 1.0 논문 리뷰 (0)	2024.01.12
BERT 논문 리뷰 (0)	2024.01.09

ABOUT ME

슬기로운 혀니의 코딩 생활 슬기로운 혀니의 코딩 생활

※ Instruction tuning

※ Self-Instruct

'Internship_LLM' 카테고리의 다른 글

티스토리툴바

ABOUT ME

※ Instruction tuning

※ Self-Instruct

'Internship_LLM' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바