back

Knowlegde

Knowledge Centre

Fine-tuning LLaMA to Recreate Eminescu's Literary Style

by editor | 23.01.2025

Fine-tuning LLaMA to Recreate Eminescu's Literary Style

As artificial intelligence continues to evolve, we've embarked on an fun project that bridges the gap between classical literature and modern technology. Our goal was to successfully developed an AI model that can generate text in the distinctive style of Mihai Eminescu, Romania's preeminent poet and a defining voice of Romanian poetry.

We've used Google Collab with Python for this project. 

Understanding the Challenge

Before diving into the technical details, it's important to understand what makes this project unique. Teaching an AI model to write like Eminescu isn't just about vocabulary and grammar—it's about capturing the essence of his romantic style, his philosophical depth, and his masterful use of Romanian language. This presents several interesting challenges, from handling Romanian diacritics to understanding the complex structures of 19th-century literary Romanian.

Technical Foundation: The LLaMA Model

We chose to build upon Meta's LLaMA model for several reasons. LLaMA is particularly well-suited for fine-tuning tasks due to its efficient architecture and strong multilingual capabilities. We experimented with two versions:

  • LLaMA-3.2-3B: Our initial implementation
  • LLaMA-3.1-8B: An alternative version for comparison
  • LLaMA-3.3-70B-Instruct: Our latest iteration

The Fine-Tuning Process

Data Preparation
The first crucial step was preparing our training data and splitting it into small chunks for the tokenisation part. Here's how we handled the text processing:

def prepare_dataset(file_paths, chunk_size=1024, overlap=128):
   chunks = []
   total_chunks = 0
   
   for file_path in file_paths:
       with open(file_path, 'r', encoding='utf-8') as file:
           text = file.read()
           
       # Split text into overlapping chunks for better context preservation
       words = text.split()
       for i in range(0, len(words), chunk_size - overlap):
           chunk = ' '.join(words[i:i + chunk_size])
           chunks.append(chunk)
           
   return Dataset.from_dict({"text": chunks})

This code breaks down Eminescu's works into manageable chunks while maintaining context through overlap. The overlap is crucial as it helps the model understand longer-range dependencies in the text.

Memory Optimization

One of our biggest challenges was handling the large model efficiently. We implemented several memory optimization techniques:

bnb_config = BitsAndBytesConfig(
   load_in_4bit=True,  # Use 4-bit quantization
   bnb_4bit_compute_dtype=torch.float16,  # Use float16 for calculations
   bnb_4bit_quant_type="nf4",  # Normal format 4-bit quantization
   bnb_4bit_use_double_quant=False,  # Avoid double quantization
   bnb_4bit_cpu_offload=True  # Offload to CPU when necessary
)

This configuration allows us to run large language models on consumer-grade hardware while maintaining model quality. The 4-bit quantization significantly reduces memory usage while the CPU offloading helps manage GPU memory constraints.

Low-Rank Adaptation (LoRA)

We used LoRA to efficiently fine-tune the model without updating all parameters:

lora_config = LoraConfig(
   r=16,  # Rank of the update matrices
   lora_alpha=32,  # Scaling factor
   target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],  # Which layers to fine-tune
   lora_dropout=0.05,  # Dropout for regularization
   bias="none",
   task_type="CAUSAL_LM"
)

This approach allows us to fine-tune the model with significantly fewer parameters, making the process more efficient while maintaining performance. The target modules focus on the attention mechanisms, which are crucial for capturing the stylistic elements of Eminescu's writing.

Evaluation and Results

We developed a comprehensive evaluation system that compares the output of our fine-tuned model with the base LLaMA model:

def generate_text(model, tokenizer, prompt, max_length=200):
   with torch.no_grad():
       inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
       outputs = model.generate(
           **inputs,
           max_length=max_length,
           num_return_sequences=1,
           temperature=0.8,  # Controls randomness
           do_sample=True,   # Enable sampling
           top_p=0.92,       # Nucleus sampling parameter
           top_k=50,         # Limit vocabulary choices
           repetition_penalty=1.1  # Avoid repetition
       )
       return tokenizer.decode(outputs[0], skip_special_tokens=True)

This generation function balances creativity and coherence through carefully tuned parameters. The temperature of 0.8 provides a good balance between diversity and focus, while the top_p and top_k parameters help maintain quality and relevance.

Real-World Applications

The implications of this project extend beyond just generating Eminescu-style text. This technology can be applied to:

  • Educational tools for studying Romanian literature
  • Creative writing assistance
  • Cultural heritage preservation
  • Literary style analysis and research

We're continuing to improve the model with several initiatives:

  • Expanding the training dataset with more of Eminescu's works
  • Experimenting with larger model variants
  • Developing better evaluation metrics for Romanian poetry
  • Creating interactive tools for writers and researchers

The complete codebase is available on our GitHub repository, and we've made our models available on HuggingFace for the broader AI and literary communities to use and build upon. 

You can watch our videos from the series on YouTube. We hope you will replicate our steps and have as much fun as we did. The first episode of this serie is on YouTube and the link is here.

Ready to Transform Your Business with Custom AI Solutions?

At Softescu, we specialize in developing intelligent AI applications that understand your unique business needs. Our team of AI engineers and machine learning experts can help you harness the power of Large Language Models and conversational AI while ensuring seamless integration with your existing systems. Whether you're looking to automate processes, enhance customer experiences, or gain deeper business insights, reach out to us for a personalized AI solution consultation.

  • Knowlegde
    Knowledge Centre
    A New Era Begins: Drupal CMS 1.0 Launches
    editor
  • Knowlegde
    Knowledge Centre
    Bringing AI to B2B
    editor
  • Knowlegde
    Knowledge Centre
    Implementing AJAX Field Updates in Drupal 8
    editor

Post a Comment.