Introducing Prem-1B

Prem AI introduces Prem-1B, an open-source Small Language Model built for Retrieval-Augmented Generation (RAG) tasks. Based on a decoder-only transformer architecture, it supports up to 8192 tokens. The model is available on Hugging Face under Apache 2.0.

Introducing Prem-1B
Introducing Prem-1B

With great enthusiasm, we unveil the Prem-1B series, an open-source, multipurpose large language model developed by Prem AI. This cutting-edge SLM offers the open community and enterprises the opportunity to harness capabilities that were once exclusively available through closed-model APIs, empowering them to build their advanced language models. The weights of the base model (Prem-1B base) and the finetuned chat model (Prem-1B Chat) are available on HuggingFace under APACHE LICENSE 2.0.

🎯 Our Objective

We aim to develop a model that excels at Retrieval-Augmented Generation (RAG). While Large Language Models (LLMs) store a vast amount of information within their parameters, RAG operates differently by ingesting information during runtime. This approach suggests that for RAG applications, we may not require models of immense size. With this initiative, we aim to create a Small Language Model (SLM) with an extended context length of 8192 tokens, enabling it to handle multi-turn conversations effectively. This endeavor represents our inaugural attempt to craft an SLM tailored for RAG tasks. Read more about our hypothesis here.

đź’» Infra Setup

Our infrastructure dedicated to model training is equipped with 16 H100 GPUs, distributed across two nodes, each hosting 8 GPUs. To facilitate multi-GPU training, these nodes are interconnected through the utilization of Ray, a distributed computing framework. We faced a few challenges while setting up the environment, which we explored in our previous blog, link below 👇

SLM Journey Unveiled
Prem’s “SLM Journey Unveiled” details training a 1B parameter Small Language Model with 8K context length. It covers dataset challenges, Distributed Data Parallelism (DDP) with Ray, and optimization techniques for data partitioning and gradient synchronization.

🏛️ Architecture

Prem-1B is a transformer-based decoder-only SLM that was trained using next-token prediction. The architecture is based on Llama 2 used by TinyLlama with flash-attention. Note that TinyLlama was trained with a context length of 2048, but Prem-1B supports a context length of up to 8192. Considering the recent release of Llama 2 and Llama 3 and their amazing performance and benchmarks, we went with this Llama architecture based on transformers. We explored Mamba architecture, Mixture of Experts (MOE) architectures, and recent technical reports of H2O-Danube-1.8B, Stable LM 2 1.6B, Phi3, and Llama3 models, and figured it’s not about architecture, but mainly about diverse quality data.

🏋️‍♂️ Pre-training

During the pre-training stage, we employed SlimPajama. We adopted Llama's tokenizer to process the data corpus. In the pre-processing phase, we packed multiple instances of data up to the defined context length of 8192 tokens, minimizing the need for excessive padding. The core objective behind pre-training is to ingest information and enable the large language model to comprehend sentence formation and perform text completion tasks effectively. We tried pre-training the model without packing the datasets, but it didn’t perform well. Mainly because most of the available open-source datasets don’t have long context data points, and if you don’t pack them during pre-training, most of the tokens will just be pad tokens, and the model will not learn anything.

In preparing the packed dataset, we utilized Lightning Data, a tool designed for efficient data handling and pre-processing. As the primary purpose of this model is to perform well on English content, we filtered out any data points containing code-specific information. After the pre-processing phase, we had accumulated 600B tokens, which were trained over the course of two epochs, totaling 1.2T tokens. Considering the research objective of developing an exceptional RAG SLM, we adopted an extended context length of 8192 tokens. We spent a total of 8500 GPU hours on pre-training.

Here is the final training config for pre-training:

model:
  model_args:
    model_name: TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T
    max_position_embeddings: 8192
    flash_attention: true
    dtype: bfloat16
  optimizer_args:
    lr: 0.0004
    betas: [0.9, 0.95]
    weight_decay: 0.1
  lr_scheduler_args:
    num_warmup_percentage: 0.1

data:
  train_path: "<train_data>"
  val_path: "<train_data>"
  max_seq_length: 8192
  batch_size: 2

trainer:
  accelerator: auto
  precision: bf16-mixed
  log_every_n_steps: 1
  gradient_clip_val: 1
  accumulate_grad_batches: 16
  max_epochs: 2
  val_check_interval: 92000
  limit_val_batches: 1.0
  limit_train_batches: 1.0
  reload_dataloaders_every_n_epochs: 1

đź’¬ Chat-Finetuning (SFT)

The pre-trained model serves as a foundation, a base model. However, base models are not designed for conversational interactions, so they are unsuitable for chat applications. To transform the base model into a capable assistant, we employ a process called chat fine-tuning. At a high level, this approach involves creating a structured prompt and ingesting it instead of raw data. The structured prompt is designed to simulate a conversation between a human and an assistant, and the model is trained to predict the assistant's response. The process of chat fine-tuning can be summarized as follows:

  1. Added a prompt template. For this, we adopted the Llama 3 chat template.
  2. Used dataset with multi-turn conversation data points. In a few of the datasets, we didn’t have a system prompt, so we just added a very generic base system prompt.
  3. The model was trained on 4-H100 GPUs for 12 hours.
  4. No data-packing like we did in the pre-training stage.

Following are the config/hyperparameters for the chat-finetuning stage:

model:
  model_args:
    model_name: TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T
    max_position_embeddings: 8192
    flash_attention: true
    dtype: bfloat16
  optimizer_args:
    lr: 0.00005
    betas: [0.9, 0.95]
    weight_decay: 0.1
  lr_scheduler_args:
    num_warmup_percentage: 0.1

data:
  train_path: "<train_dataset>"
  val_path: "<val_dataset>"
  max_seq_length: 8192
  batch_size: 2
  dataset_tokenizer: TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T

trainer:
  accelerator: auto
  precision: bf16-mixed
  log_every_n_steps: 1
  gradient_clip_val: 1
  accumulate_grad_batches: 16
  max_epochs: 3
  limit_val_batches: 1.0
  limit_train_batches: 1.0

Masked the whole prompt except the assistant responses while calculating the loss. This ensures that we only calculate the loss for the assistant tokens. Even in the case of multi-turn conversation data points, we masked all the assistant responses. For eg. consider the following data point formatted with the template:

<s><|start_header_id|>system<|end_header_id|>
You are a helpful AI assistant.<|eot_id|>

<|start_header_id|>user<|end_header_id|>hi<|eot_id|>

<|start_header_id|>assistant<|end_header_id|>
Hello! How can I help you today?<|eot_id|>     (Not MASKED)

<|start_header_id|>user<|end_header_id|>
who is the CEO of google?<|eot_id|>

<|start_header_id|>assistant<|end_header_id|>
The CEO of Google is Sundar Pichai<|eot_id|>   (Not MASKED)

<|start_header_id|>user<|end_header_id|>
who is the CEO of Twitter?<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
...

We used the following datasets while finetuning. These datasets are selected based on their quality and the diverse nature of prompts:

  1. Ultrachat 200k
  2. Deita 10K V0
  3. Slim Orca
  4. WizardLM Evol Instruct V2
  5. Capybara
  6. MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
đź’ˇ
Prem Platform. Effortlessly Integrate Generative AI into Your Applications with Full Ownership and Confidence.

🤝 DPO and Alignment

We followed SFT, by Direct Preference Optimization (DPO). It is one of the techniques used to align our model to generate better responses. Large, unsupervised language models lack precise control over their behavior due to their unsupervised training. Existing methods like Reinforcement Learning From Human Feedback (RLHF) use complex procedures to fine-tune the models to align with human preferences. DPO is a stable and computationally efficient algorithm that solves the RLHF problem using a simple classification loss, eliminating the need for sampling or significant hyperparameter tuning. You can learn more about model alignment in this blogpost. The following datasets were used for DPO finetuning:

  1. UltraFeedback Binarized
  2. Orca DPO Pairs
  3. OASST2 DPO Pairs

This stage of training is performed using the Alignment Handbook.

We used the following config for DPO finetuning. You can check the parameters in DPOConfig.

bf16: true
beta: 0.01
gradient_accumulation_steps: 4
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: False
learning_rate: 4.0e-6
lr_scheduler_type: cosine
max_length: 8192
max_prompt_length: 1000
num_train_epochs: 1
optim: adamw_torch
per_device_train_batch_size: 2
seed: 42
warmup_ratio: 0.1
loss_type: sigmoid

🔢 Results

PREM 1B Results
PREM 1B Results
Model Avg Arc-c Arc-e Hellaswag MMLU Obqa Piqa Winogrande
prem-1B 42.64 24.74 57.40 42.01 24.75 21.00 72.14 56.43
prem-1B-chat 41.76 24.48 53.32 40.28 25.27 22.20 70.89 55.88
TinyLlama-1.1B-Chat-v1.0 46.16 30.03 61.53 46.56 24.72 25.80 74.21 60.29
opt-1.3b 42.94 23.37 57.44 41.49 24.86 23.20 71.49 58.72
pythia-1b 40.71 24.31 56.90 37.72 23.20 18.80 70.62 53.43

🔎 Future plans

  1. Improve our existing model. Potentially, our focus will be on adding more quality data during pre-training and finetuning.
  2. We want to improve model alignment. We will be exploring model alignment techniques discussed.
  3. We noticed that there are a few rare cases where model generation repeats the data during inference. We need to tackle this problem.
  4. Usually, models released by organizations have self-knowledge about the organization and their creators. This topic is still not discussed in detail in research papers, and we will be exploring this path and sharing the results.
  5. Recent open-source model releases in the SLM space are around 1.6B-2B parameter models. We will be exploring some architectures in that range in our next iteration.

Other than Prem-1B, our research team at PremAI heavily worked on building the smallest and performant Text to SQL models. We also present Prem-1B-SQL which is a 1.3B parameter model (fully fine-tuned from DeepSeek 1.3B model) which is on par with GPT-3.5 and claude-2 and many other bigger open source models (like Llama 3 70B and Qwen 32B) for Text to SQL tasks.

Hitting 10K+ monthly downloads (As of November 2024)
Hitting 10K+ monthly downloads (As of November 2024)

Prem-1B-SQL is loved by the open source community. We reached more than 10K+ monthly downloads on Hugging Face and also 8K+ PremSQL library downloads. Check out our Prem-1B-SQL release blog to learn more.

Prem-1B-SQL: Fully Local Performant SLM for Text to SQL
Last week, we open-sourced PremSQL, a local first library that created customised Text-to-SQL solutions.

🚀 Try it now!

Try it on Huggingface Chat: https://huggingface.co/premai-io/prem-1B-chat.

Or you can use the models now using Huggingface pipelines.

With model and tokenizer:

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("premai-io/prem-1B-chat")
model = AutoModelForCausalLM.from_pretrained('premai-io/prem-1B-chat', torch_dtype=torch.bfloat16)
model = model.to('cuda')

# Setup terminators
terminators = [tokenizer.eos_token_id, tokenizer.encode('<|eot_id|>', add_special_tokens=False)[0]]

# Prepare the prompt
messages = [
    {
        "role": "system",
        "content": "You are a helpful AI assistant. You should give concise responses to very simple questions, but provide thorough responses to more complex and open-ended questions."
    },
    {
        'role': 'user',
        'content': 'Help me understand machine learning.'
    }
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Generate
inputs = tokenizer(prompt, return_attention_mask=False, return_tensors="pt", add_special_tokens=False)
input_ids = inputs['input_ids']
input_ids = input_ids.to(model.device)
res = model.generate(input_ids=input_ids, max_new_tokens=400, pad_token_id=tokenizer.pad_token_id, eos_token_id=terminators)
generated_text = tokenizer.decode(res[0][input_ids.shape[1]:], skip_special_tokens=True).strip()
print(generated_text)

Using pipelines:

import torch
from transformers import pipeline

# Load the pipeline
pipe = pipeline("text-generation", model="premai-io/prem-1B-chat", torch_dtype=torch.bfloat16, device=0)

# Prepare prompt
messages = [
    {
        "role": "system",
        "content": "You are a helpful AI assistant. You should give concise responses to very simple questions, but provide thorough responses to more complex and open-ended questions."
    },
    {
        'role': 'user',
        'content': 'Help me understand machine learning.'
    }
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Setup terminators
terminators = [pipe.tokenizer.eos_token_id, pipe.tokenizer.encode('<|eot_id|>', add_special_tokens=False)[0]]

# Generate
outputs = pipe(prompt, max_new_tokens=400, do_sample=True, temperature=0.7, top_k=50, top_p=0.95, pad_token_id=pipe.tokenizer.pad_token_id, eos_token_id=terminators)
print(outputs[0]["generated_text"][len(prompt):])

đź“š References

RAG Strategies
The article “RAG Strategies” explores Retrieval-Augmented Generation (RAG) methods, detailing Naive RAG, Advanced RAG, and Modular RAG approaches. It introduces RAFT, a fine-tuning technique, and discusses optimizing large language models for RAG tasks.
GitHub - jzhang38/TinyLlama: The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.
The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens. - jzhang38/TinyLlama
Meta Llama 2
Llama 2 was pretrained on publicly available online data sources. The fine-tuned model, Llama Chat, leverages publicly available instruction datasets and over 1 million human annotations.
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers’ computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5$\times$ higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.
LLM Mixture of Experts Explained
Explaining Mixture of Experts LLM (MoE): GPT4 is just 8 smaller Expert models; Mixtral is just 8 Mistral models. See the advantages and disadvantages of MoE. Find out how to calculate their number of parameters.
H2O-Danube-1.8B Technical Report
We present H2O-Danube, a series of small 1.8B language models consisting of H2O-Danube-1.8B, trained on 1T tokens, and the incremental improved H2O-Danube2-1.8B trained on an additional 2T tokens. Our models exhibit highly competitive metrics across a multitude of benchmarks and, as of the time of this writing, H2O-Danube2-1.8B achieves the top ranking on Open LLM Leaderboard for all models below the 2B parameter range. The models follow core principles of LLama 2 and Mistral, and we leverage and refine various techniques for pre-training large language models. We additionally release chat models trained with supervised fine-tuning followed by direct preference optimization. We make all models openly available under Apache 2.0 license further democratizing LLMs to a wider audience economically.
Stable LM 2 1.6B Technical Report
We introduce StableLM 2 1.6B, the first in a new generation of our language model series. In this technical report, we present in detail the data and training procedure leading to the base and instruction-tuned versions of StableLM 2 1.6B. The weights for both models are available via Hugging Face for anyone to download and use. The report contains thorough evaluations of these models, including zero- and few-shot benchmarks, multilingual benchmarks, and the MT benchmark focusing on multi-turn dialogues. At the time of publishing this report, StableLM 2 1.6B was the state-of-the-art open model under 2B parameters by a significant margin. Given its appealing small size, we also provide throughput measurements on a number of edge devices. In addition, we open source several quantized checkpoints and provide their performance metrics compared to the original model.
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. Our training dataset is a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide parameter-scaling results with a 7B, 14B models trained for 4.8T tokens, called phi-3-small, phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75%, 78% on MMLU, and 8.7, 8.9 on MT-bench). To enhance multilingual, multimodal, and long-context capabilities, we introduce three models in the phi-3.5 series: phi-3.5-mini, phi-3.5-MoE, and phi-3.5-Vision. The phi-3.5-MoE, a 16 x 3.8B MoE model with 6.6 billion active parameters, achieves superior performance in language reasoning, math, and code tasks compared to other open-source models of similar scale, such as Llama 3.1 and the Mixtral series, and on par with Gemini-1.5-Flash and GPT-4o-mini. Meanwhile, phi-3.5-Vision, a 4.2 billion parameter model derived from phi-3.5-mini, excels in reasoning tasks and is adept at handling both single-image and text prompts, as well as multi-image and text prompts.
GitHub - Lightning-AI/litdata: Transform datasets at scale. Optimize datasets for fast AI model training.
Transform datasets at scale. Optimize datasets for fast AI model training. - Lightning-AI/litdata
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form, allowing us to solve the standard RLHF problem with only a simple classification loss. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds PPO-based RLHF in ability to control sentiment of generations, and matches or improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.
GitHub - huggingface/alignment-handbook: Robust recipes to align language models with human and AI preferences
Robust recipes to align language models with human and AI preferences - huggingface/alignment-handbook