Phi 3 – Small But Highly effective Fashions from Microsoft


Introduction

The Phi mannequin from Microsoft has been on the forefront of many open-source Giant Language Fashions. Phi structure has led to all the favored small open-source fashions that we see at present which embrace TPhixtral, Phi-DPO, and others. Their Phi Household has taken the LLM structure a step ahead with the introduction of Small Language Fashions, saying that these are sufficient to realize totally different duties. Now Microsoft has lastly unveiled the Phi 3, the following technology of Phi fashions, which additional improves than the earlier technology of fashions. We are going to undergo the Phi 3 on this article and take a look at it with totally different prompts.

Studying Goals

  • Perceive the developments within the Phi 3 mannequin in comparison with earlier iterations.
  • Study concerning the totally different variants of the Phi 3 mannequin.
  • Discover the enhancements in context size and efficiency achieved by Phi 3.
  • Acknowledge the benchmarks the place Phi 3 surpasses different common language fashions.
  • Perceive how one can obtain, initialize, and use the Phi 3 mini mannequin.

This text was printed as part of the Knowledge Science Blogathon.

Phi 3 – The Subsequent Iteration of Phi Household

Lately Microsoft has launched Phi 3, showcasing its dedication to the open-source within the subject of Synthetic Intelligence. Phi has launched two variants of Phi 3. One is the Phi 3 with a 4k context dimension and the opposite is the Phi 3 with a 128k context dimension. Each of those have the identical structure and a dimension of three.8 Billion Parameters known as the Phi 3 mini. Microsoft has even introduced up two bigger variants of Phi, a 7 Billion model known as the Phi 3 Small and a 14 Billion model known as the Phi 3 Medium, although they’re nonetheless within the coaching phases. All of the Phi 3 fashions include the instruct model and thus are able to be deployed in chat functions.

Distinctive Options

  • Prolonged Context Size: Phi 3 will increase the context size of the Giant Language Mannequin from 2k to 128k, facilitated by LongRope expertise, with the default context size doubled to 4k.
  • Coaching Knowledge Dimension and High quality: Phi 3 is educated on 3.3 Trillion tokens, that includes bigger and extra superior datasets in comparison with Phi 2.
  • Mannequin Variants:
    • Phi 3 Mini: Skilled on 3.3 Trillion tokens, with a 32k vocabulary dimension and leveraging the tiktoken tokenizer.
    • Phi 3 Small (7B Model): Default context size of 8k, vocabulary dimension of 100k, and makes use of Grouped Question Consideration with 4 Queries sharing 1 Key to scale back reminiscence footprint.
  • Mannequin Structure: Incorporates Grouped Question Consideration to optimize reminiscence utilization, beginning with Pretraining and shifting to Supervised fine-tuning, aligned with Direct Desire Optimization for AI-responsible outputs.

Benchmarks – Phi 3

Coming to the benchmarks, the Phi 3 mini, i.e. the three.8 Billion Parameter mannequin has overtaken the Gemma 7B from Google. It has gotten a rating of 68.8 in MMLU and 76.7 in HellaSwag which exceeds Gemma which has a rating of 63.6 in MMLU and 49.8 in HellSwag and even the Mistral 7B mannequin which has a rating of 61.7 in MMLU and 58.5 in HellSwag. Phi-3 has even surpassed the lately launched Llama 3 8B mannequin in each of those benchmarks.

It even surpasses these and the opposite fashions in different common analysis checks just like the WinoGrande, TruthfulQA, HumanEval, and others. Within the under desk, we will evaluate the scores of the Phi 3 household of fashions with different common open-source giant language fashions.

Phi 3 – Small But Highly effective Fashions from Microsoft

Getting Began with Phi 3

To get began with Phi-3. We have to observe sure steps. Allow us to dive deeper into every step.

Step1: Downloading Libraries

Let’s begin by downloading the next libraries.

!pip set up -q transformers huggingface-cli bitsandbytes speed up
  • transformers – We want this library to obtain the Giant Language Fashions and work with them
  • huggingface-cli – We have to log in to huggingface in order that we will work with the official HuggingFace mannequin
  • bitsandbytes – We can’t immediately run the 8 Billion mannequin within the free GPU occasion of Colab, therefore we’d like this library to quantize the LLM to 4-bit to work with them
  • speed up – We want this to hurry up the GPU inference for the Giant Language Fashions

Now, earlier than we begin downloading the mannequin, we have to outline our quantization config. It is because we can’t load the whole full precision mannequin inside the free Google Colab GPU and even when we match it, the inference will likely be sluggish. So, we are going to quantize our mannequin to 4-bit precision after which work with the mannequin.

Step2: Defining Quantization Configure

The configuration for this quantization could be seen under:

import torch
from transformers import BitsAndBytesConfig


config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)
  • Right here we begin by importing the torch and the BitsAndBytesConfig from the transformers library.
  • Then we create an occasion of this BitsAndBytesConfig class and put it aside to the variable known as config
  • Whereas creating this occasion, we give it the next parameters.
  • load_in_4bit: This tells that we wish to quantize our mannequin into 4bit precision format. This can tremendously scale back the scale of the mannequin.
  • bnb_4bit_quant_type: This tells the kind of 4bit quantization we want to work with. Right here we go together with the traditional float known as the nf4. That is confirmed to present higher outcomes.
  • bnb_4bit_use_double_quant: Setting this to True will quantize the quantization constants which are inside to BitsAndBytes, this can additional scale back the scale of the mannequin.
  • bnb_4bit_compute_dtype: Right here we inform what datatype we will likely be working with when computing the ahead cross by means of the mannequin. For the colab, we will set it to mind float16 known as bfloat16, which tends to supply higher outcomes than the common float16.

Working this code will create our quantization configuration.

Step3: Obtain the Mannequin

Now, we’re able to obtain the mannequin and quantize it with the next quantization configuration. The code for this will likely be:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

torch.random.manual_seed(0)

mannequin = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct", 
    device_map="cuda", 
    torch_dtype="auto", 
    trust_remote_code=True, 
    quantization_config = config
)

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
  • Right here we begin by importing the AutoModelForCausalLM and AutoTokenizer from the transformers library
  • Now we create a variable named model_name and cross it the title of the mannequin that we’ll work with and right here we are going to give the Phi-3-mini Instruct model mannequin
  • Then we create an occasion of the AutoModelForCausualLM.from_pretrained() and cross it the mannequin title, and the machine map, which is able to set the machine to GPU if GPU is current, after which the quantization config that we’ve simply created
  • In an analogous means, we create a tokenizer object with the identical mannequin title and the machine map set to auto

Working this code will obtain the Phi-3 mini 4k context instruct LLM after which will quantize it to the 4bit degree based mostly on the configuration that we’ve supplied to it. After which the tokenizer is downloaded as effectively.

Step4: Testing Phi-3-mini

Now we are going to take a look at the Phi-3-mini. For this, the code will likely be:

messages = [
    {"role": "user", "content": "A clock shows 12:00 p.m. now. How many 
    degrees will the minute hand move in 15 minutes?"},
    {"role": "assistant", "content": "The minute hand moves 360 degrees 
    in one hour (60 minutes). Therefore, in 15 minutes, it will move 
    (15/60) * 360 degrees = 90 degrees."},
    {"role": "user", "content": "How many degrees does the hour hand 
    move in 15 minutes?"}
]

model_inputs = tokenizer.apply_chat_template(messages, 
return_tensors="pt").to("cuda")

output = mannequin.generate(model_inputs, 
                               max_new_tokens=1000, 
                               do_sample=True)

decoded_output = tokenizer.batch_decode(output, 
                                       skip_special_tokens=True)
print(decoded_output[0])
  • First, we create a listing of messages. It is a record of dictionaries, containing two key-value pairs, the place the keys are function and content material.
  • The function tells if the message is from the consumer or the assistant and the content material is the precise message
  • Right here we create a dialog about angles between the fingers of the clock. Within the final message from the consumer, we ask a query concerning the angle made by the hour’s hand.
  • Then we apply a chat template to this chat dialog. The chat template is critical for the mannequin to know, as a result of the instruct knowledge the mannequin is educated on, comprises the chat template formatting.
  • We want the corresponding tensors for this dialog and we are going to transfer it to Cuda for quicker processing.
  • Now the model_input comprises our tokens and the corresponding consideration masks.
  • These model_inputs are handed to the mannequin.generate() perform which takes these tokens with some extra parameters just like the variety of tokens to print, which we despatched to 1000, and the do_sample, which is able to pattern from the excessive chance tokens.
  • Lastly, we decode the output generated by the Giant Language Mannequin to transform the tokens again to English textual content.

Therefore, after we run this code will take within the record of messages, do the right formatting by making use of the chat template, convert them into tokens, after which cross them to generate a perform to generate the response and at last decode them to transform the response generated within the type of tokens to English textual content.

Output

Working this code produced the next output.

output of phi 3

Seeing the output generated, the mannequin has accurately answered the query. We see a really detailed strategy just like a series of ideas. Right here the mannequin begins by speaking about how the minute hand strikes and the way the hour hand strikes per hour. Then from there, it calculated the mandatory intermediate outcome, and from there it went on to unravel the precise consumer query.

Implementation with One other Query

Now let’s strive with one other query.

messages = [
    {"role": "user", "content": "If a plane crashes on the border of the 
    United States and Canada, where do they bury the survivors?"},
]

model_inputs = tokenizer.apply_chat_template(messages, 
return_tensors="pt").to("cuda")

output = mannequin.generate(model_inputs, 
                               max_new_tokens=1000, 
                               do_sample=True)

decoded_output = tokenizer.batch_decode(output,
                                       skip_special_tokens=True)
print(decoded_output[0])
"

Right here within the above instance, we requested a tough query to the Phi 3 LLM. And it was in a position to present a fairly convincing reply. Right here the LLM was in a position to get to the complicated half, that’s we can’t bury the survivors, as a result of survivors reside, therefore there are not any survivors in any respect to bury. Let’s strive giving one other tough query and checking the generated output.

messages = [
    {"role": "user", "content": "How many smartphones can a human eat?"},
]

model_inputs = tokenizer.apply_chat_template(messages, 
return_tensors="pt").to("cuda")

output = mannequin.generate(model_inputs, 
                               max_new_tokens=1000, 
                               do_sample=True)

decoded_output = tokenizer.batch_decode(output,
                                       skip_special_tokens=True)
print(decoded_output[0])
Phi 3

Right here we requested the Phi-3-mini one other tough query, about what number of smartphones can a human eat. This checks the Giant Language Mannequin’s widespread sense potential. The Phi-3 LLM was in a position to catch this by saying that it was a misunderstanding. It even tells that the. This tells that the Phi-3-mini was effectively educated on a high quality dataset containing an excellent combination of widespread sense, reasoning, and maths.

Conclusion

Phi-3 represents Microsoft’s subsequent technology of Phi fashions, bringing vital developments over Phi-2. It boasts a drastically elevated context size, reaching as much as 128k tokens with minimal efficiency affect. Moreover, Phi-3 is educated on a a lot bigger and extra complete dataset in comparison with its predecessor. Benchmarks point out that Phi-3 outperforms different common fashions in varied duties, demonstrating its effectiveness. With its functionality to deal with complicated questions and incorporate widespread sense reasoning, Phi-3 holds nice promise for varied functions.

Key Takeaways

  • Phi 3 performs effectively in sensible situations, dealing with tough and ambiguous questions successfully
  • Mannequin Variants: Completely different variations of Phi 3 embrace Mini (3.8B), Small (7B), and Medium (14B), offering choices for varied use instances.
  • Phi 3 surpasses different open-source fashions in key benchmarks like MMLU and HellaSwag.
  • In comparison with the earlier mannequin Phi 2, the context dimension of Phi 3 is doubled that’s 4k, and with the LongRope methodology, the context size is additional moved to 128k with little or no degradation in efficiency
  • Phi 3 is educated on 3.3 Trillion Tokens involving extremely curated datasets and it was each supervised fine-tuned after which adopted by alignment with Direct Desire Optimization

Regularly Requested Questions

Q1. What sort of prompts can I take advantage of with Phi 3?

A. Phi 3 fashions are educated on knowledge with a selected chat template format. So, it’s beneficial to make use of the identical format when offering prompts or inquiries to the mannequin. This template could be utilized by calling the apply_chat_template.

Q2. What’s Phi 3 and what fashions are a part of its household?

A. hello 3 is the following technology of Phi fashions from Microsoft, a part of a household together with Phi 3 mini, Small, and Medium. The place the mini model is a 3.8 Billion Parameter mannequin, whereas the Small is a 7 Billion Parameter mannequin and the Medium is a 14 Billion Parameter mannequin.

Q3. Can I take advantage of Phi 3  at no cost?

A. Sure, Phi 3 fashions can be found at no cost by means of the Hugging Face platform. Proper now solely the Phi 3 mini i.e. the three.8 Billion Parameter mannequin is obtainable on HuggingFace. This mannequin could be labored with for business functions too, based mostly on the given license.

This fall. How effectively does Phi 3 deal with tough questions?

A. Phi 3 exhibits promising outcomes with commonsense reasoning. The supplied examples reveal that Phi 3 can reply tough questions that contain humor or logic.

Q5. Are there any modifications for the tokenizers within the new Phi household of fashions?

A. Sure. Whereas the Phi 3 Mini nonetheless works with the common Llama 2 tokenizer, having a vocabulary dimension of 32k, the brand new Phi 3 Small mannequin will get a tokenizer, the place the vocabulary dimension is prolonged to 100k tokens

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here