Llama3
硬件配置分析
在本地对 8B 版本的 Llama3 进行部署测试,最低硬件配置为
- CPU: Intel Core i7 或 AMD 等价(至少 4 个核心)
- GPU: NVIDIA GeForce GTX 1060 或 AMD Radeon RX 580(至少 6 GB VRAM)
- 内存: 至少 16 GB 的 RAM
- 操作系统: Ubuntu 20.04 或更高版本,或者 Windows 10 或更高版本
Llama3 的部署环境对各个包的版本需求有些严格,需要注意,否则会报各种错误,环境列表附在最后(去最上面的 github 里找也可,我环境里可能有单纯部署之外用不到的包),其中最需要注意的是 transformers 的版本,需要大于 4.39.0 ( 我用的 4.40.1 ),因为 Llama3 比较新,老版本的 transformers 里没有 Llama3 的模型和分词器,另外就是 pytorch 和 cuda 的版本,torch 2.1.0 + cu118,主要是 transformers 对 cuda 版本有要求,部署过程中遇到的多数错误都是包的版本问题。
Llama3 模型介绍
Llama3 是 Meta 于 2024 年 4 月 18 日开源的 LLM,目前开放了 8B 和 70B 两个版本,两个版本均支持最大为 8192 个 token 的序列长度( GPT-4 支持 128K )Llama3 在 Meta 自制的两个 24K GPU 集群上进行预训练,使用 15T 的训练数据,其中 5%为非英文数据,故 Llama3 的中文能力稍弱,Meta 认为 Llama3 是目前最强的开源大模型。
本地运行
下载
首先安装 modelscope
shell
pip install modelscope
pip install modelscope
然后再通过 python 下载大模型到本地
python
from modelscope import snapshot_download
model_dir = snapshot_download('LLM-Research/Llama-3.2-1B-Instruct',cache_dir='D://llama')
from modelscope import snapshot_download
model_dir = snapshot_download('LLM-Research/Llama-3.2-1B-Instruct',cache_dir='D://llama')
运行
shell
pip install transformers
pip install torch torchvision torchaudio
pip install accelerate
pip install transformers
pip install torch torchvision torchaudio
pip install accelerate
python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "D:\llama\LLM-Research\Llama-3___2-1B-Instruct"
# load the tokenizer and the model
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# prepare the model input
prompt = "Give me a short introduction to large language models."
messages = [
{"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True, # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# conduct text completion
generated_ids = model.generate(
**model_inputs,
max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
# parse thinking content
try:
# rindex finding 151668 (</think>)
index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
index = 0
thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
print("thinking content:", thinking_content)
print("content:", content)
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "D:\llama\LLM-Research\Llama-3___2-1B-Instruct"
# load the tokenizer and the model
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# prepare the model input
prompt = "Give me a short introduction to large language models."
messages = [
{"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True, # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# conduct text completion
generated_ids = model.generate(
**model_inputs,
max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
# parse thinking content
try:
# rindex finding 151668 (</think>)
index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
index = 0
thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
print("thinking content:", thinking_content)
print("content:", content)