Llama3

硬件配置分析

在本地对 8B 版本的 Llama3 进行部署测试，最低硬件配置为

CPU： Intel Core i7 或 AMD 等价（至少 4 个核心）
GPU： NVIDIA GeForce GTX 1060 或 AMD Radeon RX 580（至少 6 GB VRAM）
内存：至少 16 GB 的 RAM
操作系统： Ubuntu 20.04 或更高版本，或者 Windows 10 或更高版本

Llama3 的部署环境对各个包的版本需求有些严格，需要注意，否则会报各种错误，环境列表附在最后(去最上面的 github 里找也可，我环境里可能有单纯部署之外用不到的包)，其中最需要注意的是 transformers 的版本，需要大于 4.39.0 ( 我用的 4.40.1 )，因为 Llama3 比较新，老版本的 transformers 里没有 Llama3 的模型和分词器，另外就是 pytorch 和 cuda 的版本，torch 2.1.0 + cu118，主要是 transformers 对 cuda 版本有要求，部署过程中遇到的多数错误都是包的版本问题。

Llama3 模型介绍

Llama3 是 Meta 于 2024 年 4 月 18 日开源的 LLM，目前开放了 8B 和 70B 两个版本，两个版本均支持最大为 8192 个 token 的序列长度( GPT-4 支持 128K )Llama3 在 Meta 自制的两个 24K GPU 集群上进行预训练，使用 15T 的训练数据，其中 5%为非英文数据，故 Llama3 的中文能力稍弱，Meta 认为 Llama3 是目前最强的开源大模型。

本地运行

下载

首先安装 modelscope

shell

pip install modelscope

pip install modelscope

然后再通过 python 下载大模型到本地

python

from modelscope import snapshot_download
model_dir = snapshot_download('LLM-Research/Llama-3.2-1B-Instruct',cache_dir='D://llama')

from modelscope import snapshot_download
model_dir = snapshot_download('LLM-Research/Llama-3.2-1B-Instruct',cache_dir='D://llama')

运行

shell

pip install transformers
pip install torch torchvision torchaudio
pip install accelerate

pip install transformers
pip install torch torchvision torchaudio
pip install accelerate

python

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "D:\llama\LLM-Research\Llama-3___2-1B-Instruct"

# load the tokenizer and the model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# prepare the model input
prompt = "Give me a short introduction to large language models."
messages = [
    {"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True, # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

# parse thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)
print("content:", content)

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "D:\llama\LLM-Research\Llama-3___2-1B-Instruct"

# load the tokenizer and the model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# prepare the model input
prompt = "Give me a short introduction to large language models."
messages = [
    {"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True, # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

# parse thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)
print("content:", content)

Llama3

硬件配置分析

Llama3 模型介绍

本地运行

下载

运行

使用 LLaMA-Factory 微调

博客

Llama3 ​

硬件配置分析 ​

Llama3 模型介绍 ​

本地运行 ​

下载 ​

运行 ​

使用 LLaMA-Factory 微调 ​

博客

Llama3

硬件配置分析

Llama3 模型介绍

本地运行

下载

运行

使用 LLaMA-Factory 微调