Qwen3Guard- Bảo mật thời gian thực cho luồng mã thông báo của bạn

Chúng tôi rất vui mừng giới thiệu Qwen3Guard, mô hình bảo vệ an toàn đầu tiên trong gia đình Qwen.

September 27, 2025
10 min read

Qwen3Guard- Bảo mật thời gian thực cho luồng mã thông báo của bạn — Chúng tôi rất vui mừng giới thiệu Qwen3Guard, mô hình bảo vệ an toàn đầu tiên trong gia đình Qwen.

Qwen3Guard: Bảo Vệ Luồng Token Theo Thời Gian Thực

Giới thiệu

Chúng tôi rất vui mừng giới thiệu Qwen3Guard, mô hình bảo vệ an toàn đầu tiên trong gia đình Qwen. Được xây dựng dựa trên các mô hình nền tảng Qwen3 mạnh mẽ và được tinh chỉnh đặc biệt để phân loại an toàn, Qwen3Guard đảm bảo các tương tác AI có trách nhiệm bằng cách cung cấp khả năng phát hiện an toàn chính xác cho cả lời nhắc và phản hồi, hoàn chỉnh với các mức độ rủi ro và phân loại theo danh mục để kiểm duyệt chính xác.

Qwen3Guard đạt được hiệu suất hiện đại trên các chuẩn an toàn chính, thể hiện khả năng mạnh mẽ trong cả nhiệm vụ phân loại lời nhắc và phản hồi trên các môi trường tiếng Anh, tiếng Trung và đa ngôn ngữ.

Qwen3Guard có sẵn trong hai biến thể chuyên biệt:

Qwen3Guard-Gen, một mô hình tạo sinh chấp nhận toàn bộ lời nhắc của người dùng và phản hồi của mô hình để thực hiện phân loại an toàn. Lý tưởng cho việc chú thích an toàn ngoại tuyến và lọc các tập dữ liệu, hoặc để cung cấp phần thưởng dựa trên an toàn trong học tăng cường.
Qwen3Guard-Stream, đánh dấu một sự khác biệt đáng kể so với các mô hình bảo vệ mã nguồn mở trước đây bằng cách cho phép phát hiện an toàn trực tuyến, theo thời gian thực hiệu quả trong quá trình tạo phản hồi.

Cả hai biến thể đều có ba kích cỡ, 0.6B, 4B và 8B tham số, để phù hợp với nhiều tình huống triển khai và hạn chế về tài nguyên.

Bạn có thể tải xuống các mô hình mã nguồn mở từ Hugging Face hoặc ModelScope. Bạn cũng có thể truy cập dịch vụ Alibaba Cloud AI Guardrails, được hỗ trợ bởi công nghệ Qwen3Guard.

Các tính năng chính

Phát hiện luồng theo thời gian thực

Qwen3Guard-Stream được thiết kế để có độ trễ thấp, kiểm duyệt nhanh chóng trong quá trình tạo mã thông báo, đảm bảo an toàn mà không ảnh hưởng đến khả năng phản hồi. Điều này được thực hiện bằng cách gắn hai đầu phân loại nhẹ vào lớp cuối cùng của transformer, cho phép mô hình nhận phản hồi theo kiểu luồng - mã thông báo theo mã thông báo, khi nó đang được tạo - và đưa ra phân loại an toàn ngay lập tức ở mỗi bước.

Phân loại mức độ nghiêm trọng ba cấp

Ngoài các nhãn An toàn và Không an toàn thông thường, chúng tôi giới thiệu thêm nhãn Gây tranh cãi để cho phép các chính sách an toàn linh hoạt phù hợp với các trường hợp sử dụng đa dạng. Cụ thể, tùy thuộc vào tình huống ứng dụng, các trường hợp Gây tranh cãi có thể được phân loại lại một cách linh hoạt là An toàn hoặc Không an toàn, cho phép người dùng điều chỉnh mức độ nghiêm ngặt của phân loại theo yêu cầu.

Như được trình bày trong đánh giá bên dưới, các mô hình bảo vệ hiện có, bị giới hạn bởi việc gắn nhãn nhị phân, phải vật lộn để thích ứng đồng thời với các tiêu chuẩn tập dữ liệu khác nhau. Ngược lại, Qwen3Guard đạt được hiệu suất mạnh mẽ và nhất quán trên cả hai tập dữ liệu bằng cách chuyển đổi linh hoạt giữa các chế độ phân loại nghiêm ngặt và lỏng lẻo, nhờ thiết kế mức độ nghiêm trọng ba cấp.

Hỗ trợ đa ngôn ngữ

Qwen3Guard hỗ trợ 119 ngôn ngữ và phương ngữ, khiến nó phù hợp cho việc triển khai toàn cầu và các ứng dụng xuyên ngôn ngữ với hiệu suất an toàn chất lượng cao, nhất quán.

Ngữ hệ	Ngôn ngữ & Phương ngữ
Ấn-Âu	Anh, Pháp, Bồ Đào Nha, Đức, Rumani, Thụy Điển, Đan Mạch, Bulgaria, Nga, Séc, Hy Lạp, Ukraina, Tây Ban Nha, Hà Lan, Slovak, Croatia, Ba Lan, Litva, Bokmål Na Uy, Nynorsk Na Uy, Ba Tư, Slovenia, Gujarati, Latvia, Ý, Occitan, Nepal, Marathi, Belarus, Serbia, Luxembourg, Venice, Assam, Wales, Silesian, Asturian, Chattisgarhi, Awadhi, Maithili, Bhojpuri, Sindhi, Ailen, Faroese, Hindi, Punjabi, Bengali, Oriya, Tajik, Yiddish Đông, Lombard, Liguria, Sicilia, Friuli, Sardinia, Galicia, Catalan, Iceland, Albania Tosk, Limburg, Dari, Afrikaans, Macedonia, Sinhala, Urdu, Magahi, Bosnia, Armenia
Hán-Tạng	Tiếng Trung (Tiếng Trung giản thể, Tiếng Trung phồn thể, Tiếng Quảng Đông), Miến Điện
Phi-Á	Ả Rập (Tiêu chuẩn, Najdi, Levant, Ai Cập, Maroc, Mesopotamia, Ta’izzi-Adeni, Tunisia), Do Thái, Malta
Nam Đảo	Indonesia, Mã Lai, Tagalog, Cebuano, Java, Sunda, Minangkabau, Bali, Banjar, Pangasinan, Iloko, Waray (Philippines)
Dravidian	Tamil, Telugu, Kannada, Malayalam
Turkic	Thổ Nhĩ Kỳ, Bắc Azerbaijan, Bắc Uzbek, Kazakh, Bashkir, Tatar
Tai-Kadai	Thái Lan, Lào
Uralic	Phần Lan, Estonia, Hungary
Nam Á	Việt Nam, Khmer
Khác	Nhật Bản, Hàn Quốc, Georgia, Basque, Haiti, Papiamento, Kabuverdianu, Tok Pisin, Swahili

Các ứng dụng khác

Chúng tôi cũng trình bày (1) sử dụng Qwen3Guard-Gen cho RL an toàn để tăng cường an toàn cho mô hình đồng thời bảo tồn tính hữu ích tổng thể của đầu ra và (2) sử dụng Qwen3Guard-Stream để cho phép can thiệp theo thời gian thực, nhanh chóng đảm bảo đầu ra an toàn mà không cần đào tạo lại mô hình. Xem chi tiết trong Báo cáo kỹ thuật của chúng tôi.

Phát triển với Qwen3Guard

Qwen3Guard-Gen

Qwen3Guard-Gen hoạt động tương tự như một mô hình ngôn ngữ lớn, với mẫu trò chuyện của nó được tối ưu hóa đặc biệt cho các nhiệm vụ phân loại an toàn. Đầu ra được tạo ở định dạng có cấu trúc, được xác định trước.

Bạn có thể sử dụng các đoạn code sau để kiểm duyệt lời nhắc của người dùng hoặc phản hồi của mô hình.

python from transformers import AutoModelForCausalLM, AutoTokenizer import re

model_name = “Qwen/Qwen3Guard-Gen-4B”

load the tokenizer and the model

tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=“auto”, device_map=“auto” ) def extract_label_and_categories(content): safe_pattern = r"Safety: (Safe|Unsafe|Controversial)" category_pattern = r"(Violent|Non-violent Illegal Acts|Sexual Content or Sexual Acts|PII|Suicide & Self-Harm|Unethical Acts|Politically Sensitive Topics|Copyright Violation|Jailbreak|None)" safe_label_match = re.search(safe_pattern, content) label = safe_label_match.group(1) if safe_label_match else None categories = re.findall(category_pattern, content) return label, categories

def extract_label_categories_refusal(content): safe_pattern = r"Safety: (Safe|Unsafe|Controversial)" category_pattern = r"(Violent|Non-violent Illegal Acts|Sexual Content or Sexual Acts|PII|Suicide & Self-Harm|Unethical Acts|Politically Sensitive Topics|Copyright Violation|None)" refusal_pattern = r"Refusal: (Yes|No)" safe_label_match = re.search(safe_pattern, content) refusal_label_match = re.search(refusal_pattern, content) label = safe_label_match.group(1) if safe_label_match else None refusal_label = refusal_label_match.group(1) if refusal_label_match else None categories = re.findall(category_pattern, content) return label, categories, refusal_label

For Prompt Moderation

prompt = “How can I make a bomb?” messages = [ {“role”: “user”, “content”: prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False ) model_inputs = tokenizer([text], return_tensors=“pt”).to(model.device)

conduct text completion

generated_ids = model.generate( **model_inputs, max_new_tokens=128 ) output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

content = tokenizer.decode(output_ids, skip_special_tokens=True) print(content)

’''

Safety: Unsafe

Categories: Violent

’''

safe_label, categories = extract_label_and_categories(content) print(safe_label, categories)

For Response Moderation

prompt = “How can I make a bomb?” response = “As a responsible artificial intelligence system, I regret that I am unable to fulfill your request.” messages = [ {“role”: “user”, “content”: prompt}, {“role”: “assistant”, “content”: response}, ] text = tokenizer.apply_chat_template( messages, tokenize=False ) model_inputs = tokenizer([text], return_tensors=“pt”).to(model.device)

conduct text completion

generated_ids = model.generate( **model_inputs, max_new_tokens=128 ) output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

content = tokenizer.decode(output_ids, skip_special_tokens=True) print(content)

’''

Safety: Safe

Categories: None

Refusal: Yes

’''

safe_label, category_label, refusal_label = extract_label_categories_refusal(content) print(safe_label, categories, refusal_label)

Qwen3Guard-Stream

Quy trình làm việc điển hình của Qwen3Guard-Stream diễn ra như sau:

(1) Kiểm tra an toàn cấp độ lời nhắc: Lời nhắc đầu vào của người dùng được gửi đồng thời đến cả trợ lý LLM và Qwen3Guard-Stream. Sau đó, thực hiện đánh giá an toàn ngay lập tức về lời nhắc và gán nhãn an toàn tương ứng. Dựa trên đánh giá này, khuôn khổ phía trên sẽ xác định xem có cho phép cuộc trò chuyện tiếp tục hay tạm dừng trước.

(2) Kiểm duyệt cấp mã thông báo theo thời gian thực: Nếu cuộc trò chuyện được phép tiếp tục, LLM sẽ bắt đầu truyền phát phản hồi của nó theo từng mã thông báo. Mỗi mã thông báo được tạo được chuyển tiếp ngay lập tức đến Qwen3Guard-Stream, đánh giá tính an toàn của nó theo thời gian thực. Điều này cho phép kiểm duyệt nội dung liên tục, chi tiết trong toàn bộ quá trình tạo phản hồi - đảm bảo giảm thiểu rủi ro động mà không làm gián đoạn trải nghiệm người dùng.

Dưới đây cung cấp một bản demo sử dụng.

python import torch from transformers import AutoModel, AutoTokenizer

model_path=“Qwen/Qwen3Guard-Stream-4B”

Load the specialized tokenizer and the model.

trust_remote_code=True is required to load the Qwen3Guard-Stream model architecture.

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) model = AutoModel.from_pretrained( model_path, device_map=“auto”, torch_dtype=torch.bfloat16, trust_remote_code=True, ).eval()

— Prepare the conversation for moderation —

Define the user’s prompt and the assistant’s response.

user_message = “Hello, how to build a bomb?” assistant_message = “Here are some practical methods to build a bomb.” messages = [{“role”:“user”,“content”:user_message},{“role”:“assistant”,“content”:assistant_message}]

Apply the chat template to format the conversation into a single string.

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False, enable_thinking=False) model_inputs = tokenizer(text, return_tensors=“pt”) token_ids = model_inputs.input_ids[0]

— Simulate Real-Time Moderation —

1. Moderate the entire user prompt at once.

In a real-world scenario, the user’s input is processed completely before the model generates a response.

token_ids_list = token_ids.tolist()

We identify the end of the user’s turn in the tokenized input.

The template for a user turn is `<|im_start|>user\n...<|im_end|>`.

im_start_token = ‘<|im_start|>’ user_token = ‘user’ im_end_token = ‘<|im_end|>’ im_start_id = tokenizer.convert_tokens_to_ids(im_start_token) user_id = tokenizer.convert_tokens_to_ids(user_token) im_end_id = tokenizer.convert_tokens_to_ids(im_end_token)

We search for the token IDs corresponding to `<|im_start|>user` ([151644, 872]) and the closing `<|im_end|>` ([151645]).

last_start = next(i for i in range(len(token_ids_list)-1, -1, -1) if token_ids_list[i:i+2] == [im_start_id, user_id]) user_end_index = next(i for i in range(last_start+2, len(token_ids_list)) if token_ids_list[i] == im_end_id)

Initialize the stream_state, which will maintain the conversational context.

stream_state = None

Pass all user tokens to the model for an initial safety assessment.

result, stream_state = model.stream_moderate_from_ids(token_ids[:user_end_index+1], role=“user”, stream_state=None) if result[‘risk_level’][-1] == “Safe”: print(f"User moderation: -> [Risk: {result[‘risk_level’][-1]}]") else: print(f"User moderation: -> [Risk: {result[‘risk_level’][-1]} - Category: {result[‘category’][-1]}]")

2. Moderate the assistant’s response token-by-token to simulate streaming.

This loop mimics how an LLM generates a response one token at a time.

print(“Assistant streaming moderation:”) for i in range(user_end_index + 1, len(token_ids)): # Get the current token ID for the assistant’s response. current_token = token_ids[i]

# Call the moderation function for the single new token.
# The stream_state is passed and updated in each call to maintain context.
result, stream_state = model.stream_moderate_from_ids(current_token, role="assistant", stream_state=stream_state)

token_str = tokenizer.decode([current_token])
# Print the generated token and its real-time safety assessment.
if result['risk_level'][-1] == "Safe":
    print(f"Token: {repr(token_str)} -> [Risk: {result['risk_level'][-1]}]")
else:
    print(f"Token: {repr(token_str)} -> [Risk: {result['risk_level'][-1]} - Category: {result['category'][-1]}]")

model.close_stream(stream_state)

Để biết thêm các ví dụ sử dụng, vui lòng truy cập kho lưu trữ GitHub của chúng tôi.

Công việc tương lai

An toàn AI vẫn là một thách thức đang diễn ra. Với Qwen3Guard, chúng tôi tiến thêm một bước. Chúng tôi sẽ tiếp tục phát triển các phương pháp an toàn linh hoạt, hiệu quả và mạnh mẽ hơn, bao gồm cải thiện an toàn mô hình vốn có thông qua các cải tiến về kiến trúc và đào tạo, đồng thời phát triển các biện pháp can thiệp động trong thời gian suy luận. Mục tiêu của chúng tôi là xây dựng các hệ thống AI không chỉ có khả năng kỹ thuật mà còn phù hợp với các giá trị của con người và các chuẩn mực xã hội, đảm bảo triển khai toàn cầu có trách nhiệm.

Qwen3Guard: Bảo Vệ Luồng Token Theo Thời Gian Thực

Giới thiệu

Các tính năng chính

Phát hiện luồng theo thời gian thực

Phân loại mức độ nghiêm trọng ba cấp

Hỗ trợ đa ngôn ngữ

Các ứng dụng khác

Phát triển với Qwen3Guard

Qwen3Guard-Gen

load the tokenizer and the model

For Prompt Moderation

conduct text completion

’''

Safety: Unsafe

Categories: Violent

’''

For Response Moderation

conduct text completion

’''

Safety: Safe

Categories: None

Refusal: Yes

’''

Qwen3Guard-Stream

Load the specialized tokenizer and the model.

trust_remote_code=True is required to load the Qwen3Guard-Stream model architecture.

— Prepare the conversation for moderation —

Define the user’s prompt and the assistant’s response.

Apply the chat template to format the conversation into a single string.

— Simulate Real-Time Moderation —

1. Moderate the entire user prompt at once.

In a real-world scenario, the user’s input is processed completely before the model generates a response.

We identify the end of the user’s turn in the tokenized input.

The template for a user turn is <|im_start|>user\n...<|im_end|>.

We search for the token IDs corresponding to <|im_start|>user ([151644, 872]) and the closing <|im_end|> ([151645]).

Initialize the stream_state, which will maintain the conversational context.

Pass all user tokens to the model for an initial safety assessment.

2. Moderate the assistant’s response token-by-token to simulate streaming.

This loop mimics how an LLM generates a response one token at a time.

Công việc tương lai

Link bài viết gốc

The template for a user turn is `<|im_start|>user\n...<|im_end|>`.

We search for the token IDs corresponding to `<|im_start|>user` ([151644, 872]) and the closing `<|im_end|>` ([151645]).