XVERSE-MoE-A36B

中文 | English

Update Information

[2024/09/13] Released XVERSE-MoE-A36B MoE base model, the Chat version model will be released later.

Model Introduction

XVERSE-MoE-A36B is a multilingual large language model, independently developed by Shenzhen Yuanxiang Technology which is using Mixture-of-experts (MoE) architecture. The total parameter scale of the model is 255 billion, with an actual number of activated parameters being 36 billion. The models released this time is the base model XVERSE-MoE-A36B. Its key features are as follows:

Model Structure: XVERSE-MoE-A36B uses the mainstream Decoder-only Transformer network structure that extends the FFN layer of dense models to expert layers. Unlike traditional MoE model where each expert has the same size as standard FFN (such as Mixtral 8x7B), it uses more fine-grained experts, with each expert being 1/4 the size of a standard FFN. It includes shared experts and non-shared experts, where shared experts are always activated during computation, and non-shared experts are selectively activated through a Router.
Training Data: The model has been thoroughly trained on a large-scale high-quality dataset, including more than 40 languages such as Chinese, English, Russian, and Spanish. The sampling ratio of different types of data is finely set, which makes the performance of Chinese and English excellent, and also takes into account the effect of other languages; The model is trained using training samples of length 8k; During the model training process, several data switches were made to dynamically introduce continuously processed high-quality data, along with adjustments to the data sampling ratio.
Training Strategy: While switching data, corresponding adjustments were also made to the learning rate scheduler to ensure the model could quickly and thoroughly learn from the newly introduced data.
Training Framework: We conducted in-depth customized optimization for the unique expert routing and weight calculation logic in the MoE model, developed an efficient fusion operator to improve computational efficiency. At the same time, to address the challenges of high memory consumption and communication volume in the MoE model, we designed a processing method for overlapping computation, communication, and CPU-Offload to increase overall throughput.

The models sizes, architectures and learning rate of XVERSE-MoE-A36B are showed as follows:

total params	activated params	n_layers	d_model	n_heads	d_ff	n_non_shared_experts	n_shared_experts	top_k	lr
255.4B	36.5B	50	6144	48	4096	64	2	6	2.5e−4

Model Evaluation

To comprehensively assess the performance of the model, we conducted extensive testing across a range of standard datasets, including MMLU, C-Eval, CMMLU, RACE-M, PIQA, GSM8K, Math, MBPP and HumanEval. And compared it with open-weight MoE and Dense models (Base) of similar parameter scales, as well as closed-source Chat models. The results are as follows:

Comparison of Open-Weight Base Models - MoE

	XVERSE-MoE-A36B	Grok-1-A85B	DeepSeek-V2-A21B	Skywork-MoE-A22B	Mixtral-8x22B-A39B	DBRX-A36B
Total Params	255B	314B	236B	146B	141B	132B
MMLU	80.8	73	78.5	77.4	77.8	73.7
C-Eval	79.5	-	81.7	82.2	56.8	44.9
CMMLU	81.7	-	84	79.5	59.9	61.3
GSM8K	89.5	62.9	79.2	76.1	82.3	70.7
MATH	53.3	23.9	43.6	31.9	34.1	25.6
HumanEval	51.8	63.2	48.8	43.9	45.1	46.3
MBPP	59.8	-	66.6	-	71.2	58
PIQA	84.8	-	83.7	-	84.1	84.5
RACE-M	88.4	-	73.1	-	85.7	55.9

Comparison of Open-Weight Base Models - Dense

	XVERSE-MoE-A36B	XVERSE-65B-2	Llama3.1-405B	Nemotron-4-340B	Qwen1.5-110B	Qwen2-72B	Qwen1.5-72B	Llama3.1-70B
Total Params	255B	65B	405B	340B	110B	72B	72B	70B
MMLU	80.8	74.4	85.2	81.1	80.4	84.2	77.5	79.3
C-Eval	79.5	72.4	-	-	89.1	91	84.1	-
CMMLU	81.7	75.1	-	-	88.3	90.1	83.5	-
GSM8K	89.5	72.6	89	-	85.4	89.5	79.5	83.7
MATH	53.3	20.8	53.8	-	49.6	51.1	34.1	41.4
HumanEval	51.8	37.8	61	57.3	54.3	64.6	46.3	58.5
MBPP	59.8	40.6	73.4	-	70.9	76.9	66.9	66.2
PIQA	84.8	79.4	85.6	-		-	-	83.8
RACE-M	88.4	90.7	-	-		-	-	-

Comparison of Closed-Source Chat Models

	XVERSE-MoE-A36B	GPT-4o	abab-6.5-20240415	Step-2	Baichuan3	GLM-4 (0520)
Total Params	255B	-	Trillion scale	Trillion scale	Hundred billion scale	-
MMLU	80.8	88.7	78.7		81.7	83.3
C-Eval	79.5	-	-	-	-	-
CMMLU	81.7	-	-	-	78.1	-
GSM8K	89.5	-	91.7	94	88.2	93.3
MATH	53.3	76.6	51.3	68.4	49.2	61.3
HumanEval	51.8	90.2	78	84.1	70.1	78.5
MBPP	59.8	-	-	-	68.2	-
PIQA	84.8	-	-	-	-	-
RACE-M	88.4	-	-	-	-	-

For all the comparison models mentioned above, we report the maximum value between their official results and our self-evaluation results.

Usage

Environment Setup

Clone this repository:

git clone /~https://github.com/xverse-ai/XVERSE-MoE-A36B
cd XVERSE-MoE-A36B

Install the dependencies using pip:

pip install -r requirements.txt

Loading with Transformers

The XVERSE-MoE-A36B model can be loaded for inference using the following code:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("xverse/XVERSE-MoE-A36B")
model = AutoModelForCausalLM.from_pretrained("xverse/XVERSE-MoE-A36B", trust_remote_code=True, torch_dtype=torch.bfloat16, device_map='auto')
model = model.eval()
inputs = tokenizer('北京的景点：故宫、天坛、万里长城等。\n深圳的景点：', return_tensors='pt').input_ids
inputs = inputs.cuda()
generated_ids = model.generate(inputs, max_new_tokens=70, eos_token_id=tokenizer.eos_token_id, repetition_penalty=1.1)
print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True))

Web Demo

The following code can be used to start a web server. By entering the access address in the browser, you can perform inference with the XVERSE-MoE-A36B model:

python chat_demo.py --port='port' --model_path='/path/to/model/' --tokenizer_path='/path/to/tokenizer/'

Limitations and Disclaimer

Like all other Large Language Models (LLMs), XVERSE-MoE-A36B may produce inaccurate, biased, or otherwise offensive content under certain circumstances. Therefore, please use the content generated by the model with caution and refrain from disseminating harmful content. Before deploying any application of XVERSE-MoE-A36B, developers should conduct safety tests and optimization of the model according to its specific application.

We strongly warn against the use of the XVERSE-MoE-A36B model for producing or spreading harmful information, or conducting any activities that might harm the public, national, or social security, or violate regulations. We assume no responsibility for any problems arising from the use of the XVERSE-MoE-A36B model, whether it be data security issues, public opinion risks, or any risks and issues caused by misunderstanding, misuse, dissemination, or non-compliance with the model.

Open Source License

The use of the source code in this repository must follow the Apache-2.0 open-source license, while the use of the model weights of XVERSE-MoE-A36B needs to adhere to the Model License Agreement.

The XVERSE-MoE-A36B model weights are fully open to academic research and support free commercial use. To apply for a commercial license, please fill in the application form. For other questions or collaborations, please contact opensource@xverse.cn.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_EN.md

README_EN.md

XVERSE-MoE-A36B

中文 | English

Update Information

Model Introduction

Model Evaluation

Usage

Environment Setup

Loading with Transformers

Web Demo

Limitations and Disclaimer

Open Source License

Files

README_EN.md

Latest commit

History

README_EN.md

File metadata and controls

XVERSE-MoE-A36B

中文 | English

Update Information

Model Introduction

Model Evaluation

Usage

Environment Setup

Loading with Transformers

Web Demo

Limitations and Disclaimer

Open Source License