Ryujin - 3.5
from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_name = "ryujin-3.5-35b-moe" tokenizer = AutoTokenizer.from_pretrained(model_name)
| Benchmark | Ryujin 3.5 (6B active) | LLaMA 3 (8B dense) | GPT-3.5 Turbo | | :--- | :--- | :--- | :--- | | | 72.4% | 66.5% | 69.8% | | HumanEval (Code) | 68.2% | 62.1% | 64.5% | | Inference Speed (t/s) | 110 t/s | 85 t/s | 90 t/s | | VRAM (4-bit) | 18 GB | 6 GB | N/A (Closed) | ryujin 3.5
Works best with vLLM for production (supports MoE expert parallelism) or llama.cpp (with MoE kernels) for CPU inference. Ryujin 3.5 vs. The Competition | Feature | Ryujin 3.5 | Mixtral 8x7B | DeepSeek-V2 | | :--- | :--- | :--- | :--- | | Active Params | 6B | 12B | 21B | | Total Params | 35B | 47B | 236B | | Expert Count | 16 | 8 | 160 | | Context Window | 256k | 32k | 128k | | License | Apache 2.0 | Apache 2.0 | MIT | from transformers import AutoModelForCausalLM
prompt = "Explain the significance of the Dragon God in Shinto mythology." inputs = tokenizer(prompt, return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=512) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ryujin 3.5