A side-by-side comparison of every active NRP-managed LLM, followed by a card per model with strengths, trade-offs, and recommended uses. Click any model name in the matrix to jump to its card; click the HuggingFace link on any card to open the upstream model page.
Feature Matrix
| Model | Status | Params | Context | Tools | Reason | Inputs |
|---|
qwen3 | main | 397B (A17B active) | 262,144 | ✓ | ✓ | image, video |
qwen3-small | main | 27B | 262,144 | ✓ | ✓ | image, video |
gpt-oss | main | 120B | 131,072 | ✓ | ✓ | — |
gemma | main | 31B | 262,144 | ✓ | ✓ | image, video |
gemma-small | evaluating | ~8B | 131,072 | ✓ | ✓ | image, video, audio |
kimi | evaluating | 1T (MoE) | 262,144 | ✓ | ✓ | image, video |
glm-4.7 | evaluating | 358B | 202,752 | ✓ | ✓ | — |
minimax-m2 | main | 230B | 204,800 | ✓ | ✓ | — |
olmo | evaluating | 32B | 65,536 | ✓ | — | — |
qwen3-embedding | main | 8B | — | — | — | image, video |
Status: main generally supported · evaluating testing, may change · deprecated being retired
Capabilities: tool function calling · reasoning thinking mode · multimodal image / video / audio inputs · research active research workload (slower removal)
If your group needs a model marked for active research (so removal is communicated rather than automatic), please reach out via the Matrix Nautilus AI/ML channel. New-model suggestions are also discussed there.
Generally supported
Qwen/Qwen3.5-397B-A17B-FP8 ↗
Flagship frontier multimodal MoE — Claude/Gemini-level performance, active research model.
- Parameters
- 397B (A17B active)
- Context
- 262,144 tokens
- Quantization
- FP8 (official)
- Multimodal
- image, video
- Tool calling
- vLLM recipe
- Disable reasoning
extra_body={"chat_template_kwargs": {"enable_thinking": false}}
Best for: Frontier-quality text and multimodal reasoning · Long-context document and repository analysis · Research workflows requiring reproducibility
Strengths
- Frontier-class reasoning and instruction following
- Multimodal (image + video) alongside text
- High-throughput despite large model size
- Sparse MoE keeps per-token compute low despite 397B total parameters
- Official FP8 quantization preserves model quality
- Capacity may be split across multiple GPU pools (A100 + H200) for throughput
Trade-offs
- Reasoning mode on by default — adds latency for simple queries
- One of the higher GPU footprint in the catalog
Qwen/Qwen3.6-27B ↗
Compact Qwen3.6 — multimodal, agentic, low-latency. Also accessible as qwen3-27b.
- Parameters
- 27B
- Context
- 262,144 tokens
- Quantization
- bf16 (native)
- Multimodal
- image, video
- Tool calling
- vLLM recipe
- Disable reasoning
extra_body={"chat_template_kwargs": {"enable_thinking": false}}
Best for: Latency-sensitive multimodal tasks · Agentic coding and tool use · Long-context tasks where qwen3 is overkill
Strengths
- Multimodal (image + video) at a fraction of qwen3's GPU cost
- 262K context window for whole-repo or long-doc work
- Strong agentic and tool-calling behavior
- Lower latency than larger models — good for interactive use
Trade-offs
- Lower throughput compared to the model size (dense model)
- Lower reasoning ceiling than the 397B qwen3 on the hardest tasks
- Native bf16 weights use more memory than FP8 (also available officially)
openai/gpt-oss-120b ↗
OpenAI's open-weights agentic model — tiny GPU footprint, strong tools, LTS candidate.
- Parameters
- 120B
- Context
- 131,072 tokens
- Quantization
- MXFP4 (native)
- Tool calling
- vLLM recipe
Best for: General-purpose chat and assistants · Agentic tool-using workflows · Reproducible research (pinnable model)
Strengths
- Runs on a single A100 or two RTX A6000 at full context (MXFP4 + sliding-window attention)
- Strong agentic and tool-calling behavior
- Stable for reproducible research pipelines
- High throughput and low per-token cost optimizes high-concurrency batch use
Trade-offs
- Text-only — no vision or video input
- Smaller 128K context compared to Qwen and Kimi models
google/gemma-4-31B-it ↗
Google's Gemma 4 — multimodal, reasoning optional, efficient frontier performance.
- Parameters
- 31B
- Context
- 262,144 tokens
- Quantization
- bf16 (native)
- Multimodal
- image, video
- Tool calling
- vLLM recipe
- Enable reasoning
extra_body={"chat_template_kwargs": {"enable_thinking": true}}
Best for: Multimodal tasks (image/video QA, visual analysis) · Efficient general-purpose assistant · Workflows where reasoning is occasional, not constant · Reproducible research (pinnable model)
Strengths
- Multimodal (image + video) at a compact 31B size
- Reasoning is opt-in (off by default) — low latency unless you need it
- Solid tool calling support
- Google-quality instruction following
Trade-offs
- Lower throughput compared to the model size (dense model)
- Reasoning must be explicitly enabled (unlike most catalogued models where it's the default)
minimax-m2
main tool reasoning
MiniMaxAI/MiniMax-M2.7 ↗
Efficient frontier coding model — 230B in native FP8, fits comfortably on four A100s.
- Parameters
- 230B
- Context
- 204,800 tokens
- Quantization
- FP8 (native)
- Tool calling
- vLLM recipe
Best for: Cost-efficient agentic coding · Long-context code review and refactoring · Production agents needing stability
Strengths
- Frontier-level agentic coding at modest GPU cost
- Native FP8 weights — no quantization quality degradation
- High throughput compared to model size
- ~200K context window for large codebase work
- Generally supported (main) — stable for production use
Trade-offs
- Text-only — no vision or audio
- No reasoning toggle (reasoning is implicit in the model behavior)
qwen3-embedding
main multimodal research
Qwen/Qwen3-VL-Embedding-8B ↗
Multimodal embedding model for retrieval and vector search — not a chat model.
- Parameters
- 8B
- Multimodal
- image, video
Best for: Vector databases and semantic search · RAG pipelines · Multimodal retrieval
Strengths
- Embeddings for text, image, and video inputs
- Compatible with Jupyter AI and OpenAI embedding clients
- Compact 8B footprint
Trade-offs
- Not a chat model — DO NOT use for chat or completions
Evaluating
google/gemma-4-E4B-it ↗
Tiny Gemma 4 with unique audio input — ASR and speech-to-text on an ~8B model.
- Parameters
- ~8B
- Context
- 131,072 tokens
- Quantization
- bf16 (native)
- Multimodal
- image, video, audio
- Tool calling
- vLLM recipe
- Enable reasoning
extra_body={"chat_template_kwargs": {"enable_thinking": true}}
Best for: Audio transcription and speech-to-text workflows · Lightweight multimodal tasks · Fast, low-cost inference for simple queries
Strengths
- Only catalogued model that accepts audio input (ASR, speech-to-text translation)
- Very small ~8B footprint — extremely low latency
- Also handles image and video input
Trade-offs
- Evaluating — availability and config may change
- Lower reasoning and instruction-following ceiling than larger models
- Reasoning must be explicitly enabled
moonshotai/Kimi-K2.6 ↗
Moonshot's 1T-parameter frontier coding model with multimodal inputs.
- Parameters
- 1T (MoE)
- Context
- 262,144 tokens
- Quantization
- MXFP4 (native)
- Multimodal
- image, video
- Tool calling
- vLLM recipe
- Disable reasoning
extra_body={"chat_template_kwargs": {"thinking": false}}
Best for: Agentic coding (Claude Code, Kimi CLI, Crush) · Large-repo code understanding · Multimodal coding tasks (UI screenshots, diagrams)
Strengths
- Frontier-class agentic coding — close to commercial top models on coding benchmarks
- 262K context suits whole-repo analysis
- Multimodal (image + video) for screenshot debugging and design-to-code
- Native MXFP4 keeps memory cost manageable at 1T params
Trade-offs
- Evaluating — availability and config may shift
- Largest active-parameter and total-parameter MoE in the catalog; GPU-intensive and slower
- Reasoning on by default — disable for faster simple queries
zai-org/GLM-4.7-FP8 ↗
Zhipu's 358B frontier coding model with official FP8 weights.
- Parameters
- 358B
- Context
- 202,752 tokens
- Quantization
- FP8 (official)
- Tool calling
- vLLM recipe
- Disable reasoning
extra_body={"chat_template_kwargs": {"enable_thinking": false}}
Best for: Agentic coding workflows · Long-form reasoning and text tasks · Tool-using agents
Strengths
- Strong agentic coding — competitive with commercial frontier models
- Official FP8 quantization preserves quality
- ~200K context covers large codebases
Trade-offs
- Text-only — no multimodal input
- Evaluating — configuration may shift
- Reasoning on by default; adds latency for simple tasks
allenai/Olmo-3.1-32B-Instruct ↗
Allen AI's fully open-source 32B instruction model — transparent training data.
- Parameters
- 32B
- Context
- 65,536 tokens
- Quantization
- bf16 (native)
- Tool calling
- vLLM recipe
Best for: Requirements for open and auditable models · Tool-using workflows on a smaller budget · NLP research baselines
Strengths
- Fully open training data and weights — good for research reproducibility
- Supports tool calling
- 32B size balances capability and speed
Trade-offs
- Evaluating — availability may change
- Lower throughput compared to the model size (dense model)
- Smaller 64K context window
- No multimodal or reasoning mode