Available Models

A side-by-side comparison of every active NRP-managed LLM, followed by a card per model with strengths, trade-offs, and recommended uses. Click any model name in the matrix to jump to its card; click the HuggingFace link on any card to open the upstream model page.

Feature Matrix

Model	Status	Params	Context	Tools	Reason	Inputs
`qwen3`	main	397B (A17B active)	262,144	✓	✓	image, video
`qwen3-small`	main	27B	262,144	✓	✓	image, video
`gpt-oss`	main	120B	131,072	✓	✓	—
`gemma`	main	31B	262,144	✓	✓	image, video
`gemma-small`	evaluating	~8B	131,072	✓	✓	image, video, audio
`kimi`	evaluating	1T (MoE)	262,144	✓	✓	image, video
`glm-4.7`	evaluating	358B	202,752	✓	✓	—
`minimax-m2`	main	230B	204,800	✓	✓	—
`olmo`	evaluating	32B	65,536	✓	—	—
`qwen3-embedding`	main	8B	—	—	—	image, video

Status: main generally supported · evaluating testing, may change · deprecated being retired

Capabilities: tool function calling · reasoning thinking mode · multimodal image / video / audio inputs · research active research workload (slower removal)

If your group needs a model marked for active research (so removal is communicated rather than automatic), please reach out via the Matrix Nautilus AI/ML channel. New-model suggestions are also discussed there.

Generally supported

`qwen3`

main tool reasoning multimodal research

Qwen/Qwen3.5-397B-A17B-FP8 ↗

Flagship frontier multimodal MoE — Claude/Gemini-level performance, active research model.

Parameters: 397B (A17B active)
Context: 262,144 tokens
Quantization: FP8 (official)
Multimodal: image, video
Tool calling: vLLM recipe
Disable reasoning: extra_body={"chat_template_kwargs": {"enable_thinking": false}}

Best for: Frontier-quality text and multimodal reasoning · Long-context document and repository analysis · Research workflows requiring reproducibility

Strengths

Frontier-class reasoning and instruction following
Multimodal (image + video) alongside text
High-throughput despite large model size
Sparse MoE keeps per-token compute low despite 397B total parameters
Official FP8 quantization preserves model quality
Capacity may be split across multiple GPU pools (A100 + H200) for throughput

Trade-offs

Reasoning mode on by default — adds latency for simple queries
One of the higher GPU footprint in the catalog

`qwen3-small`

main tool reasoning multimodal research

Qwen/Qwen3.6-27B ↗

Compact Qwen3.6 — multimodal, agentic, low-latency. Also accessible as qwen3-27b.

Parameters: 27B
Context: 262,144 tokens
Quantization: bf16 (native)
Multimodal: image, video
Tool calling: vLLM recipe
Disable reasoning: extra_body={"chat_template_kwargs": {"enable_thinking": false}}

Best for: Latency-sensitive multimodal tasks · Agentic coding and tool use · Long-context tasks where qwen3 is overkill

Strengths

Multimodal (image + video) at a fraction of qwen3's GPU cost
262K context window for whole-repo or long-doc work
Strong agentic and tool-calling behavior
Lower latency than larger models — good for interactive use

Trade-offs

Lower throughput compared to the model size (dense model)
Lower reasoning ceiling than the 397B qwen3 on the hardest tasks
Native bf16 weights use more memory than FP8 (also available officially)

`gpt-oss`

main tool reasoning research

openai/gpt-oss-120b ↗

OpenAI's open-weights agentic model — tiny GPU footprint, strong tools, LTS candidate.

Parameters: 120B
Context: 131,072 tokens
Quantization: MXFP4 (native)
Tool calling: vLLM recipe

Best for: General-purpose chat and assistants · Agentic tool-using workflows · Reproducible research (pinnable model)

Strengths

Runs on a single A100 or two RTX A6000 at full context (MXFP4 + sliding-window attention)
Strong agentic and tool-calling behavior
Stable for reproducible research pipelines
High throughput and low per-token cost optimizes high-concurrency batch use

Trade-offs

Text-only — no vision or video input
Smaller 128K context compared to Qwen and Kimi models

`gemma`

main tool reasoning multimodal research

google/gemma-4-31B-it ↗

Google's Gemma 4 — multimodal, reasoning optional, efficient frontier performance.

Parameters: 31B
Context: 262,144 tokens
Quantization: bf16 (native)
Multimodal: image, video
Tool calling: vLLM recipe
Enable reasoning: extra_body={"chat_template_kwargs": {"enable_thinking": true}}

Best for: Multimodal tasks (image/video QA, visual analysis) · Efficient general-purpose assistant · Workflows where reasoning is occasional, not constant · Reproducible research (pinnable model)

Strengths

Multimodal (image + video) at a compact 31B size
Reasoning is opt-in (off by default) — low latency unless you need it
Solid tool calling support
Google-quality instruction following

Trade-offs

Lower throughput compared to the model size (dense model)
Reasoning must be explicitly enabled (unlike most catalogued models where it's the default)

`minimax-m2`

main tool reasoning

MiniMaxAI/MiniMax-M2.7 ↗

Efficient frontier coding model — 230B in native FP8, fits comfortably on four A100s.

Parameters: 230B
Context: 204,800 tokens
Quantization: FP8 (native)
Tool calling: vLLM recipe

Best for: Cost-efficient agentic coding · Long-context code review and refactoring · Production agents needing stability

Strengths

Frontier-level agentic coding at modest GPU cost
Native FP8 weights — no quantization quality degradation
High throughput compared to model size
~200K context window for large codebase work
Generally supported (main) — stable for production use

Trade-offs

Text-only — no vision or audio
No reasoning toggle (reasoning is implicit in the model behavior)

`qwen3-embedding`

main multimodal research

Qwen/Qwen3-VL-Embedding-8B ↗

Multimodal embedding model for retrieval and vector search — not a chat model.

Parameters: 8B
Multimodal: image, video

Best for: Vector databases and semantic search · RAG pipelines · Multimodal retrieval

Strengths

Embeddings for text, image, and video inputs
Compatible with Jupyter AI and OpenAI embedding clients
Compact 8B footprint

Trade-offs

Not a chat model — DO NOT use for chat or completions

Evaluating

`gemma-small`

evaluating tool reasoning multimodal

google/gemma-4-E4B-it ↗

Tiny Gemma 4 with unique audio input — ASR and speech-to-text on an ~8B model.

Parameters: ~8B
Context: 131,072 tokens
Quantization: bf16 (native)
Multimodal: image, video, audio
Tool calling: vLLM recipe
Enable reasoning: extra_body={"chat_template_kwargs": {"enable_thinking": true}}

Best for: Audio transcription and speech-to-text workflows · Lightweight multimodal tasks · Fast, low-cost inference for simple queries

Strengths

Only catalogued model that accepts audio input (ASR, speech-to-text translation)
Very small ~8B footprint — extremely low latency
Also handles image and video input

Trade-offs

Evaluating — availability and config may change
Lower reasoning and instruction-following ceiling than larger models
Reasoning must be explicitly enabled

`kimi`

evaluating tool reasoning multimodal

moonshotai/Kimi-K2.6 ↗

Moonshot's 1T-parameter frontier coding model with multimodal inputs.

Parameters: 1T (MoE)
Context: 262,144 tokens
Quantization: MXFP4 (native)
Multimodal: image, video
Tool calling: vLLM recipe
Disable reasoning: extra_body={"chat_template_kwargs": {"thinking": false}}

Best for: Agentic coding (Claude Code, Kimi CLI, Crush) · Large-repo code understanding · Multimodal coding tasks (UI screenshots, diagrams)

Strengths

Frontier-class agentic coding — close to commercial top models on coding benchmarks
262K context suits whole-repo analysis
Multimodal (image + video) for screenshot debugging and design-to-code
Native MXFP4 keeps memory cost manageable at 1T params

Trade-offs

Evaluating — availability and config may shift
Largest active-parameter and total-parameter MoE in the catalog; GPU-intensive and slower
Reasoning on by default — disable for faster simple queries

`glm-4.7`

evaluating tool reasoning

zai-org/GLM-4.7-FP8 ↗

Zhipu's 358B frontier coding model with official FP8 weights.

Parameters: 358B
Context: 202,752 tokens
Quantization: FP8 (official)
Tool calling: vLLM recipe
Disable reasoning: extra_body={"chat_template_kwargs": {"enable_thinking": false}}

Best for: Agentic coding workflows · Long-form reasoning and text tasks · Tool-using agents

Strengths

Strong agentic coding — competitive with commercial frontier models
Official FP8 quantization preserves quality
~200K context covers large codebases

Trade-offs

Text-only — no multimodal input
Evaluating — configuration may shift
Reasoning on by default; adds latency for simple tasks

`olmo`

evaluating tool

allenai/Olmo-3.1-32B-Instruct ↗

Allen AI's fully open-source 32B instruction model — transparent training data.

Parameters: 32B
Context: 65,536 tokens
Quantization: bf16 (native)
Tool calling: vLLM recipe

Best for: Requirements for open and auditable models · Tool-using workflows on a smaller budget · NLP research baselines

Strengths

Fully open training data and weights — good for research reproducibility
Supports tool calling
32B size balances capability and speed

Trade-offs

Evaluating — availability may change
Lower throughput compared to the model size (dense model)
Smaller 64K context window
No multimodal or reasoning mode

This work was supported in part by National Science Foundation (NSF) awards CNS-1730158, ACI-1540112, ACI-1541349, OAC-1826967, OAC-2112167, CNS-2100237, CNS-2120019.