Skip to content

Available Models

A side-by-side comparison of every active NRP-managed LLM, followed by a card per model with strengths, trade-offs, and recommended uses. Click any model name in the matrix to jump to its card; click the HuggingFace link on any card to open the upstream model page.

Feature Matrix

ModelStatusParamsContextToolsReasonInputs
qwen3main397B (A17B active)262,144image, video
qwen3-smallmain27B262,144image, video
gpt-ossmain120B131,072
gemmamain31B262,144image, video
gemma-smallevaluating~8B131,072image, video, audio
kimievaluating1T (MoE)262,144image, video
glm-4.7evaluating358B202,752
minimax-m2main230B204,800
olmoevaluating32B65,536
qwen3-embeddingmain8Bimage, video
Status: main generally supported · evaluating testing, may change · deprecated being retired
Capabilities: tool function calling · reasoning thinking mode · multimodal image / video / audio inputs · research active research workload (slower removal)

If your group needs a model marked for active research (so removal is communicated rather than automatic), please reach out via the Matrix Nautilus AI/ML channel. New-model suggestions are also discussed there.

Generally supported

qwen3

main tool reasoning multimodal research

Qwen/Qwen3.5-397B-A17B-FP8 ↗

Flagship frontier multimodal MoE — Claude/Gemini-level performance, active research model.

Parameters
397B (A17B active)
Context
262,144 tokens
Quantization
FP8 (official)
Multimodal
image, video
Tool calling
vLLM recipe
Disable reasoning
extra_body={"chat_template_kwargs": {"enable_thinking": false}}

Best for: Frontier-quality text and multimodal reasoning · Long-context document and repository analysis · Research workflows requiring reproducibility

Strengths

  • Frontier-class reasoning and instruction following
  • Multimodal (image + video) alongside text
  • High-throughput despite large model size
  • Sparse MoE keeps per-token compute low despite 397B total parameters
  • Official FP8 quantization preserves model quality
  • Capacity may be split across multiple GPU pools (A100 + H200) for throughput

Trade-offs

  • Reasoning mode on by default — adds latency for simple queries
  • One of the higher GPU footprint in the catalog

qwen3-small

main tool reasoning multimodal research

Qwen/Qwen3.6-27B ↗

Compact Qwen3.6 — multimodal, agentic, low-latency. Also accessible as qwen3-27b.

Parameters
27B
Context
262,144 tokens
Quantization
bf16 (native)
Multimodal
image, video
Tool calling
vLLM recipe
Disable reasoning
extra_body={"chat_template_kwargs": {"enable_thinking": false}}

Best for: Latency-sensitive multimodal tasks · Agentic coding and tool use · Long-context tasks where qwen3 is overkill

Strengths

  • Multimodal (image + video) at a fraction of qwen3's GPU cost
  • 262K context window for whole-repo or long-doc work
  • Strong agentic and tool-calling behavior
  • Lower latency than larger models — good for interactive use

Trade-offs

  • Lower throughput compared to the model size (dense model)
  • Lower reasoning ceiling than the 397B qwen3 on the hardest tasks
  • Native bf16 weights use more memory than FP8 (also available officially)

gpt-oss

main tool reasoning research

openai/gpt-oss-120b ↗

OpenAI's open-weights agentic model — tiny GPU footprint, strong tools, LTS candidate.

Parameters
120B
Context
131,072 tokens
Quantization
MXFP4 (native)
Tool calling
vLLM recipe

Best for: General-purpose chat and assistants · Agentic tool-using workflows · Reproducible research (pinnable model)

Strengths

  • Runs on a single A100 or two RTX A6000 at full context (MXFP4 + sliding-window attention)
  • Strong agentic and tool-calling behavior
  • Stable for reproducible research pipelines
  • High throughput and low per-token cost optimizes high-concurrency batch use

Trade-offs

  • Text-only — no vision or video input
  • Smaller 128K context compared to Qwen and Kimi models

gemma

main tool reasoning multimodal research

google/gemma-4-31B-it ↗

Google's Gemma 4 — multimodal, reasoning optional, efficient frontier performance.

Parameters
31B
Context
262,144 tokens
Quantization
bf16 (native)
Multimodal
image, video
Tool calling
vLLM recipe
Enable reasoning
extra_body={"chat_template_kwargs": {"enable_thinking": true}}

Best for: Multimodal tasks (image/video QA, visual analysis) · Efficient general-purpose assistant · Workflows where reasoning is occasional, not constant · Reproducible research (pinnable model)

Strengths

  • Multimodal (image + video) at a compact 31B size
  • Reasoning is opt-in (off by default) — low latency unless you need it
  • Solid tool calling support
  • Google-quality instruction following

Trade-offs

  • Lower throughput compared to the model size (dense model)
  • Reasoning must be explicitly enabled (unlike most catalogued models where it's the default)

minimax-m2

main tool reasoning

MiniMaxAI/MiniMax-M2.7 ↗

Efficient frontier coding model — 230B in native FP8, fits comfortably on four A100s.

Parameters
230B
Context
204,800 tokens
Quantization
FP8 (native)
Tool calling
vLLM recipe

Best for: Cost-efficient agentic coding · Long-context code review and refactoring · Production agents needing stability

Strengths

  • Frontier-level agentic coding at modest GPU cost
  • Native FP8 weights — no quantization quality degradation
  • High throughput compared to model size
  • ~200K context window for large codebase work
  • Generally supported (main) — stable for production use

Trade-offs

  • Text-only — no vision or audio
  • No reasoning toggle (reasoning is implicit in the model behavior)

qwen3-embedding

main multimodal research

Qwen/Qwen3-VL-Embedding-8B ↗

Multimodal embedding model for retrieval and vector search — not a chat model.

Parameters
8B
Multimodal
image, video

Best for: Vector databases and semantic search · RAG pipelines · Multimodal retrieval

Strengths

  • Embeddings for text, image, and video inputs
  • Compatible with Jupyter AI and OpenAI embedding clients
  • Compact 8B footprint

Trade-offs

  • Not a chat model — DO NOT use for chat or completions

Evaluating

gemma-small

evaluating tool reasoning multimodal

google/gemma-4-E4B-it ↗

Tiny Gemma 4 with unique audio input — ASR and speech-to-text on an ~8B model.

Parameters
~8B
Context
131,072 tokens
Quantization
bf16 (native)
Multimodal
image, video, audio
Tool calling
vLLM recipe
Enable reasoning
extra_body={"chat_template_kwargs": {"enable_thinking": true}}

Best for: Audio transcription and speech-to-text workflows · Lightweight multimodal tasks · Fast, low-cost inference for simple queries

Strengths

  • Only catalogued model that accepts audio input (ASR, speech-to-text translation)
  • Very small ~8B footprint — extremely low latency
  • Also handles image and video input

Trade-offs

  • Evaluating — availability and config may change
  • Lower reasoning and instruction-following ceiling than larger models
  • Reasoning must be explicitly enabled

kimi

evaluating tool reasoning multimodal

moonshotai/Kimi-K2.6 ↗

Moonshot's 1T-parameter frontier coding model with multimodal inputs.

Parameters
1T (MoE)
Context
262,144 tokens
Quantization
MXFP4 (native)
Multimodal
image, video
Tool calling
vLLM recipe
Disable reasoning
extra_body={"chat_template_kwargs": {"thinking": false}}

Best for: Agentic coding (Claude Code, Kimi CLI, Crush) · Large-repo code understanding · Multimodal coding tasks (UI screenshots, diagrams)

Strengths

  • Frontier-class agentic coding — close to commercial top models on coding benchmarks
  • 262K context suits whole-repo analysis
  • Multimodal (image + video) for screenshot debugging and design-to-code
  • Native MXFP4 keeps memory cost manageable at 1T params

Trade-offs

  • Evaluating — availability and config may shift
  • Largest active-parameter and total-parameter MoE in the catalog; GPU-intensive and slower
  • Reasoning on by default — disable for faster simple queries

glm-4.7

evaluating tool reasoning

zai-org/GLM-4.7-FP8 ↗

Zhipu's 358B frontier coding model with official FP8 weights.

Parameters
358B
Context
202,752 tokens
Quantization
FP8 (official)
Tool calling
vLLM recipe
Disable reasoning
extra_body={"chat_template_kwargs": {"enable_thinking": false}}

Best for: Agentic coding workflows · Long-form reasoning and text tasks · Tool-using agents

Strengths

  • Strong agentic coding — competitive with commercial frontier models
  • Official FP8 quantization preserves quality
  • ~200K context covers large codebases

Trade-offs

  • Text-only — no multimodal input
  • Evaluating — configuration may shift
  • Reasoning on by default; adds latency for simple tasks

olmo

evaluating tool

allenai/Olmo-3.1-32B-Instruct ↗

Allen AI's fully open-source 32B instruction model — transparent training data.

Parameters
32B
Context
65,536 tokens
Quantization
bf16 (native)
Tool calling
vLLM recipe

Best for: Requirements for open and auditable models · Tool-using workflows on a smaller budget · NLP research baselines

Strengths

  • Fully open training data and weights — good for research reproducibility
  • Supports tool calling
  • 32B size balances capability and speed

Trade-offs

  • Evaluating — availability may change
  • Lower throughput compared to the model size (dense model)
  • Smaller 64K context window
  • No multimodal or reasoning mode
NSF Logo
This work was supported in part by National Science Foundation (NSF) awards CNS-1730158, ACI-1540112, ACI-1541349, OAC-1826967, OAC-2112167, CNS-2100237, CNS-2120019.