Skip to content

NRP-Managed LLMs

The NRP provides several hosted open-weights LLM for either API access, or use with our hosted chat interfaces.

Chat Interfaces

Open WebUI

If you are looking to chat with an LLM model similar to the interface provided by ChatGPT, we provide the NRP Open WebUI, based on the Open WebUI project. This is a feature-filled chat interface for all of the NRP-hosted models. You can use it to chat with or test out the models.

Visit the NRP Open WebUI interface

On MacOS and Safari you can make it always available in Dock for quick access: having Open WebUI open in Safari, click File → Add to Dock.

Librechat

If you are looking to chat with an LLM model similar to the interface provided by ChatGPT, we provide LibreChat, based on the LibreChat project. This is a simple chat interface for all of the NRP-hosted models. You can use it to chat with or test out the models.

Visit the LibreChat interface

On MacOS and Safari you can make it always available in Dock for quick access: having LibreChat open in Safari, click File → Add to Dock.

Cherry Studio

You can install the standalone Cherry Studio desktop application.

Visit the Cherry Studio application website

Go to SettingsModel Provider → press the Add button (set Provider Name to NRP or anything else you want, and Provider Type to OpenAI) → add API Key and API Host (https://ellm.nrp-nautilus.io/v1) → press Fetch model list → press the Add models to the list button at the right of the search box to add all models.

For setting the extra_body JSON parameter, go to Assistants → select an assistant (such as Default Assistant) → click and _Edit Assistant → Model SettingsCustom ParametersAdd Parameter → Set Parameter to extra_body, select JSON, and input the JSON contents in the textarea right below (such as {"cache_salt": "YWJjZGVmZ2hpamtsbW5vcHFyc3R1dnd4eXphYmNkZWZnaGlqa2xtbm9wcQ==", "chat_template_kwargs": {"enable_thinking": true, "thinking": true, "reasoning": {"enabled": true}}}).

Please do not set Max Tokens (unless you know what you are doing; read the API Access to LLMs via Envoy gateway section).

Chatbox

You can install the standalone Chatbox desktop application or use the web interface version.

Visit the Chatbox application website

Generate the Chatbox configuration in the LLM token generation page. Copy the generated configuration to clipboard - it will already have your personal token.

In Chatbox, go to SettingsModel Provider, scroll down to the end of providers list, and click Import from clipboard.

Please leave Max Output Tokens empty (fill in only Context Window unless you know what you are doing; read the API Access to LLMs via Envoy gateway section).

API Access to LLMs via Envoy gateway

To access our LLMs through the Envoy AI Gateway, you need to be a member of a group with LLM flag. Your membership info can be found on the namespaces page.

Start from creating a token. You can use this token to query the OpenAI-compatible LLM endpoint:

with CURL or any OpenAI API compatible tool.

curl -H "Authorization: Bearer <your_token>" https://ellm.nrp-nautilus.io/v1/models

NOTE: Please only specify Max Output Tokens/max_tokens/max_output_tokens when you know what they mean. This is the maximum output length under the context length, not the total context length. If strictly required by the client or library, specify it to somewhere under 1/3 or 1/4 of the total context length. NEVER specify the full context length in these configurations, as the LLM would be guaranteed to error.

Examples

Python Code

To access the NRP LLMs, you can use the OpenAI Python client. Below is an example of how to use the OpenAI Python client to access the NRP LLMs.

nrp-llm.py
import os
from openai import OpenAI
client = OpenAI(
# This is the default and can be omitted
api_key = os.environ.get("OPENAI_API_KEY"),
base_url = "https://ellm.nrp-nautilus.io/v1"
)
completion = client.chat.completions.create(
model="gpt-oss",
messages=[
{"role": "system", "content": "Talk like a pirate."},
{
"role": "user",
"content": "How do I check if a Python object is an instance of a class?",
},
],
)
print(completion.choices[0].message.content)

Bash+Curl

curl -H "Authorization: Bearer <TOKEN>" https://ellm.nrp-nautilus.io/v1/models
curl -H "Authorization: Bearer <TOKEN>" -X POST "https://ellm.nrp-nautilus.io/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-oss",
"messages": [
{"role": "user", "content": "How do I check if a Python object is an instance of a class?"}
]
}'

OpenCode

After applying the configuration below, search for NRP in the list of models (search after Ctrl+P and Switch models).

In the below configuration, either set the environment variable OPENAI_API_KEY or replace {env:OPENAI_API_KEY} with the actual API key.

Modify "output" as adequate, but never set it to a too large value that becomes similar to or larger than "context".

Open Configuration

Contents of ~/.config/opencode/opencode.json:

{
"$schema": "https://opencode.ai/config.json",
"provider": {
"NRP": {
"npm": "@ai-sdk/openai-compatible",
"name": "NRP",
"options": {
"baseURL": "https://ellm.nrp-nautilus.io/v1",
"apiKey": "{env:OPENAI_API_KEY}",
},
"models": {
"gpt-oss": {
"name": "gpt-oss",
"limit": {
"context": 131072,
"output": 32768
}
}
}
}
}
}

Crush

https://github.com/charmbracelet/crush

Please refer to https://www.zonca.dev/posts/2026-01-29-configure-nrp-llm-opencode-crush as well.

After applying the configuration below, search for NRP in the list of models.

Modify "default_max_tokens" as adequate, but never set it to a too large value that becomes similar to or larger than "context_window".

Open Configuration

Contents of ~/.config/crush/crush.json:

{
"$schema": "https://charm.land/crush.json",
"options": {
"disable_metrics": true,
"disable_provider_auto_update": false,
"debug": false,
"debug_lsp": false,
"attribution": {
"trailer_style": "none",
"generated_with": false
}
},
"mcp": {},
"providers": {
"nrp": {
"name": "NRP",
"type": "openai-compat",
"base_url": "https://ellm.nrp-nautilus.io/v1",
"api_key": "<LLM_API_KEY>",
"models": [
{
"id": "gpt-oss",
"name": "gpt-oss",
"context_window": 131072,
"default_max_tokens": 32768
}
]
}
}
}

Kimi CLI

https://github.com/MoonshotAI/kimi-cli

Open Configuration

Contents of ~/.kimi/config.json:

{
"default_model": "kimi",
"models": {
"kimi": {
"provider": "nrp",
"model": "kimi",
"max_context_size": 262144,
"capabilities": ["thinking", "image_in", "video_in"]
}
},
"providers": {
"nrp": {
"type": "openai_legacy",
"base_url": "https://ellm.nrp-nautilus.io/v1",
"api_key": "<YOUR_API_KEY>"
}
},
"services": {}
}

Claude Code

https://github.com/anthropics/claude-code

Open Configuration

Environment Variables needed for Claude Code:

ANTHROPIC_BASE_URL="https://ellm.nrp-nautilus.io/anthropic"
ANTHROPIC_API_KEY="<llm-token>"
ANTHROPIC_DEFAULT_OPUS_MODEL="<name-of-nrp-model>"
ANTHROPIC_DEFAULT_SONNET_MODEL="<name-of-nrp-model>"
ANTHROPIC_DEFAULT_HAIKU_MODEL="<name-of-nrp-model>"
CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC="1"
CLAUDE_CODE_DISABLE_FEEDBACK_SURVEY="1"
CLAUDE_CODE_ENABLE_TELEMETRY="0"
API_TIMEOUT_MS="3000000"
DISABLE_TELEMETRY="1"

Note that the Web Search Tool is an Anthropic-specific tool that only Claude models can generate. Claude Code users can use other tools (MCP Servers, custom commands) to perform web search successfully.

Note that not all NRP models support Anthropic-compatible endpoints or tools like Claude Code. Please consult the model documentation and the vLLM integration guide: https://docs.vllm.ai/en/stable/serving/integrations/claude_code/.

Isolating cached responses

In your API call to models (vLLM and SGLang models are supported), you should specify the key cache_salt in extra_body, where the value of the key cache_salt is a random base64-encoded string known only to you, where the secret random text before the base64-encode is at least 43 characters (256 bits).

Example (base64-encoded from abcdefghijklmnopqrstuvwxyzabcdefghijklmnopq): extra_body={"cache_salt": "YWJjZGVmZ2hpamtsbW5vcHFyc3R1dnd4eXphYmNkZWZnaGlqa2xtbm9wcQ=="}

response = client.chat.completions.create(
model=model,
messages=messages,
extra_body={
"cache_salt": "YWJjZGVmZ2hpamtsbW5vcHFyc3R1dnd4eXphYmNkZWZnaGlqa2xtbm9wcQ==",
},
)

API users are recommended to set the above configuration if their prompts or responses should only be cached for yourself and not for others. This is potentially very important information to prevent other people’s cached responses or prompts from showing up in your response (and your cached responses or prompts from showing up in the responses of other users), and likely also improve the accuracy of your responses, since irrelevant caches are not referenced. However, this will slightly decrease performance.

We would really like to have the above configuration applied automatically by default for each API key or in the web interfaces, but has several hurdles right now. (Envoy AI Gateway Issue, LibreChat Issue, Open WebUI Issue)

Live inference status

The NRP site hosts LLM Status — a live view of managed inference workloads (from Prometheus/vLLM metrics). Each card is one Kubernetes container; the title is the vLLM / Hugging Face Model ID. If the same logical gateway model (for example qwen3 in API requests) runs on more than one deployment, you will see more than one card with the same model name. Usage in Grafana or Thanos often uses OpenTelemetry labels such as gen_ai_original_model with the gateway short name (qwen3, gemma, gemma-4-e4b, gpt-oss, …), which is different from the Hugging Face id shown on LLM Status.

Available Models

main - Model is generally supported. You can report issues with the service. However, if the model is outdated with no apparent usage purpose, it may be removed if there are no major group or user usage, or switched to a deprecated state. This is to provide our users with the best models within our limited allocation of GPUs. Moreover, models may be upgraded to a newer variant without prior notice.

batch - The LLM is recommended for batch querying and will provide enough performance for most types of queries under moderately heavy load. For batching, code your script so that the chat request will indefinitely retry (preferably with a larger incremental retry interval as the retry count increases). It is possible that models may go down and come back up at various times.

tool - The LLM has tool calling (function calling) enabled.

multimodal - The LLM is multimodal (accepted media are written into the model card).

research - The LLM model is used for active research purposes. Removal or deprecation will be communicated to research users. Please inform an admin if your group would like to declare research usage for certain models.

evaluating - The LLM is added for testing and we’re evaluating it’s capabilities. The model may be unavailable sometimes, and configurations may be changed without notification.

deprecated - LLM is deprecated and is likely to go away soon. Please do not start using this model; this is only for existing user groups who have specific purposes for this model.

You can follow all updates and participate in the discussions within our Matrix Nautilus Artificial Intelligence/Machine Learning (NRP Matrix.to, Matrix.to) channel. Suggestions and decisions for new models are also made here.

qwen3

main batch tool multimodal research

Qwen/Qwen3.5-397B-A17B-FP8

Multimodal (image, video), 262,144 context tokens, 397B parameters, Official FP8 quantization, tool calling, Claude/Gemini-level frontier multimodal performance, use extra_body={"chat_template_kwargs": {"enable_thinking": false}} to disable thinking

Capacity may be split across multiple GPU pools (for example A100 and H200); see LLM Status for per-workload allocation.

qwen3-small

main tool multimodal research

Qwen/Qwen3.5-27B

Multimodal (image, video), 262,144 context tokens, 27B parameters, bf16 weights, tool calling, efficient multimodal and agentic performance, use extra_body={"chat_template_kwargs": {"enable_thinking": false}} to disable thinking

gpt-oss

main batch tool research

openai/gpt-oss-120b

131,072 context tokens, 120B parameters, Native MXFP4 model weights, tool calling, frontier agentic task performance

gemma

main tool multimodal research

google/gemma-4-31B-it

Multimodal (image, video), 262,144 context tokens, 31B parameters, Native bf16 weights, tool calling, efficient multimodal and audio capabilities, use extra_body={"chat_template_kwargs": {"enable_thinking": true}} to enable thinking

gemma-small

evaluating tool multimodal

google/gemma-4-E4B-it

Multimodal (image, video, audio), 131,072 context tokens, ~8B parameters, Native bf16 weights, tool calling, ASR and speech-to-text translation on the small Gemma 4 line; use extra_body={"chat_template_kwargs": {"enable_thinking": true}} to enable thinking

kimi

evaluating tool multimodal

moonshotai/Kimi-K2.5

Multimodal (image, video), 262,144 context tokens, 1T parameters, Native MXFP4 model weights, tool calling, frontier agentic coding performance, use extra_body={"chat_template_kwargs": {"thinking": false}} to disable thinking

glm-4.7

evaluating tool

zai-org/GLM-4.7-FP8

202,752 context tokens, 358B parameters, Official FP8 quantization, tool calling, frontier agentic coding performance, use extra_body={"chat_template_kwargs": {"enable_thinking": false}} to disable thinking

Deploying Your Own Models

Please refer to Managing AI Models.

How Models are Added and Removed

Added: New NRP-managed models are added by feedback and also assessments of various benchmarks and community response to the models by the administrator. We try to take into account quantitative benchmarks (such as https://artificialanalysis.ai), but the ultimate decision is based on other qualitative evidence (such as https://www.reddit.com/r/LocalLLaMA/) and discussions between administrators and users.

Removed: Moreover, we would also like to remove certain models that are deemed sufficiently obsolete, where for instance, smaller models may perform all-round better, or another model completely cleared the usage case in a better way compared to the obsolete model.

Deprecated: An exception is when research groups need these models for model reproducibility in various types of research. In that case, we deprecate these models first, and keep them up until such research concludes. If a model has been deprecated or pulled down, please reach out through the below Nautilus Artificial Intelligence/Machine Learning channel for us to track.

However, we still want to remove deprecated models as soon as possible, due to having limited GPU allocations specialized for deployed LLMs, and these GPUs should be diverted to more recent and better-performing models for the benefit of the whole NRP community. Administrators and researchers use these models for AI-assisted code development, and such models need to be rotated frequently as new and better models are released, vastly increasing individual productivity as newer models are incorporated.

Larger models that require a lot of GPUs are likely to be removed earlier in a more strict way if relative performance falls behind, but smaller, more efficient models or quantized models that do not require as many GPUs may be slightly more lenient in this criterion.

Such discussion is done in the Nautilus Artificial Intelligence/Machine Learning (NRP Matrix.to, Matrix.to) channel.

Changelogs

Click to expand

April 2026:

March 2026:

  • qwen3-embedding (Qwen/Qwen3-VL-Embedding-8B) was added as an available model on the AI Gateway.

  • embed-mistral (intfloat/e5-mistral-7b-instruct) was decommissioned and replaced with qwen3-embedding due to incompatibilities with Jupyter AI.

  • llama3-sdsc (Llama-3.3-70B-Instruct) was removed from the Envoy AI Gateway after a long deprecation. It no longer appears in /v1/models; do not select it in new configs. If it still appeared briefly on LLM Status as “down,” that was stale Prometheus series until the workload fully disappeared from metrics.

  • glm-v (GLM-4.6V multimodal route) was removed from the Envoy AI Gateway; use glm-4.7 for text and other multimodal options as documented above.

February 2026:

January 2026:

Added/Changed:

Removed:

December 2025:

Added/Changed:

November 2025:

Added/Changed:

  • qwen3 (Qwen/Qwen3-235B-A22B-Thinking-2507-FP8) has been changed to Qwen/Qwen3-VL-235B-A22B-Thinking-FP8. Very similar characteristics such as number of parameters, context size, and benchmarks, but adds state-of-the-art vision and video multimodal capabilities.
  • kimi (moonshotai/Kimi-K2-Thinking) is a widely popular programming LLM model and exhibits a similar level of model performance to Claude Sonnet 4.5 or GPT-5 models.
  • glm-4.6 (QuantTrio/GLM-4.6-GPTQ-Int4-Int8Mix) is a widely popular programming LLM model and exhibits a similar level of model performance to Claude Sonnet 4 or Gemini 2.5 Pro models.
  • minimax-m2 (MiniMaxAI/MiniMax-M2) is a widely popular programming LLM model and exhibits a similar level of model performance to Claude Sonnet 4 or Gemini 2.5 Pro models, while being able to fit the official FP8 parameters in four A100 GPUs with ample context length.
  • gpt-oss (openai/gpt-oss-120b) is a very capable agentic model, adequate for general-purpose usage, while requiring only one A100 GPU or two RTX A6000 GPUs for full context due to sliding window attention and official MXFP4 quantization, which is a fraction of other frontier models. This is our candidate for an “LTS” model used for reproducible research, that supersedes the deprecated or removed Llama3 models.
  • gemma3 was changed to 2x RTX A6000 GPUs instead of 2x A100 GPUs to conserve the latter. The model’s special sliding window attention method allows full context to fit in this case.
  • glm-v was changed to the official zai-org/GLM-4.5V-FP8 model and uses 4x L40 GPUs instead of 2x A100 GPUs to conserve the latter and gain FP8 capabilities.

Removed:

  • llama3 (meta-llama/Llama-3.2-90B-Vision-Instruct) has been officially pulled down, due to consuming 4 A100 GPUs that can be used for much more frontier models, such as MiniMax-M2 or GLM-4.6, while being much worse in performance than models that fit in one GPU.
  • deepseek-r1 (QuantTrio/DeepSeek-R1-0528-GPTQ-Int4-Int8Mix-Medium) has been officially pulled down, due to consuming 8 GPUs but being very slow (5-6 tokens/s) in any larger context size. There are many similar models that work well, although not necessarily better in every way, and are faster. This is an example of the larger models that require a lot of GPUs are likely to be removed earlier phrase above.
  • watt (watt-ai/watt-tool-8B) has been removed due to inactivity.
NSF Logo
This work was supported in part by National Science Foundation (NSF) awards CNS-1730158, ACI-1540112, ACI-1541349, OAC-1826967, OAC-2112167, CNS-2100237, CNS-2120019.