How to Use DeepSeek V4 Flash in Claude Code via Novita AI

How to Use DeepSeek V4 Flash in Claude Code via Novita AI

DeepSeek V4 Flash is a 284B MoE model with a 1-million-token context window, available through Novita AI’s Anthropic-compatible endpoint — which means Claude Code can use it directly with a three-line environment variable change. At $0.14/M input tokens versus Claude Sonnet’s $3/M, the cost difference is significant for teams running continuous agentic coding sessions.

Why Use DeepSeek V4 Flash in Claude Code

The economics are the most immediate reason. Claude Code defaults to Claude Sonnet, which runs at $3/M input tokens and $15/M output tokens. DeepSeek V4 Flash on Novita AI costs $0.14/M input and $0.28/M output — roughly a 20× reduction on input and a 50× reduction on output. For a team running Claude Code across an eight-hour workday, that difference adds up fast.

Beyond cost, V4 Flash brings two capabilities that matter specifically for agentic coding:

  • 1M-token context window — Claude Code can load an entire codebase into context without chunking. Multi-file refactors, cross-repo debugging, and long conversation histories stay coherent without manual context management.
  • Selectable reasoning modes — Non-think mode gives fast responses for boilerplate tasks; Think and Think Max modes enable step-by-step reasoning for complex architecture decisions or hard debugging sessions. You choose per-session without switching models.

Novita AI exposes an Anthropic-compatible endpoint (/anthropic), so Claude Code treats it as a drop-in replacement. No SDK changes, no plugin required — just environment variables.

What Is DeepSeek V4 Flash

DeepSeek V4 Flash is a Mixture-of-Experts (MoE) model from DeepSeek AI. It has 284B total parameters but activates only 13B per forward pass, which keeps latency and per-token cost close to a 13B dense model while retaining the knowledge capacity of a much larger network.

Key specs at a glance:

SpecValue
Model IDdeepseek/deepseek-v4-flash
Total parameters284B (13B activated per inference)
Context window1,048,576 tokens
Max output tokens393,216
Input price (Novita AI)$0.14/M tokens
Output price (Novita AI)$0.28/M tokens
Cache read price$0.028/M tokens
Reasoning modesNon-think, Think, Think Max
Function callingYes
Structured outputsYes
LicenseMIT

The three reasoning modes let you tune cost against quality per session. Non-think mode is fast and cheap — right for repetitive scaffolding or boilerplate generation. Think mode adds step-by-step reasoning for code review, architecture work, and debugging. Think Max uses the maximum reasoning budget and matches V4 Pro on most coding benchmarks.

Novita AI provides the full 1M-token context window and reliable uptime, which makes it a practical choice for production agentic workloads.

Getting Your Novita AI API Key

Sign up for a Novita AI account to receive free trial credits. After logging in, navigate to the Key Management page and click Create New Key.

Copy the key immediately — it won’t be shown again. Keep it in a password manager or secrets store; you’ll need it in the next step.

Installing Claude Code

Claude Code requires Node.js 18 or higher. Check your version first:

node --version

If Node is below 18, update from nodejs.org before continuing.

Windows

Open Command Prompt and run:

npm install -g @anthropic-ai/claude-code

Mac and Linux

Open Terminal and run:

npm install -g @anthropic-ai/claude-code

The global install makes claude available from any directory.

Configuring Environment Variables

These four variables redirect Claude Code to Novita AI’s Anthropic-compatible endpoint with DeepSeek V4 Flash as the active model.

Windows

set ANTHROPIC_BASE_URL=https://api.novita.ai/anthropic
set ANTHROPIC_AUTH_TOKEN=<Your Novita API Key>
set ANTHROPIC_MODEL=deepseek/deepseek-v4-flash
set ANTHROPIC_SMALL_FAST_MODEL=deepseek/deepseek-v4-flash

These persist for the current Command Prompt session. To make them permanent, set them through System Properties → Environment Variables.

Mac and Linux

export ANTHROPIC_BASE_URL="https://api.novita.ai/anthropic"
export ANTHROPIC_AUTH_TOKEN="<Your Novita API Key>"
export ANTHROPIC_MODEL="deepseek/deepseek-v4-flash"
export ANTHROPIC_SMALL_FAST_MODEL="deepseek/deepseek-v4-flash"

To persist across sessions, add these lines to your ~/.bashrc, ~/.zshrc, or equivalent shell profile.

ANTHROPIC_SMALL_FAST_MODEL controls the lightweight model Claude Code uses for fast internal tasks like file lookups and summaries. Setting it to the same model ID keeps all traffic on a single billing line and avoids unexpected Anthropic API calls.

Starting Claude Code

Navigate to your project directory and launch Claude Code:

cd <your-project-directory>
claude .

Claude Code opens an interactive session in the current directory. You’ll see the prompt appear once the connection to Novita AI’s endpoint is established. From here, describe your task in natural language — Claude Code will read your files, propose changes, and apply them with your approval.

Working With Large Codebases

The 1M-token context window is the most practical advantage of V4 Flash over smaller-context alternatives. A typical medium-sized production codebase runs 100K–300K tokens when flattened. V4 Flash can hold the entire thing in context without any chunking strategy.

A few workflows that benefit directly:

Cross-file refactors — Ask Claude Code to rename a data model, change an API contract, or refactor a service interface across every file that references it. With a full context window, it sees all dependencies simultaneously rather than file by file.

Long debug sessions — As a debugging session accumulates tool calls, file reads, and reasoning traces, smaller context windows truncate early history. V4 Flash retains the full session, so the model can reason about patterns it saw 200 tool calls ago.

Repository-wide reviews — Feed the entire codebase to V4 Flash’s Think or Think Max mode and ask for a security review, architecture assessment, or dead code analysis. This would exhaust a 128K model quickly; it fits comfortably within V4 Flash’s window.

System prompt overhead — Claude Code uses a detailed system prompt that can run 10K–20K tokens. On a 128K model, that overhead matters. On a 1M window it’s negligible, leaving nearly all of the context budget for actual code.

For cost control on long sessions, Non-think mode handles the bulk of routine file edits at the lowest cost. Switch to Think mode when the task requires design reasoning, and Think Max for hard algorithmic or debugging problems. The Novita cache read price ($0.028/M) means repeated system prompt injections cost very little at scale.

Selecting Reasoning Modes Per Session

DeepSeek V4 Flash supports three reasoning modes that you can control per session. Non-think mode returns fast, direct completions — right for boilerplate generation, routine edits, and quick lookups. Think mode enables step-by-step reasoning for code review, refactors, and architecture decisions. Think Max allocates the maximum reasoning budget and matches V4 Pro on most coding benchmarks.

The simplest way to bias Claude Code toward deeper reasoning is a custom system prompt:

claude --system "Use extended thinking for architecture decisions and complex debugging."

For programmatic control, Novita AI’s endpoint accepts the budget_tokens parameter. Setting it to 0 disables thinking entirely; any positive value enables thinking up to that token budget. This is useful in agentic pipelines where only specific steps need deep reasoning:

import anthropic

client = anthropic.Anthropic(
    base_url="https://api.novita.ai/anthropic",
    api_key="<Your Novita API Key>",
)

# Think Max — maximum reasoning budget for hard problems
response = client.messages.create(
    model="deepseek/deepseek-v4-flash",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 10000},
    messages=[{"role": "user", "content": "Review this function for subtle concurrency bugs."}],
)

For cost-conscious sessions, start in Non-think mode and switch to Think only when you hit a problem that requires it. Because the Novita cache read price is $0.028/M tokens, repeated system prompt injections stay cheap even across long multi-step sessions.

Conclusion

DeepSeek V4 Flash on Novita AI gives Claude Code a capable, cost-efficient backbone — 1M context, selectable reasoning, and function calling at a fraction of Claude Sonnet pricing. The setup takes under five minutes. Once the environment variables are in place, your existing Claude Code workflow runs unchanged.

Try DeepSeek V4 Flash on Novita AI and see the Novita AI LLM API documentation for further configuration options.

FAQ

Does Claude Code need any plugin or extension to use Novita AI?

No. Claude Code reads the ANTHROPIC_BASE_URL environment variable at startup and routes all API calls there. No plugin, extension, or code change is required — the switch is entirely through environment variables.

Will I be billed by Anthropic when using Novita AI?

No. When ANTHROPIC_BASE_URL points to Novita AI, all traffic and billing go through your Novita AI account. Your Anthropic account is not used.

Can I switch back to Claude Sonnet without reinstalling?

Yes. Unset ANTHROPIC_BASE_URL and ANTHROPIC_MODEL — or open a new shell without those exports — and Claude Code reverts to the default Anthropic endpoint with Claude Sonnet.

Is V4 Flash suitable for automated CI pipelines?

V4 Flash supports function calling and structured outputs, which are the two capabilities Claude Code relies on most heavily. It is a practical choice for automated coding pipelines, CI integrations, and long agentic sessions where context continuity and cost predictability matter.

What happens if the context window fills up?

At 1,048,576 tokens, V4 Flash’s context window is large enough that most sessions won’t fill it. If you are running an extremely long session — days of accumulated history, very large repos — Claude Code will start truncating the oldest messages. In practice, starting a fresh session for a new task is the simplest way to stay well within the limit.

Novita AI is an AI cloud platform that offers developers an easy way to deploy AI models using our simple API, while also providing affordable and reliable GPU cloud for building and scaling.