DeepSeek-V4-Flash on Novita AI: Fast Reasoning at Lower Cost

DeepSeek-V4-Flash backed by Novita AI: 1M Context at $0.14/M Tokens

Most open-source models with reasoning capabilities force a trade-off: small context windows, slow throughput, or prices that climb above $1/M tokens the moment you enable extended thinking. DeepSeek-V4-Flash sidesteps that entirely — 284B parameters, only 13B activated per inference, a native 1,048,576-token context window, and three selectable reasoning modes. At $0.14/M input tokens, it lands in a category where reasoning-capable models rarely compete.

In short: DeepSeek-V4-Flash is a MoE model from DeepSeek AI that brings 1M-token context and adjustable reasoning depth to developers who need throughput without the closed-model price premium. As of today, it’s available through the Novita AI API.

What Is DeepSeek-V4-Flash?

DeepSeek-V4-Flash is a Mixture-of-Experts (MoE) language model from DeepSeek AI, released as part of the DeepSeek-V4 series alongside the larger DeepSeek-V4-Pro. The model has 284B total parameters with 13B activated at inference — keeping per-token compute cost low while retaining the parameter capacity of a much larger model.

Key capabilities at a glance:

  • 284B total / 13B activated parameters — MoE architecture, low inference cost
  • 1,048,576-token context window (1M tokens) — enabled by Hybrid Attention Architecture
  • Three reasoning modes: Non-think (fast), Think (step-by-step), Think Max (maximum reasoning budget)
  • Function calling support — tool use, structured outputs, JSON mode
  • Trained on 32T+ tokens with multi-stage post-training (SFT, RL with GRPO, on-policy distillation)
  • MIT License — weights available for download on HuggingFace; commercial use permitted
  • FP4 + FP8 mixed precision — MoE expert weights in FP4, remaining layers in FP8

Key Features: Why DeepSeek-V4-Flash Stands Out

Selectable Reasoning Depth Without Switching Models

Most models lock you into a single inference mode: either reasoning-on or reasoning-off. DeepSeek-V4-Flash gives you three distinct operating modes on the same API endpoint:

ModeCharacteristicsBest For
Non-thinkFast, no chain-of-thoughtHigh-volume tasks, chat, summarization
ThinkStep-by-step reasoning, balancedComplex Q&A, code generation, analysis
Think MaxMaximum reasoning budgetMath competitions, hard coding tasks, benchmarks

The gap between modes is significant: on GPQA Diamond, V4-Flash Non-think scores 71.2 vs Think at 87.4 and Think Max at 88.1. On LiveCodeBench, Think Max reaches 91.6 vs Non-think’s 55.2. You choose cost vs quality per request — no infrastructure change required.

Hybrid Attention Architecture for 1M-Token Context

Native million-token context is harder than it sounds. DeepSeek-V4-Flash achieves it through a purpose-built Hybrid Attention Architecture that combines two mechanisms:

  • Compressed Sparse Attention (CSA) — dramatically reduces the attention compute budget for long sequences
  • Heavily Compressed Attention (HCA) — compresses KV cache footprint for 1M-context inference

The result: inference over 1M-token inputs with manageable FLOP and memory cost. For workloads like codebase analysis, legal document review, or long-session agents, this architecture makes the difference between feasible and prohibitive.

MoE Efficiency: 13B Activated at 284B Scale

The 284B/13B activated ratio is where the cost efficiency comes from. Only 13B parameters are active per forward pass, keeping latency and per-token cost close to a 13B dense model — while the full 284B parameter pool provides knowledge capacity comparable to a much larger dense network. The FP4 + FP8 mixed precision further reduces memory bandwidth pressure on expert weights.

Strong Post-Training Pipeline

DeepSeek-V4-Flash follows a two-stage post-training process: first, domain-specific expert cultivation via SFT and reinforcement learning with GRPO; then, unified model consolidation through on-policy distillation. This produces a single model with differentiated capability profiles across coding, reasoning, and general knowledge — not a generic instruction-follower.

Benchmark Performance

The benchmark story for DeepSeek-V4-Flash is about reasoning mode selection. In Non-think mode, it behaves like an efficient 13B-activated model. Dial up to Think Max and it reaches a different tier entirely.

DeepSeek-V4-Flash benchmark comparison chart showing performance across reasoning modes
DeepSeek-V4-Flash performance across modes vs frontier models [Source: DeepSeek AI / HuggingFace]

Performance Across Reasoning Modes

Below are V4-Flash’s scores across key benchmarks, comparing all three operating modes:

BenchmarkV4-Flash Non-ThinkV4-Flash ThinkV4-Flash Think Max
LiveCodeBench (Pass@1)55.288.491.6
GPQA Diamond (Pass@1)71.287.488.1
HMMT 2026 Feb (Pass@1)40.891.994.8
IMOAnswerBench (Pass@1)41.985.188.4
Codeforces Rating28163052
SWE Verified (Resolved)73.778.679.0
MRCR 1M (MMR)37.576.978.7
MCPAtlas (Pass@1)64.067.469.0
MMLU-Pro (EM)83.086.486.2

Last verified: 2026-04-27. Source: DeepSeek-V4 technical report and HuggingFace model card.

How V4-Flash Compares to Competitors

V4-Flash Think Max (79.0 SWE Verified, 91.6 LiveCodeBench) competes with models running at much higher per-token cost. It doesn’t top every leaderboard — V4-Pro Max leads on most frontier benchmarks — but for developers looking at cost-per-task rather than raw peak performance, the trade-off is favorable:

BenchmarkV4-Flash MaxV4-Pro MaxClaude Opus 4.6 MaxGemini 3.1 Pro High
LiveCodeBench (Pass@1)91.693.588.891.7
GPQA Diamond (Pass@1)88.190.191.394.3
SWE Verified (Resolved)79.080.680.880.6
HMMT 2026 Feb (Pass@1)94.895.296.294.7
MRCR 1M (MMR)78.783.592.976.3

Last verified: 2026-04-27. Claude Opus 4.6 Max and Gemini 3.1 Pro High figures sourced from the DeepSeek-V4 technical report (V4-Pro frontier comparison table). These scores were not measured head-to-head against V4-Flash in that report.

Notably, V4-Flash Think Max on MRCR 1M (78.7) beats Gemini 3.1 Pro High (76.3) on the long-context retrieval task — the benchmark that most directly maps to 1M-context use cases. On SWE Verified, all four models cluster between 79–81, making V4-Flash competitive in the real-world coding agent category at a fraction of the closed-model price.

How to Use DeepSeek-V4-Flash via Novita AI

Option 1: Playground (No Code)

Test the model directly in your browser at the Novita AI model console. No API key required to start — switch between Non-think, Think, and Think Max modes via the chat interface.

Option 2: API (Python)

DeepSeek-V4-Flash uses the OpenAI-compatible API. Use the model ID deepseek/deepseek-v4-flash with the Novita base URL:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.novita.ai/v3/openai",
    api_key="YOUR_NOVITA_API_KEY",
)

response = client.chat.completions.create(
    model="deepseek/deepseek-v4-flash",
    messages=[{"role": "user", "content": "Your prompt here"}]
)
print(response.choices[0].message.content)

To enable Think or Think Max mode, pass the reasoning parameter in the request body:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.novita.ai/v3/openai",
    api_key="YOUR_NOVITA_API_KEY",
)

# Think Max mode — maximum reasoning budget
response = client.chat.completions.create(
    model="deepseek/deepseek-v4-flash",
    messages=[{"role": "user", "content": "Solve: x^4 - 5x^2 + 4 = 0"}],
    extra_body={"reasoning": {"effort": "high"}}  # "low" = Think, "high" = Think Max
)
print(response.choices[0].message.content)

Get your API key at novita.ai/settings.

Option 3: Third-Party Tools

Because Novita AI exposes an OpenAI-compatible endpoint, DeepSeek-V4-Flash works out of the box with:

  • LangChain / LlamaIndex — use ChatOpenAI with base_url="https://api.novita.ai/v3/openai"
  • OpenWebUI — add as a custom OpenAI-compatible endpoint
  • Continue.dev / Cursor — configure as a custom model with the Novita base URL

Pricing

DeepSeek-V4-Flash is priced consistently across major providers. All figures are per million tokens, as of 2026-04-27:

ProviderInput ($/M)Output ($/M)Cache Read ($/M)Max Context
Novita AI$0.14$0.28$0.0281,048,576 tokens
DeepSeek Official$0.14$0.28$0.028131,072 tokens
SiliconFlow$0.14$0.28$0.02865,536 tokens
DeepInfra$0.14$0.2816,384 tokens

The per-token rate is the same everywhere — but max context varies significantly. Novita AI offers the full 1M token context window. DeepInfra caps at 16,384 tokens. If your workload involves long documents, codebases, or multi-turn agents, Novita is the practical choice.

Autonomous Coding Agents

V4-Flash’s 1M context window means an agent can load an entire codebase into context without chunking. Combined with 79.0 SWE Verified in Think Max mode, it handles multi-file refactors and debugging without losing state between turns.

Long-Document QA and RAG

MRCR 1M (Multi-Round Context Retrieval) at 78.7% Think Max — the benchmark measures retrieval accuracy over a genuine 1M-token window. For indexing legal documents, academic papers, or long technical specs, V4-Flash retrieves accurately where most models degrade after 32K tokens.

Math and Science Reasoning

94.8% on HMMT 2026 February (competition math) with Think Max. The budget-thinking mode lets you tune cost vs accuracy — use Think for standard problems, Think Max for the hard ones. A single request doesn’t burn a fixed compute budget; you choose.

Production APIs with Caching

At $0.028/M cache reads, repeated system prompts and tool schemas effectively cost nothing at scale. Chatbot products and API wrappers that re-inject the same context on every call benefit from cache read pricing over raw input pricing.

Frequently Asked Questions

What is DeepSeek-V4-Flash?

DeepSeek-V4-Flash is a 284B-parameter Mixture-of-Experts language model developed by DeepSeek AI, released on 2026-04-23. It activates only 13B parameters per forward pass, making it significantly faster and cheaper than dense models of comparable capability. It supports a 1,048,576-token context window and three reasoning modes: Non-thinking (fast), Budget Thinking, and Extended Thinking (Think Max).

How is DeepSeek-V4-Flash different from DeepSeek-V4-Pro?

V4-Flash is the lighter, faster variant optimized for speed and cost. V4-Pro is the flagship model with higher peak benchmark scores (e.g., 93.5 vs 91.6 on LiveCodeBench Think Max). V4-Flash “achieves comparable reasoning performance to the Pro version when given a larger thinking budget” — in practice, V4-Flash Think Max closes most of the gap against V4-Pro Think Max at lower per-token cost.

What does “Flash” mean in the model name?

Flash signals a speed-optimized variant, consistent with how Google uses the term for Gemini Flash. DeepSeek-V4-Flash prioritizes lower latency and cost over raw maximum accuracy, with the thinking modes available when you need to close the performance gap.

Does DeepSeek-V4-Flash support a 1M context window backed by Novita AI?

Yes. Novita AI exposes the full 1,048,576-token context window — the largest available across all current providers for this model. Max completion tokens on Novita is 393,216.

How do I switch reasoning modes via the API?

Pass the extra_body={"reasoning": {"effort": "low"}} parameter for Budget Thinking, or "effort": "high" for Think Max. Omit the parameter entirely for Non-thinking (fast) mode. The API is OpenAI-compatible — no SDK changes required.

What is the pricing for DeepSeek-V4-Flash backed by Novita AI?

As of 2026-04-27: $0.14/M input tokens, $0.28/M output tokens, $0.028/M cache read tokens. This matches DeepSeek’s official pricing and is consistent across providers — the differentiator on Novita is the full 1M context window and reliable uptime.

Is DeepSeek-V4-Flash open source?

Yes. The model weights are available on HuggingFace under the MIT License — confirmed in the official DeepSeek-V4 repository. Self-hosting and commercial use are permitted under MIT terms. Using it via Novita AI’s API requires no self-hosting at all.


Start Using DeepSeek-V4-Flash Today

DeepSeek-V4-Flash is now available via Novita AI with the full 1M context window, competitive pricing, and zero infrastructure overhead. You pick the reasoning mode; Novita handles the rest.

Try DeepSeek-V4-Flash backed by Novita AI

Novita AI LLM API documentation



Discover more from Novita

Subscribe to get the latest posts sent to your email.

Leave a Comment

Scroll to Top

Discover more from Novita

Subscribe now to keep reading and get access to the full archive.

Continue reading