Ling-2.6-flash: 340 Tokens/s, ~7x Efficiency | Novita AI

Agent token bills are spiraling: multi-step tool calls, long-context planning, and extended outputs turn what looks like a cheap per-token price into a very expensive monthly invoice. The industry’s answer — chain longer reasoning traces to push benchmark scores higher — makes the economics worse, not better.

Ling-2.6-flash is a different kind of model. Built around a hybrid linear attention architecture, it achieves up to 340 tokens/s on 4× H20 hardware, delivers 2.2× the prefill throughput of Nemotron-3-Super, and uses just ~15M output tokens to complete the full Artificial Analysis Intelligence Index — roughly one-tenth of what Nemotron-3-Super consumes. In short: Ling-2.6-flash is a 104B MoE model (7.4B active) with a 256K context window, optimized for agent workloads where speed, cost, and stability matter more than a single headline benchmark. It is now available on Novita AI.

What Is Ling-2.6-flash?

Ling-2.6-flash is a sparse Mixture-of-Experts language model with 104B total parameters and 7.4B active parameters per forward pass. Developed by the Ling team (InclusionAI), it is designed as an “Instant” category model — optimized for production agent deployments where token consumption and latency are real costs, not just benchmark headlines.

  • 104B total / 7.4B active parameters — MoE architecture with high sparsity
  • 256K token context window — enabled by hybrid linear attention
  • 340 tokens/s peak throughput on 4× H20 (TP=4)
  • Hybrid 1:7 MLA + Lightning Linear attention — 4× throughput at long contexts
  • Top agent benchmarks — leads BFCL-V4 (67.04), PinchBench (81.10), IFBench (58.10), Multi-IF Turn-3 (74.85)
  • BF16, FP8, and INT4 variants — open-source release planned via Linghe
  • Validated in production — ~100B daily tokens on OpenRouter within days of launch

Hybrid Linear Architecture: How Ling-2.6-flash Gets Faster at Scale

Most MoE models pair standard transformer attention with a sparse FFN layer. Ling-2.6-flash replaces most attention with a Lightning Linear layer, creating a 1:7 MLA + Lightning Linear hybrid. Attention cost grows linearly with context length rather than quadratically — critical for long agent sessions.

Ling-2.6-flash hybrid linear attention MoE architecture diagram
Ling-2.6-flash architecture: 157K vocabulary, 256K context, 1:7 MLA + Lightning Linear hybrid, 256 selectable experts [Source: Ling Official Blog]

Decode Throughput: Up to 4.38× at Long Outputs

On 4× H20-3e (TP=4, batch size 32), Ling-2.6-flash reaches 4.38× normalized decode throughput at 65,536-token output length vs. GLM-4.5-Air baseline. Qwen3.5-122B-A10B reaches 1.90×; Nemotron-3-Super 3.37×. The gap compounds as task output length increases.

Ling-2.6-flash normalized decode throughput vs generation length
Decode Throughput Comparison, 4× H20-3e, TP=4, Batch=32 [Source: Ling Official Blog]

Prefill Throughput: 2.2× Nemotron at Long Contexts

Ling-2.6-flash achieves ~4.68× normalized prefill throughput at 65K context vs. ~2.12× for Nemotron-3-Super. For RAG pipelines and multi-turn agents with long system prompts, this directly reduces per-request cost.

Ling-2.6-flash prefill throughput vs context length
Prefill Throughput Comparison, 4× H20-3e, TP=4, Batch=32 [Source: Ling Official Blog]

Token Efficiency: 15M vs. 110M to Solve the Same Benchmarks

On the full Artificial Analysis Intelligence Index, Ling-2.6-flash uses ~15M output tokens. Nemotron-3-Super uses 110M+ — roughly 7× more — for a model that scores lower on agent tasks. For applications running hundreds of thousands of agent tasks daily, this gap is a direct line item on a cost budget.

Token usage comparison: Ling 2.6 Flash 15M vs Nemotron 110M+
Output tokens to complete Artificial Analysis Intelligence Index — Ling 2.6 Flash: ~15M vs Nemotron-3-Super: ~110M+ [Source: Artificial Analysis]
Intelligence vs output tokens scatter plot — Ling 2.6 Flash efficiency zone
Intelligence vs. Output Tokens: Ling 2.6 Flash lands in the high-efficiency zone [Source: Artificial Analysis]

Benchmark Results: Where Ling-2.6-flash Leads

Evaluated on 19 benchmarks across 7 categories against Qwen3-57B-A14B, Qwen3.5-122B-A10B, GLM-4.5-Air, Nemotron-3-Super, and MiniMax-M1-80k:

Ling-2.6-flash full benchmark table: 6 models, 19 benchmarks, 7 categories
Comprehensive benchmark table [Source: Ling Official Blog]
Ling-2.6-flash agent benchmark comparison — BFCL-V4 and PinchBench leadership
Agent benchmarks: Ling-2.6-flash leads on tool-use and multi-turn IF [Source: Ling Official Blog]

Where Ling-2.6-flash Leads

  • BFCL-V4 (Function Calling): 67.04 — nearest competitor Nemotron at 35.12 (90% gap)
  • PinchBench (Agent Tasks): 81.10 vs. Nemotron 73.10
  • IFBench (Instruction Following): 58.10
  • Multi-IF Turn-3: 74.85 — strong multi-turn instruction persistence
  • LongBench-v2: 54.80 — top in long-context category
  • CCAlignBench (Chinese): 7.44 — best among all tested models

Where Others Lead

  • Math (AIME 2025, MATH-500): Nemotron-3-Super and Qwen3 reasoning variants win
  • Coding (LiveCodeBench): Qwen3.5-122B-A10B leads; Ling is competitive but not top
  • GPQA-Diamond: GLM-4.5-Air and Nemotron score higher

Quick Comparison Table

ModelActive ParamsBFCL-V4 ↑PinchBench ↑Decode TP @ 65K ↑Output Tokens ↓
Ling-2.6-flash7.4B67.0481.104.38×~15M
Nemotron-3-Super49B total35.1273.103.37×~110M+
Qwen3.5-122B-A10B10B78.201.90×
GLM-4.5-Air50.6773.301.00× (baseline)
MiniMax-M1-80k44.0775.70
Qwen3-57B-A14B14B52.3276.30

Access Ling-2.6-flash backed by Novita AI

Ling-2.6-flash is available now. Try it on OpenRouter — free tier, no setup required:

Get started on OpenRouter — inclusionai/ling-2.6-flash:free. Free tier available, no code changes needed for OpenAI-compatible clients.

Ling-2.6-flash works with LangChain, LlamaIndex, and OpenAI Agent SDK — no adapter or code change needed. Streaming, function calling, and structured outputs are all supported. Pair it with Novita Agent Sandbox for secure code execution alongside inference.

What the Community Is Saying

Ling-2.6-flash launched on OpenRouter as “Elephant Alpha” before the official reveal. Within days it had processed ~100B tokens and topped the platform trending leaderboard — without any announcement.

“Ling-2.6-flash is kind of work-oriented. About 75% less verbose than big models. Still a bit of boilerplate, but when it comes to writing code — it’s almost perfect.”

— Early user on X/Twitter

“Just tried Ling-2.6-flash on a few llama.cpp coding tasks. Much better than expected. Handles tool calls reliably and doesn’t pad the output with unnecessary explanation.”

— Early user on Reddit

The “75% less verbose” comment matches the 15M vs. 110M token gap on Artificial Analysis benchmarks exactly. The training objective appears to reward direct, complete answers — a property that compounds in cost savings at production scale.

Who Should Use Ling-2.6-flash?

  • High-volume function calling / tool-use agents — BFCL-V4 leadership by a wide margin
  • Multi-turn agent sessions — consistent across long conversation histories
  • Long context RAG pipelines — 256K token window, linear-cost prefill
  • Cost-sensitive production deployments — ~7× fewer output tokens than Nemotron
  • Chinese-language applications — top CCAlignBench
  • Math competition / AIME-style reasoning — use Nemotron or Qwen3 reasoning variants
  • Maximum coding benchmark performance — Qwen3.5-122B-A10B leads

Get Started

Ling-2.6-flash is available now. Access it via the OpenRouter model page — free tier available immediately, no code changes needed for OpenAI-compatible clients. The Agent Sandbox is available alongside for teams combining inference and secure execution.

Frequently Asked Questions

What is Ling-2.6-flash?

Ling-2.6-flash is a 104B MoE model (7.4B active) with hybrid linear attention, 256K context window, and up to 340 tokens/s inference speed — optimized for agent workloads.

How do I use Ling-2.6-flash via API?

Use OpenRouter with your Novita AI API key (BYOK). Add your Novita key at openrouter.ai/settings/integrations, select Novita as the provider, and route requests to inclusionai/ling-2.6-flash:free via the OpenAI-compatible endpoint:

POST https://openrouter.ai/api/v1/chat/completions
Authorization: Bearer YOUR_OPENROUTER_API_KEY
{
"model": "inclusionai/ling-2.6-flash:free",
"provider": {
"order": ["Novita"],
"api_key": "YOUR_NOVITA_API_KEY"
},
"messages": [{"role": "user", "content": "Hello!"}]
}

See OpenRouter BYOK docs for full setup. When using BYOK, OpenRouter charges no fees — you pay Novita directly at free-tier pricing.

How does Ling-2.6-flash compare to Nemotron-3-Super?

Ling leads on BFCL-V4 (67.04 vs 35.12), PinchBench (81.10 vs 73.10), and uses ~7× fewer output tokens. Nemotron leads on math. For agent workloads, Ling-2.6-flash is the better economic choice.

What is the context window?

256K tokens (262,144), with linear-cost prefill thanks to hybrid linear attention. Long RAG and multi-turn sessions scale efficiently.

Is Ling-2.6-flash open source?

BF16, FP8, and INT4 variants plus Linghe kernels are planned for open-source release. Timeline TBD — check the Ling official site for updates.


You Might Also Like


Discover more from Novita

Subscribe to get the latest posts sent to your email.

Leave a Comment

Scroll to Top

Discover more from Novita

Subscribe now to keep reading and get access to the full archive.

Continue reading