Step 3.7 Flash API on Novita AI: Multimodal Quick Start

Table Of Contents

What do you need before calling the API?
Which Step 3.7 Flash facts matter for implementation?
How do you call Step 3.7 Flash with cURL?
How do you call Step 3.7 Flash from Python?
How should you handle multimodal input?
How do function calling and structured outputs fit?
How should teams budget and test before production?
FAQ

Step 3.7 Flash is available on Novita AI as a Serverless LLM with the model ID stepfun/step-3.7-flash, OpenAI-compatible chat/completions, text, image, and video input support, text output, function calling, structured outputs, and reasoning listed on the model page. This quick start focuses on the developer workflow: how to call the API, what request patterns are safe to use today, what pricing fields to budget for, and where to be careful before wiring multimodal or reasoning behavior into production. For a broader look at the model’s features and positioning, see the Step 3.7 Flash API overview.

What do you need before calling the API?

Start with three pieces of configuration:

Item	Value
API key	Create and store a Novita AI API key in an environment variable such as `NOVITA_API_KEY`.
OpenAI-compatible base URL	`https://api.novita.ai/openai`
Chat completions endpoint	`POST https://api.novita.ai/openai/v1/chat/completions`
Model ID	`stepfun/step-3.7-flash`

The Novita AI documentation index lists the OpenAI-compatible base URL, and the chat completions API reference documents the request and response fields for POST https://api.novita.ai/openai/v1/chat/completions.

Keep the API key out of source control. In local development, export it in your shell. In production, load it from your secret manager:

export NOVITA_API_KEY="your_api_key"

If your application already uses OpenAI-compatible chat completions, the migration path is usually small: point the client at Novita AI’s base URL, set the Authorization bearer token, and use the Step 3.7 Flash model ID.

Which Step 3.7 Flash facts matter for implementation?

Use the exact model ID in code and the display name in user-facing UI. The current Novita model page lists Step 3.7 Flash as a Chat model in the StepFun series.

Field	Current Novita value
Display name	Step 3.7 Flash
API model ID	`stepfun/step-3.7-flash`
Model family shown by Novita	StepFun
Hosting type	Serverless LLM
Endpoint	`chat/completions`
Input modalities	Text, image, video
Output modalities	Text
Context window	262,144 tokens
Max output tokens	256,000
Listed features	Serverless, function calling, structured outputs, reasoning
Listed labels	MoE, >100B, NEW, Featured
Default listed T1 rate limit	30 RPM and 50,000,000 TPM

As of June 18, 2026, Novita lists these token prices for stepfun/step-3.7-flash:

Token type	Listed price
Input tokens	$0.20 per 1M tokens
Output tokens	$1.15 per 1M tokens
Cache read input tokens	$0.04 per 1M tokens

Pricing, model availability, rate limits, and supported request parameters can change. Check the Step 3.7 Flash model page and the Novita AI pricing page before procurement review, production launch, or any customer-facing pricing commitment.

How do you call Step 3.7 Flash with cURL?

For the first smoke test, keep the request text-only. This confirms authentication, model routing, response parsing, and basic generation before you add tools, schemas, images, or video.

curl "https://api.novita.ai/openai/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${NOVITA_API_KEY}" \
  -d '{
    "model": "stepfun/step-3.7-flash",
    "messages": [
      {
        "role": "system",
        "content": "You are a concise technical assistant."
      },
      {
        "role": "user",
        "content": "Create a four-step checklist for testing a multimodal support bot before release."
      }
    ],
    "max_tokens": 512,
    "temperature": 0.2
  }'

A successful response follows the chat completions shape documented by Novita AI: a choices array, a message with generated content, created/model metadata, and a usage object when usage is returned. For streaming responses, the API reference notes that usage appears in the final response chunk.

Use this smoke test to verify:

The API key is valid.
The model ID is accepted.
Your client can parse choices[0].message.content.
Your logging captures prompt, completion, and total token usage without storing secrets.
Your timeout and retry policy is appropriate for the size of the prompt.

How do you call Step 3.7 Flash from Python?

The OpenAI Python SDK pattern works with Novita AI when you set the Novita base URL. Install and version-pin the SDK in your own project according to your dependency policy.

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.novita.ai/openai",
    api_key=os.environ["NOVITA_API_KEY"],
)

response = client.chat.completions.create(
    model="stepfun/step-3.7-flash",
    messages=[
        {"role": "system", "content": "You are a concise technical assistant."},
        {
            "role": "user",
            "content": "Summarize the release risks for a customer support workflow that accepts screenshots and long text tickets.",
        },
    ],
    max_tokens=512,
    temperature=0.2,
)

print(response.choices[0].message.content)

For application code, wrap this in a small model gateway instead of scattering raw API calls across the codebase. A gateway lets you enforce default token limits, set per-route timeouts, normalize errors, and switch models for evaluation without changing business logic.

A practical production wrapper should capture:

model, prompt_tokens, completion_tokens, and total_tokens.
Request latency and retry count.
HTTP status and API error category.
Whether tools, JSON schema, image input, or video input were used.
A redacted request summary that excludes API keys and sensitive user content.

That telemetry matters because Step 3.7 Flash has a large context window and high max output limit. Those limits are useful, but production systems should still set explicit max_tokens, reject oversized user uploads before the model call, and monitor output length.

How should you handle multimodal input?

Novita lists text, image, and video as input modalities for Step 3.7 Flash and text as the output modality. Treat that as the supported capability boundary, then verify the exact payload shape in the current Novita docs or console before shipping a multimodal integration.

For a quick start, use this order:

Run the text-only smoke test.
Add one image input using the currently documented Novita chat message format.
Validate response quality and response shape on your real task.
Add larger image batches or video only after you have confirmed the request format, size limits, latency, and cost behavior.

Do not assume every OpenAI-compatible multimodal payload shape is accepted by every Novita-hosted model. The Step 3.7 Flash model page verifies image and video input support, but video request examples are more sensitive to file handling, URL access, duration, size, and model-specific formatting. If the current documentation or console example does not show the exact video payload shape you need, avoid hard-coding one from another provider’s docs.

Good first image-use cases include:

Summarizing a support screenshot alongside the user’s ticket text.
Extracting UI state from a product screenshot for an internal triage assistant.
Reviewing a visual QA image and producing a text checklist.

Video should be tested more conservatively. Start with short clips, record the exact request form that works, capture latency and token usage, and define fallback behavior when video input is rejected, too large, or too slow for your route.

How do function calling and structured outputs fit?

Step 3.7 Flash is listed with function calling and structured outputs. In the chat completions API, function calling is exposed through tools, and structured outputs are exposed through response_format.

Use function calling when the model should choose a tool and return JSON arguments instead of directly answering the user. The API reference documents function tools with a type of function, a function.name, a description, JSON Schema parameters, and an optional strict setting.

tools = [
    {
        "type": "function",
        "function": {
            "name": "create_support_ticket",
            "description": "Create an internal support ticket from a user-reported issue.",
            "parameters": {
                "type": "object",
                "properties": {
                    "summary": {"type": "string"},
                    "priority": {
                        "type": "string",
                        "enum": ["low", "medium", "high"],
                    },
                    "needs_human_review": {"type": "boolean"},
                },
                "required": ["summary", "priority", "needs_human_review"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="stepfun/step-3.7-flash",
    messages=[
        {
            "role": "user",
            "content": "The payment settings page returns a 500 error after I upload a screenshot.",
        }
    ],
    tools=tools,
    temperature=0.1,
)

Use structured outputs when your application needs a validated JSON response and no external tool call is required. Novita’s chat completions API reference documents response_format with json_schema and notes that strict mode supports a subset of JSON Schema. Keep early schemas small, avoid exotic schema features, and fail closed when the model response does not validate.

For reasoning, distinguish model capability from request behavior. The Step 3.7 Flash model page lists reasoning as a feature, while the chat completions API reference documents reasoning-related parameters with model-specific support notes. Before relying on a reasoning field in a production parser, run an API test with stepfun/step-3.7-flash and handle the exact response shape your account receives.

How should teams budget and test before production?

Use the listed token prices to estimate the first budget, then validate with real usage logs. Step 3.7 Flash is priced differently for input, output, and cache reads, so long prompts, verbose outputs, and repeated context have different cost profiles. If you are comparing Novita AI alongside other LLM API providers, the best LLM API providers 2026 guide covers pricing tiers, rate limits, and provider trade-offs. For teams still evaluating which inference provider fits an agent workload, choosing an inference provider for AI agents walks through the key evaluation criteria.

For example, an application that sends large support transcripts may spend most of its budget on input tokens. An agent that asks for long plans may spend more on output tokens. A retrieval or memory workflow that reuses context may benefit from cache-read pricing if the cache behavior applies to the deployed request pattern.

Before production, run an evaluation set that includes:

Short text-only prompts for latency and baseline answer quality.
Long-context prompts near your expected upper bound, not the maximum context window.
Image prompts that match your real upload source and file handling.
Tool-call prompts where the correct behavior is to call a function.
JSON-schema prompts that intentionally test invalid, missing, and edge-case fields.
Failure cases for oversized input, missing media, invalid API keys, and timeouts.

Do not route all traffic to a new model based only on a feature list. Feature flags tell you what is available; evaluation tells you whether the model follows your instructions, schemas, safety rules, and latency budget on your workload.

FAQ

Is Step 3.7 Flash available through Novita AI?

Yes. Novita lists Step 3.7 Flash as a Serverless LLM with the API model ID stepfun/step-3.7-flash.

What endpoint should I use for Step 3.7 Flash?

Use the OpenAI-compatible chat completions endpoint: POST https://api.novita.ai/openai/v1/chat/completions.

Does Step 3.7 Flash support image and video input?

Novita lists text, image, and video as input modalities for Step 3.7 Flash, with text as the output modality. Use current Novita docs or console examples to verify the exact image or video payload shape before production.

How much does Step 3.7 Flash cost?

As of June 18, 2026, Novita lists stepfun/step-3.7-flash at $0.20 per 1M input tokens, $1.15 per 1M output tokens, and $0.04 per 1M cache read input tokens.

Does Step 3.7 Flash support function calling and structured outputs?

Yes. Novita lists function calling and structured outputs as Step 3.7 Flash features. Use tools for function calling and response_format for structured outputs, then test your exact schema and parser before production.

Should I copy a video payload from another provider?

No. Even when APIs are OpenAI-compatible, multimodal file and URL handling can vary. Use a payload shape verified in current Novita documentation, console examples, or your own successful API test for stepfun/step-3.7-flash.

Step 3.7 Flash API on Novita AI: Multimodal Quick Start

What do you need before calling the API?

Which Step 3.7 Flash facts matter for implementation?

How do you call Step 3.7 Flash with cURL?

How do you call Step 3.7 Flash from Python?

How should you handle multimodal input?

How do function calling and structured outputs fit?

How should teams budget and test before production?

FAQ

Is Step 3.7 Flash available through Novita AI?

What endpoint should I use for Step 3.7 Flash?

Does Step 3.7 Flash support image and video input?

How much does Step 3.7 Flash cost?

Does Step 3.7 Flash support function calling and structured outputs?

Should I copy a video payload from another provider?

Recommended articles

Product

RESOURCES

Partners

Company

What do you need before calling the API?

Which Step 3.7 Flash facts matter for implementation?

How do you call Step 3.7 Flash with cURL?

How do you call Step 3.7 Flash from Python?

How should you handle multimodal input?

How do function calling and structured outputs fit?

How should teams budget and test before production?

FAQ

Is Step 3.7 Flash available through Novita AI?

What endpoint should I use for Step 3.7 Flash?

Does Step 3.7 Flash support image and video input?

How much does Step 3.7 Flash cost?

Does Step 3.7 Flash support function calling and structured outputs?

Should I copy a video payload from another provider?

Recommended articles

Related Posts

Product

RESOURCES

Partners

Company