- Sandbox Isolation: Process, Container, and MicroVM
- Managing Concurrent Sandbox Sessions
- Lifecycle API: Create, Execute, Terminate
- Observability: Logs, Metrics, and Traces
- CPU, Memory, and Timeout Limits
- Package Install Policies
- Network and Egress Controls
- Secrets and Credential Injection
- Ephemeral vs. Persistent File Storage
- Backend Integration: REST, WebSocket, SDK
- Failure Recovery and Cleanup
- Novita Agent Sandbox
- FAQ
Production apps running AI-generated code need a sandbox that enforces process-level isolation, supports concurrent sessions, exposes a programmable lifecycle API, provides observable logs and resource metrics, enforces package and network policies, and integrates cleanly with the app backend. Choosing a sandbox without evaluating each of those dimensions systematically is the most common way teams hit problems after launch: a workload that looked safe in staging fails under real traffic, leaks state between tenants, or silently executes code the app never intended to allow.
This guide is a requirements checklist. It covers what to verify at each isolation level, what a production lifecycle API must expose, what observability and resource controls should look like, and where backend integration patterns make or break the design. Whether you are evaluating a managed sandbox or building your own, these are the questions worth answering before you ship.
Sandbox Isolation: Process, Container, and MicroVM
Isolation is a spectrum, and each level carries different tradeoffs for performance, portability, and how much trust you are extending to generated code.
Process-level isolation uses OS primitives — namespaces, cgroups, seccomp, and AppArmor or SELinux profiles — to restrict what a process can access. It is fast and requires no separate VM kernel, but all processes share the host kernel. A kernel vulnerability or a privileged system call that slips through the seccomp filter can affect other workloads on the same host. Process isolation is a reasonable starting point for low-risk, short-lived, trusted code paths, but it is a thin boundary for untrusted AI-generated code that may attempt syscalls, subprocess spawning, or package installs.
What to verify at this level:
- Which syscalls are blocked, and what is the default policy when an unknown syscall is attempted?
- Are namespaces scoped per task, per tenant, or shared across jobs?
- Are cgroup limits enforced at the task level or only at the host level?
- Does the sandbox clean up all processes, temp files, sockets, and shared memory on exit?
Container-level isolation adds a filesystem and network namespace boundary and makes image management repeatable. Containers are faster to start than full VMs, easier to compose, and widely supported by orchestration layers. The tradeoff is that containers still share the host kernel, and the container boundary is only as strong as the underlying runtime configuration. Privileged containers, broad capability sets, mounted host sockets, and host-network mode all reduce the effective boundary to roughly nothing.
What to verify at this level:
- Is the container image minimal, with only the runtimes and tools the workload actually needs?
- Are capabilities dropped to the minimum required set?
- Is the container rootless, or does it require root and what controls are around that?
- Is host PID namespace, host network, and the Docker socket explicitly excluded?
- Are mounted volumes limited to explicitly defined paths, and is the root filesystem read-only where possible?
MicroVM isolation puts each workload inside a lightweight virtual machine — with its own guest kernel, virtual devices, and a KVM-backed boundary between the guest and host. Technologies like Firecracker use a minimal device model to reduce attack surface while keeping startup fast enough for interactive use. The microVM boundary means that a kernel exploit in the guest does not automatically affect the host or other guests.
What to verify at this level:
- Does each agent run, each tenant, or each concurrent session get a separate microVM?
- What is the startup latency from API call to ready-to-execute, and is that measured from a warm pool, snapshot, or cold boot?
- Are guest images version-controlled, audited for what runtimes and tools they include, and updated on a regular schedule?
- What happens at the host level if the guest kernel panics or becomes unresponsive?
The practical decision is about your threat model. MicroVM isolation is the strongest generally available boundary for untrusted AI-generated code, but it does not replace filesystem policy, egress controls, package governance, or secrets handling. Those controls must sit on top of whichever isolation layer you choose.
Managing Concurrent Sandbox Sessions
A production app generating code for multiple users simultaneously needs a sandbox that handles concurrency as a first-class concern, not an afterthought.
The key questions are:
Per-session isolation: When 50 sessions are running at the same time, does each session have its own isolated filesystem, process tree, network namespace, and credential scope? State leakage between sessions is one of the most damaging failure modes in multi-tenant sandboxed apps, and it is often invisible in testing where sessions run sequentially.
Session limits and backpressure: Does the sandbox surface concurrency limits as a clear API contract? If 500 requests arrive and the platform supports 100 concurrent sessions, does the API return a structured error, queue the request, or silently degrade? Production apps need that signal to implement backpressure, queue management, and user-facing feedback.
Resource fairness under load: When one session consumes unusually high CPU or memory, are other sessions protected by per-session resource limits, or can one noisy workload degrade the whole pool?
Warm pools and session startup latency: Interactive coding features need sub-second session start times. That usually requires a pool of pre-initialized environments that can be claimed immediately rather than booted on demand. Verify whether the platform documents warm pool availability and what startup latency to expect at different concurrency levels.
Session reuse vs. fresh environments: Some apps benefit from reusing a long-lived session across multiple agent turns, while others need a clean environment for each request. Verify that both patterns are supported and that session reuse does not carry stale state from a previous conversation.
Lifecycle API: Create, Execute, Terminate
The lifecycle API is the interface between your application and the sandbox runtime. A production-grade API must expose at minimum:
Create: Initialize a new sandbox session, optionally from a template or snapshot, with specified resource limits, environment variables, and mounted volumes. The response should include a session ID and a ready signal, not just an acknowledgment.
Execute: Submit code or a command for execution. This should be an async call that returns an execution ID. The API must support specifying a working directory, environment overrides for the call, and a timeout.
Stream output: Retrieve stdout and stderr as a stream, not only as a final result after execution completes. Streaming matters for long-running jobs, agent steps that take many seconds, and any UX that shows the user incremental progress.
Terminate: End a running execution before it completes. The sandbox should guarantee that the process tree is cleaned up, not just the parent process.
Cleanup: Destroy the session and release all associated resources — filesystem, memory, process slots, network state, and any held credentials. This call should be idempotent so that retries after a network error do not cause errors.
Upload and download files: Transfer input files into the sandbox before execution and retrieve output artifacts after. File transfers should be bounded by size limits and policy-controlled for which paths are writable.
Additional capabilities worth verifying for production use:
- Pause and resume: Can a long-running session be suspended and resumed later without losing state? This is useful for rate limiting, cost control, and session handoff between agent turns.
- Snapshot: Can the current session state be captured and used as the starting point for future sessions? This is the key mechanism for warm pools and reusable environments.
- Timeout enforcement: If the executing code exceeds the wall-clock timeout, does the platform terminate it cleanly and report the right exit status?
Observability: Logs, Metrics, and Traces
You cannot debug or audit what you cannot see. Production sandboxes need observability built in, not bolted on.
Stdout and stderr capture: Every execution should produce a captured output record associated with the session ID and execution ID. This should be accessible via the API after execution completes, not only available as a real-time stream.
Execution logs: The platform should record what code ran, when it started, when it finished, what the exit code was, which user or tenant owned the session, and which template or snapshot was used. These records are the minimum needed to reconstruct what happened when something goes wrong.
Resource metrics: Production apps need per-session metrics for CPU usage, memory peak, wall-clock time, and filesystem writes. This allows capacity planning, anomaly detection, and per-session cost attribution.
Error tracing: When a sandbox fails to start, execute, or clean up, the error surface should be structured: error code, message, session ID, and enough context to distinguish a user error (bad code, missing package) from a platform error (quota exceeded, internal failure).
Audit trail: For multi-tenant apps, the audit trail should make agent behavior reconstructable: session ID, tenant, execution sequence, package installs, external domains contacted, files written, and cleanup result. Raw customer code and full command output may not belong in audit logs by default — design for what your retention and access policies can actually support.
What to avoid: a sandbox that surfaces only “execution failed” with no structured error, no session-level logs, and no way to distinguish a timeout from an OOM from a process escape attempt. That forces you to instrument everything at the application layer, which duplicates work and misses events the sandbox can observe directly.
CPU, Memory, and Timeout Limits
Unbounded resource consumption is one of the simplest ways a sandboxed workload can cause problems in production — either by degrading other sessions or by creating unexpected infrastructure costs.
A production sandbox must enforce limits at the session level, not just at the host level:
CPU: Limit how much CPU time a single session can consume. A session that generates an infinite loop should not degrade other sessions on the same host. Verify whether the limit is a hard cap (the process is throttled or killed) or a soft limit (it is competing with other processes for available CPU).
Memory: Set a memory cap that triggers cleanup or termination rather than allowing the session to exhaust host memory. Verify what happens when the limit is hit: OOM kill, structured error response, or silent hang.
Wall-clock timeout: Every execution call should have a maximum duration. The timeout should be enforceable at the platform level, not only at the client level — if the client drops the connection, the sandbox should still terminate the execution at the configured limit.
Disk usage: Generated code may write large output files, install large packages, or fill the working directory. A disk quota on the session working directory prevents runaway writes.
Process count: AI-generated code may spawn subprocesses, background workers, or shell commands that themselves spawn more processes. A limit on the total number of processes in the session’s namespace prevents fork bombs and runaway subprocess trees.
When evaluating a sandbox platform, check whether these limits are configurable per session (so different user tiers or task types can have different limits), whether they are enforced at the sandbox level, and whether hitting a limit produces a structured API error or a silent failure.
Package Install Policies
AI-generated code frequently requests package installs — pip install, npm install, apt-get, Git clones, direct URL fetches. Each of those operations pulls external code into the sandbox at runtime, which is one of the highest-risk operations a sandbox needs to govern.
A production package policy should cover:
Registry allowlists: Which package registries are permitted? PyPI and npm are defaults, but many teams want the option to restrict to internal mirrors, curated registries, or explicitly approved sources.
Install caching: When many sessions install the same popular packages, a layer cache or pull-through proxy avoids redundant downloads, reduces startup latency, and gives you a point to inspect what is being fetched.
Offline mode: Some workloads should run with no package installs at all — the environment is pre-baked into the image or template, and install attempts should fail with a clear error. This is the appropriate mode for evaluation runs where reproducibility matters more than flexibility.
Hash verification and lockfiles: When packages are allowed, pinned versions and hash verification reduce the risk of a registry compromise changing what code runs inside the sandbox.
Size limits: Packages and their transitive dependencies can be large. A size cap on the total downloaded footprint per session prevents accidental or intentional storage exhaustion.
Package logging: Every install attempt should be recorded in the execution audit log: package name, version requested, registry source, and success or failure. This is the data you need to reconstruct what entered the sandbox during an incident.
The question to ask a sandbox vendor is not “can users install packages?” but “how is each install audited, what registries are allowed by default, and can I configure a stricter policy for sensitive workloads?”
Network and Egress Controls
Network access is the second major vector for a sandbox to reach unexpected destinations. Default-open egress is convenient in development but is a poor default for production apps running AI-generated code.
Default-deny egress: The strongest production posture is to block all outbound connections by default and explicitly allowlist the destinations a session legitimately needs. This requires more configuration but makes the access model auditable.
Allowlisted destinations: For coding agents, typical allowed destinations may include package registries, a specific set of public APIs the agent is built to call, and nothing else. For data analysis agents, the list may include specific data sources. Verify that the platform supports per-session or per-tenant destination allowlists.
DNS policy: DNS should be handled consistently with egress policy. A session that cannot reach arbitrary HTTP destinations should also not be able to resolve arbitrary DNS names and use that to infer network topology or bypass controls through DNS-based channels.
Internal service access: AI-generated code should not be able to reach cloud metadata endpoints (e.g., the AWS instance metadata service), internal APIs, private databases, or admin panels unless those are explicitly configured. Verify whether the sandbox’s default network policy blocks well-known internal address ranges.
Package download egress: Package installs are network operations. If egress is restricted, make sure the package registry allowlist is consistent with the egress policy, or use a pull-through proxy inside the trusted network.
Logging outbound connections: Even when egress is permitted, logging which domains and IPs a session contacted is useful for incident investigation. Not all sandbox platforms provide this natively; verify what you will get.
Secrets and Credential Injection
AI agents frequently need credentials — API keys, database connections, OAuth tokens, short-lived cloud credentials. How a sandbox handles secrets matters for both security and operational reliability.
Narrow scope: Each session should receive only the secrets it needs for the specific task it is executing. Mounting a broad environment file with all credentials into every session is operationally convenient but means that compromised or misbehaving code in any session can reach all of those credentials.
Short-lived credentials: Where the backend supports it, prefer short-lived tokens with a TTL scoped to the session duration. This limits the window during which a leaked credential is useful.
Injection mechanism: Verify whether secrets are injected as environment variables, mounted files, or through a secrets API. Environment variables are accessible to all processes in the session by default; mounted files can be scoped to a path and permission set. For the most sensitive credentials, consider a secrets API that provides values only to an explicitly authorized process.
Redaction: The sandbox should not echo secrets back through stdout, stderr, execution logs, error messages, or model-visible tool responses. Redaction is an application-layer responsibility, but a sandbox that supports configurable log scrubbing reduces the blast radius of accidental exposure.
Cleanup: After the session ends, verify that environment variables, mounted secret files, and any cached credential data are cleaned up as part of the session teardown, not left behind for the next session to inherit.
Ephemeral vs. Persistent File Storage
Different workloads have different persistence needs, and a production sandbox should support both patterns clearly.
Ephemeral sessions: The default for short-lived code execution is a session that creates a clean working directory, runs code, produces output, and is destroyed. Ephemeral sessions are easy to reason about: each run starts from a known baseline, no state accumulates, and cleanup is straightforward. They are the right choice for evaluation jobs, one-shot code completions, and any task where reproducibility matters more than continuity.
Persistent workspaces: Long-running coding agents, iterative development workflows, and multi-turn agent sessions often need a workspace that survives across multiple execution calls. Files installed, dependencies cached, code written, and history accumulated in one turn should be available in the next. Persistent workspaces are more complex to operate: they accumulate state, they can drift from the template, and they need an explicit lifecycle — when is the workspace cleaned up, who owns it, and what access controls protect it between sessions?
Snapshots and templates: Templates let you define a known-good baseline environment — runtimes, tools, dependencies — and launch sessions from it consistently. Snapshots capture the current state of a running session and use it as the starting point for future sessions. Both are useful for teams that need repeatable environments and low startup latency. Verify that templates are versioned, that who can create and update them is controlled, and that snapshots are isolated by tenant.
Output artifact export: After execution, what can leave the sandbox? A production policy should define which file paths are exportable, what size limits apply, and whether artifacts are reviewed or filtered before the application receives them.
Cross-session state: Be explicit about whether your app design intends sessions to share state or not. Accidental sharing — through a shared package cache, a shared volume, or a misrouted workspace — is a common multi-tenant isolation failure.
Backend Integration: REST, WebSocket, SDK
A sandbox is only useful if it integrates cleanly into the application backend. The three main integration patterns are REST, WebSocket, and SDK.
REST: A REST API is the lowest-friction integration for apps that submit discrete execution requests and poll for results. It works well for short-lived tasks, is easy to debug with standard HTTP tooling, and fits naturally into existing service architectures. The tradeoff is that polling for results adds latency compared to push notifications, and streaming long-running output requires either SSE or polling a log endpoint.
WebSocket: A WebSocket connection supports bidirectional, low-latency communication between the application and the sandbox. This is the right choice for interactive use cases: a coding assistant that streams output as code runs, a browser agent that needs to send commands and receive responses in real time, or an evaluation harness that monitors execution continuously. The tradeoff is operational complexity: WebSocket connections require persistent state, reconnect handling, and more complex infrastructure on both the client and server side.
SDK: A language-native SDK hides transport details, handles authentication, provides typed interfaces for session management and execution, and often includes helpers for streaming output, uploading files, and managing templates. An SDK is the fastest path to integration for most app developers. Verify that the SDK is actively maintained, covers the full API surface, and handles errors in a structured way that your application can act on.
Integration points your application needs to own: Regardless of transport, your application is responsible for authorization (which users can create sessions and with which resource limits), approval gates (which tool calls or code executions require human review before running), result handling (how the sandbox output is surfaced or acted on by the agent), and cleanup (triggering session teardown when the user flow completes or the agent turn ends).
A well-designed sandbox API does not try to own your application’s business logic. It exposes primitives — create, execute, stream, terminate, cleanup — and lets your application layer build the right product behavior on top.
Failure Recovery and Cleanup
Production systems fail. A sandbox that handles failure gracefully prevents resource leaks, stale state, and difficult-to-debug incidents.
Execution timeout handling: When a running execution exceeds its timeout, the platform should terminate the process tree cleanly and return a structured error response — not leave a zombie session consuming resources. Verify what happens to the session after a timeout: is it automatically cleaned up, or does it require an explicit cleanup call?
Session crash recovery: If the sandbox host crashes or the session VM exits unexpectedly, the platform should detect the failure, mark the session as terminated, and surface that state through the API so the application can react. Sessions should not silently disappear with no API signal.
Cleanup guarantees: A cleanup or terminate API call should reliably release all resources: CPU and memory allocations, filesystem quota, process slots, network state, and credentials. The cleanup should be idempotent — calling it multiple times on the same session ID should not return an error. This matters in practice: application code that retries cleanup after a network error should not break.
Partial execution failures: When code fails mid-execution — an unhandled exception, a killed process, a missing package — the sandbox should return a structured result that distinguishes partial success (some output was produced before the failure) from total failure. Applications built on partial results need this to avoid presenting incomplete or misleading output to users.
Runaway process handling: If generated code creates a background process that survives the main execution, the sandbox should terminate it as part of session cleanup rather than allowing it to run indefinitely. Verify whether the platform’s cleanup covers the full process tree, not only the immediate child of the execution call.
Capacity and quota errors: When the platform is at session capacity or a tenant has reached their quota, the API should return a specific error code the application can handle explicitly — not a generic 500 or a silent hang. This allows the application to queue, backoff, or surface a useful message to the user.
Novita Agent Sandbox
Novita Agent Sandbox is a managed sandbox platform built for agent workloads. It targets coding agents, data analysis agents, browser-oriented workflows, and longer-running agent sessions where generated code needs to run in an isolated, observable environment without landing on application servers or shared infrastructure.
For teams already using Novita AI model APIs, Agent Sandbox can be part of a broader agent architecture: the model plans and generates code, the sandbox provides isolated execution with a programmable lifecycle, and the application layer owns authorization, approval gates, and result handling.
Novita has described capabilities including microVM isolation, concurrent session support, a lifecycle API covering create, execute, stream, terminate, and cleanup, Pause and Autoresume for managing session state, templates and snapshots for fast and repeatable environment startup, and integration with Novita model APIs. Verify current feature availability, resource configuration options, and pricing on the Novita Agent Sandbox documentation and product page before making architecture decisions. Claims about specific isolation boundaries, concurrency limits, startup latency, and network policy should be confirmed against current product documentation.
When evaluating Novita Agent Sandbox against the requirements in this guide, apply the same checklist as any other vendor: isolation boundary per session, lifecycle API completeness, observability surface, configurable resource limits, package policy options, egress controls, secrets handling, persistence model, and backend integration support.
FAQ
What isolation model should I choose for AI-generated code?
MicroVM isolation gives the strongest boundary for untrusted AI-generated code, but it adds operational complexity. Container isolation is adequate for lower-risk workloads when the container is correctly hardened — no privileged mode, minimal capabilities, read-only root filesystem where possible, and no host socket mounts. Process isolation alone is too thin a boundary for untrusted code that may attempt syscalls, subprocess spawning, or package installs. Match the isolation level to your actual threat model.
How do I handle package installs in a production sandbox?
Use registry allowlists rather than default-open access. Add a pull-through cache to reduce redundant downloads and give you an inspection point. Log every install attempt with package name, version, source, and result. For workloads where reproducibility matters more than flexibility — evaluation runs, automated pipelines — consider an offline mode where the environment is pre-baked and installs are disallowed entirely.
What should a lifecycle API expose at minimum?
Create, execute with streaming output, terminate, and cleanup. Stream output is the capability most often missing from minimal implementations, and it is the one that matters most for interactive agent UIs. Cleanup must be idempotent and must cover the full process tree, not just the entry-point process.
How do I prevent secrets from leaking through a sandbox?
Scope credentials narrowly to the task — not a broad environment file. Prefer short-lived tokens. Do not log full stdout by default if secrets may appear there. Verify that the sandbox cleans up environment variables and mounted secret files on session teardown. Treat redaction as an application responsibility, not a sandbox guarantee.
