- AI Data Analyst Architecture: Upload, Analyze, Review
- What Runs Inside a Python Sandbox for Data Analysis?
- How Should CSV Upload and Schema Inspection Work?
- How Does the Model Generate and Run Python Safely?
- Controlled Python Package Access for AI Data Analysis
- How to Validate Charts and Output Files
- Security Checkpoints Before Production
- Using Novita Agent Sandbox as the Execution Layer
- Conclusion
- FAQ
An AI data analyst needs sandboxed Python when user-provided datasets, model-generated code, package installs, generated charts, and downloadable outputs must run in an isolated, observable environment. The practical implementation flow is: upload a file, inspect the schema with trusted code, ask the model for a plan, review the generated Python, run it in a constrained sandbox, validate the output artifacts, and show the user what happened.
AI Data Analyst Architecture: Upload, Analyze, Review
The product pattern is simple on the surface: a user uploads a CSV, asks a natural-language question, and expects useful tables, charts, and downloadable files. Under the hood, the app is running a small agent workflow with real side effects. The model plans the analysis and drafts Python, while the application decides what code, packages, files, network access, and outputs are allowed.
Build the first version around one clear path:
- Accept a CSV upload for one analysis job.
- Create a job-scoped sandbox workspace.
- Run owned schema inspection code before asking the model for Python.
- Ask the model for an analysis plan, then a script that follows your file and package rules.
- Execute the script with time, memory, disk, package, and network limits.
- Collect only validated artifacts from a known output directory.
- Show the user the answer, charts, warnings, logs, and files selected for download.
That separation keeps responsibilities clear. The model proposes and explains analysis. The backend applies product policy and orchestration. The sandbox runs the code with constrained files, packages, time, memory, network access, and secrets.
What Runs Inside a Python Sandbox for Data Analysis?
Put the analysis workspace inside the sandbox, not inside your main application server. The sandbox should receive a narrow input bundle for one analysis job: the uploaded file, a small manifest, a generated script, and any approved runtime configuration. The application backend should keep authentication, billing, user identity, long-term storage, and production secrets outside that workspace.
For an AI data analyst, the sandbox usually owns these tasks:
| Sandbox task | Why it belongs there |
|---|---|
| File staging | The uploaded CSV can be scanned and copied into an isolated working directory before Python touches it. |
| Schema inspection | The app can infer column names, types, null rates, row count, and sample values without exposing the full file to the model. |
| Python execution | Model-generated code runs away from the application server and can be time-boxed. |
| Package preparation | Only approved dependencies are installed or made available to the job. |
| Chart rendering | Plot images are written as files and reviewed before download. |
| Result packaging | Final artifacts can be collected from a known output directory. |
| Cleanup | Temporary files, generated code, and session state can be deleted or allowed to expire. |
Keep the model’s prompt smaller than the data. Send a schema summary, a few representative rows if policy allows, column descriptions, user intent, and constraints such as “do not train a model” or “only use approved packages.” The raw dataset should remain in the sandbox file system unless your product has a specific, reviewed reason to expose more.
How Should CSV Upload and Schema Inspection Work?
Start by treating every upload as untrusted input. Validate the file type, size, encoding, delimiter, row count, column count, and suspicious formulas before the model gets involved. A CSV can still contain values that trigger spreadsheet formula execution when opened later, so exported files should be sanitized for the target format as well.
A practical upload flow looks like this:
- The user uploads a CSV to the app.
- The backend stores the original file under a job-scoped object key or staging path.
- The backend creates a sandbox session for the job.
- The backend copies the file into a sandbox working directory.
- A small, deterministic inspection script reads the file and produces a schema summary.
- The model receives the schema summary, user question, allowed libraries, and output requirements.
The inspection step should be deterministic code you own, not model-generated code. It can produce a compact JSON summary like this:
{
"file": "sales.csv",
"rows": 84231,
"columns": [
{"name": "order_date", "type": "date", "null_rate": 0.01},
{"name": "region", "type": "string", "sample_values": ["NA", "EMEA", "APAC"]},
{"name": "revenue", "type": "number", "null_rate": 0.0}
],
"safe_sample_rows": 5
}
That summary gives the model enough context to draft an analysis without handing it the whole dataset. For sensitive workloads, reduce or remove sample values, mask columns, or require the user to approve which columns can be used.
How Does the Model Generate and Run Python Safely?
The model should produce a plan before it produces code. A good plan names the columns it will use, the transformations it intends to run, the charts it expects to create, and the output files it will write. This gives your application a checkpoint for policy and user review.
After the plan is accepted, ask for Python that follows a narrow contract:
- Read input files only from an
input/directory. - Write artifacts only to an
output/directory. - Use approved packages only.
- Avoid network calls unless the job policy explicitly allows them.
- Print a structured summary at the end.
- Fail clearly when required columns are missing.
At a conceptual level, the orchestration loop looks like this:
job = create_analysis_job(user_id, uploaded_file)
sandbox = create_sandbox(job_id=job.id, timeout_seconds=300)
copy_file_to_sandbox(uploaded_file, sandbox_path="/work/input/data.csv")
schema = run_owned_schema_inspector(sandbox, "/work/input/data.csv")
plan = ask_model_for_analysis_plan(
user_question=job.question,
schema=schema,
allowed_packages=["pandas", "numpy", "matplotlib"],
output_contract={"directory": "/work/output", "formats": ["png", "csv", "json"]},
)
review_policy(plan)
script = ask_model_for_python(plan=plan, schema=schema)
review_static_code_policy(script)
result = run_python_in_sandbox(
sandbox=sandbox,
script=script,
working_dir="/work",
timeout_seconds=120,
memory_limit_mb=1024,
)
artifacts = collect_outputs(sandbox, "/work/output")
review_outputs(artifacts)
return_answer_to_user(result.summary, artifacts)
This is pseudocode, not a product SDK contract. The point is the boundary: generated code is reviewed, run with a timeout, constrained to known directories, and followed by output collection and review.
If the script fails, send the error message and a small code excerpt back to the model for repair. Do not send unlimited logs. Error repair should keep the same package, file, network, and output policy as the first attempt.
Controlled Python Package Access for AI Data Analysis
Package access is where many AI data analyst demos become risky. A model may ask for a library because it saw it in a tutorial, because a package name looks plausible, or because the user’s prompt suggested it. Your app should not turn those suggestions into unrestricted package installs.
Use a policy that matches the sensitivity of the data:
| Package policy | Best fit | Tradeoff |
|---|---|---|
| Prebuilt image only | Production workloads with predictable analysis needs | Lowest flexibility, simplest review surface |
| Allowlisted packages | Most CSV analysis assistants | Good balance for pandas, plotting, and common statistics packages |
| Version-pinned installs | Reproducible analysis jobs | Requires package maintenance and vulnerability review |
| Cached internal mirror | Enterprise or regulated data workflows | More operational work, better control over supply chain |
| User-approved installs | Exploratory tools for trusted users | More flexible, but slower and needs clear warnings |
For a first production version, start with a prebuilt environment or a short allowlist. Most CSV questions can be answered with a small set of libraries: pandas, numpy, matplotlib, seaborn, scipy, and sometimes scikit-learn. If a job needs another package, have the model explain why, then route that request through human approval or a package review workflow.
Log package name, version, source registry, install time, and the reason the package was requested. If your security team uses dependency scanners or private registries, integrate with that process instead of letting the agent bypass it.
How to Validate Charts and Output Files
Generated files are part of the product experience, but they are also part of the trust boundary. A chart can be wrong. A CSV can contain formula-like values. A notebook can include hidden code. A ZIP can contain unexpected paths. Treat outputs as artifacts to inspect, not just files to download.
Define a simple output contract:
{
"required_files": ["summary.json"],
"optional_files": ["chart-*.png", "filtered-data.csv"],
"blocked_extensions": [".exe", ".sh", ".bat", ".html"],
"max_total_size_mb": 25
}
For each completed job, collect files only from the expected output directory. Validate MIME type, extension, size, and path. For images, generate thumbnails for preview. For CSV exports, escape spreadsheet formulas if the file may be opened in Excel or Google Sheets. For JSON summaries, validate against a schema before using them in the UI.
Give users a review step before they download or share results. The review screen should show:
- The original question.
- The dataset name and schema used.
- The analysis steps in plain language.
- The generated charts and tables.
- Any columns excluded for policy reasons.
- Warnings, errors, retries, or package requests.
The model can write a narrative explanation, but the app should ground that explanation in files and logs from the sandbox run.
Security Checkpoints Before Production
An AI data analyst is a useful internal tool only if security and platform teams can reason about what it is allowed to do. The review should cover isolation, resource limits, package policy, network behavior, secrets, logs, and deletion.
Use this checklist before moving beyond a prototype:
| Checkpoint | Question to answer |
|---|---|
| Isolation boundary | What separates one user’s code and files from the host and other users? |
| File access | Can generated code read only the job directory, or can it see broader storage? |
| Resource limits | What caps CPU time, memory, disk, process count, and wall-clock time? |
| Network policy | Is outbound network access off, allowlisted, proxied, or fully open? |
| Package policy | Which packages can be installed, from where, and with what version controls? |
| Secret boundary | Are API keys, database credentials, and service tokens kept out of the sandbox unless explicitly scoped? |
| Logs | Are commands, package installs, errors, file reads/writes, and output artifacts recorded? |
| Human review | Which plans, code snippets, package requests, and outputs need approval? |
| Cleanup | When are sandbox state, uploaded files, generated scripts, logs, and outputs deleted? |
Avoid absolute claims such as “the code cannot escape” or “data cannot leak.” The practical standard is more concrete: define the boundary, document the controls, test failure modes, and keep enough audit trail to investigate unexpected behavior.
For network and package policy, remember that dependency installation is a form of network egress unless packages come from a prebuilt image or controlled mirror. If the dataset is sensitive, network access should be blocked or tightly allowlisted by default. If the analyst needs live external data, make that a separate tool with its own approval and logging path.
Using Novita Agent Sandbox as the Execution Layer
Novita Agent Sandbox provides isolated, stateful execution environments for AI agents. The current Novita docs describe support for running code, installing dependencies, accessing files, using browsers, and preserving execution state across sessions. For an AI data analyst, those primitives map directly to the execution part of the architecture: create a job workspace, move files in, run analysis code, collect artifacts, and clean up or preserve state based on the session design.
The Novita Agent Sandbox SDK and CLI documentation lists official SDK support for Python and JavaScript/TypeScript, which fits common application backends. The sandbox filesystem documentation describes an isolated filesystem with fixed 20 GB storage space for sandboxes, useful for staging CSV files and generated artifacts within a job-scoped workspace.
Keep the distinction clear:
- Implementation guidance in this article describes a general architecture for AI data analyst apps.
- Novita Agent Sandbox can provide the sandbox execution layer for those workflows.
- Your application still owns user authentication, data retention policy, package approval, network policy, output review, and publish/deployment decisions.
That separation helps teams build with a clean responsibility model. The model suggests and explains analysis. The application enforces product policy. The sandbox provides the controlled runtime where code, files, packages, charts, and logs can be handled away from the main application server.
Conclusion
The strongest AI data analyst design is not “let the model run Python.” It is a controlled loop: inspect the dataset, ask the model for a plan, review generated code, run it in a sandbox, collect validated artifacts, show the user what happened, and clean up state when the job is done. That structure keeps the user experience fast while giving engineering and security teams concrete checkpoints to evaluate before production.
For teams building this pattern, start small: CSV upload, schema inspection, a short package allowlist, chart output, strict timeouts, and a visible review screen. Add broader package access, network tools, persistence, and automation only after the boundaries are documented and tested.
FAQ
Why does an AI data analyst need a sandbox?
It needs a sandbox because the workflow combines untrusted files, model-generated Python, package requests, chart generation, and downloadable artifacts. Running that work in a separate environment gives your app a place to apply file, resource, package, network, logging, and cleanup controls.
Should the model see the full CSV?
Usually no. Start by sending the model a schema summary, safe samples, column descriptions, and the user’s question. Keep the raw file in the sandbox unless your product has a reviewed reason to expose more data to the model.
Can package installs be allowed?
Yes, but they should be controlled. Use a prebuilt image, allowlist, pinned versions, private mirror, or approval workflow. Do not let model-generated code install arbitrary packages from the public internet without review.
What files should the app return to users?
Return only validated files from a known output directory, such as chart images, summary JSON, and sanitized CSV exports. Block unexpected extensions, large files, hidden paths, and artifacts that were not part of the output contract.
Is this a compliance guarantee?
No. A sandbox is one part of the execution architecture. Compliance and security approval depend on your data, threat model, controls, logging, retention, review process, and deployment environment.
