Air-Gapped AI Fine-Tuning: How to Train Custom LLMs Without Internet Access

Fine-tune LLMs without internet access. Step-by-step guide covering data preparation, model selection, training infrastructure, and deployment in air-gapped enterprise environments.

Air-Gapped AI Fine-Tuning: How to Train Custom LLMs Without Internet Access

Running an LLM in an air-gapped environment is straightforward. Download weights, transfer them in, serve with vLLM or Ollama. Plenty of guides cover this.

Fine-tuning in an air-gapped environment is different. Training custom models on proprietary data inside a network with zero external connectivity introduces challenges that most enterprise AI guides skip entirely.

This guide covers what it actually takes to fine-tune LLMs behind an air gap: the infrastructure requirements, the data pipeline, the training workflow, and the evaluation process. No hand-waving about "secure environments." Specific steps for teams handling classified, regulated, or sensitive data who need custom models that never touch the public internet.

Why Fine-Tune Instead of Just Running Inference?

Most air-gapped AI content focuses on inference. Download a model, run it locally, done. But inference alone limits what you can do.

A base model trained on public data doesn't know your organization's terminology, procedures, or domain expertise. It can't answer questions about your internal systems. It hallucinates when asked about proprietary processes.

RAG helps with some of this. Retrieval-augmented generation connects the model to your documents at query time. But RAG has limitations:

  • Retrieval adds 200-500ms latency per query
  • Context windows cap how much retrieved content you can inject
  • The model still doesn't internalize your domain knowledge
  • Complex reasoning across multiple documents degrades quality

Fine-tuning solves these problems by baking domain knowledge directly into the model weights. A fine-tuned model responds instantly, understands your terminology natively, and reasons about your domain without retrieval overhead.

For enterprise AI evaluation, fine-tuned models consistently outperform base models with RAG on domain-specific tasks. The accuracy gains compound with task complexity.

The trade-off: fine-tuning requires more infrastructure and expertise than RAG. In an air-gapped environment, that complexity increases further.

What Makes Air-Gapped Fine-Tuning Different

Air-gapped fine-tuning differs from standard fine-tuning in four ways:

No package management during training. Every Python library, CUDA driver, and dependency must be pre-loaded. A missing package means stopping the job, transferring files through your security process, and restarting. Plan for this upfront.

No real-time debugging. You can't pip install a fix or pull a patch from GitHub. Every tool in your troubleshooting toolkit needs to exist inside the perimeter before training begins.

No cloud compute fallback. If your on-premise GPUs can't handle the job, you don't spin up cloud instances. You either reduce model size, optimize training parameters, or upgrade hardware through procurement cycles that take months.

No model hub access. Base models, tokenizers, and adapter weights all need to be transferred in advance. You can't pull the latest checkpoint mid-training.

These constraints shape every decision in your fine-tuning pipeline. The sections below walk through how to handle each one.

Infrastructure Requirements

GPU Sizing

Fine-tuning demands more GPU memory than inference. You're storing model weights, optimizer states, gradients, and activations simultaneously.

Model Size Full Fine-Tune (FP16) LoRA Fine-Tune QLoRA (4-bit)
7B params 112GB VRAM 28GB VRAM 12GB VRAM
13B params 208GB VRAM 52GB VRAM 20GB VRAM
34B params 544GB VRAM 136GB VRAM 48GB VRAM
70B params 1.1TB VRAM 280GB VRAM 96GB VRAM

Full fine-tuning updates all model parameters. For most enterprise use cases, it's overkill. LoRA fine-tuning adds trainable low-rank matrices to specific layers while freezing the base weights. QLoRA combines LoRA with 4-bit quantization to cut memory requirements by 75% with minimal quality loss.

For a 7B model with QLoRA, a single NVIDIA A100 (80GB) handles training comfortably. A 70B model needs at least 2x A100s or 4x H100s. Most enterprise teams fine-tune 7B-13B models and see strong results.

Storage

Training generates checkpoints, logs, and intermediate files. Budget 3-5x the model size for working storage:

  • 7B model: 50-100GB
  • 13B model: 100-200GB
  • 70B model: 500GB-1TB

Fast NVMe storage matters. Training I/O bottlenecks on slow disks, especially when loading large datasets or writing checkpoints. RAID configurations with redundancy protect against mid-training failures.

Networking

Air-gapped doesn't mean isolated from all networks. Internal networking between training nodes, storage, and monitoring systems needs high bandwidth. For distributed training across multiple GPUs:

  • Single-node multi-GPU: NVLink or PCIe (400-900 GB/s between GPUs)
  • Multi-node: InfiniBand (200-400 Gbps) or high-speed Ethernet

For single-node training (most common for 7B-13B models), standard networking suffices.

Software Stack

Pre-install everything before the air gap closes:

Core dependencies:

  • Python 3.10+
  • PyTorch 2.0+ with CUDA support
  • Transformers library
  • PEFT (Parameter-Efficient Fine-Tuning)
  • bitsandbytes (for quantization)
  • accelerate (for distributed training)

Training frameworks:

  • Axolotl (handles QLoRA, LoRA, full fine-tuning)
  • LLaMA-Factory (supports 100+ models)
  • Hugging Face TRL (RLHF and DPO training)

Serving:

  • vLLM (production inference)
  • Ollama (simpler deployments)
  • TGI (Hugging Face's inference server)

Monitoring:

  • Weights & Biases (can run offline with local server)
  • MLflow (self-hosted tracking)
  • Prometheus + Grafana (infrastructure metrics)

Package these as Docker containers or conda environments. Test the full stack on a connected system before transferring to the air-gapped environment. Missing dependencies mid-training cause painful delays.

Data Pipeline for Air-Gapped Training

Dataset Preparation (Connected Environment)

Prepare your fine-tuning dataset before transferring it inside the air gap. This happens on your connected infrastructure:

1. Collect raw data Internal documents, support tickets, code repositories, domain-specific texts. For enterprise dataset automation, tools can extract and structure this automatically.

2. Clean and deduplicate Remove personally identifiable information. Strip formatting artifacts. Deduplicate to prevent the model from memorizing repeated content.

3. Format for training Convert to instruction-response pairs for instruction tuning:

{
  "instruction": "Summarize the key risks in this compliance report",
  "input": "[document text]",
  "output": "[expected summary]"
}

Or conversational format for chat fine-tuning:

{
  "conversations": [
    {"role": "user", "content": "What's our policy on data retention?"},
    {"role": "assistant", "content": "According to section 4.2..."}
  ]
}

4. Split into train/validation/test Standard splits: 80% train, 10% validation, 10% test. For small datasets (under 10,000 examples), consider 90/5/5.

5. Validate format Run the dataset through your training framework on a connected test system. Catch formatting errors before transfer.

Secure Transfer

Moving datasets into air-gapped environments follows your organization's security protocols. Common methods:

  • Encrypted USB drives with hardware write-protection
  • Cross-domain solutions (CDS) for classified environments
  • Secure file transfer appliances with air-gap-compliant protocols

Transfer the dataset, base model weights, tokenizer files, and any adapter configurations together. Verify checksums after transfer to confirm integrity.

Base Model Selection

Not every model works well for fine-tuning. Consider:

License compliance. Commercial use restrictions vary. Meta's Llama models allow commercial use with acceptable use policies. Mistral models use Apache 2.0. Qwen models have specific restrictions for large deployments.

Base capability. Stronger base models fine-tune better. A 7B model with strong reasoning (like Qwen2.5-7B or Llama-3.1-8B) often outperforms a 13B model with weaker foundations.

Domain alignment. Code models fine-tune better for technical tasks. Instruction-tuned models adapt faster than base models for chat applications.

Quantization support. Some models quantize cleanly to 4-bit. Others lose significant capability. Test before committing.

For most enterprise use cases, these models work well:

Use Case Recommended Base Models
General enterprise tasks Llama-3.1-8B, Qwen2.5-7B, Mistral-7B-v0.3
Code generation DeepSeek-Coder-7B, CodeLlama-7B, Qwen2.5-Coder-7B
Long documents Llama-3.1-8B (128K context), Qwen2.5-7B (128K context)
Multilingual Qwen2.5-7B, Aya-23-8B

Download models from Hugging Face on a connected system. Include all files: config.json, tokenizer files, and safetensor weights. Transfer the complete package.

Training Configuration

LoRA Parameters

LoRA injects trainable low-rank matrices into specific model layers. Key parameters:

rank (r): Controls adapter capacity. Higher rank = more parameters = more capability but slower training. Start with r=16 for most tasks. Increase to r=64 for complex domains.

alpha: Scaling factor. Common practice: alpha = 2 * rank.

target_modules: Which layers to adapt. For Llama-style models:

  • Minimum: q_proj, v_proj (attention layers)
  • Recommended: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

dropout: 0.05-0.1 for regularization. Prevents overfitting on small datasets.

Example LoRA config for Llama-3.1-8B:

lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
target_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj
  - gate_proj
  - up_proj
  - down_proj

Training Hyperparameters

Learning rate: 1e-4 to 2e-4 for LoRA. Lower (5e-5) for full fine-tuning.

Batch size: Maximize within GPU memory. Use gradient accumulation if batch size is too small. Effective batch size of 32-128 works well.

Epochs: 1-3 for large datasets (50K+ examples). 3-10 for small datasets (1K-10K examples). Watch validation loss for overfitting.

Warmup: 3-10% of total steps. Prevents early training instability.

Scheduler: Cosine decay or linear. Both work. Cosine sometimes converges better.

Monitoring During Training

Without internet access, you need local monitoring:

Validation loss: Primary metric. Should decrease steadily. Sudden increases indicate overfitting or data issues.

Training loss: Should decrease faster than validation loss. Large gaps suggest overfitting.

GPU utilization: Target 90%+. Lower utilization means bottlenecks elsewhere (data loading, CPU preprocessing).

Memory usage: Stay under 90% VRAM to avoid OOM errors during gradient accumulation.

Log everything to local storage. MLflow running on-premise handles experiment tracking without external connectivity. Export logs for later analysis if needed.

Evaluation Without External Dependencies

Standard LLM benchmarks (MMLU, HumanEval, HellaSwag) require downloading test sets. For air-gapped evaluation, build custom benchmarks:

Domain-Specific Test Sets

Create evaluation sets from held-out production data:

  1. Task accuracy: Does the model complete the actual tasks you need? Grade responses against gold-standard answers.
  2. Factual consistency: Does the model hallucinate about your domain? Test with questions that have verifiable answers from your documentation.
  3. Format compliance: Does output match required structure? Automated checks for JSON validity, required fields, length constraints.
  4. Edge cases: How does the model handle unusual inputs? Include adversarial examples in your test set.

LLM-as-Judge Evaluation

Use your fine-tuned model (or a separate judge model) to evaluate outputs. This runs entirely locally:

judge_prompt = """
Rate this response on a scale of 1-5:
- Accuracy: Does it contain correct information?
- Completeness: Does it fully answer the question?
- Format: Does it follow the required structure?

Question: {question}
Response: {response}
"""

For LLM reliability evaluation, combine automated metrics with human review on a sample of outputs.

A/B Comparison

Compare your fine-tuned model against:

  • The base model (measures fine-tuning lift)
  • The previous fine-tuned version (measures iteration improvement)
  • RAG baseline (validates fine-tuning value over retrieval)

Document results for compliance and audit purposes. Air-gapped environments often require evidence that models meet accuracy thresholds before deployment.

Deployment After Training

Merge Adapters (Optional)

LoRA produces adapter weights separate from base model weights. You can serve them separately (faster iteration) or merge them (simpler deployment):

from peft import PeftModel
from transformers import AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained("llama-3.1-8b")
peft_model = PeftModel.from_pretrained(base_model, "path/to/adapter")
merged_model = peft_model.merge_and_unload()
merged_model.save_pretrained("merged-model")

Merged models load faster and require less configuration. Separate adapters let you swap fine-tunes without reloading the base model.

Quantize for Production

Training typically uses FP16 or BF16 precision. Production inference can use lower precision:

Precision Memory Reduction Speed Impact Quality Impact
FP16 Baseline Baseline None
INT8 50% +20-40% Minimal
INT4 75% +50-100% Small (1-3% accuracy)

For most enterprise tasks, INT8 quantization provides the best trade-off. INT4 works for simpler tasks where small accuracy drops are acceptable.

Tools like llama.cpp, GPTQ, and AWQ handle quantization. Serve models locally with vLLM for production inference with automatic batching and PagedAttention memory optimization.

Version Control

Track every deployed model:

  • Base model version and source
  • Training dataset version and hash
  • Hyperparameters used
  • Evaluation results
  • Deployment date and environment

Audit requirements in regulated industries demand this traceability. Build it into your workflow from the start.

When Fine-Tuning Gets Complex

The workflow above handles standard supervised fine-tuning. Some use cases require more:

RLHF/DPO for Alignment

If the fine-tuned model needs preference alignment (choosing better responses, avoiding certain behaviors), you need:

  • Preference data: pairs of responses labeled as better/worse
  • Training setup for DPO or PPO algorithms
  • Significantly more compute and iteration cycles

Model alignment processes add complexity but matter for customer-facing applications where response quality directly impacts business outcomes.

Continual Learning

One-time fine-tuning works for static domains. Dynamic environments need ongoing updates. Continual learning frameworks handle:

  • Incremental updates without full retraining
  • Catastrophic forgetting prevention
  • Efficient data replay strategies

Multi-Model Distillation

Training smaller models to match larger ones. Data distillation creates 10x smaller models that run on modest hardware while preserving capability. Useful when the air-gapped deployment target has limited GPU resources.

Managed Platforms for Air-Gapped Fine-Tuning

Building the full fine-tuning stack from scratch requires significant engineering effort. For teams without dedicated ML infrastructure expertise, managed platforms simplify the process.

Prem AI handles the entire fine-tuning lifecycle within air-gapped environments:

Dataset preparation: Drag-and-drop upload supports JSONL, PDF, TXT, DOCX. Automatic PII redaction protects sensitive information. Synthetic data augmentation expands small datasets.

Model training: 30+ base models available (Mistral, Llama, Qwen, Gemma). The autonomous fine-tuning system handles hyperparameter optimization automatically. Run up to 6 concurrent experiments.

Evaluation: Built-in LLM-as-judge scoring, side-by-side model comparisons, custom evaluation rubrics.

Deployment: One-click deployment to AWS VPC or on-premise infrastructure via Prem-Operator for Kubernetes.

Swiss jurisdiction under FADP provides legal protection beyond GDPR. SOC 2, GDPR, and HIPAA compliance certifications meet regulatory requirements.

The platform runs entirely within your perimeter. No data leaves your environment during training, evaluation, or deployment.

Common Failures and How to Avoid Them

Missing dependencies mid-training. Build a complete Docker image with all packages installed. Test the full training run on a connected system before transfer. Include troubleshooting tools in the image.

GPU out-of-memory errors. Calculate memory requirements before starting. Use gradient checkpointing to trade compute for memory. Reduce batch size and increase gradient accumulation.

Overfitting on small datasets. Monitor validation loss closely. Stop training when validation loss increases for 2-3 consecutive checkpoints. Use dropout and weight decay for regularization. Data augmentation helps when you can't collect more examples.

Poor evaluation metrics. Your evaluation set might not represent real usage. Collect examples from actual production queries. Include edge cases and adversarial examples.

Slow data loading. Preprocessing bottlenecks stall GPU utilization. Pre-tokenize datasets. Use multiple data loading workers. Store data on fast NVMe storage.

Checklist Before You Start

Before transferring anything into your air-gapped environment:

Infrastructure:

  • [ ] GPU memory sufficient for your model + training method
  • [ ] Storage provisioned (3-5x model size)
  • [ ] Network configured between training nodes and storage
  • [ ] Monitoring tools installed and tested

Software:

  • [ ] Complete Docker image with all dependencies
  • [ ] Training framework tested on target hardware
  • [ ] Serving framework installed and configured
  • [ ] Logging and experiment tracking operational

Data:

  • [ ] Dataset cleaned and formatted
  • [ ] PII removed or redacted
  • [ ] Train/validation/test splits created
  • [ ] Dataset validated against training framework

Models:

  • [ ] Base model weights downloaded completely
  • [ ] Tokenizer files included
  • [ ] Model configuration verified
  • [ ] License compliance confirmed

Evaluation:

  • [ ] Domain-specific test set prepared
  • [ ] Evaluation metrics defined
  • [ ] Comparison baselines established
  • [ ] Acceptance criteria documented

Summary

Fine-tuning LLMs in air-gapped environments requires more planning than standard cloud-based training. Every dependency, dataset, and model must be prepared and transferred before training begins. Debugging happens without internet access. Compute resources are fixed.

The payoff: custom models trained on your proprietary data that never leave your secure perimeter. For organizations handling classified, regulated, or sensitive information, this is the only path to production-grade AI that meets compliance requirements.

Start with parameter-efficient methods like QLoRA to minimize hardware requirements. Build evaluation pipelines before training, not after. Document everything for audit and compliance purposes.

For teams building this capability internally, the checklist above covers the major preparation steps. For teams that need the capability without building from scratch, platforms like Prem AI package the full workflow for air-gapped deployment.


FAQ

Can you fine-tune LLMs without any internet connection?

Yes. Once base models, datasets, and all software dependencies are transferred into the air-gapped environment, fine-tuning runs entirely locally. The model weights update based on your training data with no external calls.

What's the minimum hardware for air-gapped fine-tuning?

A single NVIDIA A10 (24GB VRAM) or RTX 4090 handles QLoRA fine-tuning of 7B models. Production deployments typically use A100s or H100s for faster training and larger models. Enterprise AI doesn't always need enterprise hardware when using efficient methods.

How long does fine-tuning take in air-gapped environments?

Training time depends on dataset size, model size, and hardware. A 7B model with 10,000 training examples on a single A100 typically completes in 2-4 hours. Larger datasets or models scale proportionally. Checkpoint frequently in case of interruptions.

Is RAG or fine-tuning better for air-gapped enterprise AI?

RAG works well when your knowledge changes frequently and you need citations. Fine-tuning works better for stable domain knowledge, lower latency requirements, and tasks requiring internalized understanding. Many production systems combine both approaches.

How do you update fine-tuned models in air-gapped environments?

The same secure transfer process used for initial deployment. Prepare updated datasets and retrain on connected infrastructure, or run incremental training inside the air gap using continual learning methods. Version control ensures you can roll back if updates degrade performance.

Subscribe to Prem AI

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
[email protected]
Subscribe