Llama 4 Drops with Enterprise Features: What Open Weights Mean Now
Meta's Llama 4 release includes 70B and 405B variants with commercial licenses. Here's how the capability gap has closed and what it means for build vs buy decisions.
Meta's Llama 4 release includes 70B and 405B variants with commercial licenses. Here's how the capability gap has closed and what it means for build vs buy decisions.
The release: Meta dropped Llama 4 with three variants - 8B, 70B, and 405B parameters. All include commercial usage rights, instruction-tuned variants, and multimodal capabilities. The 405B model approaches GPT-4o performance on most benchmarks.
Why this matters: The gap between open-weight and proprietary models continues to narrow. Llama 4 405B is the first open model genuinely competitive with frontier APIs for many enterprise use cases.
The builder's question: Does Llama 4 change the self-hosting equation? When does running your own models beat API access?
| Variant | Parameters | Context | Multimodal | License |
|---|---|---|---|---|
| Llama 4 8B | 8B | 128K | Yes | Commercial |
| Llama 4 70B | 70B | 128K | Yes | Commercial |
| Llama 4 405B | 405B | 128K | Yes | Commercial* |
*405B commercial license requires acceptance of additional terms for deployments exceeding 700M monthly active users.
The commercial license is notably permissive. No revenue caps, no usage restrictions for typical enterprise deployments. Fine-tuned derivatives can also be commercially deployed.
Meta's published benchmarks position Llama 4 405B as frontier-competitive:
| Benchmark | Llama 4 405B | GPT-4o | Claude 3.5 Sonnet |
|---|---|---|---|
| MMLU | 88.2% | 87.2% | 88.7% |
| HumanEval | 89.0% | 90.2% | 92.0% |
| MATH | 73.1% | 76.6% | 71.1% |
| MT-Bench | 8.8 | 9.0 | 8.9 |
| Vision (MMMU) | 61.3% | 63.0% | N/A |
The story: Llama 4 405B is within striking distance of proprietary frontiers. For many practical applications, the performance difference won't matter.
Running Llama 4 yourself means:
| Variant | GPU memory | Typical setup |
|---|---|---|
| 8B | 16GB | Single A100 or RTX 4090 |
| 70B | 140GB | 2x A100 80GB or 4x RTX 4090 |
| 405B | 800GB+ | 8x A100 80GB or 4x H100 |
The 405B model is infrastructure-intensive. Most self-hosting deployments will use the 70B variant, which delivers excellent performance at manageable resource requirements.
For a medium-volume workload (100M tokens/month):
| Option | Monthly cost | Notes |
|---|---|---|
| GPT-4o API | $750 | $2.50/1M input, $10/1M output |
| Claude 3.5 Sonnet | $900 | $3/1M input, $15/1M output |
| Llama 4 70B (cloud GPU) | $2,000-3,000 | 2x A100 spot instance |
| Llama 4 70B (dedicated) | $5,000-8,000 | Reserved instances |
At this volume, API access is cheaper. But the equation changes at higher volumes:
| Option | 1B tokens/month |
|---|---|
| GPT-4o API | $7,500 |
| Llama 4 70B (cloud GPU) | $3,000-4,000 |
| Llama 4 70B (dedicated) | $5,000-8,000 |
Above approximately 500M tokens/month, self-hosting becomes economically attractive.
Self-hosting isn't just GPU costs:
For teams without ML infrastructure experience, these costs can exceed the GPU savings.
Data sovereignty requirements: When data cannot leave your infrastructure, self-hosting is the only option.
// Self-hosted inference - data never leaves your network
const client = new OpenAI({
baseURL: 'http://internal-llm.company.local:8000/v1',
apiKey: 'internal-token'
});
Predictable high volume: If you're running 1B+ tokens monthly with predictable patterns, the economics favour self-hosting.
Custom fine-tuning: Building domain-specific models requires weights access. Llama 4's permissive license enables commercial fine-tuning deployments.
Latency requirements: Self-hosted models in your data centre eliminate network round-trips. Critical for real-time applications.
Variable demand: Burst workloads are better served by API elasticity.
Frontier capabilities: If you need GPT-4o or Claude Opus capabilities, Llama 4 may not match them for your specific use case.
Limited ML ops capacity: The operational overhead of self-hosting is real. Teams without dedicated infrastructure expertise should think twice.
Run Llama 4 on your infrastructure using vLLM or TensorRT-LLM:
# vLLM deployment
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-4-70B-Instruct \
--tensor-parallel-size 2
This provides an OpenAI-compatible endpoint, making integration straightforward.
Major cloud providers offer managed Llama 4 deployments:
| Provider | Service | Pricing model |
|---|---|---|
| AWS | Bedrock | Per-token |
| Azure | AI Studio | Per-token |
| Google Cloud | Vertex AI | Per-token |
| Together AI | Inference API | Per-token |
| Fireworks AI | Inference API | Per-token |
Managed services eliminate ops overhead while retaining Llama 4's cost advantages over proprietary models.
For cost-sensitive deployments, quantised Llama 4 variants reduce resource requirements:
| Quantisation | Memory reduction | Quality loss |
|---|---|---|
| FP16 (native) | Baseline | None |
| INT8 | ~50% | Minimal |
| INT4 (AWQ) | ~75% | Slight |
| GGUF/Q4_K_M | ~75% | Slight |
With INT4 quantisation, Llama 4 70B fits on a single A100 80GB or consumer hardware.
Many enterprises are adopting hybrid approaches:
interface ModelRouter {
route(task: TaskType): Model;
}
const router: ModelRouter = {
route(task) {
switch (task) {
case 'internal-analysis':
// Data stays internal
return llamaClient;
case 'customer-facing':
// Need best quality
return claudeClient;
case 'high-volume-classification':
// Cost-optimised
return llamaClient;
default:
return gpt4oClient;
}
}
};
This captures the benefits of self-hosting for appropriate workloads while maintaining access to frontier models.
Llama 4's open weights enable fine-tuning for domain-specific performance:
from transformers import AutoModelForCausalLM, TrainingArguments
from peft import LoraConfig, get_peft_model
# Load base model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-4-70B-Instruct")
# Configure LoRA for efficient fine-tuning
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
)
model = get_peft_model(model, lora_config)
# Fine-tune on your domain data...
Fine-tuned Llama 4 models often outperform larger general-purpose models on specific tasks.
Llama 4 puts pricing pressure on proprietary providers. OpenAI and Anthropic must justify premiums through:
Expect continued API price reductions.
More negotiating leverage with API providers. The credible alternative of self-hosting strengthens enterprise bargaining positions.
Lower barrier to AI product development. Building on Llama 4 means:
Llama 4 represents a maturation point for open-weight models. The 405B variant is genuinely frontier-competitive. The 70B variant offers excellent performance at reasonable infrastructure requirements.
For enterprises, the implications:
Self-hosting is viable. For the right workloads, Llama 4 delivers production-quality results without API dependencies.
Hybrid is the answer. Most organisations will benefit from combining self-hosted Llama 4 for appropriate workloads with API access for frontier capabilities.
The economics are improving. GPU costs are falling. Model efficiency is improving. The self-hosting breakeven point will continue to decline.
Data control matters. For regulated industries and sensitive data, self-hosted Llama 4 provides a path to AI capabilities without data sovereignty concerns.
The open-weight ecosystem has arrived. Build your AI strategy accordingly.
Further reading: