News20 Nov 20257 min read

Llama 4 Drops with Enterprise Features: What Open Weights Mean Now

Meta's Llama 4 release includes 70B and 405B variants with commercial licenses. Here's how the capability gap has closed and what it means for build vs buy decisions.

MB
Max Beech
Head of Content

The release: Meta dropped Llama 4 with three variants - 8B, 70B, and 405B parameters. All include commercial usage rights, instruction-tuned variants, and multimodal capabilities. The 405B model approaches GPT-4o performance on most benchmarks.

Why this matters: The gap between open-weight and proprietary models continues to narrow. Llama 4 405B is the first open model genuinely competitive with frontier APIs for many enterprise use cases.

The builder's question: Does Llama 4 change the self-hosting equation? When does running your own models beat API access?

Model specifications

VariantParametersContextMultimodalLicense
Llama 4 8B8B128KYesCommercial
Llama 4 70B70B128KYesCommercial
Llama 4 405B405B128KYesCommercial*

*405B commercial license requires acceptance of additional terms for deployments exceeding 700M monthly active users.

The commercial license is notably permissive. No revenue caps, no usage restrictions for typical enterprise deployments. Fine-tuned derivatives can also be commercially deployed.

Benchmark performance

Meta's published benchmarks position Llama 4 405B as frontier-competitive:

BenchmarkLlama 4 405BGPT-4oClaude 3.5 Sonnet
MMLU88.2%87.2%88.7%
HumanEval89.0%90.2%92.0%
MATH73.1%76.6%71.1%
MT-Bench8.89.08.9
Vision (MMMU)61.3%63.0%N/A

The story: Llama 4 405B is within striking distance of proprietary frontiers. For many practical applications, the performance difference won't matter.

Self-hosting economics

Running Llama 4 yourself means:

Infrastructure requirements

VariantGPU memoryTypical setup
8B16GBSingle A100 or RTX 4090
70B140GB2x A100 80GB or 4x RTX 4090
405B800GB+8x A100 80GB or 4x H100

The 405B model is infrastructure-intensive. Most self-hosting deployments will use the 70B variant, which delivers excellent performance at manageable resource requirements.

Cost comparison

For a medium-volume workload (100M tokens/month):

OptionMonthly costNotes
GPT-4o API$750$2.50/1M input, $10/1M output
Claude 3.5 Sonnet$900$3/1M input, $15/1M output
Llama 4 70B (cloud GPU)$2,000-3,0002x A100 spot instance
Llama 4 70B (dedicated)$5,000-8,000Reserved instances

At this volume, API access is cheaper. But the equation changes at higher volumes:

Option1B tokens/month
GPT-4o API$7,500
Llama 4 70B (cloud GPU)$3,000-4,000
Llama 4 70B (dedicated)$5,000-8,000

Above approximately 500M tokens/month, self-hosting becomes economically attractive.

Hidden costs to consider

Self-hosting isn't just GPU costs:

  • Inference optimisation: vLLM, TensorRT-LLM setup and tuning
  • Ops overhead: Monitoring, scaling, failover
  • Model updates: Incorporating new releases, fine-tuning pipelines
  • Security: Model access controls, output filtering

For teams without ML infrastructure experience, these costs can exceed the GPU savings.

When to self-host

Strong cases for self-hosting

Data sovereignty requirements: When data cannot leave your infrastructure, self-hosting is the only option.

// Self-hosted inference - data never leaves your network
const client = new OpenAI({
  baseURL: 'http://internal-llm.company.local:8000/v1',
  apiKey: 'internal-token'
});

Predictable high volume: If you're running 1B+ tokens monthly with predictable patterns, the economics favour self-hosting.

Custom fine-tuning: Building domain-specific models requires weights access. Llama 4's permissive license enables commercial fine-tuning deployments.

Latency requirements: Self-hosted models in your data centre eliminate network round-trips. Critical for real-time applications.

Weak cases for self-hosting

Variable demand: Burst workloads are better served by API elasticity.

Frontier capabilities: If you need GPT-4o or Claude Opus capabilities, Llama 4 may not match them for your specific use case.

Limited ML ops capacity: The operational overhead of self-hosting is real. Teams without dedicated infrastructure expertise should think twice.

Deployment options

Direct deployment

Run Llama 4 on your infrastructure using vLLM or TensorRT-LLM:

# vLLM deployment
pip install vllm
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-4-70B-Instruct \
  --tensor-parallel-size 2

This provides an OpenAI-compatible endpoint, making integration straightforward.

Managed inference

Major cloud providers offer managed Llama 4 deployments:

ProviderServicePricing model
AWSBedrockPer-token
AzureAI StudioPer-token
Google CloudVertex AIPer-token
Together AIInference APIPer-token
Fireworks AIInference APIPer-token

Managed services eliminate ops overhead while retaining Llama 4's cost advantages over proprietary models.

Quantised variants

For cost-sensitive deployments, quantised Llama 4 variants reduce resource requirements:

QuantisationMemory reductionQuality loss
FP16 (native)BaselineNone
INT8~50%Minimal
INT4 (AWQ)~75%Slight
GGUF/Q4_K_M~75%Slight

With INT4 quantisation, Llama 4 70B fits on a single A100 80GB or consumer hardware.

Enterprise adoption patterns

Hybrid architecture

Many enterprises are adopting hybrid approaches:

interface ModelRouter {
  route(task: TaskType): Model;
}

const router: ModelRouter = {
  route(task) {
    switch (task) {
      case 'internal-analysis':
        // Data stays internal
        return llamaClient;

      case 'customer-facing':
        // Need best quality
        return claudeClient;

      case 'high-volume-classification':
        // Cost-optimised
        return llamaClient;

      default:
        return gpt4oClient;
    }
  }
};

This captures the benefits of self-hosting for appropriate workloads while maintaining access to frontier models.

Fine-tuning pipeline

Llama 4's open weights enable fine-tuning for domain-specific performance:

from transformers import AutoModelForCausalLM, TrainingArguments
from peft import LoraConfig, get_peft_model

# Load base model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-4-70B-Instruct")

# Configure LoRA for efficient fine-tuning
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
)

model = get_peft_model(model, lora_config)
# Fine-tune on your domain data...

Fine-tuned Llama 4 models often outperform larger general-purpose models on specific tasks.

Competitive implications

For API providers

Llama 4 puts pricing pressure on proprietary providers. OpenAI and Anthropic must justify premiums through:

  • Capability advantages (reasoning, safety)
  • Ease of use (no infrastructure needed)
  • Ecosystem (fine-tuning APIs, assistants framework)

Expect continued API price reductions.

For enterprises

More negotiating leverage with API providers. The credible alternative of self-hosting strengthens enterprise bargaining positions.

For startups

Lower barrier to AI product development. Building on Llama 4 means:

  • No per-token API costs eating into margins
  • Complete control over model behaviour
  • No dependency on provider terms changes

Our assessment

Llama 4 represents a maturation point for open-weight models. The 405B variant is genuinely frontier-competitive. The 70B variant offers excellent performance at reasonable infrastructure requirements.

For enterprises, the implications:

  1. Self-hosting is viable. For the right workloads, Llama 4 delivers production-quality results without API dependencies.

  2. Hybrid is the answer. Most organisations will benefit from combining self-hosted Llama 4 for appropriate workloads with API access for frontier capabilities.

  3. The economics are improving. GPU costs are falling. Model efficiency is improving. The self-hosting breakeven point will continue to decline.

  4. Data control matters. For regulated industries and sensitive data, self-hosted Llama 4 provides a path to AI capabilities without data sovereignty concerns.

The open-weight ecosystem has arrived. Build your AI strategy accordingly.


Further reading: