# Gemma 4 12B Release Source Ledger

Captured: 2026-06-04

## Scope

This ledger records the source bundle for the June 3, 2026 Gemma 4 12B release. The ingest is a production/model-release source rather than a paper ingest: no separate arXiv paper or technical-report PDF was found during the checked search. The strongest evidence is the official Google launch blog, Google Developers Blog developer guide, Google AI for Developers model overview/model card, official Hugging Face model card, and the X announcement by Michael Tschannen.

`X_BEARER_TOKEN` was unavailable in the local environment, so the X post was captured through public oEmbed and third-party public tweet mirrors. Thread reconstruction is therefore best-effort. The first post resolves to Michael Tschannen's June 3, 2026 post at `https://x.com/mtschannen/status/2062236357351579915`.

## Captured Artifacts

- `00README.json` - capture manifest for the source bundle.
- `source_x_oembed.json` - public X oEmbed capture for the user-provided X URL.
- `source_x_fxtwitter.json` - public mirror capture of the X post.
- `source_x_vxtwitter.json` - public mirror capture of the X post.
- `source_google_blog.html` / `source_google_blog.txt` - Google launch blog, "Introducing Gemma 4 12B: a unified, encoder-free multimodal model."
- `source_developer_blog.html` / `source_developer_blog.txt` - Google Developers Blog, "Gemma 4 12B: The Developer Guide."
- `source_google_ai_edge_blog.html` / `source_google_ai_edge_blog.txt` - Google Developers Blog, "Bringing Gemma 4 12B to your Laptop."
- `source_gemma4_overview.html` / `source_gemma4_overview.txt` - Google AI for Developers Gemma 4 overview.
- `source_gemma4_model_card.html` / `source_gemma4_model_card.txt` - Google AI for Developers Gemma 4 model card.
- `source_huggingface_gemma-4-12B-it.html` / `source_huggingface_gemma-4-12B-it.txt` - official Hugging Face instruction-tuned model card.
- `source_visual_guide.html` / `source_visual_guide.txt` - Maarten Grootendorst visual guide linked from the Google Developers Blog.

## Load-Bearing Extracts

The X post frames Gemma 4 12B as a dense encoder-free model processing raw text, image, and audio inputs, and explicitly connects it to Michael Tschannen's research focus on unifying models and training paradigms across modalities.

The Google launch blog says Gemma 4 12B is a mid-sized model with native audio input, no multimodal encoders, vision/audio inputs flowing into the LLM backbone, 16GB local execution target, Apache 2.0 release, MTP drafters, and production deployment options through Google Cloud routes including Model Garden, Cloud Run, and GKE.

The Google Developers Blog narrows the mechanism: the vision side replaces other medium-sized Gemma 4 vision transformer layers with a 35M-parameter embedder that projects raw 48x48 pixel patches to the LLM hidden dimension using a single matrix multiplication plus coordinate lookup; the audio side removes the conformer-style encoder used in E2B/E4B, slices raw 16 kHz audio into 40ms frames of 640 floats, and linearly projects those frames into the LLM input space.

The Google AI model card confirms that "Unified" means the 12B model eliminates dedicated vision and audio encoders, projects raw image patches and audio waveforms into the LLM embedding space through lightweight linear layers, routes all modalities into a single decoder-only transformer, supports text/image/audio, has a 256K context window, and reports official benchmarks across reasoning, vision, audio, and long-context tasks.

The Hugging Face card is the official weight artifact for the instruction-tuned checkpoint and repeats the encoder-free/deployment framing. It also lists the Apache 2.0 license and the supported local multimodal usage snippets.

The visual guide gives the clearest explanatory breakdown of the architectural trade-off: Gemma 4 12B removes separate non-text encoders, but still uses lightweight projection and positional machinery; the LLM backbone must take over much of the representation work that encoders previously performed.

## Boundary Notes

- Treat "encoder-free" as "no separate multimodal transformer/conformer encoder modules," not as "no input preprocessing." The model still uses image patch extraction, linear projections, positional embeddings/normalization for vision, audio framing, and linear projection for audio.
- Treat benchmark, latency, and quality claims as Google-reported release evidence until independent evaluations are available.
- Treat production relevance as release/deployment evidence: open weights, official model cards, Google AI Edge support, local runtime support, and Google Cloud deployment paths. It is not the same kind of evidence as a peer-reviewed architecture paper.
