project / / active
Beat Saber Automapper
Three-stage transformer pipeline that generates playable Beat Saber levels from audio files — notes, arcs, chains, bombs, obstacles, and a synchronized light show, packaged as a v3-format .zip.
An open-source AI system that takes an audio file and produces a playable Beat Saber level — notes, arcs, chains, bombs, obstacles, and a synchronized light show — packaged as a v3-format .zip. The pipeline is a three-stage transformer: a shared audio encoder feeds an onset detector, a note-sequence decoder, and a lighting decoder, each conditioned on difficulty, genre, and per-frame song-structure features.
Purpose
The goal is to replicate what good human mappers do — density planning, swing-direction flow, lighting that tracks song energy — from audio alone, with a model small enough to train overnight on a single RTX 5090. Existing automappers either target the obsolete v1 format, skip lighting entirely, or produce maps that feel mechanical. This project targets v3, includes lighting, and is trained on a curated high-rating slice of BeatSaver.
Highlights
- Style-cohort training (V5). Replaced a single averaged model with per-mapper style cohorts — 18 mappers in 9 style buckets — each trained independently.
- Auto-researcher harness. A YAML queue of training specs runs overnight; each result is scored on a playability + style-closeness composite and written to a leaderboard.
- Rich conditioning. Every stage receives difficulty + genre + song-structure embeddings (RMS, onset strength, band energies, section id, section progress) so soft sections slow down and drops get dense.
- Token-level note generation. Stage 2 uses an autoregressive transformer over a 183-token vocabulary, with beam search or nucleus sampling and an ergonomics loss that penalizes parity violations.
Technical notes
| Stage | Task | Arch | Loss |
|---|---|---|---|
| 1 | onset detection | 6-block TCN + 2L transformer | BCE on fuzzy labels |
| 2 | note sequence | 8L transformer decoder | CE + flow + ergo |
| 3 | lighting events | 4L transformer decoder | CE + label smoothing |
All stages share one audio encoder: 4-layer CNN frontend → sinusoidal positional encoding → 6-layer transformer (d_model=512). Trained end-to-end on Beat Saber map data — no pretrained speech weights, because what matters here is low-level rhythmic structure, not semantics.
Note. Trained with a RTX 5090 (sm_120) requires PyTorch nightly with
cu128; stable wheels don’t yet compile for Blackwell.
Links
- Repo: github.com/Kwoolford/beatsaber_automapper
- Previewer: ArcViewer — drag in a generated
.zip - Map format spec: BSMG Wiki