project / / active

Beat Saber Automapper

Three-stage transformer pipeline that generates playable Beat Saber levels from audio files — notes, arcs, chains, bombs, obstacles, and a synchronized light show, packaged as a v3-format .zip.

An open-source AI system that takes an audio file and produces a playable Beat Saber level — notes, arcs, chains, bombs, obstacles, and a synchronized light show — packaged as a v3-format .zip. The pipeline is a three-stage transformer: a shared audio encoder feeds an onset detector, a note-sequence decoder, and a lighting decoder, each conditioned on difficulty, genre, and per-frame song-structure features.

Purpose

The goal is to replicate what good human mappers do — density planning, swing-direction flow, lighting that tracks song energy — from audio alone, with a model small enough to train overnight on a single RTX 5090. Existing automappers either target the obsolete v1 format, skip lighting entirely, or produce maps that feel mechanical. This project targets v3, includes lighting, and is trained on a curated high-rating slice of BeatSaver.

Highlights

  • Style-cohort training (V5). Replaced a single averaged model with per-mapper style cohorts — 18 mappers in 9 style buckets — each trained independently.
  • Auto-researcher harness. A YAML queue of training specs runs overnight; each result is scored on a playability + style-closeness composite and written to a leaderboard.
  • Rich conditioning. Every stage receives difficulty + genre + song-structure embeddings (RMS, onset strength, band energies, section id, section progress) so soft sections slow down and drops get dense.
  • Token-level note generation. Stage 2 uses an autoregressive transformer over a 183-token vocabulary, with beam search or nucleus sampling and an ergonomics loss that penalizes parity violations.

Technical notes

Stage Task Arch Loss
1 onset detection 6-block TCN + 2L transformer BCE on fuzzy labels
2 note sequence 8L transformer decoder CE + flow + ergo
3 lighting events 4L transformer decoder CE + label smoothing

All stages share one audio encoder: 4-layer CNN frontend → sinusoidal positional encoding → 6-layer transformer (d_model=512). Trained end-to-end on Beat Saber map data — no pretrained speech weights, because what matters here is low-level rhythmic structure, not semantics.

Note. Trained with a RTX 5090 (sm_120) requires PyTorch nightly with cu128; stable wheels don’t yet compile for Blackwell.