Model used_acestep-v15-xl-base
Promt: Music Caption : house music style, soulful vocal performance, driving groove, vocal track singing, 122 bpm
Lyrics: [Style: raw house diva vocals, gritty soul tone, passionate female voice]
[Vibe: 90s club vocal hook, loop ready, dancefloor energy]
[Hook: rhythmic vocal chops]
You took your love away from me.
[Hook: driving rhythm]
Left me dancing with your memory.
[Refrain: emotional peak]
Why did you leave me? Why did you walk away?
# Project Blueprint & Civitai Release Documentation
Model Architecture: High-Capacity Audio Diffusion LoRA
Project Code: house_vocal_monster_xl
Production Date: May 24, 2026
---
## 1. Project Overview & Dev Notes
* Current Model Name: Deep House Vocal Beast XL (Rank 128 / -11 LUFS Hi-Fi) - V2
* Target Style: Deep House Vocals, Soulful Performances, Driving Grooves
* Hardware Environment: NVIDIA RTX 3090 (24 GB VRAM)
* Training Framework: ACE-Step v1.5 (PyTorch 2.7.1 + CUDA 12.8 / Windows 11)
### Revision History: The Breakthrough from V1 to V2
This official release marks the successful Second Attempt (V2) of the training cycle.
* The V1 Failure: The initial training run resulted in a broken model with an under-dimensioned .safetensors file size. The abstraction layers were far too thin because the Alpha value was accidentally set to only 6. This mathematical bottleneck severely choked the neural network, preventing it from transferring the learned vocal weights onto the base model.
* The V2 Correction: This iteration completely rectifies the architectural bottleneck by scaling the network to a matching Rank 128 / Alpha 128 matrix. This full 1:1 capacity ratio unlocks the deep expressive depth required for high-fidelity vocal textures, resulting in a solid, fully-functional 244 MB model file.
---
## 2. Configuration Code Comparison (JSON)
### Failed Run (V1 - Thin Abstraction Layer)
{
"name": "house_xl_failed_attempt_v1",
"description": "FAILED RUN - Abstraction layer too thin due to low alpha value",
"adapter_type": "lora",
"rank": 128,
"alpha": 6,
"dropout": 0.0,
"target_modules_str": "q_proj k_proj v_proj o_proj w1 w2 c_proj gate_proj up_proj down_proj",
"attention_type": "both",
"bias": "none",
"learning_rate": 5e-05,
"batch_size": 1,
"gradient_accumulation": 4,
"epochs": 100,
"warmup_steps": 100,
"weight_decay": 0.01,
"max_grad_norm": 1.0,
"seed": -1,
"shift": 1.0,
"num_inference_steps": 50,
"optimizer_type": "adamw8bit",
"scheduler_type": "cosine_restarts",
"cfg_ratio": 0.0,
"save_every": 5,
"log_every": 10,
"log_heavy_every": 25,
"gradient_checkpointing": true,
"offload_encoder": false,
"sample_every_n_epochs": 0
}
### Successful Run (V2 - "The Beast" Full Capacity)
{
"name": "house_xl_300mb_beast",
"description": "High capacity dynamic rank 128 deep house vocal model",
"adapter_type": "lora",
"rank": 128,
"alpha": 128,
"dropout": 0.0,
"target_modules_str": "q_proj k_proj v_proj o_proj w1 w2 c_proj gate_proj up_proj down_proj",
"attention_type": "both",
"bias": "none",
"learning_rate": 5e-05,
"batch_size": 1,
"gradient_accumulation": 4,
"epochs": 100,
"warmup_steps": 100,
"weight_decay": 0.01,
"max_grad_norm": 1.0,
"seed": -1,
"shift": 1.0,
"num_inference_steps": 50,
"optimizer_type": "adamw8bit",
"scheduler_type": "cosine_restarts",
"cfg_ratio": 0.0,
"save_every": 5,
"log_every": 10,
"log_heavy_every": 25,
"gradient_checkpointing": true,
"offload_encoder": false,
"sample_every_n_epochs": 0
}
---
## 3. Dataset Preprocessing Pipeline
To feed the Rank 128 architecture, the training audio underwent a surgical preprocessing script:
* Slicing Integrity: Precise 30.0-second segments with a strict 1-second overlap.
* Phase Alignment: Applied mathematical zero-crossing detection (find_nearest_zero_crossing) to completely eliminate digital clicks and pops at segment borders.
* Loudness Normalization: Saturated club-standard target of -11.0 LUFS with a strict gain-cap to guarantee maximum sonic pressure without digital clipping.
* Format: Uncompressed 24-bit PCM WAV (pcm_s24le) sampled at 44.1 kHz.
---
## 4. Civitai Platform Upload Content
### Model Title
Deep House Vocal Beast XL (Rank 128 / -11 LUFS Hi-Fi) - V2
### Short Description
V2 Second Attempt! High-capacity, professionally preprocessed LoRA trained on premium deep house vocals. Fixed abstraction layers (Alpha 128) for full expressiveness.
### Tags & Trigger Words
v2, second attempt, house music style, soulful vocal performance, driving groove, clear mix, vocal track singing, 122 bpm
### Official Release Description (For Civitai Textfield)
Welcome to the House Vocal Beast XL (Second Attempt)
DEV NOTE / WHY THIS IS V2:
This is the successful second attempt at training this model. In the first run, the abstraction layers were way too thin because of an accidental setting (Alpha was set to only 6 or similar). This resulted in a tiny safetensors file that completely choked the model's ability to express the training data. For this version, the architecture was fully corrected to a massive Rank 128 / Alpha 128 ratio, unlocking the true high-capacity capability of the neural layers!
This professional-grade audio diffusion LoRA is engineered specifically for soulful, driving, and deep house vocal generation. It is designed to deliver studio-quality vocal tracks that sit perfectly in a club mix without requiring heavy post-processing.
Technical Blueprint & Dataset Care:
Unlike standard datasets, the source material for this model underwent a rigorous, surgical audio pre-processing pipeline:
- Corrected High-Capacity Architecture: Trained at Rank 128 / Alpha 128 to capture the finest vocal nuances, breath textures, and emotional delivery.
- Phase-Synchronized Slicing: All training samples (30.0s) were sliced strictly at mathematical amplitude zero-crossings to eliminate digital clicks and artifacts.
- Club-Standard Loudness: Digitally normalized to a consistent, powerful -11.0 LUFS (using strict gain capping to prevent clipping) for that signature warm, saturated house groove.
- Studio Grade Quality: Processed entirely in uncompressed 24-bit PCM WAV (44.1 kHz) to preserve full dynamic range.
How to Prompt & Trigger Words:
The model is highly responsive to musical terms and automatically understands keys and tempos. To unlock the full "Beast" potential, use the following tags in your prompt:
house music style, soulful vocal performance, driving groove, classic house piano chord, rhythmic bassline, 909 percussion, clear mix
To make the model sing specific lines, format your prompt like this:
vocal track singing: 'YOUR LYRICS HERE', 122 bpm, key of A Minor
Recommended Settings:
- LoRA Weight: 0.8 - 1.0 (Start at 1.0, lower slightly if vocals become too aggressive)
- CFG Scale: 3.5 - 7.0 (Depending on your base model)
- Inference Steps: 50
Give your house tracks the voice they deserve. Download, test it out, and leave a review with your generations!
---
## 5. Training Metrics & Loss Analytics
### Hardware Performance Profile
* GPU Engine: NVIDIA RTX 3090 (24 GB VRAM)
* VRAM Allocation: Consistent 12.1 / 24.0 GiB (50% Load) via Gradient Checkpointing
* Time per Epoch: ~191.8 Seconds
* Total Elapsed Time: 5 hours 28 minutes 10 seconds
### Text-Based Training & Loss Graph (Visual Curve)
This graph illustrates the mathematical convergence of the model's loss across 100 epochs.
Loss Value
^
1.0 |
| * (Epoch 1: 0.8917)
0.8 | \
| (Epoch 2: 0.7154) (Epoch 51: 0.6903)
0.6 | \ / \
| * (Epoch 5: 0.6214)---* (Epoch 66: 0.5939) (Epoch 98: 0.7701)
0.4 | \ / \
| \____________________*(Best: 0.5045)_________/ *(Epoch 100: 0.5045)
0.2 |
|
0.0 +------------------------------------------------------------------------------------>
0 20 40 60 80 100 (Epochs)
### Loss Step Log (Complete Session Data)
| Epoch | Global Step | Step Loss | Training Status / Notes |
| :--- | :--- | :--- | :--- |
| Epoch 1 | Step 0050 | 0.8917 | Training Initialization & Warmup |
| Epoch 2 | Step 0090 | 0.7154 | Initial Descent & Style Adaptation |
| Epoch 5 | Step 0290 | 0.6214 | Checkpoint 1 Generated (244 MB) |
| Epoch 20 | Step 1150 | 0.6019 | Smooth Convergence, Saturated House Cadence |
| Epoch 50 | Step 2950 | 0.7071 | Mid-way Checkpoint Generated (Best Interim Loss: 0.5267) |
| Epoch 65 | Step 3835 | 0.6968 | Checkpoint Saved, Scheduler Restart Active |
| Epoch 66 | Step 3850 | 0.5939 | Pattern Stabilization & Fine Accent Capture |
| Epoch 98 | Step 5770 | 0.7701 | Convergence Stage (Absolute Best Historical Step Loss: 0.5045) |
| Epoch 100 | Step 5800 | 0.5045 | Final Model Export & Consolidation Successful |
Description
Initial functional release (V2). Complete architectural overhaul from the failed V1 attempt. Increased the structural abstraction layers by correcting the Alpha setting from a bottlenecked value of 6 up to a full, high-capacity Rank 128 / Alpha 128 matrix. This unlocks proper vocal expression and deep texture mapping.
