MIDI-VALLE: Improving Expressive Piano Performance Synthesis Through Neural Codec Language Modelling

Fine-Tuning the Encodec with the ATEPP audios

Five 10-second reconstructed samples are randomly selected from the test set of ATEPP as listening examples.

Source	Sample A	Sample B	Sample C	Sample D	Sample E
Groundtruth
Encodec (w/o fine-tuning)
Piano-Encodec (fine-tuned)

Synthesis Quality: MIDI-VALLE v.s M2A

To demonstrate the effectiveness of the proposed model, we compare the performance of the MIDI-VALLE model with the M2A model [1] across ATEPP, Maestro, and Pijama datasets. Four different segments are randomly selected from each set to demonstrate the performance of proposed MIDI-VALLE model.

ATEPP: Transcribed MIDI in Classical Style

Source	Sample A	Sample B	Sample C	Sample D
GroundTruth Human Recording
Piano-Encodec Reconstruction
MIDI-VALLE Generation
M2A Generation

Maestro: Recorded MIDI in Classicl Style

Source	Sample A	Sample B	Sample C	Sample D
GroundTruth Human Recording
Piano-Encodec Reconstruction
MIDI-VALLE Generation
M2A Generation

Pijama: Transcribed MIDI in Jazz Style

Source	Sample A	Sample B	Sample C	Sample D
GroundTruth Human Recording
Piano-Encodec Reconstruction
MIDI-VALLE Generation
M2A Generation

Compatibility with EPRs: MIDI-VALLE vs. M2A

Three expressive performance rendering (EPR) models—M2M [1], VirtuosoNet [2], and DExter [3]—are selected to evaluate how MIDI-VALLE and M2A perform when integrated into a two-stage music performance synthesis (MPS) pipeline. For each EPR system, four segments are generated for comparing.

M2M

Source	Sample A	Sample B	Sample C	Sample D
M2M + MIDI-VALLE
M2M + M2A

VirtuosoNet

Source	Sample A	Sample B	Sample C	Sample D
VirtuosoNet + MIDI-VALLE
VirtuosoNet + M2A

DExter

Source	Sample A	Sample B	Sample C	Sample D
DExter + MIDI-VALLE
DExter + M2A

Prompt Effects on MIDI-VALLE

The following audio samples demonstrate the effects of different prompts on the synthesised expressive performance using MIDI-VALLE. In some cases, even when the acoustics of the audio prompt differ significantly from the ground truth, there is little impact on the synthesis, as seen in sample B. However, for segments like sample A, these variations can lead to noticeable differences in both timbre and the reconstruction of the recording environment. This may be due to the differences between the MIDI prompt and the target MIDI prompt in terms of note ranges, chords, harmonic features, rhythms, and other musical elements. These details are not further investigated here, as they fall outside the scope of our primary focus.

Sample A

Condition	3-second Prompt	Performance
GroundTruth	N/A
First 3 Seconds
Prompt I
Prompt II
Prompt III

Sample B

Condition	3-second Prompt	Performance
GroundTruth	N/A
First 3 Seconds
Prompt I
Prompt II
Prompt III

References

[1] Tang, J., Cooper, E., Wang, X., Yamagishi, J., & Fazekas, G. (2025, April). Towards An Integrated Approach for Expressive Piano Performance Synthesis from Music Scores. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE.
[2] Jeong, D., Kwon, T., Kim, Y., Lee, K., & Nam, J. (2019, November). VirtuosoNet: A Hierarchical RNN-based System for Modeling Expressive Piano Performance. In ISMIR (pp. 908-915).
[3] Zhang, H., Chowdhury, S., Cancino-Chacón, C. E., Liang, J., Dixon, S., & Widmer, G. (2024). Dexter: Learning and controlling performance expression with diffusion models. Applied Sciences, 14(15), 6543.