MIDI-VALLE: Improving Expressive Piano Performance Synthesis Through Neural Codec Language Modelling


Jingjing Tang1*  Xin Wang2   Zhe Zhang2   Junichi Yamagishi2   Geraint Wiggins13   George Fazekas1

1Centre for Digital Music, Queen Mary University of London, UK
2National Institute of Informatics, Japan
3Vrije Universiteit Brussel, Belgium


architecture

Figure 1: Architecture of the MIDI-VallE model.



Fine-Tuning the Encodec with the ATEPP audios


Five 10-second reconstructed samples are randomly selected from the test set of ATEPP as listening examples.


Source
Sample A
Sample B
Sample C
Sample D
Sample E
Groundtruth
Encodec (w/o fine-tuning)
Piano-Encodec (fine-tuned)


Synthesis Quality: MIDI-VALLE v.s M2A


To demonstrate the effectiveness of the proposed model, we compare the performance of the MIDI-VALLE model with the M2A model [1] across ATEPP, Maestro, and Pijama datasets. Four different segments are randomly selected from each set to demonstrate the performance of proposed MIDI-VALLE model.


ATEPP: Transcribed MIDI in Classical Style

Source
Sample A
Sample B
Sample C
Sample D
GroundTruth Human Recording
Piano-Encodec Reconstruction
MIDI-VALLE Generation
M2A Generation


Maestro: Recorded MIDI in Classicl Style

Source
Sample A
Sample B
Sample C
Sample D
GroundTruth Human Recording
Piano-Encodec Reconstruction
MIDI-VALLE Generation
M2A Generation


Pijama: Transcribed MIDI in Jazz Style

Source
Sample A
Sample B
Sample C
Sample D
GroundTruth Human Recording
Piano-Encodec Reconstruction
MIDI-VALLE Generation
M2A Generation


Compatibility with EPRs: MIDI-VALLE vs. M2A


Three expressive performance rendering (EPR) models—M2M [1], VirtuosoNet [2], and DExter [3]—are selected to evaluate how MIDI-VALLE and M2A perform when integrated into a two-stage music performance synthesis (MPS) pipeline. For each EPR system, four segments are generated for comparing.


M2M

Source
Sample A
Sample B
Sample C
Sample D
M2M + MIDI-VALLE
M2M + M2A

VirtuosoNet

Source
Sample A
Sample B
Sample C
Sample D
VirtuosoNet + MIDI-VALLE
VirtuosoNet + M2A

DExter

Source
Sample A
Sample B
Sample C
Sample D
DExter + MIDI-VALLE
DExter + M2A


Prompt Effects on MIDI-VALLE


The following audio samples demonstrate the effects of different prompts on the synthesised expressive performance using MIDI-VALLE. In some cases, even when the acoustics of the audio prompt differ significantly from the ground truth, there is little impact on the synthesis, as seen in sample B. However, for segments like sample A, these variations can lead to noticeable differences in both timbre and the reconstruction of the recording environment. This may be due to the differences between the MIDI prompt and the target MIDI prompt in terms of note ranges, chords, harmonic features, rhythms, and other musical elements. These details are not further investigated here, as they fall outside the scope of our primary focus.


Sample A

Condition
3-second Prompt
Performance
GroundTruth N/A
First 3 Seconds
Prompt I
Prompt II
Prompt III


Sample B

Condition
3-second Prompt
Performance
GroundTruth N/A
First 3 Seconds
Prompt I
Prompt II
Prompt III


References


[1] Tang, J., Cooper, E., Wang, X., Yamagishi, J., & Fazekas, G. (2025, April). Towards An Integrated Approach for Expressive Piano Performance Synthesis from Music Scores. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE.
[2] Jeong, D., Kwon, T., Kim, Y., Lee, K., & Nam, J. (2019, November). VirtuosoNet: A Hierarchical RNN-based System for Modeling Expressive Piano Performance. In ISMIR (pp. 908-915).
[3] Zhang, H., Chowdhury, S., Cancino-Chacón, C. E., Liang, J., Dixon, S., & Widmer, G. (2024). Dexter: Learning and controlling performance expression with diffusion models. Applied Sciences, 14(15), 6543.