Five 10-second reconstructed samples are randomly selected from the test set of ATEPP as listening examples.
Source
Sample A
Sample B
Sample C
Sample D
Sample E
Groundtruth
Encodec (w/o fine-tuning)
Piano-Encodec (fine-tuned)
Synthesis Quality: MIDI-VALLE v.s M2A
To demonstrate the effectiveness of the proposed model, we compare the performance of the MIDI-VALLE model with the M2A model [1] across ATEPP, Maestro, and Pijama datasets. Four different segments are randomly selected from each set to demonstrate the performance of proposed MIDI-VALLE model.
ATEPP: Transcribed MIDI in Classical Style
Source
Sample A
Sample B
Sample C
Sample D
GroundTruth Human Recording
Piano-Encodec Reconstruction
MIDI-VALLE Generation
M2A Generation
Maestro: Recorded MIDI in Classicl Style
Source
Sample A
Sample B
Sample C
Sample D
GroundTruth Human Recording
Piano-Encodec Reconstruction
MIDI-VALLE Generation
M2A Generation
Pijama: Transcribed MIDI in Jazz Style
Source
Sample A
Sample B
Sample C
Sample D
GroundTruth Human Recording
Piano-Encodec Reconstruction
MIDI-VALLE Generation
M2A Generation
Compatibility with EPRs: MIDI-VALLE vs. M2A
Three expressive performance rendering (EPR) models—M2M [1], VirtuosoNet [2], and DExter [3]—are selected to evaluate how MIDI-VALLE and M2A perform when integrated into a two-stage music performance synthesis (MPS) pipeline. For each EPR system, four segments are generated for comparing.
M2M
Source
Sample A
Sample B
Sample C
Sample D
M2M + MIDI-VALLE
M2M + M2A
VirtuosoNet
Source
Sample A
Sample B
Sample C
Sample D
VirtuosoNet + MIDI-VALLE
VirtuosoNet + M2A
DExter
Source
Sample A
Sample B
Sample C
Sample D
DExter + MIDI-VALLE
DExter + M2A
Prompt Effects on MIDI-VALLE
The following audio samples demonstrate the effects of different prompts on the synthesised expressive performance using MIDI-VALLE. In some cases, even when the acoustics of the audio prompt differ significantly from the ground truth, there is little impact on the synthesis, as seen in sample B. However, for segments like sample A, these variations can lead to noticeable differences in both timbre and the reconstruction of the recording environment. This may be due to the differences between the MIDI prompt and the target MIDI prompt in terms of note ranges, chords, harmonic features, rhythms, and other musical elements. These details are not further investigated here, as they fall outside the scope of our primary focus.
Sample A
Condition
3-second Prompt
Performance
GroundTruth
N/A
First 3 Seconds
Prompt I
Prompt II
Prompt III
Sample B
Condition
3-second Prompt
Performance
GroundTruth
N/A
First 3 Seconds
Prompt I
Prompt II
Prompt III
References
[1] Tang, J., Cooper, E., Wang, X., Yamagishi, J., & Fazekas, G. (2025, April). Towards An Integrated Approach for Expressive Piano Performance Synthesis from Music Scores. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE.
[2] Jeong, D., Kwon, T., Kim, Y., Lee, K., & Nam, J. (2019, November). VirtuosoNet: A Hierarchical RNN-based System for Modeling Expressive Piano Performance. In ISMIR (pp. 908-915).
[3] Zhang, H., Chowdhury, S., Cancino-Chacón, C. E., Liang, J., Dixon, S., & Widmer, G. (2024). Dexter: Learning and controlling performance expression with diffusion models. Applied Sciences, 14(15), 6543.