Jingjing Tang1* Erica Cooper2 Xin Wang2 Junichi Yamagishi2 George Fazekas1
1Centre for Digital Music, Queen Mary University of London, UK
2National Institute of Informatics, Japan
Figure 1: Proposed Integrated System v.s. Proposed Baseline System.
This demo page showcases the results of our research on transforming symbolic music scores into expressive piano performance audio. The approach combines a Transformer-based Expressive Performance Rendering (EPR) model with a fine-tuned neural MIDI synthesiser, offering a streamlined method for converting inexpressive score MIDI files into rich, expressive piano performances.
Our integrated system is designed to directly generate expressive audio performances from score inputs by combining MIDI-to-MIDI (M2M) and MIDI-to-Audio (M2A) models. The M2M model is responsible for rendering expressive MIDI files, while the M2A model, which has been fine-tuned for this task, generates the corresponding audio outputs. This demo page illustrates the improvement achieved through the fine-tuning of the M2A model and presents a comparison of the proposed system (M2M + M2A) with other existing systems.
The evaluation conducted in this study highlights the system's effectiveness in reconstructing human-like expressiveness and capturing the acoustic ambiance of environments such as concert halls and recording studios. The proposed system is the first of its kind to seamlessly convert inexpressive score MIDI files into expressive piano performance audios using purely deep learning models.
This section demonstrates the improvements achieved by fine-tuning the M2A model. The audio samples (all from test set) before and after fine-tuning are presented below along with the groundtruth audio recordings, allowing for a direct comparison of the model's performance.
Model |
Sample 1 |
Sample 2 |
Sample 3 |
Sample 4 |
---|---|---|---|---|
|
|
|
|
|
M2A (Before Fine-tuning) |
|
|
|
|
M2A (After Fine-tuning) |
|
|
|
|
In this section, we present a comparison between the proposed M2M + M2A system and other systems. Audio samples generated by different methods, including:
System |
Sample 1 |
Sample 2 |
Sample 3 |
Sample 4 |
---|---|---|---|---|
|
|
|
|
|
Groundtruth midi + Pianoteq |
|
|
|
|
Groundtruth midi + M2A |
|
|
|
|
M2M output + Pianoteq |
|
|
|
|
M2M output + M2A |
|
|
|
|
Baseline |
|
|
|
|
Score + Pianoteq |
|
|
|
|
In this section, we present more and longer samples generated with our proposed M2M + M2A system and the baseline systems. The results demonstrate how the integrated system effectively balances musical expressiveness with audio quality, outperforming the baseline models.
Model |
Sample 1 |
Sample 2 |
Sample 3 |
Sample 4 |
---|---|---|---|---|
M2M+M2A |
|
|
|
|
Baseline |
|
|
|
|
Groundtruth |
|
|
|
|