Towards An Integrated Approach for Expressive Piano Performance Synthesis from Music Scores

Jingjing Tang, Erica Cooper, Xin Wang, Junichi Yamagishi, George Fazekas

Introduction

This demo page showcases the results of our research on transforming symbolic music scores into expressive piano performance audio. The approach combines a Transformer-based Expressive Performance Rendering (EPR) model with a fine-tuned neural MIDI synthesiser, offering a streamlined method for converting inexpressive score MIDI files into rich, expressive piano performances.

Our integrated system is designed to directly generate expressive audio performances from score inputs by combining MIDI-to-MIDI (M2M) and MIDI-to-Audio (M2A) models. The M2M model is responsible for rendering expressive MIDI files, while the M2A model, which has been fine-tuned for this task, generates the corresponding audio outputs. This demo page illustrates the improvement achieved through the fine-tuning of the M2A model and presents a comparison of the proposed system (M2M + M2A) with other existing systems.

The evaluation conducted in this study highlights the system's effectiveness in reconstructing human-like expressiveness and capturing the acoustic ambiance of environments such as concert halls and recording studios. The proposed system is the first of its kind to seamlessly convert inexpressive score MIDI files into expressive piano performance audios using purely deep learning models.

Fine-tuned M2A Model

This section demonstrates the improvements achieved by fine-tuning the M2A model. The audio samples (all from test set) before and after fine-tuning are presented below along with the groundtruth audio recordings, allowing for a direct comparison of the model's performance.

Model Sample 1 Sample 2 Sample 3 Sample 4
Groundtruth
M2A (Before Fine-tuning)
M2A (After Fine-tuning)

Comparison of Different Systems with Proposed M2M + M2A

In this section, we present a comparison between the proposed M2M + M2A system and other systems. Audio samples generated by different methods, including:

  1. Groundtruth audio recording,
  2. Groundtruth midi + Pianoteq,
  3. Groundtruth midi + M2A,
  4. M2M output + Pianoteq,
  5. M2M output + M2A (proposed),
  6. Baseline (proposed),
  7. Score + Pianoteq,
are provided for comparison. The results demonstrate how the integrated system effectively balances musical expressiveness with audio quality, outperforming the baseline models.

System Sample 1 Sample 2 Sample 3 Sample 4
Groundtruth Audio Recording
Groundtruth midi + Pianoteq
Groundtruth midi + M2A
M2M output + Pianoteq
M2M output + M2A
Baseline
Score + Pianoteq

More and Longer Generation Samles from the Proposed M2M + M2A and Baseline Models

In this section, we present more and longer samples generated with our proposed M2M + M2A system and the baseline systems. The results demonstrate how the integrated system effectively balances musical expressiveness with audio quality, outperforming the baseline models.

Model Sample 1 Sample 2 Sample 3 Sample 4
M2M+M2A
Baseline
Groundtruth