From 00d5b71988397874627552e4f55f44d4731577af Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micha=C3=ABl=20Defferrard?= Date: Wed, 15 Nov 2017 11:02:15 +0100 Subject: [PATCH] creation: integrate technical metadata in tracks dataframe (will fix #4) --- creation.ipynb | 93 +++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 74 insertions(+), 19 deletions(-) diff --git a/creation.ipynb b/creation.ipynb index a8542e20..8cb2b9d8 100644 --- a/creation.ipynb +++ b/creation.ipynb @@ -158,15 +158,7 @@ "source": [ "## 2 Format metadata\n", "\n", - "Todo:\n", - "* Sanitize values, e.g. list of words for tags, valid links in `artist_wikipedia_page`, remove html markup in free-form text.\n", - " * Clean tags. E.g. some tags are just artist names.\n", - "* Fill metadata about encoding: length, number of samples, sample rate, bit rate, channels (mono/stereo), 16bits?.\n", - "* Update duration from audio\n", - " * 2624 is marked as 05:05:50 (18350s) although it is reported as 00:21:15.15 by ffmpeg.\n", - " * 112067: 3714s --> 01:59:55.06, 112808: 3718s --> 01:59:59.56\n", - " * ffmpeg: Estimating duration from bitrate, this may be inaccurate\n", - " * Solution, decode the complete mp3: `ffmpeg -i input.mp3 -f null -`" + "Todo: sanitize values, e.g. list of words for tags, valid links in `artist_wikipedia_page`, remove html markup in free-form text. Clean tags. E.g. some tags are just artist names." ] }, { @@ -208,15 +200,6 @@ "tracks.rename(columns={'license_title': 'track_license', 'tags': 'track_tags'}, inplace=True)" ] }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "tracks['track_duration'] = tracks['track_duration'].map(creation.convert_duration)" - ] - }, { "cell_type": "code", "execution_count": null, @@ -357,7 +340,79 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### 2.4 Merge DataFrames" + "### 2.4 Technical metadata from mp3 files\n", + "\n", + "Problem: the bitrate and duration returned by the FMA API are sometimes wrong (look for example at the duration reported for track ID 2624). These errors cause us to then wrongly extract the 30s excerpts (see this [GitHub issue](https://github.com/mdeff/fma/issues/4)).\n", + "\n", + "Solution: get these metadata (along others) from the mp3 headers. The most accurate measure of duration is to decode the mp3 and count the number of frames.\n", + "\n", + "Limitation: three files (track IDs 23430, 153189, 155249) could not be open by [mutagen](https://github.com/quodlibet/mutagen). As such, the `bit_rate` was set to the value given by [ffmpeg](https://ffmpeg.org) and the `mode` to `UNKNOWN`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# The extraction work is done by a script." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "mp3_metadata = pd.read_csv('mp3_metadata.csv', index_col=0)\n", + "#assert (mp3_metadata.index == tracks.index).all()\n", + "\n", + "failed = mp3_metadata.index[mp3_metadata['channels'] == 0]\n", + "print('Could not extract metadata from {} tracks.'.format(len(failed)))\n", + "\n", + "mp3_metadata.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "duration = tracks.at[2624, 'track_duration']\n", + "print('API reported duration: {}'.format(duration))\n", + "duration = mp3_metadata.loc[2624, 'samples'] / mp3_metadata.loc[2624, 'sample_rate']\n", + "print('Real duration after decoding: {:.0f}s'.format(duration))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "tracks['track_channels'] = mp3_metadata['channels']\n", + "tracks['track_bit_rate_mode'] = mp3_metadata['mode']\n", + "tracks['track_bit_rate'] = mp3_metadata['bit_rate']\n", + "tracks['track_sample_rate'] = mp3_metadata['sample_rate']\n", + "tracks['track_samples'] = mp3_metadata['samples']\n", + "duration = mp3_metadata['samples'] / mp3_metadata['sample_rate']\n", + "duration[tracks['track_samples'] == 0] = 0 # Fix NaN introduced with division by 0.\n", + "tracks['track_duration'] = duration.round().astype(int)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 2.5 Merge DataFrames" ] }, {