Skip to content

Commit

Permalink
creation: integrate technical metadata in tracks dataframe (will fix #4)
Browse files Browse the repository at this point in the history
  • Loading branch information
mdeff committed Jun 17, 2020
1 parent 60b2a05 commit 00d5b71
Showing 1 changed file with 74 additions and 19 deletions.
93 changes: 74 additions & 19 deletions creation.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -158,15 +158,7 @@
"source": [
"## 2 Format metadata\n",
"\n",
"Todo:\n",
"* Sanitize values, e.g. list of words for tags, valid links in `artist_wikipedia_page`, remove html markup in free-form text.\n",
" * Clean tags. E.g. some tags are just artist names.\n",
"* Fill metadata about encoding: length, number of samples, sample rate, bit rate, channels (mono/stereo), 16bits?.\n",
"* Update duration from audio\n",
" * 2624 is marked as 05:05:50 (18350s) although it is reported as 00:21:15.15 by ffmpeg.\n",
" * 112067: 3714s --> 01:59:55.06, 112808: 3718s --> 01:59:59.56\n",
" * ffmpeg: Estimating duration from bitrate, this may be inaccurate\n",
" * Solution, decode the complete mp3: `ffmpeg -i input.mp3 -f null -`"
"Todo: sanitize values, e.g. list of words for tags, valid links in `artist_wikipedia_page`, remove html markup in free-form text. Clean tags. E.g. some tags are just artist names."
]
},
{
Expand Down Expand Up @@ -208,15 +200,6 @@
"tracks.rename(columns={'license_title': 'track_license', 'tags': 'track_tags'}, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tracks['track_duration'] = tracks['track_duration'].map(creation.convert_duration)"
]
},
{
"cell_type": "code",
"execution_count": null,
Expand Down Expand Up @@ -357,7 +340,79 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.4 Merge DataFrames"
"### 2.4 Technical metadata from mp3 files\n",
"\n",
"Problem: the bitrate and duration returned by the FMA API are sometimes wrong (look for example at the duration reported for track ID 2624). These errors cause us to then wrongly extract the 30s excerpts (see this [GitHub issue](https://github.com/mdeff/fma/issues/4)).\n",
"\n",
"Solution: get these metadata (along others) from the mp3 headers. The most accurate measure of duration is to decode the mp3 and count the number of frames.\n",
"\n",
"Limitation: three files (track IDs 23430, 153189, 155249) could not be open by [mutagen](https://github.com/quodlibet/mutagen). As such, the `bit_rate` was set to the value given by [ffmpeg](https://ffmpeg.org) and the `mode` to `UNKNOWN`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# The extraction work is done by a script."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mp3_metadata = pd.read_csv('mp3_metadata.csv', index_col=0)\n",
"#assert (mp3_metadata.index == tracks.index).all()\n",
"\n",
"failed = mp3_metadata.index[mp3_metadata['channels'] == 0]\n",
"print('Could not extract metadata from {} tracks.'.format(len(failed)))\n",
"\n",
"mp3_metadata.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"duration = tracks.at[2624, 'track_duration']\n",
"print('API reported duration: {}'.format(duration))\n",
"duration = mp3_metadata.loc[2624, 'samples'] / mp3_metadata.loc[2624, 'sample_rate']\n",
"print('Real duration after decoding: {:.0f}s'.format(duration))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tracks['track_channels'] = mp3_metadata['channels']\n",
"tracks['track_bit_rate_mode'] = mp3_metadata['mode']\n",
"tracks['track_bit_rate'] = mp3_metadata['bit_rate']\n",
"tracks['track_sample_rate'] = mp3_metadata['sample_rate']\n",
"tracks['track_samples'] = mp3_metadata['samples']\n",
"duration = mp3_metadata['samples'] / mp3_metadata['sample_rate']\n",
"duration[tracks['track_samples'] == 0] = 0 # Fix NaN introduced with division by 0.\n",
"tracks['track_duration'] = duration.round().astype(int)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.5 Merge DataFrames"
]
},
{
Expand Down

0 comments on commit 00d5b71

Please sign in to comment.