-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intent to showcase more results and/or stats? #14
Comments
There are some videos in Bilibili demonstrating the outputs, you can take a look: https://www.bilibili.com/video/BV1Gk4y1J7xN As for the metrics, in the image generation field we can use FID or IS, but it's difficult to apply them to charting. Feel free to share your idea with me. Thank you for your support for this project! |
Hey thanks for the quick reply! Allow me some time to study FD in general so I dont say nonsense. My background is in object detection so I dont have all the GAN knowledge required rn. At a first glance though, it seems like a promising metric if we were to figure out a representative way to encode a chart as some curve/distribution (uneducated guess lol). In terms of results, I'm not entirely sure if you're simply limited to the state-of-the-art aspect of audio feature extraction. A GAN (or any type of ML model for that matter) can only do so much if the available audio features arent super accurate. One can easily see that the tool currently misses a lot of clear sounds that dont have a strong/loud attack, such piano notes during a section where other instruments kinda share the same frequency space. There are other issues such as consistency in what is layered, but I think that's only important once the problem of accurately pinpointing relevant notes is solved. If you dont mind sparing me the search through the repo, what audio analysis tools does this project use to extract audio features? Looking forward to the evolution of this ^^ EDIT: |
@mat100payette Thanks for your valuable suggestions!
The current training loss is the denoising loss, aka the ability of the model to recover real VAE representation from noise. Maybe it can serve as a metric?
I used melspectrogram (https://github.com/Keytoyze/Mug-Diffusion/blob/master/mug/util.py#L133) with n_mels=128 to extract audio features. It appears to be a common technique in the fields of speech recognition and audio processing. Currently I'm training a second generation model designed to solve the problems. I think there may be two possible reasons:
I'd love to hear your comments or suggestions for improvements XD |
I'll throw ideas/concerns your way, feel free to consider just those you think make any sense, if any. 1-
While I personally think it might be detrimental in a charting domain (because a lot of times some pretty quiet sections are still relevant to chart), maybe this is an indicator that your approach could be right. Not sure. 2- 3- |
WOW, thanks for your comment very much!! I will read the BSRNN paper. As for the consistency, I suspect that the model does not grasp the overall song well, since mappers may generate different patterns for a same instrument/sound in different charts, but the model may mix different patterns in a single chart. I'll try to make the model learn how to better charting from an overall perspective. I think your idea about layer the sounds makes sense, but I wonder how to layer the charts in the dataset unsupervisedly to generate the corresponding training data? Or you just want to use this pipeline during the inference stage? For the metrics, maybe some snapping correctness can be used, e.g., ratio of missing/overmapping notes, or global offset of the chart. What do you think? |
I'm really unsure about how to approach same-sound consistency on a per-chart basis. I think it's heavily dependent on what input you're able to feed it and that's limited to the trained VAE's output. I have some ideas but I don't yet fully understand how you integrate the conditions (audio + prompt + noise level) in the denoising training. If you don't mind, could you please try to explain that process to me? Without that, I can't really provide any insight on the loss function. Now for the metrics (specifically the VAE's performance) I think what you suggest could work. Keep in mind that if you don't analyze your VAE's output independently (i.e. not combined to the denoising), you're most likely pushing too much unpredictable randomness to the denoising training. A popular metric in object detection is mAP, which is a single value that basically tells you "on average, did I detect relevant things, and how close to the real objects were the detections?". It might be possible to apply that same logic to the VAE if you consider the groundtruth notes as objects to detect (in this case, to decode), and the quality of the decoding being "how close to the real note was I". In object detection you consider a "good detection" to be one that has a big enough IoU (given a manually chosen threshold). In a chart's space, you'd simply ensure that the decoded note is within t time of the real note. You'd also have to factor in the note type though, which introduces classification too. To evaluate both localization + classification, a tool called TIDE is generally used nowadays. Here are some images to give you a good idea of what it looks for: Plot/graph of the error type distribution: Source: https://towardsdatascience.com/a-better-map-for-object-detection-32662767d424 As you can see, all of these error types would be applicable to your VAE, except they're be in a 1D space instead of 2D, which makes them even easier to define. I think that if you manage to get such information for your VAE, you'll have a much deeper understanding of the type and amplitude of noise your encoding output pushes to your denoising training. For example, if the VAE has a high amount of dupe errors, you'll know that it decodes duplicate notes close to each other instead of the single real note. If it has a lot of bkgd (decoded a note that doesn't exist), you'll know it creates random ghost notes during decoding, etc etc. Hopefully this is helpful to you! I don't know if this project is in the context of your PhD or if it's a side project, so maybe you have time constraints I'm unaware of. Regardless, do let me know if/when some things are out of scope for you ^^ |
This comment was marked as outdated.
This comment was marked as outdated.
@mat100payette I have read the BSRNN paper. I found the core idea is similar to MelSpectrum, which splits more bands in low-frequencies and less bands in high-frequencies. BSRNN using different model (Norm + MLP) to extract features in each bands. As for transfering it to AI charting, I have two concerns:
|
Hello there!
I was made aware of this cool project, good stuff. Quick question, I know you said that it's mostly a project to satisfy your curiosity, but do you plan on showcasing more output from the tool? As a software engineer I'd be curious to examine the output you get before diving into understanding the codebase. Having like 10+ varied chart output that you consider representative of this tool's capabilities, and maybe also some stats on its performance along with what metrics it's evaluated on would be super appreciated.
I understand if it's not something you planned though :)
Good luck with continuing this neat project!
The text was updated successfully, but these errors were encountered: