-
-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v0.5.6 - Performance Part 2 (Is that a new scan loop?) #1500
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* Staging the code for the new scan loop. * Implemented a basic idea of changes on drives triggering scan loop. Issues: 1. Scan by folder does not work, 2. Queuing system is very hacky and needs a separate thread, 3. Performance degregation could be very real. * Started writing unit test for new loop code * Implemented a basic method to scan a folder path with ignore support (not implemented, code in place) * Added some code to the parser to build out the idea of processing series in batches based on some top level folder. * Scan Series now uses the new code (folder based parsing) and now handles the LocalizedSeries issue. * Got library scan working with the new folder-based scan loop. Updated code to set FolderPath (for improved scan times and partial scan support). * Wrote some notes on update library scan loop. * Removed migration for merge * Reapplied the SeriesFolder migration after merge * Refactored a check that used multiple db calls into one. * Made lots of progress on ignore support, but some confusion on underlying library. Ticket created. On hold till then. * Updated Scan Library and Scan Series to exit early if no changes are on the underlying folders that need to be scanned. * Implemented the ability to have .kavitaignore files within your directories and Kavita will parse them and ignore files and directories based on rules within them. * Fixed an issue where ignore files nested wouldn't stack with higher level ignores * Wrote out some basic code that showcases how we can scan series or library based on file events on the underlying system. Very buggy, needs lots of edge case testing and logging and dupplication checking. * Things are working kinda. I'm getting lost in my own code and complexity. I'm not sure it's worth it. * Refactored ScanFiles out to Directory Service. * Refactored more code out to keep the code clean. * More unit tests * Refactored the signature of ParsedSeries to use IList. Started writing unit tests and reworked the UpdateLibrary to work how it used to with new scan loop code (note: using async update library/series does not work). * Fixed the bug where processSeriesInfos was being invoked twice per series and made the code work very similar to old code (except loose leaf files dont work) but with folder based scanning. * Prep for unit tests (updating broken ones with new implementations) * Just some notes. Not sure I want to finish this work. * Refactored the LibraryWatcher with some comments and state variables. * Undid the migrations in case I don't move forward with this branch * Started to clean the code and prepare for finishing this work. * Fixed a bad merge * Updated signatures to cleanup the code and commit to the new strategy for scanning. * Swapped out the code with async processing of series on a small library * The new scan loop is working in both Sync and Async methods. The code is slow and not optimized. This represents a good point to start polling and applying optimizations. * Refactored UpdateSeries out of Scanner and into a dedicated file. * Refactored how ProcessTasks are awaited to allow more async * Fixed an issue where side nav item wouldn't show correct highlight and migrated to OnPush * Moved where we start to stopwatch to encapsulate the full scan * Cleaned up SignalR events to report correctly (still needs a redesign) * Remove the "remove" code until I figure it out * Put in extremely expensive series deletion code for library scan. * Have Genre and Tag update the DB immediately to avoid dup issues * Taking a break * Moving to a lock with People was successful. Need to apply to others. * Refactored code for series level and tag and genre with new locking strategy. * New scan loop works. Next up optimization * Swapped out the Kavita log with svg for faster load * Refactored metadata updates to occur when the series are being updated. * Code cleanup * Added a new type of generic message (Info) to inform the user. * Code cleanup * Implemented an optimization which prevents any I/O (other than an attribute lookup) for Library/Series Scan. This can bring a recently updated library on network storage (650 series) to fully process in 2 seconds. Fixed a bug where File Analysis was running everytime for each non-epub file. * Fixed ARM x64 builds not being able to view PDF cover images due to a bad update in DocNet. * Some code cleanup * Added experimental signalr update code to have a more natural refresh of library-detail page * Hooked in ability to send new series events to UI * Moved all scan (file scan only) tasks into Scan Queue. Made it so scheduled ScanLibraries will now check if any existing task is being run and reschedule for 3 hours, and 10 mins for scan series. * Implemented the info event in the events widget and added a clear all button to dismiss all infos and errors. Added --event-widget-info-bg-color * Remove --drawer-background-color since it's not used * When new series added, inject directly into the view. * Some debug code cleanup * Fixed up the unit tests * Ensure all config directories exist on startup * Disabled Library Watching (that will go in next build) * Ensure update for series is admin only * Lots of code changes, scan series kinda works, specials are splitting, optimizations are failing. Demotivated on this work again. * Removed SeriesFolder migration * Added the SeriesFolder migration * Added a new pipe for dates so we can provide some nicer defaults. Added folder path to the series detail. * The scan optimizations now work for NTFS systems. * Removed a TODO * Migrated all the times to use DateTime.Now and not Utc. * Refactored some repo calls to use the includes flag pattern * Implemented a check for the library scan optimization check to validate if the library was updated (type change, library rename, folder change, or series deleted) and let the optimization be bypassed. * Added another optimization which will use just folder attribute of last write time if the drive is not NTFS. * Fixed a unit test * Some code cleanup
* Fixed collection cover images not rendering * added a try/catch on sending email, so we fail silently if it doesn't send. * Fixed Go Back not returning to last scroll position due to layoutmode change resetting, despite nothing changing. * Fixed a bug where when turning between pages on default mode, the height calculations could get skewed. * Fixed a missing case for card item where it wouldn't show tooltip title for series.
* Refactored ScanSeries to avoid a lot of extra work and fixed a bug where Scan Series would invoke the processing twice. Refactored the series selection code during process such that we use Localized Name as well, for cases where the original name was changed. Undid an optimization around Last Write time, since Linux file systems match how NTFS works. * Fixed part of the query * Added a NormalizedLocalizedName for quick searching in which a series needs grouping. Reworked scan loop code a bit to ensure we don't do extra work. Tweaked the widget logic to help display better and not show "Nothing going on here". * Fixed a bug where archives with ._ files would be counted as valid files, while they are actually just metadata files on Mac's. * Fixed a broken unit test
* Simplify parent lookup with Directory.GetParent * Address comments
* Added Last Folder Scanned time to series info modal. Tweaked the info event detail modal to have a primary and thus be auto-dismissable * Added an error event when multiple series are found in processing a series. * Fixed a bug where a series could get stuck with other series due to a bad select query. Started adding the force flag hook for the UI and designing the confirm. Confirm service now also has ability to hide the close button. Updated error events and logging in the loop, to be more informative * Fixed a bug where confirm service wasn't showing the proper body content. * Hooked up force scan series * refresh metadata now has force update * Fixed up the messaging with the prompt on scan, hooked it up properly in the scan library to avoid the check if the whole library needs to even be scanned. Fixed a bug where NormalizedLocalizedName wasn't being calculated on new entities. Started adding unit tests for this problematic repo method. * Fixed a bug where we updated NormalizedLocalizedName before we set it. * Send an info to the UI when series are spread between multiple library level folders. * Added some logger output when there are no files found in a folder. Return early if there are no files found, so we can avoid some small loops of code. * Fixed an issue where multiple series in a folder with localized series would cause unintended grouping. This is not supported and hence we will warn them and allow the bad grouping. * Added a case where scan series fails due to the folder being removed. We will now log an error * Normalize paths when finding the highest directory till root. * Fixed an issue with Scan Series where changing a series' folder to a different path but the original series folder existed with another series in it, would cause the series to not be deleted. * Fixed some bugs around specials causing a series merge issue on scan series. * Removed a bug marker * Cleaned up some of the scan loop and removed a test I don't need. * Remove any prompts for force flow, it doesn't work well. Leave the API as is though. * Fixed up a check for duplicate ScanLibrary calls
* When we navigate from a page then back, resume back on the last scroll key (if clicked) * Resume jump key position when navigating back to a page. Removed some extra blank space on collection detail when a collection doesn't have a summary or cover image. * Ignore progress events on series cards * Added a url to swagger for /, which could be reverse proxy url
* Misc fixes - Fixed modal being stretched when not needed. - Fixed Logo vertical align - Fixed drawer content scroll, and from it being squished due to overridden by bootstrap. * series detail cover image stretch fix - Fixes: Fixes series detail cover image being stretched on larger resolutions * fixing empty lists scrollbar * Fixing want to read error * fixing unnecessary scrollbar * Fixing recently updated tooltip
* Hooked in a server setting to enable/disable folder watching * Validated the file rename change event * Validated delete file works * Tweaked some logic to determine if a change occurs on a folder or a file. * Added a note for an upcoming branch * Some minor changes in the loop that just shift where code runs. * Implemented ScanFolder api * Ensure we restart watchers when we modify a library folder. * Fixed a unit test
* Updated scan time for watcher to 30 seconds for non-dev. Moved ScanFolder off the Scan queue as it doesn't need to be there. Updated loggers * Fixed jumpbar missing * Tweaked the messaging for CoverGen * When we return early due to nothing being done on library and series scan, make sure we kick off other tasks that need to occur. * Fixed a foreign constraint issue on Volumes when we were adding to a new series. * Fixed a case where when picking normalized series, capitalization differences wouldn't stack when they should. * Reduced the logging output on dev and prod settings. * Fixed a bug in the code that finds the highest directory from a file, where we were not checking against a normalized path. * Cleaned up some code * Fixed broken unit tests
* Added a ToList() to avoid a bug where a person could be removed from a list while iterating over the list. * When deleting a series, want to read page will now automatically remove that series from the view. * Fixed a series lookup which was ignoring format * Ignore XML comment warnings * Removed a note since it was already working that way * Fixed unit test
* Tweaked a Migration to log correctly only if something is going to be done. * Refactored Reading List Controller code into a dedicated service and cleaned up some methods that aren't needed anymore. * Fixed a bug where adding a new item to a reading list wasn't adding it at the end. * Fixed an issue where collection page would re-render the same covers on multiple items. * Fixed a missing margin-top which made the page extras drawer not render correctly and hence unclosable on small screens. * Added some timeout on manage users screen to give data time to flush. Added a dedicated token log for account flows, in case url encoding plays a part (but from testing it doesn't). * Reverted back to building for ES6 instead of es2020 for old Safari 12.5.5 browsers (10MB difference in build size). * Cleaned up the logic in removing series not found during scan loop. * Tweaked the timings for Library Watcher to 1 min and reprocess queue every 30 seconds.
* Fixed a bootstrap bug * Fixed repeating images on collection detail * Fixed up some logic in library watcher which wasn't processing all of the queue. * When parsing non-epubs in Book library, use Manga parsing for Volume support to better support Light Novels * Fixed some bugs with the tachiyomi plugin api's for progress tracking
* Adding Health controller - Added: Added API endpoint for a health check to streamline docker healthy status. * review comment fixes
* Refactored Library Watcher to use Hangfire under the hood. * Support .kavitaignore at root level. * Refactored a lot of the library watching code to process faster and handle when FileSystemWatcher runs out of internal buffer space. It's still not perfect, but good enough for basic use. * Make folder watching as experimental and default it to off by default. * Revert #1479 * Tweaked the messaging for OPDS to remove a note about download role. Moved some code closer to where it's used. * Cleaned up how the events widget reports * Fixed a null issue when deleting series in the UI * Cleaned up some debug code * Added more information for when we skip a scan * Cleaned up some logging messages in CoverGen tasks * More log message tweaks * Added some debug to help identify a rare issue * Fixed a bug where save bookmarks as webp could get reset to false when saving other server settings * Updated some documentation on library watcher. * Make LibraryWatcher fire every 5 mins
…1487) * Sort series by chapter number only when some chapters have no volume information * Implement a Default static instance of ChapterSortComparer * Further use Default static Comparers * Add missing ToLit() as per comments
* Update to use SQLIte for Hangfire to retain information on tasks * Updated all external links to have noopener noreferrer * When watching folders, ensure the folders exist before creating watchers. * Tweaked the messaging for Email Service and added link to the project.
* Fixed a bug where typeahead wouldn't automatically show results on relationship screen without an additional click. * Tweaked the code which checks if a modification occured to check on seconds rather than minutes * Clear cache will now clear temp/ directory as well. * Fixed an issue where Chrome was caching api responses when it shouldn't had. * Added a cleanup temp code * Ensure genres get removed during series scan when removed from metadata. * Fixed a bug where all epubs with a volume would show as Volume 0 in reading list * When a scan is in progress, don't let the user delete the library.
* Refactored invite user flow to separate error handling on create user flow and email flow. This should help users that have unique situations. * Switch to using files to check LastWriteTime. Debug code in for Robbie to test on rclone * Updated Parser namespace. Changed the LastWriteTime to check all files and folders.
* Added a no data section to collection detail. * Remove an optimization for skipping the whole library scan as it wasn't reliable * When resetting password, ensure the input is colored correctly * Fixed setting new password after resetting, throwing an error despite it actually being successful. Fixed incorrect messaging for Password Reset page. * Fixed a bug where reset password would show the side nav button and skew the page. Updated a lot of references to use Typed version for formcontrols. * Removed a migration from 0.5.0, 6 releases ago. * Added a null check so we don't throw an exception when connecting with signalR on unauthenticated users.
* Tweaked log messaging for library scan when no files were scanned. * When a theme that is set gets removed due to a scan, inform the user to refresh. * Fixed a typo and make Darkness -> Brightness * Make download theme files allowed to be invoked by non-authenticated users, to allow new users to get the default theme. * Hide all series side nav item if there are no libraries exposed to the user * Fixed an API for Tachiyomi when syncing progress * Fixed dashboard not responding to Series Removed and Added events. Ensure we send SeriesRemoved events when they are deleted. * Reverted Hangfire SQLite due to aborted jobs being resumed, when they shouldnt. Fixed some scan loop issues where cover gen wouldn't be invoked always on new libraries.
# FIxed - Fixed: Fixed an issue with series detail cover when scaled down.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
For the past 3 months, I have been working tirelessly to rebuild our main scan loop, which is not only the most complicated part of Kavita but also the most critical. It translates files on your system into the series, volumes, and chapters you know in the UI. This goal of a new loop was so difficult, I quit 3 times, but finally made it through. The new loop is finally ready to share with you all, thanks to many of our nightly users helping test and provide feedback.
What is this new scan loop I speak of? The new scan loop is a new way (and a new set of requirements) to scan your disks and process files extremely efficiently. This changes Kavita to using a folder-based parsing mechanism rather than the previous mechanism where we would parse all files, group then perform DB operations in groups of 50. The reasoning for this drastic shift is that the old method failed to scale on users with massive libraries (we have one user with half a million files per stats) and would take an exceedingly long time for networked (rclone) users.
To dive in, we need to talk about some requirement changes. In order to utilize folder-based scanning, we have to remove the flexibility of users can put files anywhere they want and Kavita will group them for you. When I started, I chose this mechanism based on feedback from Plex and Jellyfin, but in reality most of our users have proper folder-based structures, so this new requirement change shouldn't affect many users.
Kavita expects all series to be in their own folder, so a library cannot have loose leaf files in it. You can still do /library root/manga/Publisher/Series A/, /library root/manga/Publisher/Series B/, just not /library root/my book.pdf. Kavita will now scan folder by folder an for each folder, group the files in and process them as a single series in a task (async). This task will do all DB operations and invoke Cover Gen and File Analysis (word counting) for just that series. This change is important because it will now process series-by-series and update your UI in that manner. Hence you should be able to more quickly see your series and use them completely than before on first scan. This however comes with a drawback due to increased I/O and trips to the DB, the library scan is slower (on first scan). For me, previously it took 30 mins to scan all files into the DB (without covers or word counts), on the new loop it takes 80 mins (but has covers and word counts). The main difference is on the old loop I had to wait at least 10 mins till the first chunk inserts, with the new loop, you don't have that issue. This also means the Chunk errors users sometimes receive should no longer be an issue and if there is an error, that series will tell you directly. You can find more information about the new Scan loop on the wiki.
With that said, this also allows some deep optimizations. Kavita will no longer do any I/O (except an attribute lookup on folders) if a folder hasn't changed since the last scan time. This single optimization means that my library scan can run in 5 seconds for my 660 series library, as it skips work on anything that hasn't changed. From the nightly testers, many users have seen drastic time reduction, esp our rclone users. This also allows for 2 enhancements, the ability to ignore certain files/folders (aka .kavitaignore) and the ability to hook into Kavita and say a folder has changed and let Kavita kick off a scan.
Next up is .kavitaignore, inspired from .plexignore and a feature request as well, users can now tell Kavita to ignore files and folders at any level of a scan. You can have at library root to have a global set of rules, like ignore cover.png, or you can have nested into your folders. You can have one for each folder and Kavita will intelligently combine the rules as it moves down the folder tree. This is another degree of freedom which allows you to not only filter, but speed up scans as well. This feature is experimental and still being tested and tweaked, but works in many cases. Give it a go and report back on issues faced or clever tricks that we can enhance our wiki with.
Lastly, we have folder watching. This new, experimental feature is really something cool. Why wait for nightly scans, when Kavita can pickup on changes and instantly run an appropriate scan? This new feature does just that. When a change is detected, a task is queued up and after 5 minutes executed. Any changes that related to that first task, will get disregarded. Please note, this is experimental and likely to not work if you download directly to your library. This feature will be fletched out further in later releases.
TLDR: There are new requirements for what Kavita expects from file layout. Read about it on the wiki. If you scan without conforming your layout, you may run into issues.
If you have been enjoying Kavita, please consider donating or backing me. Kavita is primarily built by just me in my spare time and each release is hundreds of hours of work.
Added
Changed
Fixed
Removed
KavitaEmail