You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fundamentally a Novel isnt much different than a Manga. (ID, title, chapters)
The differences only appears at the FetchPages and FetchImages level (which obviously wont be called FetchImages)
If the website is known to only hosts Novels we can use a generic decorator to extract text .
The problem relies in Madara, Mangastream , HeanCMS, whatever : where you can have Mangas & Novels in the same list and the difference can be found when getting "Pages" only.
For Madara you have a particular CSS element to tell content is a bunch of html. For Mangastream i've seen a isNovel JS variable but we can use a proper CSS as well i guess. In HeanCMS api returns html content of the novel chapter, or a list of pictures in case of a manga chapter.
That is a fundamental problem : At which level are we able to tell "this is a novel" or "this is manga"? It varies from a website to another. and many website handle both as similar content, until its time to display "the pages".
What to do with content
=======================
So we got the html. From the page or the api but we got it. Now what?
There is the bloat removing step. I think by default we can remove scripts tags, and the "onxxx" attributes. Then there are needs to remove bloat depending on websites.
Some novel chapters comes with pictures. Should we download them too? In that case, should we fix the html with the downloaded image paths?
User can choose to save it as html, or as a picture?
In case of saving as html, how do we handle theming? Should we deliver themed html templates that the user can choose? How to handle previewing? Just previewing it as picture is safer i think.
Are we still using html2canvas to generate picture from text?
more questions incoming
The text was updated successfully, but these errors were encountered:
MANGA/NOVEL detection
=======================
If the website is known to only hosts Novels we can use a generic decorator to extract text .
The problem relies in Madara, Mangastream , HeanCMS, whatever : where you can have Mangas & Novels in the same list and the difference can be found when getting "Pages" only.
That is a fundamental problem : At which level are we able to tell "this is a novel" or "this is manga"? It varies from a website to another. and many website handle both as similar content, until its time to display "the pages".
What to do with content
=======================
So we got the html. From the page or the api but we got it. Now what?
There is the bloat removing step. I think by default we can remove scripts tags, and the "onxxx" attributes. Then there are needs to remove bloat depending on websites.
Some novel chapters comes with pictures. Should we download them too? In that case, should we fix the html with the downloaded image paths?
User can choose to save it as html, or as a picture?
In case of saving as html, how do we handle theming? Should we deliver themed html templates that the user can choose? How to handle previewing? Just previewing it as picture is safer i think.
Are we still using html2canvas to generate picture from text?
more questions incoming
The text was updated successfully, but these errors were encountered: