Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An option to remove chapter title #1375

Closed
xeolod opened this issue Jul 9, 2024 · 10 comments
Closed

An option to remove chapter title #1375

xeolod opened this issue Jul 9, 2024 · 10 comments

Comments

@xeolod
Copy link

xeolod commented Jul 9, 2024

Is your feature request related to a problem? Please describe.
When downloading from royalroad, many novels have their chapter title in their body too, so the downloaded chapters have two titles in them, one from its chapter title and another from its body text.

Describe the solution you'd like
An option to remove chapter title. Just like remove author notes option.

Describe alternatives you've considered
I tried adding manual parser, but I couldn't make it work properly, so I am requesting this option.

Additional context
This option (Remove chapter title) might work for all hosts as it woud nullify the chapter title only.

@Kiradien
Copy link
Collaborator

Kiradien commented Jul 9, 2024

Can you share an example of a chapter this actually happens in?

@xeolod
Copy link
Author

xeolod commented Jul 10, 2024

Here you go. I had also linked the novel used below.

Also the downloaded novels from royalroad and webnovel have this p class = and div data ejs, respectively with random text. Some time ago it would only have the novel text only. If possible, please correct them to show novel text only.

royalraod

webnovel

@Kiradien
Copy link
Collaborator

Also the downloaded novels from royalroad and webnovel have this p class = and div data ejs, respectively with random text. Some time ago it would only have the novel text only. If possible, please correct them to show novel text only.

For future reference, data-ejs attributes were removed from webnovel in PR #1363. These changes aren't currently in the live build, and some junk data does still persist. They should, however be included in the build linked here: #1368 (comment)

I'll check to see if something similar can be done for RR, but scrubbing classes isn't as cut & dry as removing entire attributes.
Either way, I'll give both of these a shot; I have a few ideas for both of these issues...

Kiradien added a commit to Kiradien/WebToEpub that referenced this issue Jul 10, 2024
Removed random identifier generated className.
@dteviot
Copy link
Owner

dteviot commented Jul 10, 2024

@Kiradien @xeolod
I'm going to suggest that doing the "double title removal" might be better as a post processing step using EpubEditor.
Logic might be something like:

  1. Find the H1 header, then the text in it.
  2. Search for any other text nodes with the same text.
  3. If any found, delete their enclosing element.

@Kiradien
Copy link
Collaborator

As dteviot said above, that is probably the best way, I played around with a config to do the same and it could be a bit funky - especially due to author notes. I've pushed for PR on the cleanup code, however.

@xeolod
Copy link
Author

xeolod commented Jul 11, 2024

For future reference, data-ejs attributes were removed from webnovel in PR #1363. These changes aren't currently in the live build, and some junk data does still persist. They should, however be included in the build linked here: #1368 (comment)

Tested it on webnovel, almost all the junk data is removed. One div data ejs attribute still exists, but removed it using regex.

Kiradien added a commit that referenced this issue Jul 11, 2024
@dteviot
Copy link
Owner

dteviot commented Jul 13, 2024

Test versions for Firefox and Chrome with Kiradien's Royal Road cleanup have been uploaded to https://drive.google.com/drive/folders/1B_X2WcsaI_eg9yA-5bHJb8VeTZGKExl8?usp=sharing.

dteviot added a commit to dteviot/EpubEditor that referenced this issue Jul 19, 2024
Remove title text when story text starts with copy of title.
See: dteviot/WebToEpub#1375
@dteviot
Copy link
Owner

dteviot commented Jul 19, 2024

@xeolod

Try this script to remove duplicated title text.

let titleNode = dom.querySelector("h1")?.firstChild;
let titleText = titleNode?.data;
let filter = (node) => {
    return (node !== titleNode) && (node.data == titleText)
        ? NodeFilter.FILTER_ACCEPT
        : NodeFilter.FILTER_SKIP;
};

let walker = dom.createTreeWalker(
  dom.body,
  NodeFilter.SHOW_TEXT,
  filter
);
let node = walker.firstChild()?.parentNode;
if (node != null) {
    console.log(node.outerHTML);
    node.remove();
    return true;
}
return false;

Tested with:

For my notes: 24 minutes work

@xeolod
Copy link
Author

xeolod commented Aug 5, 2024

Thanks, it's working.

@dteviot
Copy link
Owner

dteviot commented Aug 23, 2024

@xeolod
Updated version (0.0.0.167) has been submitted to Firefox and Chrome stores.
Firefox version is available now.
Chrome might be available in a few hours to 21 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants