Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IMPROVEMENT] Filter bad words #1139

Merged
merged 22 commits into from
Jan 5, 2020

Conversation

NilsIrl
Copy link
Contributor

@NilsIrl NilsIrl commented Dec 8, 2019

Fix #1114

  • I have read and understood the contributors guide.
  • I have checked that another pull request for this purpose does not exist.
  • I have considered, and confirmed that this submission will be valuable to others.
  • I accept that this submission may not be used, and the pull request closed at the will of the maintainer.
  • I give this submission freely, and claim no ownership to its content.
  • I have used CCExtractor just a couple of times.
  • I have mentioned this change in the changelog.

This PR reformats most of the capitalization parts as well to make a better interface.

This PR also removes one of the list that was used for capitalization before spell_lower as it was useless.

PS: also btw, for the list of words, I took them from Wikipedia, I don't have that good of a vocab 😂

Copy link
Contributor

@cfsmp3 cfsmp3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a few issues there, but it's a decent PR. Should be quick to fix.

src/lib_ccx/ccx_common_common.c Show resolved Hide resolved
src/lib_ccx/ccx_decoders_structs.h Outdated Show resolved Hide resolved
src/lib_ccx/ccx_encoders_helpers.c Outdated Show resolved Hide resolved
src/lib_ccx/ccx_encoders_helpers.c Outdated Show resolved Hide resolved
src/lib_ccx/ccx_encoders_helpers.h Outdated Show resolved Hide resolved
src/lib_ccx/params.c Outdated Show resolved Hide resolved
@aadibajpai
Copy link
Member

Just generally curious but would it be possible to match patterns? Like fuck* would match not only fuck but also fucking or fucker and so on. I found an old google archive with such a wordlist https://code.google.com/archive/p/badwordslist/downloads

@NilsIrl
Copy link
Contributor Author

NilsIrl commented Dec 8, 2019

I found an old google archive with such a wordlist https://code.google.com/archive/p/badwordslist/downloads

I also found a list from a university that was around 10k words. But most of the words were not profanity in themselves so I gave up, and just took some words to use as placeholder.

e.g. of irrelevant word: gay, transgender, mum...

@NilsIrl
Copy link
Contributor Author

NilsIrl commented Dec 8, 2019

Acted on feedback missed 2 actually

src/lib_ccx/ccx_common_common.c Show resolved Hide resolved
src/lib_ccx/ccx_encoders_helpers.c Show resolved Hide resolved
src/lib_ccx/ccx_encoders_helpers.c Outdated Show resolved Hide resolved
@@ -468,6 +540,6 @@ void shell_sort(void *base, int nb, size_t size, int(*compar)(const void*p1, con

void ccx_encoders_helpers_perform_shellsort_words(void)
{
shell_sort(spell_lower, spell_words, sizeof(*spell_lower), string_cmp_function, NULL);
shell_sort(spell_correct, spell_words, sizeof(*spell_correct), string_cmp_function, NULL);
shell_sort(spell_lower.words, spell_lower.len, sizeof(*spell_lower.words), string_cmp_function, NULL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems strange to be passing a length there and also using sizeof, why?

Copy link
Contributor Author

@NilsIrl NilsIrl Dec 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't write that code but:

Seems strange to be passing a length

So that it the function knows how many elements it needs

using sizeof

To know the size of each element

The best would probably to use a stdlib sorting function. (qsort)

Also I didn't sort the list of profane words, which means binary search won't work (so far it has worked because the list was already sorted).

I'll be fixing that tomorrow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of shell sort was added in #41 because it was faster than quicksort.

Also to clarify, spell_words was the length of the array (now spell_correct.len) so no behaviour should have changed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not too reassuring :-) You need to test a lot and be sure that you are not introducing any bug...

src/lib_ccx/params.c Outdated Show resolved Hide resolved
@cfsmp3
Copy link
Contributor

cfsmp3 commented Dec 10, 2019

@NilsIrl once you have tested it write here how you tested specifically, the output, etc, and we'll merge.

@NilsIrl
Copy link
Contributor Author

NilsIrl commented Dec 17, 2019

It would probably be better to modify the subtitles before they are passed to the encoder. This abstracts these features away from encoders.

What do you think @cfsmp3

For example, right now I'm implementing the scc encoder and have to deal with this which is even worse because I need to implement the "previous" version, the version before this PR.

There also seems to be autodash and trim_subs which might be related (though I haven't looked into them yet and they don't appear on all encoders).

@cfsmp3
Copy link
Contributor

cfsmp3 commented Dec 18, 2019 via email

@NilsIrl NilsIrl mentioned this pull request Dec 22, 2019
7 tasks
@NilsIrl
Copy link
Contributor Author

NilsIrl commented Dec 27, 2019

Multi word swear words will not work

Fix 1: Remove multiple word swear words
Fix 2: Add the option to have multi word swear words, (could this be considered for capitalisation)

"Jesus fuck",
"Jesus Harold Christ",
"Jesus wept",
"Judas Priest",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this should be considered as a swear word :p

If there's ever a documentary on https://en.wikipedia.org/wiki/Judas_Priest, this will be annoying ;)

@canihavesomecoffee
Copy link
Member

@NilsIrl The existing capfile (the dictionary) no longer works:

E:\GitHub\ccextractor\windows\Debug>ccextractorwin.exe --capfile ....\Dictionary\MattS_dictionary.txt E:\Downloads\c83f765c661595e1bfa4750756a54c006c6f2c697a436bc0726986f71f0706cd.ts
Error: There was an error processing the capitalization file.

Please fix this before we can merge.

@cfsmp3
Copy link
Contributor

cfsmp3 commented Jan 1, 2020

@canihavesomecoffee all OK now?

@canihavesomecoffee
Copy link
Member

@canihavesomecoffee all OK now?

Didn't retest yet. Can do that tomorrow.

@canihavesomecoffee
Copy link
Member

Dictionary still works as intended, built-in list also works.

Will trigger a final re-run of the Test Suite to check.

@CCExtractor CCExtractor deleted a comment from ccextractor-bot Jan 5, 2020
@CCExtractor CCExtractor deleted a comment from ccextractor-bot Jan 5, 2020
@ccextractor-bot
Copy link
Collaborator

CCExtractor CI platform finished running the test files on linux. Below is a summary of the test results:

Report Name Tests Passed
Broken 12/13
DVB 3/7
DVR-MS 2/2
General 27/27
Hauppage 3/3
MP4 3/3
NoCC 10/10
Teletext 14/21
WTV 13/13
XDS 34/34
CEA-708 14/14
DVD 3/3
Options 86/86

It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you).

Your PR breaks these cases:


Check the result page for more info.

@ccextractor-bot
Copy link
Collaborator

CCExtractor CI platform finished running the test files on windows. Below is a summary of the test results:

Report Name Tests Passed
Broken 12/13
DVB 4/7
DVR-MS 2/2
General 26/27
Hauppage 3/3
MP4 2/3
NoCC 10/10
Teletext 21/21
WTV 8/13
XDS 0/34
CEA-708 0/14
DVD 0/3
Options 0/86

It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you).

Your PR breaks these cases:


Check the result page for more info.

@canihavesomecoffee
Copy link
Member

I'd say it's mergeable.

@canihavesomecoffee canihavesomecoffee merged commit af67596 into CCExtractor:master Jan 5, 2020
@NilsIrl NilsIrl deleted the filter_bad_words branch January 5, 2020 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[PROPOSAL] Add kid-friendly(-kf) parameter
5 participants