[IMPROVEMENT] Filter bad words #1139

NilsIrl · 2019-12-08T15:24:42Z

Fix #1114

I have read and understood the contributors guide.
I have checked that another pull request for this purpose does not exist.
I have considered, and confirmed that this submission will be valuable to others.
I accept that this submission may not be used, and the pull request closed at the will of the maintainer.
I give this submission freely, and claim no ownership to its content.
I have used CCExtractor just a couple of times.
I have mentioned this change in the changelog.

This PR reformats most of the capitalization parts as well to make a better interface.

This PR also removes one of the list that was used for capitalization before spell_lower as it was useless.

PS: also btw, for the list of words, I took them from Wikipedia, I don't have that good of a vocab 😂

cfsmp3

There's a few issues there, but it's a decent PR. Should be quick to fix.

src/lib_ccx/ccx_common_common.c

src/lib_ccx/ccx_decoders_structs.h

src/lib_ccx/ccx_encoders_helpers.c

src/lib_ccx/ccx_encoders_helpers.h

src/lib_ccx/params.c

src/lib_ccx/ccx_encoders_helpers.c

aadibajpai · 2019-12-08T21:33:24Z

Just generally curious but would it be possible to match patterns? Like fuck* would match not only fuck but also fucking or fucker and so on. I found an old google archive with such a wordlist https://code.google.com/archive/p/badwordslist/downloads

NilsIrl · 2019-12-08T21:35:45Z

I found an old google archive with such a wordlist https://code.google.com/archive/p/badwordslist/downloads

I also found a list from a university that was around 10k words. But most of the words were not profanity in themselves so I gave up, and just took some words to use as placeholder.

e.g. of irrelevant word: gay, transgender, mum...

NilsIrl · 2019-12-08T23:36:28Z

~~Acted on feedback~~ missed 2 actually

src/lib_ccx/ccx_common_common.c

src/lib_ccx/ccx_encoders_helpers.c

cfsmp3 · 2019-12-09T00:27:22Z

src/lib_ccx/ccx_encoders_helpers.c

@@ -468,6 +540,6 @@ void shell_sort(void *base, int nb, size_t size, int(*compar)(const void*p1, con

 void ccx_encoders_helpers_perform_shellsort_words(void)
 {
-	shell_sort(spell_lower, spell_words, sizeof(*spell_lower), string_cmp_function, NULL);
-	shell_sort(spell_correct, spell_words, sizeof(*spell_correct), string_cmp_function, NULL);
+	shell_sort(spell_lower.words,   spell_lower.len,   sizeof(*spell_lower.words),   string_cmp_function, NULL);


Seems strange to be passing a length there and also using sizeof, why?

I didn't write that code but:

Seems strange to be passing a length

So that it the function knows how many elements it needs

using sizeof

To know the size of each element

The best would probably to use a stdlib sorting function. (qsort)

Also I didn't sort the list of profane words, which means binary search won't work (so far it has worked because the list was already sorted).

I'll be fixing that tomorrow

The use of shell sort was added in #41 because it was faster than quicksort.

Also to clarify, spell_words was the length of the array (now spell_correct.len) so no behaviour should have changed.

That's not too reassuring :-) You need to test a lot and be sure that you are not introducing any bug...

src/lib_ccx/params.c

cfsmp3 · 2019-12-10T16:40:40Z

@NilsIrl once you have tested it write here how you tested specifically, the output, etc, and we'll merge.

NilsIrl · 2019-12-17T21:35:06Z

It would probably be better to modify the subtitles before they are passed to the encoder. This abstracts these features away from encoders.

What do you think @cfsmp3

For example, right now I'm implementing the scc encoder and have to deal with this which is even worse because I need to implement the "previous" version, the version before this PR.

There also seems to be autodash and trim_subs which might be related (though I haven't looked into them yet and they don't appear on all encoders).

cfsmp3 · 2019-12-18T21:46:03Z

Not so easy - for example for 608 you'd need to modify the grid, which is fixed in size - it would be a pain in the ass. It's not a bad idea though, but I don't think it would save as much time as you'd think :-)

…

On Tue, Dec 17, 2019 at 1:35 PM Nils ANDRÉ-CHANG ***@***.***> wrote: It would probably be better to modify the subtitles before they are passed to the encoder. This abstracts these features away from encoders. What do you think @cfsmp3 <https://github.com/cfsmp3> For example, right now I'm implementing the scc encoder and have to deal with this which is even worse because I need to implement the "previous" version, the version before this PR. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1139?email_source=notifications&email_token=ABNMTWPFVOYXUF7PTWCMHPLQZFAYXA5CNFSM4JX5BFLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHEA6OY#issuecomment-566759227>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABNMTWJPP63VD62SXNIWVZLQZFAYXANCNFSM4JX5BFLA> .

NilsIrl · 2019-12-27T16:59:53Z

Multi word swear words will not work

Fix 1: Remove multiple word swear words
Fix 2: Add the option to have multi word swear words, (could this be considered for capitalisation)

canihavesomecoffee · 2019-12-29T21:35:28Z

src/lib_ccx/ccx_encoders_helpers.c

+	"Jesus fuck",
+	"Jesus Harold Christ",
+	"Jesus wept",
+	"Judas Priest",


I'm not sure this should be considered as a swear word :p

If there's ever a documentary on https://en.wikipedia.org/wiki/Judas_Priest, this will be annoying ;)

canihavesomecoffee · 2019-12-30T02:28:06Z

@NilsIrl The existing capfile (the dictionary) no longer works:

E:\GitHub\ccextractor\windows\Debug>ccextractorwin.exe --capfile ....\Dictionary\MattS_dictionary.txt E:\Downloads\c83f765c661595e1bfa4750756a54c006c6f2c697a436bc0726986f71f0706cd.ts
Error: There was an error processing the capitalization file.

Please fix this before we can merge.

cfsmp3 · 2020-01-01T21:17:53Z

@canihavesomecoffee all OK now?

canihavesomecoffee · 2020-01-01T21:53:59Z

@canihavesomecoffee all OK now?

Didn't retest yet. Can do that tomorrow.

canihavesomecoffee · 2020-01-05T13:29:19Z

Dictionary still works as intended, built-in list also works.

Will trigger a final re-run of the Test Suite to check.

ccextractor-bot · 2020-01-05T14:26:29Z

CCExtractor CI platform finished running the test files on linux. Below is a summary of the test results:

Report Name	Tests Passed
Broken	12/13
DVB	3/7
DVR-MS	2/2
General	27/27
Hauppage	3/3
MP4	3/3
NoCC	10/10
Teletext	14/21
WTV	13/13
XDS	34/34
CEA-708	14/14
DVD	3/3
Options	86/86

It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you).

Your PR breaks these cases:

ccextractor -out=sami -latin1 -autoprogram 5b4e0a6034...
ccextractor -datapid 5603 -autoprogram -out=srt -latin1 -teletext 85c7fc1ad7...
ccextractor -autoprogram -out=srt -latin1 85271be4d2...
ccextractor -autoprogram -out=srt -latin1 4e56e88ba4...
ccextractor -autoprogram -out=ttxt -latin1 c0d2fba8c0...
ccextractor -autoprogram -out=ttxt -latin1 27d7a43dd6...
ccextractor -autoprogram -out=ttxt -latin1 efbe129086...
ccextractor -autoprogram -out=ttxt -latin1 e2e2b501e0...
ccextractor -autoprogram -out=ttxt -latin1 -datets dcada745de...
ccextractor -autoprogram -out=srt -latin1 -teletext -tpage 398 3b276ad8bf...
ccextractor -stdout -quiet -nofc 79a51f3500...
ccextractor -stdout -quiet -nofc 767b546f96...

Check the result page for more info.

ccextractor-bot · 2020-01-05T16:36:21Z

CCExtractor CI platform finished running the test files on windows. Below is a summary of the test results:

Report Name	Tests Passed
Broken	12/13
DVB	4/7
DVR-MS	2/2
General	26/27
Hauppage	3/3
MP4	2/3
NoCC	10/10
Teletext	21/21
WTV	8/13
XDS	0/34
CEA-708	0/14
DVD	0/3
Options	0/86

It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you).

Your PR breaks these cases:

ccextractor -out=sami -latin1 -autoprogram 5b4e0a6034...
ccextractor -autoprogram -out=srt -latin1 85271be4d2...
ccextractor -out=srt -latin1 da75bdee47...
ccextractor -out=srt -latin1 bd6f33a669...
ccextractor -out=srt -latin1 0e5e6b26be...
ccextractor -out=srt -latin1 a226cc302d...
ccextractor -out=srt -latin1 ae6327683e...
ccextractor -autoprogram -out=ttxt -latin1 -ucla -xds 725a49f871...
ccextractor -autoprogram -out=ttxt -xds -latin1 -ucla d037c7509e...
ccextractor -autoprogram -out=srt -latin1 -ucla d037c7509e...
ccextractor -autoprogram -out=smptett -latin1 -ucla e274a73653...
ccextractor -autoprogram -out=ttxt -xds -latin1 -ucla e274a73653...
ccextractor -autoprogram -out=ttxt -xds -latin1 -ucla 85058ad37e...
ccextractor -autoprogram -out=ttxt -latin1 -ucla -xds b22260d065...
ccextractor -autoprogram -out=srt -latin1 -ucla b22260d065...
ccextractor -autoprogram -out=ttxt -latin1 -xds -ucla c813e713a0...
ccextractor -autoprogram -out=srt -latin1 -ucla c813e713a0...
ccextractor -autoprogram -out=ttxt -latin1 -ucla -xds 27fab4dbb6...
ccextractor -autoprogram -out=srt -latin1 -ucla 27fab4dbb6...
ccextractor -autoprogram -out=ttxt -latin1 -ucla -xds bbd5bb52fc...
ccextractor -autoprogram -out=srt -latin1 -ucla bbd5bb52fc...
ccextractor -autoprogram -out=ttxt -latin1 -ucla -xds b992e0cccb...
ccextractor -autoprogram -out=ttxt -latin1 -ucla -xds d0291cdcf6...
ccextractor -autoprogram -out=ttxt -latin1 -ucla 7d2730d38e...
ccextractor -autoprogram -out=srt -latin1 -ucla 7d2730d38e...
ccextractor -autoprogram -out=ttxt -latin1 -ucla -xds c8dc039a88...
ccextractor -autoprogram -out=srt -latin1 -ucla c8dc039a88...
ccextractor -autoprogram -out=ttxt -latin1 -ucla -xds 53339f3455...
ccextractor -autoprogram -out=srt -latin1 -ucla 53339f3455...
ccextractor -autoprogram -out=ttxt -latin1 -ucla -xds 83b03036a2...
ccextractor -autoprogram -out=srt -latin1 -ucla 83b03036a2...
ccextractor -autoprogram -out=ttxt -latin1 -ucla -xds 7d3f25c32c...
ccextractor -autoprogram -out=srt -latin1 -ucla 7d3f25c32c...
ccextractor -autoprogram -out=ttxt -latin1 -ucla -xds f41d4c29a1...
ccextractor -autoprogram -out=srt -latin1 -ucla f41d4c29a1...
ccextractor -autoprogram -out=ttxt -latin1 -ucla -xds 88cd42b89a...
ccextractor -autoprogram -out=srt -latin1 -ucla 88cd42b89a...
ccextractor -autoprogram -out=srt -latin1 -2 -ucla 88cd42b89a...
ccextractor -autoprogram -out=ttxt -latin1 -ucla -xds 7f41299cc7...
ccextractor -autoprogram -out=srt -latin1 -ucla 7f41299cc7...
ccextractor -autoprogram -out=ttxt -latin1 -ucla -xds 0069dffd21...
ccextractor -autoprogram -out=ttxt -latin1 5ae2007a79...
ccextractor -autoprogram -out=ttxt -latin1 1e44efd810...
ccextractor -autoprogram -out=ttxt -latin1 add511677c...
ccextractor -out=srt -latin1 -autoprogram 29e5ffd34b...
ccextractor -svc 1 -out=txt -nobom -noru ea83ff7bcb...
ccextractor -svc 1 -out=txt f17524b53f...
ccextractor -svc 1 -out=txt da904de35d...
ccextractor -svc 1 -out=txt 80848c45f8...
ccextractor -svc 1 -out=txt -nobom -noru b5d6aad89f...
ccextractor -svc 1[EUC-KR] -out=txt -noru b5d6aad89f...
ccextractor -svc 1 -out=srt da904de35d...
ccextractor -svc 1 -out=sami da904de35d...
ccextractor -svc 1 -out=ttxt da904de35d...
ccextractor -svc 1[EUC-KR] b5d6aad89f...
ccextractor -svc 1[EUC-KR] -noru b5d6aad89f...
ccextractor -svc all da904de35d...
ccextractor -svc all[EUC-KR] b5d6aad89f...
ccextractor -svc 1,2[UTF-8],3[EUC-KR],54 -out=txt da904de35d...
ccextractor -autoprogram -out=srt -latin1 -1 a65d39ccb3...
ccextractor -autoprogram -out=srt -latin1 -2 a65d39ccb3...
ccextractor -autoprogram c83f765c66...
ccextractor -svc 1 c83f765c66...
ccextractor -in=ts c83f765c66...
ccextractor -out=srt c83f765c66...
ccextractor -out=sami c83f765c66...
ccextractor -out=dvdraw c83f765c66...
ccextractor -out=txt c83f765c66...
ccextractor -out=ttxt c83f765c66...
ccextractor -out=smptett c83f765c66...
ccextractor -out=spupng c83f765c66...
ccextractor -gt c83f765c66...
ccextractor -nogt c83f765c66...
ccextractor --fixpadding c83f765c66...
ccextractor -90090 c83f765c66...
ccextractor -mythtv c83f765c66...
ccextractor -pn 1 c83f765c66...
ccextractor -datapid 256 c83f765c66...
ccextractor -datastreamtype 2 c83f765c66...
ccextractor -datastreamtype 2 -streamtype 2 c83f765c66...
ccextractor -noautotimeref c83f765c66...
ccextractor -bom c83f765c66...
ccextractor -nobom c83f765c66...
ccextractor -unicode c83f765c66...
ccextractor -utf8 c83f765c66...
ccextractor -latin1 c83f765c66...
ccextractor -nofc c83f765c66...
ccextractor -nots c83f765c66...
ccextractor -trim c83f765c66...
ccextractor -sc c83f765c66...
ccextractor --capfile /repository/Dictionary/MattS_dictionary.txt c83f765c66...
ccextractor -unixts 5 -out=txt c83f765c66...
ccextractor -out=txt -datets c83f765c66...
ccextractor -out=txt -sects c83f765c66...
ccextractor -out=txt -UCLA c83f765c66...
ccextractor -out=txt -lf c83f765c66...
ccextractor -autodash -trim c83f765c66...
ccextractor -bi c83f765c66...
ccextractor -nobi c83f765c66...
ccextractor -bs 1M c83f765c66...
ccextractor -dru c83f765c66...
ccextractor -noru c83f765c66...
ccextractor -ru1 c83f765c66...
ccextractor -ru2 c83f765c66...
ccextractor -ru3 c83f765c66...
ccextractor -delay 200 c83f765c66...
ccextractor -startat 4 -endat 7 c83f765c66...
ccextractor -nocodec dvbsub c83f765c66...
ccextractor -debug -out=srt c83f765c66...
ccextractor -608 -out=srt c83f765c66...
ccextractor -708 -out=srt c83f765c66...
ccextractor -goppts -out=srt c83f765c66...
ccextractor -xdsdebug -out=srt c83f765c66...
ccextractor -vides -out=srt c83f765c66...
ccextractor -cbraw -out=srt c83f765c66...
ccextractor -nosync -out=srt c83f765c66...
ccextractor -fullbin -out=srt c83f765c66...
ccextractor -parsedebug -out=srt c83f765c66...
ccextractor -parsePAT -out=srt c83f765c66...
ccextractor -parsePMT -out=srt c83f765c66...
ccextractor -investigate_packets -out=srt c83f765c66...
ccextractor -in=ps e9b9008fdf...
ccextractor -in=es dc7169d7c4...
ccextractor -in=asf 6395b281ad...
ccextractor -in=wtv b46e9e8e3f...
ccextractor -in=bin 988d4e8bba...
ccextractor -in=raw fb79021542...
ccextractor -in=mp4 b2771c84c2...
ccextractor -mp4vidtrack 5df914ce77...
ccextractor -wtvconvertfix acf871cbfd...
ccextractor -wtvmpeg2 10f0f77cf4...
ccextractor --hauppauge d6df1b227a...
ccextractor -xmltv -out=null 96efd279cf...
ccextractor -codec dvbsub -out=spupng 85271be4d2...
ccextractor --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
ccextractor --startcreditsnotbefore 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
ccextractor --startcreditsnotafter 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
ccextractor --startcreditsforatleast 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
ccextractor --startcreditsforatmost 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
ccextractor --endcreditstext "CCextractor Ends crdit Testing" addf5e2fc9...
ccextractor --endcreditsforatleast 3 --endcreditstext "CCextractor Ends crdit Testing" addf5e2fc9...
ccextractor --endcreditsforatmost 2 --endcreditstext "CCextractor Ends crdit Testing" addf5e2fc9...
ccextractor -tpage 801 4e56e88ba4...
ccextractor -tverbose 4e56e88ba4...
ccextractor -teletext 4e56e88ba4...
ccextractor -autoprogram -out=srt -bom -latin1 8849331dda...
ccextractor -stdout -quiet -nofc 79a51f3500...
ccextractor -stdout -quiet -nofc 767b546f96...

Check the result page for more info.

canihavesomecoffee · 2020-01-05T16:37:51Z

I'd say it's mergeable.

cfsmp3 requested changes Dec 8, 2019

View reviewed changes

aadibajpai reviewed Dec 8, 2019

View reviewed changes

src/lib_ccx/ccx_encoders_helpers.c Show resolved Hide resolved

cfsmp3 requested changes Dec 9, 2019

View reviewed changes

NilsIrl mentioned this pull request Dec 22, 2019

[FEATURE] SCC and CCD encoder #1154

Merged

7 tasks

NilsIrl added 19 commits December 28, 2019 22:56

Remove space before ';'

5b29db3

Add --kf option and parse files

99a12b8

Rename profanity_file to filter_profanity_file. Dump params

7d8499a

Use correct function

a7d2264

Add missing continue

2739602

Fix double free error

8ef89f6

Censor word when in dictionary

59a8c7a

Fix '\0' in output file

37e4d41

Make a fix_subtitles function

57eb179

Fix bug with asterisk

e3e810f

Remove lower_spell list as it's useless

f4961a0

Remove useless wrappers

e5575a0

Fix subtitles for more encoders

84cff4d

Feedback

b0e5eb0

Rename fix_subtitles to correct_spelling_and_censor_words_608

fc78fc3

Remove checking if function is called twice

f739d54

Sort both capitalization and profanity lists

70ac7f9

Fix error where wrong return valued is checked

b2d3a2f

Rename spell_correct to capitalization_list

5fcb31d

NilsIrl force-pushed the filter_bad_words branch from a0c861a to 5fcb31d Compare December 28, 2019 23:24

Fix syntax error because of forgotten brace

4fe32b1

canihavesomecoffee reviewed Dec 29, 2019

View reviewed changes

Fix crash

e1d3060

NilsIrl force-pushed the filter_bad_words branch from d009e9e to e1d3060 Compare January 1, 2020 17:16

Remove multi word profanity

af64fa8

CCExtractor deleted a comment from ccextractor-bot Jan 5, 2020

canihavesomecoffee merged commit af67596 into CCExtractor:master Jan 5, 2020

NilsIrl deleted the filter_bad_words branch January 5, 2020 19:37

NilsIrl mentioned this pull request Jan 10, 2020

[PROPOSAL] Add kid-friendly(-kf) parameter #1114

Closed

10 tasks

NilsIrl mentioned this pull request Jan 18, 2020

[bug] warning in build function in linux #1188

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[IMPROVEMENT] Filter bad words #1139

[IMPROVEMENT] Filter bad words #1139

NilsIrl commented Dec 8, 2019 •

edited

Loading

cfsmp3 left a comment

aadibajpai commented Dec 8, 2019

NilsIrl commented Dec 8, 2019

NilsIrl commented Dec 8, 2019 •

edited

Loading

cfsmp3 Dec 9, 2019

NilsIrl Dec 9, 2019 •

edited

Loading

NilsIrl Dec 9, 2019

cfsmp3 Dec 9, 2019

cfsmp3 commented Dec 10, 2019

NilsIrl commented Dec 17, 2019 •

edited

Loading

cfsmp3 commented Dec 18, 2019 via email

NilsIrl commented Dec 27, 2019 •

edited

Loading

canihavesomecoffee Dec 29, 2019

canihavesomecoffee commented Dec 30, 2019

cfsmp3 commented Jan 1, 2020

canihavesomecoffee commented Jan 1, 2020

canihavesomecoffee commented Jan 5, 2020

ccextractor-bot commented Jan 5, 2020

ccextractor-bot commented Jan 5, 2020

canihavesomecoffee commented Jan 5, 2020

[IMPROVEMENT] Filter bad words #1139

[IMPROVEMENT] Filter bad words #1139

Conversation

NilsIrl commented Dec 8, 2019 • edited Loading

cfsmp3 left a comment

Choose a reason for hiding this comment

aadibajpai commented Dec 8, 2019

NilsIrl commented Dec 8, 2019

NilsIrl commented Dec 8, 2019 • edited Loading

cfsmp3 Dec 9, 2019

Choose a reason for hiding this comment

NilsIrl Dec 9, 2019 • edited Loading

Choose a reason for hiding this comment

NilsIrl Dec 9, 2019

Choose a reason for hiding this comment

cfsmp3 Dec 9, 2019

Choose a reason for hiding this comment

cfsmp3 commented Dec 10, 2019

NilsIrl commented Dec 17, 2019 • edited Loading

cfsmp3 commented Dec 18, 2019 via email

NilsIrl commented Dec 27, 2019 • edited Loading

canihavesomecoffee Dec 29, 2019

Choose a reason for hiding this comment

canihavesomecoffee commented Dec 30, 2019

cfsmp3 commented Jan 1, 2020

canihavesomecoffee commented Jan 1, 2020

canihavesomecoffee commented Jan 5, 2020

ccextractor-bot commented Jan 5, 2020

ccextractor-bot commented Jan 5, 2020

canihavesomecoffee commented Jan 5, 2020

NilsIrl commented Dec 8, 2019 •

edited

Loading

NilsIrl commented Dec 8, 2019 •

edited

Loading

NilsIrl Dec 9, 2019 •

edited

Loading

NilsIrl commented Dec 17, 2019 •

edited

Loading

NilsIrl commented Dec 27, 2019 •

edited

Loading