Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to ignore Content-Transfer-Encoding when the text(mail) crawling #1449

Closed
anatomo opened this issue Jan 17, 2018 · 0 comments
Closed

How to ignore Content-Transfer-Encoding when the text(mail) crawling #1449

anatomo opened this issue Jan 17, 2018 · 0 comments

Comments

@anatomo
Copy link

anatomo commented Jan 17, 2018

Hello @marevol.

After I asked #1442, I decoded mail files to utf-8 before crawling. And crawl these files.
But, Maybe the crawler looks parse mail header (the header part has other encoding type).
So, Could you advise how to ignore Content-Transfer-Encoding or Mail Header?
(I want to crawling these files as text/plain or utf-8.)

When crawl this file, It looks good.
test1.txt

But crawl this file, It does not show message part (digest field does not have message part).
test2.txt

thanks.

@anatomo anatomo closed this as completed Jan 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant