scraped telegram posts having line breaks like the original post #687

mohammadali-seifkashani · 2023-01-25T06:04:44Z

When I tested snscrape for Telegram channels, I recognized that the posts don't have line breaks. And this caused problem for my text analysis. So I created the static method "get_post_text()" for that.

I also did a little cleaning in file code which are as follows:

Remove redundant parentheses in conditions
convert long lines to paragraphs

…r having line breaks in result

…elScraper._soup_to_items method

… - convert long lines to paragraphs

JustAnotherArchivist

Please revert the indentation and empty line changes, then I can review the other changes.

JustAnotherArchivist

The indentation is still wrong and the diff unreadable.

… change

mohammadali-seifkashani · 2023-02-12T20:14:57Z

Please do a favor and check the code.

JustAnotherArchivist

Thanks for the cleanup. The indentation of the new function is still incorrect though as it uses spaces, not tabs, so the code is a syntax error currently. Other things below.

JustAnotherArchivist · 2023-02-13T00:27:29Z

snscrape/modules/telegram.py

-		return cls._cli_construct(args, args.channel)
+		return cls._cli_construct(args, args.channel)


Still an undesired whitespace change. There should be a LF at the end of a (text-ish) file.

(Looks like GitHub doesn't display this correctly on the PR page itself, only in the full diff: https://github.com/JustAnotherArchivist/snscrape/pull/687/files#diff-7f40c11448f92ed2f5d1764136d372d15faa3d4da0272813e88478c4d8870a09L203)

JustAnotherArchivist · 2023-02-13T00:30:10Z

snscrape/modules/telegram.py

+        result = []
+        # Using the features of the BS4 module itself
+        for s in post.stripped_strings:
+            result.append(s)
+        return '\n'.join(result)


This can be simplified to '\n'.join(post.stripped_strings), but it doesn't do the right thing anyway. It splits out links into separate lines, and it doesn't preserve multiple line breaks. A good test case for both is https://t.me/telegram/201. Looks like this might require explicitly replacing the <br> tags.

JustAnotherArchivist · 2023-02-13T00:32:16Z

snscrape/modules/telegram.py

@@ -143,6 +143,14 @@ def get_items(self):
 				raise snscrape.base.ScraperException(f'Got status code {r.status_code}')
 			soup = bs4.BeautifulSoup(r.text, 'lxml')

+    @staticmethod
+    def get_post_text(post) -> str:


Should be message, not post, to avoid confusion with the variable in _soup_to_items. This should also not be public API. So _get_message_text(message).

mohammadali-seifkashani added 3 commits January 25, 2023 09:13

adding static method get_post_text to class TelegramChannelScraper fo…

6546279

…r having line breaks in result

remove extra line soup.get_text(separator="\n") from my TelegramChann…

117cab7

…elScraper._soup_to_items method

reformat telegram.py file: remove redundant parentheses in conditions…

70c1a1f

… - convert long lines to paragraphs

JustAnotherArchivist requested changes Jan 27, 2023

View reviewed changes

revert identations because of the request of the repository owner

ed3d520

mohammadali-seifkashani requested review from JustAnotherArchivist and removed request for JustAnotherArchivist February 4, 2023 15:29

JustAnotherArchivist requested changes Feb 4, 2023

View reviewed changes

Just copy the source code and add my function to it without any other…

41aeff1

… change

mohammadali-seifkashani requested a review from JustAnotherArchivist February 5, 2023 08:15

JustAnotherArchivist requested changes Feb 13, 2023

View reviewed changes

JustAnotherArchivist added enhancement New feature or request module:telegram labels Feb 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scraped telegram posts having line breaks like the original post #687

scraped telegram posts having line breaks like the original post #687

mohammadali-seifkashani commented Jan 25, 2023

JustAnotherArchivist left a comment

JustAnotherArchivist left a comment

mohammadali-seifkashani commented Feb 12, 2023

JustAnotherArchivist left a comment

JustAnotherArchivist Feb 13, 2023

JustAnotherArchivist Feb 13, 2023

JustAnotherArchivist Feb 13, 2023

JustAnotherArchivist Feb 13, 2023

		return cls._cli_construct(args, args.channel)
		return cls._cli_construct(args, args.channel)

scraped telegram posts having line breaks like the original post #687

Are you sure you want to change the base?

scraped telegram posts having line breaks like the original post #687

Conversation

mohammadali-seifkashani commented Jan 25, 2023

JustAnotherArchivist left a comment

Choose a reason for hiding this comment

JustAnotherArchivist left a comment

Choose a reason for hiding this comment

mohammadali-seifkashani commented Feb 12, 2023

JustAnotherArchivist left a comment

Choose a reason for hiding this comment

JustAnotherArchivist Feb 13, 2023

Choose a reason for hiding this comment

JustAnotherArchivist Feb 13, 2023

Choose a reason for hiding this comment

JustAnotherArchivist Feb 13, 2023

Choose a reason for hiding this comment

JustAnotherArchivist Feb 13, 2023

Choose a reason for hiding this comment