Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

twarc2 search without configure on Windows throws JSON parse error #441

Closed
osemele opened this issue Apr 22, 2021 · 63 comments
Closed

twarc2 search without configure on Windows throws JSON parse error #441

osemele opened this issue Apr 22, 2021 · 63 comments

Comments

@osemele
Copy link

osemele commented Apr 22, 2021

I ran the request below:
twarc2 search '#ENDSARS-is:retweet' --start-time 2017-12-01 --end-time 2020-11-30 --flatten --archive C:\Users\USER\Desktop\MyTwarcResults.json

and I got this error message below:

Traceback (most recent call last):
  File "C:\Users\USER\PycharmProjects\workspace\venv\Scripts\twarc2-script.py", line 33, in <module>
    sys.exit(load_entry_point('twarc==2.0.6', 'console_scripts', 'twarc2')())
  File "c:\users\user\pycharmprojects\workspace\venv\lib\site-packages\click\core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "c:\users\user\pycharmprojects\workspace\venv\lib\site-packages\click\core.py", line 782, in main
    rv = self.invoke(ctx)
  File "c:\users\user\pycharmprojects\workspace\venv\lib\site-packages\click\core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "c:\users\user\pycharmprojects\workspace\venv\lib\site-packages\click\core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "c:\users\user\pycharmprojects\workspace\venv\lib\site-packages\click\core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "c:\users\user\pycharmprojects\workspace\venv\lib\site-packages\click\decorators.py", line 33, in new_func
    return f(get_current_context().obj, *args, **kwargs)
  File "c:\users\user\pycharmprojects\workspace\venv\lib\site-packages\twarc\decorators.py", line 172, in __call__
    result = e.response.json()
  File "c:\users\user\pycharmprojects\workspace\venv\lib\site-packages\requests\models.py", line 900, in json
    return complexjson.loads(self.text, **kwargs)
  File "C:\Users\USER\AppData\Local\Programs\Python\Python36\lib\json\__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "C:\Users\USER\AppData\Local\Programs\Python\Python36\lib\json\decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "C:\Users\USER\AppData\Local\Programs\Python\Python36\lib\json\decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

What exactly be the cause/source of this error, and how can i get help?

@igorbrigadir
Copy link
Contributor

'#ENDSARS-is:retweet'

i think this query is missing a space, it should be "#ENDSARS -is:retweet"

Another issue may be the ' vs " quotes - so the full command that might work is:

twarc2 search --start-time "2017-12-01" --end-time "2020-11-30" --flatten --archive "#ENDSARS -is:retweet" "C:\Users\USER\Desktop\MyTwarcResults.json"

Does that give the same error?

@edsu
Copy link
Member

edsu commented Apr 22, 2021

I noticed the missing space too. I couldn't get it to throw the same error though. Maybe it's a Windows only behavior? Does anyone else with access have time to confirm?

@igorbrigadir
Copy link
Contributor

I saw this error before from someone else, and the issue was a failing connection to the API, solved with setting alternative DNS servers, : Expecting value: line 1 column 1 (char 0) comes up if there's a blank or no response from the API but I can't reproduce this either. Will have to load up Windows again for more testing

@osemele
Copy link
Author

osemele commented Apr 23, 2021

'#ENDSARS-is:retweet'

i think this query is missing a space, it should be "#ENDSARS -is:retweet"

Another issue may be the ' vs " quotes - so the full command that might work is:

twarc2 search --start-time "2017-12-01" --end-time "2020-11-30" --flatten --archive "#ENDSARS -is:retweet" "C:\Users\USER\Desktop\MyTwarcResults.json"

Does that give the same error?

Yes it still does unfortunately. I am really stucked, don't know what to do

@osemele
Copy link
Author

osemele commented Apr 23, 2021

to run other simple commands like;
twarc search blacklivesmatter > search.jsonl
gives me same error.

@osemele
Copy link
Author

osemele commented Apr 24, 2021

I have tried all the suggestions from @igorbrigadir and @edsu, yet, i get the error response:

raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

@edsu
Copy link
Member

edsu commented Apr 25, 2021

Interesting! @osemele can you paste the last 20 lines or so of your twarc.log file. You should find the file in the directory where you are running your twarc command?

edsu added a commit that referenced this issue Apr 25, 2021
If errors from the Twitter API are not JSON they can cause strange
errors. Instead we should catch these and log what was received from
Twitter instead of JSON.

Refs #441
@edsu
Copy link
Member

edsu commented Apr 25, 2021

Also, I've just released twarc v2.0.7 that should log what was received from the Twitter API when an error message is not JSON. Could you try installing it with pip3 install --upgrade twarc and see what error message you get?

@edsu
Copy link
Member

edsu commented Apr 25, 2021

Thinking about @igorbrigadir's point about DNS I'd also be interested to see if the v1.1 API is working. Can you try the older twarc client and see what happens?

twarc search blacklivesmatter

Note twarc instead of twarc2.

@osemele
Copy link
Author

osemele commented Apr 26, 2021 via email

@edsu
Copy link
Member

edsu commented Apr 26, 2021

Strange, I thought I was catching JSONDecodeError in v2.0.7 and up. What do you see when you run this:

twarc2 version

@osemele
Copy link
Author

osemele commented Apr 27, 2021 via email

@igorbrigadir
Copy link
Contributor

If twarc works, but twarc2 does not, I would first check if the app is setup for v2 access on https://developer.twitter.com/en/portal/dashboard - and if that's correct (the app should be under "Academic Research" Project, not "Standard") try this command, replacing AAA...zzz with your bearer token:

curl "https://api.twitter.com/2/tweets/search/all?query=from%3Atwitterdev%20new%20-is%3Aretweet&max_results=10" -H "Authorization: Bearer AAA...zzz"

If that fails, something else is wrong.

If you do not have Academic Access, you will not be able to use --archive and can search within the last 7 days only. Standard endpoint example is:

curl https://api.twitter.com/2/tweets/search/recent?query=from%3Atwitterdev%20new%20-is%3Aretweet&max_results=10 -H "Authorization: Bearer AAA...zzz"

If those work, but twarc2 still does not, I would recommend reinstalling twarc, or trying it in a brand new vitrualenv environment:

pip install--upgrade --force-reinstall twarc

Another thing i found, is that maybe there is a space in your user name, and your Anaconda / pip is broken as a result: https://stackoverflow.com/questions/42152589/anaconda-failed-to-create-process

@edsu
Copy link
Member

edsu commented Apr 28, 2021

I've been meaning to check what happens when you use --archive with keys that don't have access to the Academic Product Track. If that really is the cause here I think twarc2 should give an understandable error.

@edsu
Copy link
Member

edsu commented Apr 28, 2021

@osemele can you please paste the full stack trace you see when you run twarc2 version ?

@osemele
Copy link
Author

osemele commented Apr 29, 2021 via email

@edsu
Copy link
Member

edsu commented Apr 29, 2021

@osemele I accidentally introduced a new error when trying to catch the one you found earlier. Could you upgrade twarc to v2.0.9 and try your twarc2 command again and paste any errors you see?

@osemele
Copy link
Author

osemele commented Apr 29, 2021 via email

@igorbrigadir
Copy link
Contributor

http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail
Virus-free.
www.avg.com
http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

Thanks! This is making sense to me now - It appears that AVG is blocking or redirecting requests to api.twitter.com. You will need to configure AVG to allow connections to twitter, or allow twarc2.exe or something like that - i don't know AVG settings, but that's the source of your error, which was hidden by our JSONDecode bug.

@osemele
Copy link
Author

osemele commented Apr 29, 2021 via email

@osemele
Copy link
Author

osemele commented Apr 29, 2021 via email

@edsu
Copy link
Member

edsu commented Apr 29, 2021

This is just for reference but @igorbrigadir is doing some fine detective work on this over in the Twitter Forum: https://twittercommunity.com/t/mining-historical-data-using-twarc/153350/13

@edsu
Copy link
Member

edsu commented Apr 29, 2021

@osemele what is AVG? Can you try to reinstall twarc with this command and see if twarc2 works?

python -m pip install --upgrade --force-reinstall twarc

@osemele
Copy link
Author

osemele commented Apr 30, 2021 via email

@igorbrigadir
Copy link
Contributor

And what's the exact command you're running and where (Anaconda prompt? cmd.exe?)

What does

pip list

output?

@osemele
Copy link
Author

osemele commented Apr 30, 2021 via email

@igorbrigadir
Copy link
Contributor

And what does this output? (Replacing AAA...zzz with your own bearer token and removing it from the output before posting here)

curl -v -H "Authorization: Bearer AAA...zzz" "https://api.twitter.com/2/tweets/search/recent?expansions=author_id%2Cin_reply_to_user_id%2Creferenced_tweets.id%2Creferenced_tweets.id.author_id%2Centities.mentions.username%2Cattachments.poll_ids%2Cattachments.media_keys%2Cgeo.place_id&user.fields=created_at%2Cdescription%2Centities%2Cid%2Clocation%2Cname%2Cpinned_tweet_id%2Cprofile_image_url%2Cprotected%2Cpublic_metrics%2Curl%2Cusername%2Cverified%2Cwithheld&tweet.fields=attachments%2Cauthor_id%2Ccontext_annotations%2Cconversation_id%2Ccreated_at%2Centities%2Cgeo%2Cid%2Cin_reply_to_user_id%2Clang%2Cpublic_metrics%2Ctext%2Cpossibly_sensitive%2Creferenced_tweets%2Creply_settings%2Csource%2Cwithheld&media.fields=duration_ms%2Cheight%2Cmedia_key%2Cpreview_image_url%2Ctype%2Curl%2Cwidth%2Cpublic_metrics&poll.fields=duration_minutes%2Cend_datetime%2Cid%2Coptions%2Cvoting_status&place.fields=contained_within%2Ccountry%2Ccountry_code%2Cfull_name%2Cgeo%2Cid%2Cname%2Cplace_type&max_results=10&query=endsars"

and

curl -v -H "Authorization: Bearer AAA...zzz" "https://api.twitter.com/2/tweets/search/all?expansions=author_id%2Cin_reply_to_user_id%2Creferenced_tweets.id%2Creferenced_tweets.id.author_id%2Centities.mentions.username%2Cattachments.poll_ids%2Cattachments.media_keys%2Cgeo.place_id&user.fields=created_at%2Cdescription%2Centities%2Cid%2Clocation%2Cname%2Cpinned_tweet_id%2Cprofile_image_url%2Cprotected%2Cpublic_metrics%2Curl%2Cusername%2Cverified%2Cwithheld&tweet.fields=attachments%2Cauthor_id%2Ccontext_annotations%2Cconversation_id%2Ccreated_at%2Centities%2Cgeo%2Cid%2Cin_reply_to_user_id%2Clang%2Cpublic_metrics%2Ctext%2Cpossibly_sensitive%2Creferenced_tweets%2Creply_settings%2Csource%2Cwithheld&media.fields=duration_ms%2Cheight%2Cmedia_key%2Cpreview_image_url%2Ctype%2Curl%2Cwidth%2Cpublic_metrics&poll.fields=duration_minutes%2Cend_datetime%2Cid%2Coptions%2Cvoting_status&place.fields=contained_within%2Ccountry%2Ccountry_code%2Cfull_name%2Cgeo%2Cid%2Cname%2Cplace_type&max_results=10&query=%23endsars&start_time=2006-03-21T00%3A00%3A00%2B00%3A00"

You may also have to replace curl with curl.exe

@osemele
Copy link
Author

osemele commented Apr 30, 2021 via email

@edsu
Copy link
Member

edsu commented May 5, 2021

Yeah, a 400 error from the API is documented as:

The request was invalid or cannot be otherwise served. An accompanying error message will explain further. Requests without authentication or with invalid query parameters are considered invalid and will yield this response.

and then suggest:

Double check the format of your JSON query. For example, if your rule contains double-quote characters associated with an exact-match or other operator, you may need to escape them using a backslash to distinguish them from the structure of the JSON format.

But we're not actually sending any JSON as part of the search/recent API call it's just a GET. I guess if we set logging.level to DEBUG we might get some underlying information from requests/urllib3? --log-level might actually be a nice option to have...

But I agree @igorbrigadir it seems like something is interfering with the execution of twarc2.exe? I don't know how feasible it is, but it might be nice to be able to run twarc2 like this:

python -m twarc.command2 search ...

@osemele
Copy link
Author

osemele commented May 7, 2021 via email

@edsu
Copy link
Member

edsu commented May 7, 2021

@osemele that is awesome news! Do you know what you did to fix it? It would be useful for us to know if this situation ever arises again.

@osemele
Copy link
Author

osemele commented May 7, 2021 via email

@edsu
Copy link
Member

edsu commented May 7, 2021

That's very helpful thanks @osemele . We will test running twarc2 search without having run twarc2 configure first on Windows.

@edsu edsu changed the title Error Message after running Twarc command twarc2 search without configure on Windows throws JSON parse error May 7, 2021
@igorbrigadir
Copy link
Contributor

igorbrigadir commented May 9, 2021

Maybe we could try and read the old twarc config file if it exists to auto configure twarc2?

@edsu
Copy link
Member

edsu commented May 9, 2021

Yeah, that would be nice if it wasn't too tricky. Do the old stand-alone apps have access to the Twitter v2 API? I guess it is confusing for someone might concurrently use twarc and twarc2. I wanted to update twarc2 to allow for "profiles" like twarc.

@igorbrigadir
Copy link
Contributor

Do the old stand-alone apps have access to the Twitter v2 API?

Not by default, but, the same keys work for both v1.1 and v2, if the app is set up in a Project on the dashboard - https://developer.twitter.com/en/portal/projects-and-apps so we could link that in a "warning" when loading configs this way maybe?

"Profiles" sound like a good feature for sure.

@osemele
Copy link
Author

osemele commented May 10, 2021 via email

@igorbrigadir
Copy link
Contributor

Yes very! Thanks again for digging in with the debugging!

@edsu
Copy link
Member

edsu commented May 18, 2021

This also has me wondering if the input should actually display the keys on the console. It seems to be causing some confusion.

@edsu
Copy link
Member

edsu commented May 26, 2021

It's not really clear to me that the configuration was actually the root of the problem here. We are seeing the AVG firewall issue coming up in #469 as well.

@AbirRes
Copy link

AbirRes commented Jun 9, 2021

Hi @edsu, not sure if this thread is still running. I am facing a similar issue as @osemele, "unable to parse 400 error as json: Bad request" with twarc2. I have been able to successfully configure twarc2 as well as twarc, so the above-suggested fix does not work for me. twarc runs perfectly for me, but twarc2, unfortunately, does not. When I run the command: twarc2 stream blm > tweets.json1, it creates a file "tweets" but without any data. I have tried installing, uninstalling Anaconda, Python, etc., but unfortunately, nothing has worked so far. I also tried on a computer where the username does not have any space in it to avoid the pip breaking down, but that did not seem to be the problem as well. I am sorry for the long post, but I can't seem to find the fix while twarc2 seems to do exactly what I need which is why I really want it to work.
I would really appreciate any suggestions that you could kindly provide.

@edsu
Copy link
Member

edsu commented Jun 9, 2021

@AbirRes thank you for the details! Can you say what operating system + version you are using as well as what version of Python you have installed and where you got it from?

@AbirRes
Copy link

AbirRes commented Jun 9, 2021

@edsu I am using Windows 10 and python 3.9.5. I downloaded it from their official website. I also tried it after downloading Anaconda, where then I used the Anaconda prompt to run the commands. Furthermore, I followed the usual/suggested install methods and did not do anything custom to change the path, etc.

@edsu
Copy link
Member

edsu commented Jun 9, 2021

It is very interesting that twarc works but twarc2 does not. Can you share which commands were you using for each?

@AbirRes
Copy link

AbirRes commented Jun 9, 2021

I have only tried the basic commands so far. I have tried: twarc search #covid; twarc filter #covid; twarc trends. For twarc2, I have tried twarc2 stream #covid19; twarc2 stream blm; twarc2 stream blm > tweets.json1. All give "unable to parse json" error, but the latter one creates an empty tweets folder.

@osemele
Copy link
Author

osemele commented Jun 10, 2021 via email

@edsu
Copy link
Member

edsu commented Jun 10, 2021

@osemele the file extension for the output file would not cause this problem.

@AbirRes what do you see when you run twarc2 search blm ?

@AbirRes
Copy link

AbirRes commented Jun 10, 2021

@AbirRes what do you see when you run twarc2 search blm ?

I get the message: Unable to parse 400 error as JSON: Bad Request.

I am sorry, I can't post a snapshot as I am not in front of my system right now.

edsu added a commit that referenced this issue Jun 10, 2021
This commit adds twarc.config.ConfigProvider which is based on
click_config_file.configobj_provider and stores the file path for the
config file that was used. This is useful for logging.

Also when --verbose is used the log will now contain the keys that are
being used to talk to the API. This isn't something you would normally
want in your logs, but it can be useful for debugging situations like #441
and #469.
@edsu
Copy link
Member

edsu commented Jun 10, 2021

@AbirRes v2.1.4 of twarc was just released, which has some improved logging. Would you be willing to upgrade:

pip install --upgrade twarc

and then run a search with verbose logging:

twarc2 --verbose search blm --limit 500 

Then can you email me your twarc.log at ehs@pobox.com? You should see the twarc.log file in the same directory where you ran the twarc2 command.

IMPORTANT! Please understand that --verbose will cause your API credentials to be written to the log, so please DO NOT upload the log file here to this GitHub issue. I promise not to use your keys other than to test from my side. You can even reset them once I'm done testing if you want. I completely understand if you would rather not do this, but I would be very grateful as we've had other people with this issue and we have not been able to replicate it in our Windows testing environments.

@edsu
Copy link
Member

edsu commented Jun 11, 2021

@AbirRes thanks for sending the twarc.log file. I responded with an email asking if you would be willing to run twarc2 configure again (now that the token will display properly) and see if that will help you run other twarc2 commands.

@edsu
Copy link
Member

edsu commented Jun 11, 2021

Oh, and the reason why twarc was working but twarc2 was not is that they actually use separate configuration files. Eventually the twarc one will go away when the v1.1 API endpoint is retired.

@edsu
Copy link
Member

edsu commented Jun 12, 2021

With @AbirRes' help we were able to figure out that the bearer token was not persisted to the configuration file correctly. It was a ctrl-v character, which seemed to really confuse the Twitter API. I think the ctrl-v ended up in the configuration file because we were previously hiding the input of the token (for screen recording). It could be that some Windows terminals aren't set up to do ctrl-v properly, and users could not see that it wasn't working since it was hidden. Tokens should now appear in the console to help catch this in the future.

So if you have this problem, please make sure you are using twarc v2.1.5 or higher:

pip install --upgrade twarc

and then reconfigure twarc2:

twarc2 configure

Hopefully that will allow you to use twarc2 subcommands going forwards. Thanks for everyone's patience on this!

@edsu edsu closed this as completed Jun 12, 2021
@GoPro13
Copy link

GoPro13 commented Jun 16, 2023

Don't do that again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants