Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError on scraping data from multiple sources #424

Closed
RobertLucian opened this issue Nov 29, 2017 · 16 comments
Closed

UnicodeDecodeError on scraping data from multiple sources #424

RobertLucian opened this issue Nov 29, 2017 · 16 comments

Comments

@RobertLucian
Copy link

When I run the following code I get UnicodeDecodeError: 'utf-8' codec can't decode byte [x] in position [y]: invalid continuation byte, where [x] & [y] vary depending on the requested stock data or source:

from pandas_datareader import data, wb
import datetime

start = datetime.datetime(2017, 1, 1)
end = datetime.datetime(2017, 11, 1)

bac = data.DataReader('BAC', 'google', start, end)

I've tested on various sources (3 so far on 4 or 5 stocks) and I consistently get this error.
I'm running this on a conda environment, where the python's version is 3.6 and pandas_datareader's version is 0.5.0.

Could someone point out what is the issue here?

@ghost
Copy link

ghost commented Nov 29, 2017

I am experiencing the same error here. I think it happens with google source.

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last) 

     16     def __init__(self, symbol, from_date, to_date):
---> 17         raw_data = web.DataReader(symbol, "google", from_date, to_date) 

~/.local/lib/python3.5/site-packages/pandas_datareader/data.py in DataReader(name, data_source, start, end, retry_count, pause, session, access_key)
    135                                  chunksize=25,
    136                                  retry_count=retry_count, pause=pause,
--> 137                                  session=session).read()
    138 
    139     elif data_source == "enigma":

~/.local/lib/python3.5/site-packages/pandas_datareader/base.py in read(self)
    179         if isinstance(self.symbols, (compat.string_types, int)):
    180             df = self._read_one_data(self.url,
--> 181                                      params=self._get_params(self.symbols))
    182         # Or multiple symbols, (e.g., ['GOOG', 'AAPL', 'MSFT'])
    183         elif isinstance(self.symbols, DataFrame):

~/.local/lib/python3.5/site-packages/pandas_datareader/base.py in _read_one_data(self, url, params)
     77         """ read one data from specified URL """
     78         if self._format == 'string':
---> 79             out = self._read_url_as_StringIO(url, params=params)
     80         elif self._format == 'json':
     81             out = self._get_response(url, params=params).json()

~/.local/lib/python3.5/site-packages/pandas_datareader/base.py in _read_url_as_StringIO(self, url, params)
     96                           "inputs: {}".format(service, self.url))
     97         if isinstance(text, compat.binary_type):
---> 98             out.write(bytes_to_str(text))
     99         else:
    100             out.write(text)

~/.local/lib/python3.5/site-packages/pandas/compat/__init__.py in bytes_to_str(b, encoding)
     72 
     73     def bytes_to_str(b, encoding=None):
---> 74         return b.decode(encoding or 'utf-8')
     75 
     76     # The signature version below is directly copied from Django,

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfa in position 14207: invalid start byte

@tuannguyenthe
Copy link

I am also having the same problem.


UnicodeDecodeError Traceback (most recent call last)
in ()
2 end = dt.datetime(2017, 11, 30)
3
----> 4 f = web.DataReader('AAPL', 'google', start, end)

~/envs/3.5/lib/python3.5/site-packages/pandas_datareader/data.py in DataReader(name, data_source, start, end, retry_count, pause, session, access_key)
135 chunksize=25,
136 retry_count=retry_count, pause=pause,
--> 137 session=session).read()
138
139 elif data_source == "enigma":

~/envs/3.5/lib/python3.5/site-packages/pandas_datareader/base.py in read(self)
179 if isinstance(self.symbols, (compat.string_types, int)):
180 df = self._read_one_data(self.url,
--> 181 params=self._get_params(self.symbols))
182 # Or multiple symbols, (e.g., ['GOOG', 'AAPL', 'MSFT'])
183 elif isinstance(self.symbols, DataFrame):

~/envs/3.5/lib/python3.5/site-packages/pandas_datareader/base.py in _read_one_data(self, url, params)
77 """ read one data from specified URL """
78 if self._format == 'string':
---> 79 out = self._read_url_as_StringIO(url, params=params)
80 elif self._format == 'json':
81 out = self._get_response(url, params=params).json()

~/envs/3.5/lib/python3.5/site-packages/pandas_datareader/base.py in _read_url_as_StringIO(self, url, params)
96 "inputs: {}".format(service, self.url))
97 if isinstance(text, compat.binary_type):
---> 98 out.write(bytes_to_str(text))
99 else:
100 out.write(text)

~/envs/3.5/lib/python3.5/site-packages/pandas/compat/init.py in bytes_to_str(b, encoding)
70
71 def bytes_to_str(b, encoding=None):
---> 72 return b.decode(encoding or 'utf-8')
73
74 # The signature version below is directly copied from Django,

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xec in position 352: invalid continuation byte

@nzd31155
Copy link

nzd31155 commented Dec 1, 2017

I get the same problem - this all worked fine just last week

import pandas_datareader.data as wb
import datetime

start = datetime.datetime(2017, 1, 1)
end = datetime.datetime(2017, 11, 1)

stocks = ['LON:KGF', 'LON:ADM']
f = wb.DataReader(stocks, 'google', start, end)

error message

f = web.DataReader(stocks, 'google', start,end)
Traceback (most recent call last):
File "", line 1, in
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas_datareader/data.py", line 137, in DataReader
session=session).read()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas_datareader/base.py", line 186, in read
df = self._dl_mult_symbols(self.symbols)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas_datareader/base.py", line 197, in _dl_mult_symbols
self._get_params(sym))
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas_datareader/base.py", line 79, in _read_one_data
out = self._read_url_as_StringIO(url, params=params)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas_datareader/base.py", line 98, in _read_url_as_StringIO
out.write(bytes_to_str(text))
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/compat/init.py", line 73, in bytes_to_str
return b.decode(encoding or 'utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 28735: invalid start byte

@bsolomon1124
Copy link

bsolomon1124 commented Dec 1, 2017

A quick fix is below, porting from the source, paring it down and making a few slight tweaks.

I believe the issue is with the body returned by requests.get() and reading of the resulting bytes. (The traceback agrees with this.) For instance, try data = requests.get(url).content (gets bytes); this will fail. Below, data = requests.get(url).text works.

I really haven't tested this rigorously but the Google API does appear to be working okay. For instance, the export link generated by url does work just fine at the moment.

import datetime
import requests
from io import StringIO
# This is just a wrapper importing the compatible version of
#     urllib's urlencode--see pandas docs
from pandas.io.common import urlencode
import pandas as pd

BASE = 'http://finance.google.com/finance/historical'


# There seems to be confusion over whether the date api has changed.
# https://github.com/pydata/pandas-datareader/pull/425
# Both formats seem to work, but I'll use the "newer" one here to be safe
def get_params(symbol, start, end):
    params = {
        'q': symbol,
        'startdate': start.strftime('%Y/%m/%d'),
        'enddate': end.strftime('%Y/%m/%d'),
        'output': "csv"
    }
    return params


def build_url(symbol, start, end):
    params = get_params(symbol, start, end)
    return BASE + '?' + urlencode(params)


start = datetime.datetime(2010, 1, 1)
end = datetime.datetime.today()
sym = 'SPY'
url = build_url(sym, start, end)

data = requests.get(url).text
data = pd.read_csv(StringIO(data), index_col='Date', parse_dates=True)

print(data.head())
#               Open    High     Low   Close     Volume
# Date
# 2017-11-30  263.76  266.05  263.67  265.01  127894389
# 2017-11-29  263.02  263.63  262.20  262.71   77512102
# 2017-11-28  260.76  262.90  260.66  262.87   98971719
# 2017-11-27  260.41  260.75  260.00  260.23   52274922
# 2017-11-24  260.32  260.48  260.16  260.36   27856514

@qmpzqmpz
Copy link

qmpzqmpz commented Dec 2, 2017

Check if GoogleDailyReader.url() in pandas_datareader/google/daily.py returns 'http://www.google.com/finance/historical'.
If so, change it to 'http://finance.google.com/finance/historical'.
(www -> finance)
The return value of GoogleDailyReader.url() was 'http://www.google.com/finance/historical' when I downloaded pandas-datareader in PyCharm yesterday. I don't know why.

@bsolomon1124
Copy link

That's very strange @qmpzqmpz because the url seems to be correct in source, at least in 0.5.0:

https://github.com/pydata/pandas-datareader/blob/master/pandas_datareader/google/daily.py#L34

But when I test, url attribute shows the "old" url.

import pandas_datareader as pdr
c = pdr.google.daily.GoogleDailyReader()

c.url
# http://www.google.com/finance/historical'

pdr.__version__
# '0.5.0'

@coulanuk
Copy link

coulanuk commented Dec 3, 2017

@bsolomon1124 your fix works well.
I have just added:
data.sort_index(ascending=True, inplace=True)
To get the dataframe in ascending date order which was the original behaviour, I believe.
This enables one to calculate returns etc.

@paintdog
Copy link

paintdog commented Dec 3, 2017

Can someone write a pull request and fix the bugs???

@bsolomon1124
Copy link

Busy week for me but I can try to submit this this weekend. Although, it looks like some other commits have been failing the travisci build.

@nzd31155
Copy link

nzd31155 commented Dec 4, 2017

Testing your fix works nicely thanks. In the datareader I pulled pricing for a list of tickers (e.g. LON:BARC, LON:KGF, LON:BLND) the result gave me a dataframe with a panel for each stock. If I feed a list to the fix above it, the API doesn't like it. I know I can iterate through the list but wanted to do it once and return a panel. Am I missing something stupid?

@bsolomon1124
Copy link

bsolomon1124 commented Dec 4, 2017

@nzd31155 Yeah, the pandas-datareader code for reading multiple signals is a loop that reads each individually and then returns a Panel. You can find it here:

https://github.com/pydata/pandas-datareader/blob/master/pandas_datareader/base.py#L189

The class structure is like this--

  • _BaseReader is the base class for not just Google/Yahoo but other datareader modules as well.
  • _DailyBaseReader inherits from _BaseReader and is a base class for Google/Yahoo specifically.
  • GoogleDailyReader inherits from _DailyBaseReader and implements the url and param retrieval.
  • DataReader (the interface that most are familiar with) is just a wrapper that maps data_source to one of the classes such as GoogleDailyReader

Just fyi that Panel has a deprecation warning on it as of pandas 0.20. A MultiIndex df would be a good alternative.

I'm not an active developer on pandas-datareader but glad to take a deeper look when I have a moment and try to get something going that passes the build tests.

@bsolomon1124
Copy link

bsolomon1124 commented Dec 5, 2017

Just an update regarding the url: it's correct in the GitHub repo, but outdated in the PyPI download with equivalent version. (Go figure...)

To check:

>>> import pandas_datareader as pdr
>>> test = pdr.google.daily.GoogleDailyReader('')
>>> test.url??
Type:        property
String form: <property object at 0x10bdb39a8>
Source:     
# test.url.fget
@property
def url(self):
    return 'http://www.google.com/finance/historical'

@bashtage
Copy link
Contributor

This appears to be fixed in master, so closing for now. Reopen if this persists after 0.6.0

@wangliangliang2
Copy link

image
@bashtage
still exist

@wangliangliang2
Copy link

from pandas_datareader import data
...: goog = data.DataReader('GOOG', start='2004', end='2016', data_source='google')
...: goog.head()

@AlessandroVol23
Copy link

I still have the same issue as well.

  • Python: 3.6.5
  • pandas-datareader: 0.6.0

pd.core.common.is_list_like = pd.api.types.is_list_like
from pandas_datareader import data
goog = data.DataReader("GOOG", start="2004", end="2016", data_source="google")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 17880: invalid start byte

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants