UnicodeDecodeError on scraping data from multiple sources #424

RobertLucian · 2017-11-29T16:20:51Z

When I run the following code I get UnicodeDecodeError: 'utf-8' codec can't decode byte [x] in position [y]: invalid continuation byte, where [x] & [y] vary depending on the requested stock data or source:

from pandas_datareader import data, wb
import datetime

start = datetime.datetime(2017, 1, 1)
end = datetime.datetime(2017, 11, 1)

bac = data.DataReader('BAC', 'google', start, end)

I've tested on various sources (3 so far on 4 or 5 stocks) and I consistently get this error.
I'm running this on a conda environment, where the python's version is 3.6 and pandas_datareader's version is 0.5.0.

Could someone point out what is the issue here?

The text was updated successfully, but these errors were encountered:

ghost · 2017-11-29T18:32:57Z

I am experiencing the same error here. I think it happens with google source.

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last) 

     16     def __init__(self, symbol, from_date, to_date):
---> 17         raw_data = web.DataReader(symbol, "google", from_date, to_date) 

~/.local/lib/python3.5/site-packages/pandas_datareader/data.py in DataReader(name, data_source, start, end, retry_count, pause, session, access_key)
    135                                  chunksize=25,
    136                                  retry_count=retry_count, pause=pause,
--> 137                                  session=session).read()
    138 
    139     elif data_source == "enigma":

~/.local/lib/python3.5/site-packages/pandas_datareader/base.py in read(self)
    179         if isinstance(self.symbols, (compat.string_types, int)):
    180             df = self._read_one_data(self.url,
--> 181                                      params=self._get_params(self.symbols))
    182         # Or multiple symbols, (e.g., ['GOOG', 'AAPL', 'MSFT'])
    183         elif isinstance(self.symbols, DataFrame):

~/.local/lib/python3.5/site-packages/pandas_datareader/base.py in _read_one_data(self, url, params)
     77         """ read one data from specified URL """
     78         if self._format == 'string':
---> 79             out = self._read_url_as_StringIO(url, params=params)
     80         elif self._format == 'json':
     81             out = self._get_response(url, params=params).json()

~/.local/lib/python3.5/site-packages/pandas_datareader/base.py in _read_url_as_StringIO(self, url, params)
     96                           "inputs: {}".format(service, self.url))
     97         if isinstance(text, compat.binary_type):
---> 98             out.write(bytes_to_str(text))
     99         else:
    100             out.write(text)

~/.local/lib/python3.5/site-packages/pandas/compat/__init__.py in bytes_to_str(b, encoding)
     72 
     73     def bytes_to_str(b, encoding=None):
---> 74         return b.decode(encoding or 'utf-8')
     75 
     76     # The signature version below is directly copied from Django,

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfa in position 14207: invalid start byte

tuannguyenthe · 2017-12-01T07:17:19Z

I am also having the same problem.

UnicodeDecodeError Traceback (most recent call last)
in ()
2 end = dt.datetime(2017, 11, 30)
3
----> 4 f = web.DataReader('AAPL', 'google', start, end)

~/envs/3.5/lib/python3.5/site-packages/pandas_datareader/data.py in DataReader(name, data_source, start, end, retry_count, pause, session, access_key)
135 chunksize=25,
136 retry_count=retry_count, pause=pause,
--> 137 session=session).read()
138
139 elif data_source == "enigma":

~/envs/3.5/lib/python3.5/site-packages/pandas_datareader/base.py in read(self)
179 if isinstance(self.symbols, (compat.string_types, int)):
180 df = self._read_one_data(self.url,
--> 181 params=self._get_params(self.symbols))
182 # Or multiple symbols, (e.g., ['GOOG', 'AAPL', 'MSFT'])
183 elif isinstance(self.symbols, DataFrame):

~/envs/3.5/lib/python3.5/site-packages/pandas_datareader/base.py in _read_one_data(self, url, params)
77 """ read one data from specified URL """
78 if self._format == 'string':
---> 79 out = self._read_url_as_StringIO(url, params=params)
80 elif self._format == 'json':
81 out = self._get_response(url, params=params).json()

~/envs/3.5/lib/python3.5/site-packages/pandas_datareader/base.py in _read_url_as_StringIO(self, url, params)
96 "inputs: {}".format(service, self.url))
97 if isinstance(text, compat.binary_type):
---> 98 out.write(bytes_to_str(text))
99 else:
100 out.write(text)

~/envs/3.5/lib/python3.5/site-packages/pandas/compat/init.py in bytes_to_str(b, encoding)
70
71 def bytes_to_str(b, encoding=None):
---> 72 return b.decode(encoding or 'utf-8')
73
74 # The signature version below is directly copied from Django,

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xec in position 352: invalid continuation byte

nzd31155 · 2017-12-01T13:20:15Z

I get the same problem - this all worked fine just last week

import pandas_datareader.data as wb
import datetime

start = datetime.datetime(2017, 1, 1)
end = datetime.datetime(2017, 11, 1)

stocks = ['LON:KGF', 'LON:ADM']
f = wb.DataReader(stocks, 'google', start, end)

error message

f = web.DataReader(stocks, 'google', start,end)
Traceback (most recent call last):
File "", line 1, in
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas_datareader/data.py", line 137, in DataReader
session=session).read()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas_datareader/base.py", line 186, in read
df = self._dl_mult_symbols(self.symbols)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas_datareader/base.py", line 197, in _dl_mult_symbols
self._get_params(sym))
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas_datareader/base.py", line 79, in _read_one_data
out = self._read_url_as_StringIO(url, params=params)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas_datareader/base.py", line 98, in _read_url_as_StringIO
out.write(bytes_to_str(text))
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/compat/init.py", line 73, in bytes_to_str
return b.decode(encoding or 'utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 28735: invalid start byte

bsolomon1124 · 2017-12-01T15:28:22Z

A quick fix is below, porting from the source, paring it down and making a few slight tweaks.

I believe the issue is with the body returned by requests.get() and reading of the resulting bytes. (The traceback agrees with this.) For instance, try data = requests.get(url).content (gets bytes); this will fail. Below, data = requests.get(url).text works.

I really haven't tested this rigorously but the Google API does appear to be working okay. For instance, the export link generated by url does work just fine at the moment.

import datetime
import requests
from io import StringIO
# This is just a wrapper importing the compatible version of
#     urllib's urlencode--see pandas docs
from pandas.io.common import urlencode
import pandas as pd

BASE = 'http://finance.google.com/finance/historical'


# There seems to be confusion over whether the date api has changed.
# https://github.com/pydata/pandas-datareader/pull/425
# Both formats seem to work, but I'll use the "newer" one here to be safe
def get_params(symbol, start, end):
    params = {
        'q': symbol,
        'startdate': start.strftime('%Y/%m/%d'),
        'enddate': end.strftime('%Y/%m/%d'),
        'output': "csv"
    }
    return params


def build_url(symbol, start, end):
    params = get_params(symbol, start, end)
    return BASE + '?' + urlencode(params)


start = datetime.datetime(2010, 1, 1)
end = datetime.datetime.today()
sym = 'SPY'
url = build_url(sym, start, end)

data = requests.get(url).text
data = pd.read_csv(StringIO(data), index_col='Date', parse_dates=True)

print(data.head())
#               Open    High     Low   Close     Volume
# Date
# 2017-11-30  263.76  266.05  263.67  265.01  127894389
# 2017-11-29  263.02  263.63  262.20  262.71   77512102
# 2017-11-28  260.76  262.90  260.66  262.87   98971719
# 2017-11-27  260.41  260.75  260.00  260.23   52274922
# 2017-11-24  260.32  260.48  260.16  260.36   27856514

qmpzqmpz · 2017-12-02T19:26:10Z

Check if GoogleDailyReader.url() in pandas_datareader/google/daily.py returns 'http://www.google.com/finance/historical'.
If so, change it to 'http://finance.google.com/finance/historical'.
(www -> finance)
The return value of GoogleDailyReader.url() was 'http://www.google.com/finance/historical' when I downloaded pandas-datareader in PyCharm yesterday. I don't know why.

bsolomon1124 · 2017-12-02T21:26:40Z

That's very strange @qmpzqmpz because the url seems to be correct in source, at least in 0.5.0:

https://github.com/pydata/pandas-datareader/blob/master/pandas_datareader/google/daily.py#L34

But when I test, url attribute shows the "old" url.

import pandas_datareader as pdr
c = pdr.google.daily.GoogleDailyReader()

c.url
# http://www.google.com/finance/historical'

pdr.__version__
# '0.5.0'

coulanuk · 2017-12-03T14:23:16Z

@bsolomon1124 your fix works well.
I have just added:
data.sort_index(ascending=True, inplace=True)
To get the dataframe in ascending date order which was the original behaviour, I believe.
This enables one to calculate returns etc.

paintdog · 2017-12-03T15:35:45Z

Can someone write a pull request and fix the bugs???

bsolomon1124 · 2017-12-04T16:10:18Z

Busy week for me but I can try to submit this this weekend. Although, it looks like some other commits have been failing the travisci build.

nzd31155 · 2017-12-04T16:33:59Z

Testing your fix works nicely thanks. In the datareader I pulled pricing for a list of tickers (e.g. LON:BARC, LON:KGF, LON:BLND) the result gave me a dataframe with a panel for each stock. If I feed a list to the fix above it, the API doesn't like it. I know I can iterate through the list but wanted to do it once and return a panel. Am I missing something stupid?

bsolomon1124 · 2017-12-04T18:38:31Z

@nzd31155 Yeah, the pandas-datareader code for reading multiple signals is a loop that reads each individually and then returns a Panel. You can find it here:

https://github.com/pydata/pandas-datareader/blob/master/pandas_datareader/base.py#L189

The class structure is like this--

_BaseReader is the base class for not just Google/Yahoo but other datareader modules as well.
_DailyBaseReader inherits from _BaseReader and is a base class for Google/Yahoo specifically.
GoogleDailyReader inherits from _DailyBaseReader and implements the url and param retrieval.
DataReader (the interface that most are familiar with) is just a wrapper that maps data_source to one of the classes such as GoogleDailyReader

Just fyi that Panel has a deprecation warning on it as of pandas 0.20. A MultiIndex df would be a good alternative.

I'm not an active developer on pandas-datareader but glad to take a deeper look when I have a moment and try to get something going that passes the build tests.

bsolomon1124 · 2017-12-05T00:55:06Z

Just an update regarding the url: it's correct in the GitHub repo, but outdated in the PyPI download with equivalent version. (Go figure...)

To check:

>>> import pandas_datareader as pdr
>>> test = pdr.google.daily.GoogleDailyReader('')
>>> test.url??
Type:        property
String form: <property object at 0x10bdb39a8>
Source:     
# test.url.fget
@property
def url(self):
    return 'http://www.google.com/finance/historical'

bashtage · 2018-01-18T22:22:08Z

This appears to be fixed in master, so closing for now. Reopen if this persists after 0.6.0

wangliangliang2 · 2018-08-05T07:10:31Z

@bashtage
still exist

wangliangliang2 · 2018-08-05T07:11:16Z

from pandas_datareader import data
...: goog = data.DataReader('GOOG', start='2004', end='2016', data_source='google')
...: goog.head()

AlessandroVol23 · 2018-08-06T09:15:25Z

I still have the same issue as well.

Python: 3.6.5
pandas-datareader: 0.6.0

pd.core.common.is_list_like = pd.api.types.is_list_like
from pandas_datareader import data
goog = data.DataReader("GOOG", start="2004", end="2016", data_source="google")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 17880: invalid start byte

bsolomon1124 mentioned this issue Dec 1, 2017

Disruption of equities data :: pandas_datareader dependency on Yahoo and Google Finance API rsvp/fecon235#7

Open

ghost mentioned this issue Dec 6, 2017

Fix of UnicodeDecodeError on scraping data from multiple sources #427

Closed

3 tasks

nzd31155 mentioned this issue Dec 11, 2017

'utf-8' Encoding + 'unable to read URL' errors: when downloading financial data from yahoo and google #428

Closed

bashtage added the yahoo-finance label Jan 13, 2018

bashtage closed this as completed Jan 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError on scraping data from multiple sources #424

UnicodeDecodeError on scraping data from multiple sources #424

RobertLucian commented Nov 29, 2017

ghost commented Nov 29, 2017 •

edited by ghost

Loading

tuannguyenthe commented Dec 1, 2017

nzd31155 commented Dec 1, 2017

bsolomon1124 commented Dec 1, 2017 •

edited

Loading

qmpzqmpz commented Dec 2, 2017 •

edited

Loading

bsolomon1124 commented Dec 2, 2017

coulanuk commented Dec 3, 2017

paintdog commented Dec 3, 2017

bsolomon1124 commented Dec 4, 2017

nzd31155 commented Dec 4, 2017

bsolomon1124 commented Dec 4, 2017 •

edited

Loading

bsolomon1124 commented Dec 5, 2017 •

edited

Loading

bashtage commented Jan 18, 2018

wangliangliang2 commented Aug 5, 2018

wangliangliang2 commented Aug 5, 2018

AlessandroVol23 commented Aug 6, 2018

UnicodeDecodeError on scraping data from multiple sources #424

UnicodeDecodeError on scraping data from multiple sources #424

Comments

RobertLucian commented Nov 29, 2017

ghost commented Nov 29, 2017 • edited by ghost Loading

tuannguyenthe commented Dec 1, 2017

nzd31155 commented Dec 1, 2017

bsolomon1124 commented Dec 1, 2017 • edited Loading

qmpzqmpz commented Dec 2, 2017 • edited Loading

bsolomon1124 commented Dec 2, 2017

coulanuk commented Dec 3, 2017

paintdog commented Dec 3, 2017

bsolomon1124 commented Dec 4, 2017

nzd31155 commented Dec 4, 2017

bsolomon1124 commented Dec 4, 2017 • edited Loading

bsolomon1124 commented Dec 5, 2017 • edited Loading

bashtage commented Jan 18, 2018

wangliangliang2 commented Aug 5, 2018

wangliangliang2 commented Aug 5, 2018

AlessandroVol23 commented Aug 6, 2018

ghost commented Nov 29, 2017 •

edited by ghost

Loading

bsolomon1124 commented Dec 1, 2017 •

edited

Loading

qmpzqmpz commented Dec 2, 2017 •

edited

Loading

bsolomon1124 commented Dec 4, 2017 •

edited

Loading

bsolomon1124 commented Dec 5, 2017 •

edited

Loading