Add in LCRA waterquality data #111

nathanhilbert · 2015-07-21T21:20:13Z

This is a placeholder to continue work. I'm seeking thoughts on what the developers think about the ghost module and how to better handle the import. Ghost.py was necessary to interact with the .aspx framework for the LCRA that uses multiple security measures, which makes scraping data difficult. Ghost.py has dependencies on PySide, which leverages the PyQT module. This requires some compiling, but could be installed from binary.
Also, any thoughts on testing?

To do:

return from date specified in function
return as_dataframe
figure out mock failure in the tests
document

dharhas · 2015-07-21T21:53:17Z

@nathanhilbert

A couple of points.

I had built a custom scraper for the lcra hydromet site. It should be in the twdb waterdatafortexas repo. It was build by inspecting the traffic back and forth with chrome dev tools and using requests to mimic it. From what I recall, the first time you hit the site it creates some session information that you have to send back in all the subsequent requests. You may be able to build on that.
pyside is a very heavy dependency, if you can get away without using it I think you will be better in the long run. One of the design goals of ulmo is to be cross platform.
I'm currently refactoring the entire codebase to be more consistent across services and plugin based. (see issue Complete Refactor of Ulmo #109). With that approach you could either make it a plugin that is only enabled when ghost.py is available or maintain your own set of plugins in a different repo. If you would like
Looks like pyside and pyqt are available in Anaconda python distribution, so that would be the preferred way to install them.

nathanhilbert · 2015-07-22T13:29:05Z

Thanks @dharhas

I haven't been able to find anything related to pulling LCRA data in the TWDB org. WDFT has four repos associated to it. Would your script be in WDFT-common perhaps?
Yes, I don't like using PySide, but I was getting a 500 error using requests, mechanize, and RoboBrowser even with using sessions. Seems like the the JS needs to run through to make a valid call, which is why your work with chrome dev tools probably works.
The new wheels does a pretty good job of making modules cross platform, and PySide is supported http://www.lfd.uci.edu/~gohlke/pythonlibs/#pyside
I was planning on making this an optional part of the code if Ghost.py is available. I've seen Complete Refactor of Ulmo #109 , is there any code, road map, or module architecture layout? Based on the new core code, I could see what approach might be best.
Anaconda would be straightforward bulk install, but non-Anaconda users could use the PySide wheel.

I'll hold off work on this until I find your scraping code and see more info on the rewrite.
On the less technical side, I contacted LCRA directly and got forwarded to the person that manages the site. We might be able to figure something out on their end.

dharhas · 2015-07-22T13:41:33Z

The wheels are an improvement but nowhere close to perfect, I haven't tested the pyside wheel but I've had issues with GDAL etc. So far I have found conda the best way forward.
The code should be in the wdft lake repo. I forget what it is called. There is a folder with scripts that download lake data from the usgs, lcra, usace, usbr etc, every day using celery. In some cases these are just wrappers around ulmo. The file you need is probably called lcra.py. @wilsaj should be able to find it for you if you can't. I no longer have access to the repo.
I'm currently experimenting with a couple of layouts for the code. I'm hoping to have something semi working this week and push it to a branch for comments.
Ruben Solis has good contacts at LCRA. Unfortunately, their data infrastructure is pretty fixed so you may be able to get a CD dump of their data but I doubt they will change their delivery technique. If they are using KISTERS or one of the other commercial packages they might be able to turn on WaterML2 services.

wilsaj · 2015-07-22T13:49:02Z

Hey Nathan.

The code Dharhas is talking about is in: https://github.com/twdb/swis/blob/master/data/lcra.py

Most relevant part is https://github.com/twdb/swis/blob/master/data/lcra.py#L222-L233

That script pulls from http://hydromet.lcra.org - I'm not sure how different the water quality site is, but it looks pretty similar. The "trick" with hydromet is that there is session data passed by the server in the form of cookies and hidden input elements that needs to be continually passed along (it's a chain that gets validated server-side on each subsequent request). It's bananas.

nathanhilbert · 2015-07-22T13:58:45Z

Got it. I'm actually making a POST with the waterquality.lcra.org, which is beyond bananas, because my best guess is that there is some preprocessing of the form data in JS before the form submits. I would use a version of the hydromet script, but the map just throws you back to the form to select parameters and do the POST. Maybe I'm missing something in one of these https://gist.github.com/nathanhilbert/79bc915a002b985ca027 ?

dharhas · 2015-07-22T14:30:34Z

I might be able to help if you send me the lcra.py code. Looking at chrome
dev tools while accessing data on the wq page it looks very familiar. There
is a __VIEWSTATE parameter that you need to get on the first request and
pass around to subsequent calls.

dharhas

On Wed, Jul 22, 2015 at 8:58 AM Nathan Hilbert notifications@github.com
wrote:

Got it. I'm actually making a POST
http://waterquality.lcra.org/parameter.aspx?qrySite=20460 with the
waterquality.lcra.org, which is beyond bananas, because my best guess is
that there is some preprocessing of the form data in JS before the form
submits. I would use a version of the hydromet script, but the map just
throws you back to the form to select parameters and do the POST. Maybe I'm
missing something in one of these
https://gist.github.com/nathanhilbert/79bc915a002b985ca027 ?

—
Reply to this email directly or view it on GitHub
#111 (comment).

nathanhilbert · 2015-07-22T14:40:17Z

quasi private.....
https://gist.github.com/nathanhilbert/de017ec906e173e5ac1d
The __VIEWSTATE is being handled by hidden input along with a few other params. Thanks again.

dharhas · 2015-07-22T16:36:51Z

The following code works for me. Its a slightly modified version of the lcra.py code.

from bs4 import BeautifulSoup
import requests


def _extract_headers_for_next_request(request):
    payload = dict()
    for tag in BeautifulSoup(request.content).findAll('input'):
        tag_dict = dict(tag.attrs)
        #some tags don't have a value and are used w/ JS to toggle a set of checkboxes
        payload[tag_dict['name']] = tag_dict.get('value')
    return payload


def _make_next_request(url, previous_request, data):
    data_headers = _extract_headers_for_next_request(previous_request)
    data_headers.update(data)
    return requests.post(url, cookies=previous_request.cookies, data=data_headers)


url1 = 'http://waterquality.lcra.org/parameter.aspx?qrySite=12148'
url2 = 'http://waterquality.lcra.org/events.aspx'

initial_request = requests.get(url1)
#00300, 00301 are parameter numbers
result = _make_next_request(url2, initial_request, {'multiple': ['00300', '00301'], 'site': '12148'})

# Parse the result with beautifulsoup... You can probably work this out

#write html file as a check
with open('test.html', 'w') as f: 
    f.write(result.content)

you can get the list of available parameters by parsing initial_request

dharhas · 2015-07-22T16:47:35Z

Adding in your parsing code from the gist seems to complete the process.

from bs4 import BeautifulSoup
import requests


def _extract_headers_for_next_request(request):
    payload = dict()
    for tag in BeautifulSoup(request.content).findAll('input'):
        tag_dict = dict(tag.attrs)
        #some tags don't have a value and are used w/ JS to toggle a set of checkboxes
        payload[tag_dict['name']] = tag_dict.get('value')
    return payload


def _make_next_request(url, previous_request, data):
    data_headers = _extract_headers_for_next_request(previous_request)
    data_headers.update(data)
    return requests.post(url, cookies=previous_request.cookies, data=data_headers)


url1 = 'http://waterquality.lcra.org/parameter.aspx?qrySite=12148'
url2 = 'http://waterquality.lcra.org/events.aspx'

initial_request = requests.get(url1)
#00300, 00301 are parameter numbers
result = _make_next_request(url2, initial_request, {'multiple': ['00300', '00301'], 'site': '12148'})

# Parse the result with beautifulsoup... You can probably work this out

soup = BeautifulSoup(result.content, 'html.parser')
gridview = soup.find(id="GridView1")
results = []

#get the headers and the index of them
headers = [head.string for head in gridview.findAll('th')]

#uses \xa0 for blank

for row in gridview.findAll('tr'):
    vals = [aux.string for aux in row.findAll('td')]
    results.append(dict(zip(headers, vals)))

print results

nathanhilbert · 2015-07-22T16:54:24Z

Yup. This solves it. I couldn't find my previous request cookies, and I
didn't know the multiple param could be parsed like that. I'll incorporate
this code in the PR and squash. At least it can stand as a placeholder for
the rewrite.

On Wed, Jul 22, 2015 at 11:47 AM, Dharhas Pothina notifications@github.com
wrote:

Adding in your parsing code from the gist seems to complete the process.

from bs4 import BeautifulSoupimport requests

def _extract_headers_for_next_request(request):
payload = dict()
for tag in BeautifulSoup(request.content).findAll('input'):
tag_dict = dict(tag.attrs)
#some tags don't have a value and are used w/ JS to toggle a set of checkboxes
payload[tag_dict['name']] = tag_dict.get('value')
return payload

def _make_next_request(url, previous_request, data):
data_headers = _extract_headers_for_next_request(previous_request)
data_headers.update(data)
return requests.post(url, cookies=previous_request.cookies, data=data_headers)

url1 = 'http://waterquality.lcra.org/parameter.aspx?qrySite=12148'
url2 = 'http://waterquality.lcra.org/events.aspx'

initial_request = requests.get(url1)#00300, 00301 are parameter numbers
result = _make_next_request(url2, initial_request, {'multiple': ['00300', '00301'], 'site': '12148'})

Parse the result with beautifulsoup... You can probably work this out

soup = BeautifulSoup(result.content, 'html.parser')
gridview = soup.find(id="GridView1")
results = []
#get the headers and the index of them
headers = [head.string for head in gridview.findAll('th')]
#uses \xa0 for blank
for row in gridview.findAll('tr'):
vals = [aux.string for aux in row.findAll('td')]
results.append(dict(zip(headers, vals)))
print results

—
Reply to this email directly or view it on GitHub
#111 (comment).

dharhas · 2015-07-22T17:13:32Z

if you are up to it, merging in the hydromet scraper code would be great. having an lcra module that can hit hydromet and waterquality pages would probably be useful to a lot of folks.

nathanhilbert · 2015-07-22T19:40:22Z

I wouldn't mind throwing in the hydromet code as well for #17. I'll create a new PR for this down the line.

Any idea why appveyor is unhappy?
Any other notes on the most recent code updates?

dharhas · 2015-07-22T21:33:43Z

I haven't finished setting up appveyor. I created an account a few days ago
but haven't put in the info it needs to build and test ulmo.

I'll push my refactor branch to github tomorrow or Friday.

On Wed, Jul 22, 2015, 2:40 PM Nathan Hilbert notifications@github.com
wrote:

I wouldn't mind throwing in the hydromet code as well for #17
#17. I'll create a new PR for
this down the line.

Any idea why appveyor is unhappy?

Any other notes on the most recent code updates?

—
Reply to this email directly or view it on GitHub
#111 (comment).

dharhas · 2015-09-15T20:58:42Z

@nathanhilbert

Hey, my refactor has kinda stalled and I'm not sure when I'll be able to pick it up again. In the meantime, I'm pushing a new version out (tagged 0.8) with python 2 and 3 compatibility. Can you rebase this PR to the latest master and I'll merge it in. We have AppVeyor & Travis integration now so we should be able to test on Windows/Linux/OSX.

solomon-negusse · 2015-09-15T22:25:54Z

hey Dharhas, Nathan had contacted someone at LCRA and there is a possibility they can make their data available in a web service so we should probably wait on this.

dharhas · 2015-09-15T22:45:35Z

Yeah they said that 10 years ago as well. I wouldn't hold my breath.

It would be fairly easy to update this and merge it in. But that is y'alls call, I don't need this particular dataset for any of my work.

wilsaj · 2015-11-24T15:42:56Z

Pinging on this. We also need to get this data at TNRIS.

Any updates on the LCRA webservice situation @solomon-negusse ?

Or the refactor situation @dharhas ?

dharhas · 2015-11-24T16:22:54Z

@wilsaj @solomon-negusse

Refactor is stalled out. I don't have time and with no one else really contributing to ulmo, I'm not sure it is worth the effort.

Best way forward right now would be to merge the sample code I wrote earlier in this discussion + the code from lcra.py and make pull request for a lcra module for ulmo. It wouldn't be a huge amount of work to generalize both. I have no need for this data so can't justify working on it.

wilsaj · 2015-11-24T16:24:35Z

No worries. I do and I can. I'll take a stab at it if we're not still holding off on the refactor.

dharhas · 2015-11-24T16:42:11Z

sounds good. please make sure tests pass on py2/py3. The Appveyor/Travis should test on both.

wilsaj · 2015-11-24T16:53:51Z

Will do

solomon-negusse · 2015-11-25T02:09:13Z

hey, i just saw this.. i sent a couple of emails to the guy at LCRA few weeks ago asking if there was progress but didn't hear back :/ so i was going to work on this at some point this month if you can wait a couple of weeks.

dharhas · 2015-12-22T15:45:20Z

superceded by #117

add option to send arguments to pytest

d71cb71

add lcra water quality module

4c27192

nathanhilbert force-pushed the lcra branch from 3bf2c26 to 4c27192 Compare July 22, 2015 18:19

nathanhilbert added 2 commits July 22, 2015 13:33

fix station param id

675459d

as_dataframe and date params processed and documentation

df3b24b

nathanhilbert changed the title ~~Add in LCRA data pulls would close #17~~ Add in LCRA waterquality data Jul 22, 2015

dharhas closed this Dec 22, 2015

emiliom mentioned this pull request Sep 2, 2021

LCRA tests are failing #211

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add in LCRA waterquality data #111

Add in LCRA waterquality data #111

nathanhilbert commented Jul 21, 2015

dharhas commented Jul 21, 2015

nathanhilbert commented Jul 22, 2015

dharhas commented Jul 22, 2015

wilsaj commented Jul 22, 2015

nathanhilbert commented Jul 22, 2015

dharhas commented Jul 22, 2015

nathanhilbert commented Jul 22, 2015

dharhas commented Jul 22, 2015

dharhas commented Jul 22, 2015

nathanhilbert commented Jul 22, 2015

Parse the result with beautifulsoup... You can probably work this out

dharhas commented Jul 22, 2015

nathanhilbert commented Jul 22, 2015

dharhas commented Jul 22, 2015

dharhas commented Sep 15, 2015

solomon-negusse commented Sep 15, 2015

dharhas commented Sep 15, 2015

wilsaj commented Nov 24, 2015

dharhas commented Nov 24, 2015

wilsaj commented Nov 24, 2015

dharhas commented Nov 24, 2015

wilsaj commented Nov 24, 2015

solomon-negusse commented Nov 25, 2015

dharhas commented Dec 22, 2015

Add in LCRA waterquality data #111

Add in LCRA waterquality data #111

Conversation

nathanhilbert commented Jul 21, 2015

dharhas commented Jul 21, 2015

nathanhilbert commented Jul 22, 2015

dharhas commented Jul 22, 2015

wilsaj commented Jul 22, 2015

nathanhilbert commented Jul 22, 2015

dharhas commented Jul 22, 2015

nathanhilbert commented Jul 22, 2015

dharhas commented Jul 22, 2015

dharhas commented Jul 22, 2015

nathanhilbert commented Jul 22, 2015

Parse the result with beautifulsoup... You can probably work this out

dharhas commented Jul 22, 2015

nathanhilbert commented Jul 22, 2015

dharhas commented Jul 22, 2015

dharhas commented Sep 15, 2015

solomon-negusse commented Sep 15, 2015

dharhas commented Sep 15, 2015

wilsaj commented Nov 24, 2015

dharhas commented Nov 24, 2015

wilsaj commented Nov 24, 2015

dharhas commented Nov 24, 2015

wilsaj commented Nov 24, 2015

solomon-negusse commented Nov 25, 2015

dharhas commented Dec 22, 2015