Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add in LCRA waterquality data #111

Closed
wants to merge 4 commits into from

Conversation

nathanhilbert
Copy link

This is a placeholder to continue work. I'm seeking thoughts on what the developers think about the ghost module and how to better handle the import. Ghost.py was necessary to interact with the .aspx framework for the LCRA that uses multiple security measures, which makes scraping data difficult. Ghost.py has dependencies on PySide, which leverages the PyQT module. This requires some compiling, but could be installed from binary.
Also, any thoughts on testing?

To do:

  • return from date specified in function
  • return as_dataframe
  • figure out mock failure in the tests
  • document

@dharhas
Copy link
Contributor

dharhas commented Jul 21, 2015

@nathanhilbert

A couple of points.

  1. I had built a custom scraper for the lcra hydromet site. It should be in the twdb waterdatafortexas repo. It was build by inspecting the traffic back and forth with chrome dev tools and using requests to mimic it. From what I recall, the first time you hit the site it creates some session information that you have to send back in all the subsequent requests. You may be able to build on that.

  2. pyside is a very heavy dependency, if you can get away without using it I think you will be better in the long run. One of the design goals of ulmo is to be cross platform.

  3. I'm currently refactoring the entire codebase to be more consistent across services and plugin based. (see issue Complete Refactor of Ulmo #109). With that approach you could either make it a plugin that is only enabled when ghost.py is available or maintain your own set of plugins in a different repo. If you would like

  4. Looks like pyside and pyqt are available in Anaconda python distribution, so that would be the preferred way to install them.

@nathanhilbert
Copy link
Author

Thanks @dharhas

  1. I haven't been able to find anything related to pulling LCRA data in the TWDB org. WDFT has four repos associated to it. Would your script be in WDFT-common perhaps?
  2. Yes, I don't like using PySide, but I was getting a 500 error using requests, mechanize, and RoboBrowser even with using sessions. Seems like the the JS needs to run through to make a valid call, which is why your work with chrome dev tools probably works.
    The new wheels does a pretty good job of making modules cross platform, and PySide is supported http://www.lfd.uci.edu/~gohlke/pythonlibs/#pyside
  3. I was planning on making this an optional part of the code if Ghost.py is available. I've seen Complete Refactor of Ulmo #109 , is there any code, road map, or module architecture layout? Based on the new core code, I could see what approach might be best.
  4. Anaconda would be straightforward bulk install, but non-Anaconda users could use the PySide wheel.

I'll hold off work on this until I find your scraping code and see more info on the rewrite.
On the less technical side, I contacted LCRA directly and got forwarded to the person that manages the site. We might be able to figure something out on their end.

@dharhas
Copy link
Contributor

dharhas commented Jul 22, 2015

  • The wheels are an improvement but nowhere close to perfect, I haven't tested the pyside wheel but I've had issues with GDAL etc. So far I have found conda the best way forward.
  • The code should be in the wdft lake repo. I forget what it is called. There is a folder with scripts that download lake data from the usgs, lcra, usace, usbr etc, every day using celery. In some cases these are just wrappers around ulmo. The file you need is probably called lcra.py. @wilsaj should be able to find it for you if you can't. I no longer have access to the repo.
  • I'm currently experimenting with a couple of layouts for the code. I'm hoping to have something semi working this week and push it to a branch for comments.
  • Ruben Solis has good contacts at LCRA. Unfortunately, their data infrastructure is pretty fixed so you may be able to get a CD dump of their data but I doubt they will change their delivery technique. If they are using KISTERS or one of the other commercial packages they might be able to turn on WaterML2 services.

@wilsaj
Copy link
Contributor

wilsaj commented Jul 22, 2015

Hey Nathan.

The code Dharhas is talking about is in: https://github.com/twdb/swis/blob/master/data/lcra.py

Most relevant part is https://github.com/twdb/swis/blob/master/data/lcra.py#L222-L233

That script pulls from http://hydromet.lcra.org - I'm not sure how different the water quality site is, but it looks pretty similar. The "trick" with hydromet is that there is session data passed by the server in the form of cookies and hidden input elements that needs to be continually passed along (it's a chain that gets validated server-side on each subsequent request). It's bananas.

@nathanhilbert
Copy link
Author

Got it. I'm actually making a POST with the waterquality.lcra.org, which is beyond bananas, because my best guess is that there is some preprocessing of the form data in JS before the form submits. I would use a version of the hydromet script, but the map just throws you back to the form to select parameters and do the POST. Maybe I'm missing something in one of these https://gist.github.com/nathanhilbert/79bc915a002b985ca027 ?

@dharhas
Copy link
Contributor

dharhas commented Jul 22, 2015

I might be able to help if you send me the lcra.py code. Looking at chrome
dev tools while accessing data on the wq page it looks very familiar. There
is a __VIEWSTATE parameter that you need to get on the first request and
pass around to subsequent calls.

  • dharhas

On Wed, Jul 22, 2015 at 8:58 AM Nathan Hilbert notifications@github.com
wrote:

Got it. I'm actually making a POST
http://waterquality.lcra.org/parameter.aspx?qrySite=20460 with the
waterquality.lcra.org, which is beyond bananas, because my best guess is
that there is some preprocessing of the form data in JS before the form
submits. I would use a version of the hydromet script, but the map just
throws you back to the form to select parameters and do the POST. Maybe I'm
missing something in one of these
https://gist.github.com/nathanhilbert/79bc915a002b985ca027 ?


Reply to this email directly or view it on GitHub
#111 (comment).

@nathanhilbert
Copy link
Author

quasi private.....
https://gist.github.com/nathanhilbert/de017ec906e173e5ac1d
The __VIEWSTATE is being handled by hidden input along with a few other params. Thanks again.

@dharhas
Copy link
Contributor

dharhas commented Jul 22, 2015

The following code works for me. Its a slightly modified version of the lcra.py code.

from bs4 import BeautifulSoup
import requests


def _extract_headers_for_next_request(request):
    payload = dict()
    for tag in BeautifulSoup(request.content).findAll('input'):
        tag_dict = dict(tag.attrs)
        #some tags don't have a value and are used w/ JS to toggle a set of checkboxes
        payload[tag_dict['name']] = tag_dict.get('value')
    return payload


def _make_next_request(url, previous_request, data):
    data_headers = _extract_headers_for_next_request(previous_request)
    data_headers.update(data)
    return requests.post(url, cookies=previous_request.cookies, data=data_headers)


url1 = 'http://waterquality.lcra.org/parameter.aspx?qrySite=12148'
url2 = 'http://waterquality.lcra.org/events.aspx'

initial_request = requests.get(url1)
#00300, 00301 are parameter numbers
result = _make_next_request(url2, initial_request, {'multiple': ['00300', '00301'], 'site': '12148'})

# Parse the result with beautifulsoup... You can probably work this out

#write html file as a check
with open('test.html', 'w') as f: 
    f.write(result.content)

you can get the list of available parameters by parsing initial_request

@dharhas
Copy link
Contributor

dharhas commented Jul 22, 2015

Adding in your parsing code from the gist seems to complete the process.

from bs4 import BeautifulSoup
import requests


def _extract_headers_for_next_request(request):
    payload = dict()
    for tag in BeautifulSoup(request.content).findAll('input'):
        tag_dict = dict(tag.attrs)
        #some tags don't have a value and are used w/ JS to toggle a set of checkboxes
        payload[tag_dict['name']] = tag_dict.get('value')
    return payload


def _make_next_request(url, previous_request, data):
    data_headers = _extract_headers_for_next_request(previous_request)
    data_headers.update(data)
    return requests.post(url, cookies=previous_request.cookies, data=data_headers)


url1 = 'http://waterquality.lcra.org/parameter.aspx?qrySite=12148'
url2 = 'http://waterquality.lcra.org/events.aspx'

initial_request = requests.get(url1)
#00300, 00301 are parameter numbers
result = _make_next_request(url2, initial_request, {'multiple': ['00300', '00301'], 'site': '12148'})

# Parse the result with beautifulsoup... You can probably work this out

soup = BeautifulSoup(result.content, 'html.parser')
gridview = soup.find(id="GridView1")
results = []

#get the headers and the index of them
headers = [head.string for head in gridview.findAll('th')]

#uses \xa0 for blank

for row in gridview.findAll('tr'):
    vals = [aux.string for aux in row.findAll('td')]
    results.append(dict(zip(headers, vals)))

print results

@nathanhilbert
Copy link
Author

Yup. This solves it. I couldn't find my previous request cookies, and I
didn't know the multiple param could be parsed like that. I'll incorporate
this code in the PR and squash. At least it can stand as a placeholder for
the rewrite.

On Wed, Jul 22, 2015 at 11:47 AM, Dharhas Pothina notifications@github.com
wrote:

Adding in your parsing code from the gist seems to complete the process.

from bs4 import BeautifulSoupimport requests

def _extract_headers_for_next_request(request):
payload = dict()
for tag in BeautifulSoup(request.content).findAll('input'):
tag_dict = dict(tag.attrs)
#some tags don't have a value and are used w/ JS to toggle a set of checkboxes
payload[tag_dict['name']] = tag_dict.get('value')
return payload

def _make_next_request(url, previous_request, data):
data_headers = _extract_headers_for_next_request(previous_request)
data_headers.update(data)
return requests.post(url, cookies=previous_request.cookies, data=data_headers)

url1 = 'http://waterquality.lcra.org/parameter.aspx?qrySite=12148'
url2 = 'http://waterquality.lcra.org/events.aspx'

initial_request = requests.get(url1)#00300, 00301 are parameter numbers
result = _make_next_request(url2, initial_request, {'multiple': ['00300', '00301'], 'site': '12148'})

Parse the result with beautifulsoup... You can probably work this out

soup = BeautifulSoup(result.content, 'html.parser')
gridview = soup.find(id="GridView1")
results = []
#get the headers and the index of them
headers = [head.string for head in gridview.findAll('th')]
#uses \xa0 for blank
for row in gridview.findAll('tr'):
vals = [aux.string for aux in row.findAll('td')]
results.append(dict(zip(headers, vals)))
print results


Reply to this email directly or view it on GitHub
#111 (comment).

@dharhas
Copy link
Contributor

dharhas commented Jul 22, 2015

if you are up to it, merging in the hydromet scraper code would be great. having an lcra module that can hit hydromet and waterquality pages would probably be useful to a lot of folks.

@nathanhilbert
Copy link
Author

I wouldn't mind throwing in the hydromet code as well for #17. I'll create a new PR for this down the line.

Any idea why appveyor is unhappy?
Any other notes on the most recent code updates?

@nathanhilbert nathanhilbert changed the title Add in LCRA data pulls would close #17 Add in LCRA waterquality data Jul 22, 2015
@dharhas
Copy link
Contributor

dharhas commented Jul 22, 2015

I haven't finished setting up appveyor. I created an account a few days ago
but haven't put in the info it needs to build and test ulmo.

I'll push my refactor branch to github tomorrow or Friday.

On Wed, Jul 22, 2015, 2:40 PM Nathan Hilbert notifications@github.com
wrote:

I wouldn't mind throwing in the hydromet code as well for #17
#17. I'll create a new PR for
this down the line.

Any idea why appveyor is unhappy?

Any other notes on the most recent code updates?


Reply to this email directly or view it on GitHub
#111 (comment).

@dharhas
Copy link
Contributor

dharhas commented Sep 15, 2015

@nathanhilbert

Hey, my refactor has kinda stalled and I'm not sure when I'll be able to pick it up again. In the meantime, I'm pushing a new version out (tagged 0.8) with python 2 and 3 compatibility. Can you rebase this PR to the latest master and I'll merge it in. We have AppVeyor & Travis integration now so we should be able to test on Windows/Linux/OSX.

@solomon-negusse
Copy link
Member

hey Dharhas, Nathan had contacted someone at LCRA and there is a possibility they can make their data available in a web service so we should probably wait on this.

@dharhas
Copy link
Contributor

dharhas commented Sep 15, 2015

Yeah they said that 10 years ago as well. I wouldn't hold my breath.

It would be fairly easy to update this and merge it in. But that is y'alls call, I don't need this particular dataset for any of my work.

@wilsaj
Copy link
Contributor

wilsaj commented Nov 24, 2015

Pinging on this. We also need to get this data at TNRIS.

Any updates on the LCRA webservice situation @solomon-negusse ?

Or the refactor situation @dharhas ?

@dharhas
Copy link
Contributor

dharhas commented Nov 24, 2015

@wilsaj @solomon-negusse

Refactor is stalled out. I don't have time and with no one else really contributing to ulmo, I'm not sure it is worth the effort.

Best way forward right now would be to merge the sample code I wrote earlier in this discussion + the code from lcra.py and make pull request for a lcra module for ulmo. It wouldn't be a huge amount of work to generalize both. I have no need for this data so can't justify working on it.

@wilsaj
Copy link
Contributor

wilsaj commented Nov 24, 2015

No worries. I do and I can. I'll take a stab at it if we're not still holding off on the refactor.

@dharhas
Copy link
Contributor

dharhas commented Nov 24, 2015

sounds good. please make sure tests pass on py2/py3. The Appveyor/Travis should test on both.

@wilsaj
Copy link
Contributor

wilsaj commented Nov 24, 2015

Will do

@solomon-negusse
Copy link
Member

hey, i just saw this.. i sent a couple of emails to the guy at LCRA few weeks ago asking if there was progress but didn't hear back :/ so i was going to work on this at some point this month if you can wait a couple of weeks.

@dharhas
Copy link
Contributor

dharhas commented Dec 22, 2015

superceded by #117

@dharhas dharhas closed this Dec 22, 2015
@emiliom emiliom mentioned this pull request Sep 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants