-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add in LCRA waterquality data #111
Conversation
A couple of points.
|
Thanks @dharhas
I'll hold off work on this until I find your scraping code and see more info on the rewrite. |
|
Hey Nathan. The code Dharhas is talking about is in: https://github.com/twdb/swis/blob/master/data/lcra.py Most relevant part is https://github.com/twdb/swis/blob/master/data/lcra.py#L222-L233 That script pulls from http://hydromet.lcra.org - I'm not sure how different the water quality site is, but it looks pretty similar. The "trick" with hydromet is that there is session data passed by the server in the form of cookies and hidden input elements that needs to be continually passed along (it's a chain that gets validated server-side on each subsequent request). It's bananas. |
Got it. I'm actually making a POST with the waterquality.lcra.org, which is beyond bananas, because my best guess is that there is some preprocessing of the form data in JS before the form submits. I would use a version of the hydromet script, but the map just throws you back to the form to select parameters and do the POST. Maybe I'm missing something in one of these https://gist.github.com/nathanhilbert/79bc915a002b985ca027 ? |
I might be able to help if you send me the lcra.py code. Looking at chrome
On Wed, Jul 22, 2015 at 8:58 AM Nathan Hilbert notifications@github.com
|
quasi private..... |
The following code works for me. Its a slightly modified version of the lcra.py code. from bs4 import BeautifulSoup
import requests
def _extract_headers_for_next_request(request):
payload = dict()
for tag in BeautifulSoup(request.content).findAll('input'):
tag_dict = dict(tag.attrs)
#some tags don't have a value and are used w/ JS to toggle a set of checkboxes
payload[tag_dict['name']] = tag_dict.get('value')
return payload
def _make_next_request(url, previous_request, data):
data_headers = _extract_headers_for_next_request(previous_request)
data_headers.update(data)
return requests.post(url, cookies=previous_request.cookies, data=data_headers)
url1 = 'http://waterquality.lcra.org/parameter.aspx?qrySite=12148'
url2 = 'http://waterquality.lcra.org/events.aspx'
initial_request = requests.get(url1)
#00300, 00301 are parameter numbers
result = _make_next_request(url2, initial_request, {'multiple': ['00300', '00301'], 'site': '12148'})
# Parse the result with beautifulsoup... You can probably work this out
#write html file as a check
with open('test.html', 'w') as f:
f.write(result.content) you can get the list of available parameters by parsing initial_request |
Adding in your parsing code from the gist seems to complete the process. from bs4 import BeautifulSoup
import requests
def _extract_headers_for_next_request(request):
payload = dict()
for tag in BeautifulSoup(request.content).findAll('input'):
tag_dict = dict(tag.attrs)
#some tags don't have a value and are used w/ JS to toggle a set of checkboxes
payload[tag_dict['name']] = tag_dict.get('value')
return payload
def _make_next_request(url, previous_request, data):
data_headers = _extract_headers_for_next_request(previous_request)
data_headers.update(data)
return requests.post(url, cookies=previous_request.cookies, data=data_headers)
url1 = 'http://waterquality.lcra.org/parameter.aspx?qrySite=12148'
url2 = 'http://waterquality.lcra.org/events.aspx'
initial_request = requests.get(url1)
#00300, 00301 are parameter numbers
result = _make_next_request(url2, initial_request, {'multiple': ['00300', '00301'], 'site': '12148'})
# Parse the result with beautifulsoup... You can probably work this out
soup = BeautifulSoup(result.content, 'html.parser')
gridview = soup.find(id="GridView1")
results = []
#get the headers and the index of them
headers = [head.string for head in gridview.findAll('th')]
#uses \xa0 for blank
for row in gridview.findAll('tr'):
vals = [aux.string for aux in row.findAll('td')]
results.append(dict(zip(headers, vals)))
print results |
Yup. This solves it. I couldn't find my previous request cookies, and I On Wed, Jul 22, 2015 at 11:47 AM, Dharhas Pothina notifications@github.com
|
if you are up to it, merging in the hydromet scraper code would be great. having an lcra module that can hit hydromet and waterquality pages would probably be useful to a lot of folks. |
I wouldn't mind throwing in the hydromet code as well for #17. I'll create a new PR for this down the line. Any idea why appveyor is unhappy? |
I haven't finished setting up appveyor. I created an account a few days ago I'll push my refactor branch to github tomorrow or Friday. On Wed, Jul 22, 2015, 2:40 PM Nathan Hilbert notifications@github.com
|
Hey, my refactor has kinda stalled and I'm not sure when I'll be able to pick it up again. In the meantime, I'm pushing a new version out (tagged 0.8) with python 2 and 3 compatibility. Can you rebase this PR to the latest master and I'll merge it in. We have AppVeyor & Travis integration now so we should be able to test on Windows/Linux/OSX. |
hey Dharhas, Nathan had contacted someone at LCRA and there is a possibility they can make their data available in a web service so we should probably wait on this. |
Yeah they said that 10 years ago as well. I wouldn't hold my breath. It would be fairly easy to update this and merge it in. But that is y'alls call, I don't need this particular dataset for any of my work. |
Pinging on this. We also need to get this data at TNRIS. Any updates on the LCRA webservice situation @solomon-negusse ? Or the refactor situation @dharhas ? |
Refactor is stalled out. I don't have time and with no one else really contributing to ulmo, I'm not sure it is worth the effort. Best way forward right now would be to merge the sample code I wrote earlier in this discussion + the code from lcra.py and make pull request for a lcra module for ulmo. It wouldn't be a huge amount of work to generalize both. I have no need for this data so can't justify working on it. |
No worries. I do and I can. I'll take a stab at it if we're not still holding off on the refactor. |
sounds good. please make sure tests pass on py2/py3. The Appveyor/Travis should test on both. |
Will do |
hey, i just saw this.. i sent a couple of emails to the guy at LCRA few weeks ago asking if there was progress but didn't hear back :/ so i was going to work on this at some point this month if you can wait a couple of weeks. |
superceded by #117 |
This is a placeholder to continue work. I'm seeking thoughts on what the developers think about the ghost module and how to better handle the import. Ghost.py was necessary to interact with the .aspx framework for the LCRA that uses multiple security measures, which makes scraping data difficult. Ghost.py has dependencies on PySide, which leverages the PyQT module. This requires some compiling, but could be installed from binary.
Also, any thoughts on testing?
To do: