-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
.extract() is unable to get data properly from sparse tables #192
Comments
AFAICT, this is expected. Were you expecting some other value, PS: to reproduce in In [1]: html = """<!DOCTYPE html>
...: <html lang="en">
...: <table class="manual_table">
...: <thead>
...: <tr>
...: <th class="">Mar 2008</th>
...: <th class="">Mar 2009</th>
...: <th class="">Mar 2010</th>
...: </tr>
...: </thead>
...: <tbody>
...: <tr>
...: <td class="">8,626</td>
...: <td class="">8,427</td>
...: <td class="">11,525</td>
...: </tr>
...: <tr>
...: <td class="">16,408</td>
...: <td class="">19,582</td>
...: <td class=""></td>
...: </tr>
...: <tr>
...: <td class=""></td>
...: <td class="">22,574</td>
...: <td class="">21,755</td>
...: </tr>
...: </tbody>
...: </table>"""
In [2]: from parsel import Selector
In [3]: s = Selector(text=html)
In [4]: rows = s.css(".manual_table tbody tr")
In [5]: rows[0].css("td::text").extract()
Out[5]: ['8,626', '8,427', '11,525']
In [6]: rows[1].css("td::text").extract()
Out[6]: ['16,408', '19,582']
In [7]: rows[2].css("td::text").extract()
Out[7]: ['22,574', '21,755'] |
Thank for the clarification.
Is there a better way to parse a sparse table other than the looping method? What I suggest is that similar |
Is anyone looking into this? |
I believe that is the right way to do it with Parsel. |
@shubham-MLwiz >>> s.css(".manual_table tbody tr td").xpath("normalize-space()").getall()
['8,626', '8,427', '11,525', '16,408', '19,582', '', '', '22,574', '21,755'] Full code from parsel import Selector
html = """<!DOCTYPE html>
<html lang="en">
<table class="manual_table">
<thead>
<tr>
<th class="">Mar 2008</th>
<th class="">Mar 2009</th>
<th class="">Mar 2010</th>
</tr>
</thead>
<tbody>
<tr>
<td class="">8,626</td>
<td class="">8,427</td>
<td class="">11,525</td>
</tr>
<tr>
<td class="">16,408</td>
<td class="">19,582</td>
<td class=""></td>
</tr>
<tr>
<td class=""></td>
<td class="">22,574</td>
<td class="">21,755</td>
</tr>
</tbody>
</table>
</html>"""
s = Selector(text=html)
rows = s.css(".manual_table tbody tr")
dt = []
for row in rows:
for data in row.css("td"):
dt.append(data.css("::text").get(default=''))
print("Loop:", dt)
dt2 = s.css(".manual_table tbody tr td").xpath("normalize-space()").getall()
print("One-liner:", dt2) Output
I'm commenting on this old issue because I've faced it today. |
I created a manual table to reproduce the bug which I am facing
Now when I try to run the below code on the above html. This is the output I get
As you can notice, It is unable to give proper output for empty data cells. It is ignoring all empty values and that seems a bug.
Similarly if you run below code you will find some weird results. I am confused because it is not supposed to be like this.
Both
.getall()
and.extract()
give the same issue.The text was updated successfully, but these errors were encountered: