Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problem with html2creole AttributeError: 'NoneType' object has no attribute 'parent' #6

Closed
binarytemple opened this issue Feb 20, 2012 · 3 comments

Comments

@binarytemple
Copy link

I'm getting an error when running the following code

f = open("/tmp/test.html","r")
html2creole(unicode(fr,errors='ignore'))

In [54]: html2creole(unicode(fr,errors='ignore'))

In [53]: html2creole(unicode(fr,errors='ignore'))

AttributeError Traceback (most recent call last)

/tmp/ in ()

/usr/local/lib/python2.7/dist-packages/creole/init.pyc in html2creole(html_string, debug, parser_kwargs, emitter_kwargs, unknown_emit)
110 warnings.warn("parser_kwargs argument in html2creole would be removed in the future!", PendingDeprecationWarning)
111
--> 112 document_tree = parse_html(html_string, debug=debug)
113
114 emitter_kwargs2 = {

/usr/local/lib/python2.7/dist-packages/creole/init.pyc in parse_html(html_string, debug)
91
92 h2c = HtmlParser(debug=debug)
---> 93 document_tree = h2c.feed(html_string)
94 if debug:
95 h2c.debug()

/usr/local/lib/python2.7/dist-packages/creole/html_parser/parser.pyc in feed(self, raw_data)
157 # print("-"*79)

158

--> 159 HTMLParser2.feed(self, data)
160
161 return self.root

/usr/lib/python2.7/HTMLParser.pyc in feed(self, data)
107 """
108 self.rawdata = self.rawdata + data
--> 109 self.goahead(0)
110
111 def close(self):

/usr/lib/python2.7/HTMLParser.pyc in goahead(self, end)
151 k = self.parse_starttag(i)
152 elif startswith("</", i):
--> 153 k = self.parse_endtag(i)
154 elif startswith("<!--", i):
155 k = self.parse_comment(i)

/usr/local/lib/python2.7/dist-packages/creole/shared/html_parser.pyc in parse_endtag(self, i)
98 return j
99 # --- changed end -----------------------------------------------------

--> 100 self.handle_endtag(tag.lower())
101 self.clear_cdata_mode()
102 return j
/usr/local/lib/python2.7/dist-packages/creole/html_parser/parser.pyc in handle_endtag(self, tag)
255 self._go_up()
256 else:
--> 257 self.cur = self.cur.parent
258
259 #-------------------------------------------------------------------------

Here's the actual html code (I don't know if I can attach files)

<html>
 <head>
  <title>
   Regions - Online Help - EN
  </title>
  <link href="AppStyles.css" type="text/css" rel="stylesheet" />
  <link href="pagestyles.css" type="text/css" rel="stylesheet" />
  <link href="style_blue.css" type="text/css" rel="stylesheet" />
  <script type="text/javascript" src="static_page.js">
  </script>
  <meta http-equiv="Cache-Control" content="no-cache" />
  <meta http-equiv="Pragma" content="no-cache" />
  <meta http-equiv="expires" content="FRI, 13 APR 1999 01:00:00 GMT" />
  <meta name="ROBOTS" content="NOINDEX, NOFOLLOW, NOARCHIVE" />
 </head>
 <body class="page_body">
  <p>
   <span class="breadcrumbs">
    <a href="Welcome.htm" title="">
     Home
    </a>
    &nbsp;&gt;&nbsp;
    <a href="Welcome.htm" title="">
     Welcome
    </a>
    &nbsp;&gt;&nbsp;
    <a href="reporting1.htm" title="">
     Reporting
    </a>
    &nbsp;&gt;&nbsp;
    <a href="regions.htm" title="">
     Regions
    </a>
   </span>
  </p>
  <p>
   <span class="heading">
    Regions
   </span>
  </p>
  <p>
   This demographic report allows you to view the regional breakdown of mentions by country.&nbsp;
   <br />
   Viewed via the
   <img alt="" style="border:0px solid;" src="./images/regions.gif" />
   icon in the
   <strong>
    Icon Panel
   </strong>
   . It can also be viewed by double-clicking on the
   <strong>
    <a href="summary_dashboard1.htm">
     Summary&nbsp;Dashboard
    </a>
   </strong>
   and selecting the&nbsp;appropriate&nbsp;option. &nbsp;This has two components,
   <strong>
    Report
   </strong>
   and
   <strong>
   </strong>
   <strong>
    <a href="data_explorer.htm">
     Data Explorer
    </a>
   </strong>
   .
  </p>
  <br />
  <span style="font-size: 18px;">
   <strong>
    Report
   </strong>
  </span>
  <br />
  <br />
  <img alt="" style="border:0px solid;" src="./images/Regions.png" />
  <br />
  <br />
  You can change the way that the mentions are displayed using the drop down list accessed via the
  <img alt="" style="border:0px solid;" src="./images/config-over.gif" />
  icon.
  <br />
  <br />
  <strong>
   <span style="font-size: 16px;">
   </span>
  </strong>
  <strong>
   <a href="data_explorer.htm">
    Data Explorer
   </a>
  </strong>
  <br />
  <br />
  The
  <strong>
  </strong>
  <strong>
   <a href="data_explorer.htm">
    Data Explorer
   </a>
  </strong>
  displays the mentions that make up the data shown in the
  <strong>
   Report
  </strong>
  panel. In addition there is the ability to filter the mentions by country via the filter located to the right of the
  <strong>
   Email
  </strong>
  button.&nbsp;
 </body>
</html>
@jedie
Copy link
Owner

jedie commented Apr 4, 2012

Sorry for the late response.

You have to cut out the body content and put this to html2creole()

made something like this:

body_re = re.compile(r'<body[^>]*>(.*?)</body>', re.S | re.I)

f = open("/tmp/test.html","r")
html = f.read()
f.close()
content = body_re.findall(html)
creole = html2creole(content)

@jedie jedie closed this as completed Apr 4, 2012
@jedie
Copy link
Owner

jedie commented Apr 4, 2012

I found a bug related to "AttributeError: 'NoneType' object has no attribute 'parent'" and fix it with: 9e5b5dd

I create a new relase v1.0.2

@binarytemple
Copy link
Author

Dude. It works like beautiful now. U are the man!

Here is a full sample that works with the html from my first post, no need for the regex.

from creole import *
f = open("/tmp/blah.html","r")
html = f.read()
f.close()
creole = html2creole(unicode(html))
print creole

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants