Skip to content

Unstrutured library is unable to extract CDATA from the xml data #3075

@PhaneendraGunda

Description

@PhaneendraGunda

Sample XML:

<GENERAL_INFO><TITLE><![CDATA[Mobile Apple Devices (iPhones, iPads, and Smartwatches)]]></TITLE><SUMMARY><![CDATA[<p>This article highlights the key benefits and specifications of Apple iPhones, iPads, and Smartwatches.</p></SUMMARY></GENERAL_INFO>

Code to fetch data from the XML

from unstructured.partition.html import partition_html

_text = ' '.join([element.text for element in partition_html(text=_html_text)])

Is there any flag or function to enable extracting content from the CDATA ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingxmlRelated to partitioning XML documents

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions