Skip to content

Commit

Permalink
DOC: move enhancements in 1.5 release notes (pandas-dev#47375)
Browse files Browse the repository at this point in the history
* DOC: move enhancements in 1.5 release notes

* remove rogue hash
  • Loading branch information
simonjayhawkins authored and yehoshuadimarsky committed Jul 13, 2022
1 parent e97367e commit 334ffc2
Showing 1 changed file with 82 additions and 76 deletions.
158 changes: 82 additions & 76 deletions doc/source/whatsnew/v1.5.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,85 @@ If the compression method cannot be inferred, use the ``compression`` argument:
(``mode`` being one of ``tarfile.open``'s modes: https://docs.python.org/3/library/tarfile.html#tarfile.open)


.. _whatsnew_150.enhancements.read_xml_dtypes:

read_xml now supports ``dtype``, ``converters``, and ``parse_dates``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Similar to other IO methods, :func:`pandas.read_xml` now supports assigning specific dtypes to columns,
apply converter methods, and parse dates (:issue:`43567`).

.. ipython:: python
xml_dates = """<?xml version='1.0' encoding='utf-8'?>
<data>
<row>
<shape>square</shape>
<degrees>00360</degrees>
<sides>4.0</sides>
<date>2020-01-01</date>
</row>
<row>
<shape>circle</shape>
<degrees>00360</degrees>
<sides/>
<date>2021-01-01</date>
</row>
<row>
<shape>triangle</shape>
<degrees>00180</degrees>
<sides>3.0</sides>
<date>2022-01-01</date>
</row>
</data>"""
df = pd.read_xml(
xml_dates,
dtype={'sides': 'Int64'},
converters={'degrees': str},
parse_dates=['date']
)
df
df.dtypes
.. _whatsnew_150.enhancements.read_xml_iterparse:

read_xml now supports large XML using ``iterparse``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For very large XML files that can range in hundreds of megabytes to gigabytes, :func:`pandas.read_xml`
now supports parsing such sizeable files using `lxml's iterparse`_ and `etree's iterparse`_
which are memory-efficient methods to iterate through XML trees and extract specific elements
and attributes without holding entire tree in memory (:issue:`45442`).

.. code-block:: ipython
In [1]: df = pd.read_xml(
... "/path/to/downloaded/enwikisource-latest-pages-articles.xml",
... iterparse = {"page": ["title", "ns", "id"]})
... )
df
Out[2]:
title ns id
0 Gettysburg Address 0 21450
1 Main Page 0 42950
2 Declaration by United Nations 0 8435
3 Constitution of the United States of America 0 8435
4 Declaration of Independence (Israel) 0 17858
... ... ... ...
3578760 Page:Black cat 1897 07 v2 n10.pdf/17 104 219649
3578761 Page:Black cat 1897 07 v2 n10.pdf/43 104 219649
3578762 Page:Black cat 1897 07 v2 n10.pdf/44 104 219649
3578763 The History of Tom Jones, a Foundling/Book IX 0 12084291
3578764 Page:Shakespeare of Stratford (1926) Yale.djvu/91 104 21450
[3578765 rows x 3 columns]
.. _`lxml's iterparse`: https://lxml.de/3.2/parsing.html#iterparse-and-iterwalk
.. _`etree's iterparse`: https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse

.. _whatsnew_150.enhancements.other:

Other enhancements
Expand Down Expand Up @@ -294,83 +373,10 @@ upon serialization. (Related issue :issue:`12997`)
Backwards incompatible API changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. _whatsnew_150.api_breaking.read_xml_dtypes:

read_xml now supports ``dtype``, ``converters``, and ``parse_dates``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Similar to other IO methods, :func:`pandas.read_xml` now supports assigning specific dtypes to columns,
apply converter methods, and parse dates (:issue:`43567`).

.. ipython:: python
xml_dates = """<?xml version='1.0' encoding='utf-8'?>
<data>
<row>
<shape>square</shape>
<degrees>00360</degrees>
<sides>4.0</sides>
<date>2020-01-01</date>
</row>
<row>
<shape>circle</shape>
<degrees>00360</degrees>
<sides/>
<date>2021-01-01</date>
</row>
<row>
<shape>triangle</shape>
<degrees>00180</degrees>
<sides>3.0</sides>
<date>2022-01-01</date>
</row>
</data>"""
.. _whatsnew_150.api_breaking.api_breaking1:

df = pd.read_xml(
xml_dates,
dtype={'sides': 'Int64'},
converters={'degrees': str},
parse_dates=['date']
)
df
df.dtypes
.. _whatsnew_150.read_xml_iterparse:

read_xml now supports large XML using ``iterparse``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For very large XML files that can range in hundreds of megabytes to gigabytes, :func:`pandas.read_xml`
now supports parsing such sizeable files using `lxml's iterparse`_ and `etree's iterparse`_
which are memory-efficient methods to iterate through XML trees and extract specific elements
and attributes without holding entire tree in memory (:issue:`#45442`).

.. code-block:: ipython
In [1]: df = pd.read_xml(
... "/path/to/downloaded/enwikisource-latest-pages-articles.xml",
... iterparse = {"page": ["title", "ns", "id"]})
... )
df
Out[2]:
title ns id
0 Gettysburg Address 0 21450
1 Main Page 0 42950
2 Declaration by United Nations 0 8435
3 Constitution of the United States of America 0 8435
4 Declaration of Independence (Israel) 0 17858
... ... ... ...
3578760 Page:Black cat 1897 07 v2 n10.pdf/17 104 219649
3578761 Page:Black cat 1897 07 v2 n10.pdf/43 104 219649
3578762 Page:Black cat 1897 07 v2 n10.pdf/44 104 219649
3578763 The History of Tom Jones, a Foundling/Book IX 0 12084291
3578764 Page:Shakespeare of Stratford (1926) Yale.djvu/91 104 21450
[3578765 rows x 3 columns]
.. _`lxml's iterparse`: https://lxml.de/3.2/parsing.html#iterparse-and-iterwalk
.. _`etree's iterparse`: https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse
api_breaking_change1
^^^^^^^^^^^^^^^^^^^^

.. _whatsnew_150.api_breaking.api_breaking2:

Expand Down

0 comments on commit 334ffc2

Please sign in to comment.