From 3a45ddff4beaf7577441151d17bd8966a1e598ef Mon Sep 17 00:00:00 2001 From: cobalt <61329810+cobaltt7@users.noreply.github.com> Date: Sun, 2 Feb 2025 03:47:05 +0000 Subject: [PATCH 1/4] Split up the IO docs --- doc/redirects.csv | 3 +- doc/source/user_guide/index.rst | 2 +- doc/source/user_guide/io.rst | 6487 ----------------- doc/source/user_guide/io/clipboard.rst | 57 + .../user_guide/io/community_packages.rst | 27 + doc/source/user_guide/io/csv.rst | 1729 +++++ doc/source/user_guide/io/excel.rst | 531 ++ doc/source/user_guide/io/feather.rst | 67 + doc/source/user_guide/io/hdf5.rst | 1096 +++ doc/source/user_guide/io/html.rst | 459 ++ doc/source/user_guide/io/index.rst | 273 + doc/source/user_guide/io/json.rst | 618 ++ doc/source/user_guide/io/latex.rst | 35 + doc/source/user_guide/io/orc.rst | 62 + doc/source/user_guide/io/parquet.rst | 187 + doc/source/user_guide/io/pickling.rst | 121 + doc/source/user_guide/io/sas.rst | 47 + doc/source/user_guide/io/spss.rst | 38 + doc/source/user_guide/io/sql.rst | 523 ++ doc/source/user_guide/io/stata.rst | 171 + doc/source/user_guide/io/xml.rst | 548 ++ doc/source/whatsnew/v1.4.0.rst | 2 +- 22 files changed, 6593 insertions(+), 6490 deletions(-) delete mode 100644 doc/source/user_guide/io.rst create mode 100644 doc/source/user_guide/io/clipboard.rst create mode 100644 doc/source/user_guide/io/community_packages.rst create mode 100644 doc/source/user_guide/io/csv.rst create mode 100644 doc/source/user_guide/io/excel.rst create mode 100644 doc/source/user_guide/io/feather.rst create mode 100644 doc/source/user_guide/io/hdf5.rst create mode 100644 doc/source/user_guide/io/html.rst create mode 100644 doc/source/user_guide/io/index.rst create mode 100644 doc/source/user_guide/io/json.rst create mode 100644 doc/source/user_guide/io/latex.rst create mode 100644 doc/source/user_guide/io/orc.rst create mode 100644 doc/source/user_guide/io/parquet.rst create mode 100644 doc/source/user_guide/io/pickling.rst create mode 100644 doc/source/user_guide/io/sas.rst create mode 100644 doc/source/user_guide/io/spss.rst create mode 100644 doc/source/user_guide/io/sql.rst create mode 100644 doc/source/user_guide/io/stata.rst create mode 100644 doc/source/user_guide/io/xml.rst diff --git a/doc/redirects.csv b/doc/redirects.csv index c11e4e242f128..fc5cb74b60979 100644 --- a/doc/redirects.csv +++ b/doc/redirects.csv @@ -24,7 +24,8 @@ gotchas,user_guide/gotchas groupby,user_guide/groupby indexing,user_guide/indexing integer_na,user_guide/integer_na -io,user_guide/io +io,user_guide/io/index +user_guide/io,user_guide/io/index merging,user_guide/merging missing_data,user_guide/missing_data options,user_guide/options diff --git a/doc/source/user_guide/index.rst b/doc/source/user_guide/index.rst index f0d6a76f0de5b..220456848211a 100644 --- a/doc/source/user_guide/index.rst +++ b/doc/source/user_guide/index.rst @@ -63,7 +63,7 @@ Guides 10min dsintro basics - io + io/index pyarrow indexing advanced diff --git a/doc/source/user_guide/io.rst b/doc/source/user_guide/io.rst deleted file mode 100644 index daf323acff129..0000000000000 --- a/doc/source/user_guide/io.rst +++ /dev/null @@ -1,6487 +0,0 @@ -.. _io: - -.. currentmodule:: pandas - - -=============================== -IO tools (text, CSV, HDF5, ...) -=============================== - -The pandas I/O API is a set of top level ``reader`` functions accessed like -:func:`pandas.read_csv` that generally return a pandas object. The corresponding -``writer`` functions are object methods that are accessed like -:meth:`DataFrame.to_csv`. Below is a table containing available ``readers`` and -``writers``. - -.. csv-table:: - :header: "Format Type", "Data Description", "Reader", "Writer" - :widths: 30, 100, 60, 60 - - text,`CSV `__, :ref:`read_csv`, :ref:`to_csv` - text,Fixed-Width Text File, :ref:`read_fwf` , NA - text,`JSON `__, :ref:`read_json`, :ref:`to_json` - text,`HTML `__, :ref:`read_html`, :ref:`to_html` - text,`LaTeX `__, :ref:`Styler.to_latex` , NA - text,`XML `__, :ref:`read_xml`, :ref:`to_xml` - text, Local clipboard, :ref:`read_clipboard`, :ref:`to_clipboard` - binary,`MS Excel `__ , :ref:`read_excel`, :ref:`to_excel` - binary,`OpenDocument `__, :ref:`read_excel`, NA - binary,`HDF5 Format `__, :ref:`read_hdf`, :ref:`to_hdf` - binary,`Feather Format `__, :ref:`read_feather`, :ref:`to_feather` - binary,`Parquet Format `__, :ref:`read_parquet`, :ref:`to_parquet` - binary,`ORC Format `__, :ref:`read_orc`, :ref:`to_orc` - binary,`Stata `__, :ref:`read_stata`, :ref:`to_stata` - binary,`SAS `__, :ref:`read_sas` , NA - binary,`SPSS `__, :ref:`read_spss` , NA - binary,`Python Pickle Format `__, :ref:`read_pickle`, :ref:`to_pickle` - SQL,`SQL `__, :ref:`read_sql`,:ref:`to_sql` - -:ref:`Here ` is an informal performance comparison for some of these IO methods. - -.. note:: - For examples that use the ``StringIO`` class, make sure you import it - with ``from io import StringIO`` for Python 3. - -.. _io.read_csv_table: - -CSV & text files ----------------- - -The workhorse function for reading text files (a.k.a. flat files) is -:func:`read_csv`. See the :ref:`cookbook` for some advanced strategies. - -Parsing options -''''''''''''''' - -:func:`read_csv` accepts the following common arguments: - -Basic -+++++ - -filepath_or_buffer : various - Either a path to a file (a :class:`python:str`, :class:`python:pathlib.Path`) - URL (including http, ftp, and S3 - locations), or any object with a ``read()`` method (such as an open file or - :class:`~python:io.StringIO`). -sep : str, defaults to ``','`` for :func:`read_csv`, ``\t`` for :func:`read_table` - Delimiter to use. If sep is ``None``, the C engine cannot automatically detect - the separator, but the Python parsing engine can, meaning the latter will be - used and automatically detect the separator by Python's builtin sniffer tool, - :class:`python:csv.Sniffer`. In addition, separators longer than 1 character and - different from ``'\s+'`` will be interpreted as regular expressions and - will also force the use of the Python parsing engine. Note that regex - delimiters are prone to ignoring quoted data. Regex example: ``'\\r\\t'``. -delimiter : str, default ``None`` - Alternative argument name for sep. - -Column and index locations and names -++++++++++++++++++++++++++++++++++++ - -header : int or list of ints, default ``'infer'`` - Row number(s) to use as the column names, and the start of the - data. Default behavior is to infer the column names: if no names are - passed the behavior is identical to ``header=0`` and column names - are inferred from the first line of the file, if column names are - passed explicitly then the behavior is identical to - ``header=None``. Explicitly pass ``header=0`` to be able to replace - existing names. - - The header can be a list of ints that specify row locations - for a MultiIndex on the columns e.g. ``[0,1,3]``. Intervening rows - that are not specified will be skipped (e.g. 2 in this example is - skipped). Note that this parameter ignores commented lines and empty - lines if ``skip_blank_lines=True``, so header=0 denotes the first - line of data rather than the first line of the file. -names : array-like, default ``None`` - List of column names to use. If file contains no header row, then you should - explicitly pass ``header=None``. Duplicates in this list are not allowed. -index_col : int, str, sequence of int / str, or False, optional, default ``None`` - Column(s) to use as the row labels of the ``DataFrame``, either given as - string name or column index. If a sequence of int / str is given, a - MultiIndex is used. - - .. note:: - ``index_col=False`` can be used to force pandas to *not* use the first - column as the index, e.g. when you have a malformed file with delimiters at - the end of each line. - - The default value of ``None`` instructs pandas to guess. If the number of - fields in the column header row is equal to the number of fields in the body - of the data file, then a default index is used. If it is larger, then - the first columns are used as index so that the remaining number of fields in - the body are equal to the number of fields in the header. - - The first row after the header is used to determine the number of columns, - which will go into the index. If the subsequent rows contain less columns - than the first row, they are filled with ``NaN``. - - This can be avoided through ``usecols``. This ensures that the columns are - taken as is and the trailing data are ignored. -usecols : list-like or callable, default ``None`` - Return a subset of the columns. If list-like, all elements must either - be positional (i.e. integer indices into the document columns) or strings - that correspond to column names provided either by the user in ``names`` or - inferred from the document header row(s). If ``names`` are given, the document - header row(s) are not taken into account. For example, a valid list-like - ``usecols`` parameter would be ``[0, 1, 2]`` or ``['foo', 'bar', 'baz']``. - - Element order is ignored, so ``usecols=[0, 1]`` is the same as ``[1, 0]``. To - instantiate a DataFrame from ``data`` with element order preserved use - ``pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']]`` for columns - in ``['foo', 'bar']`` order or - ``pd.read_csv(data, usecols=['foo', 'bar'])[['bar', 'foo']]`` for - ``['bar', 'foo']`` order. - - If callable, the callable function will be evaluated against the column names, - returning names where the callable function evaluates to True: - - .. ipython:: python - - import pandas as pd - from io import StringIO - - data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3" - pd.read_csv(StringIO(data)) - pd.read_csv(StringIO(data), usecols=lambda x: x.upper() in ["COL1", "COL3"]) - - Using this parameter results in much faster parsing time and lower memory usage - when using the c engine. The Python engine loads the data first before deciding - which columns to drop. - -General parsing configuration -+++++++++++++++++++++++++++++ - -dtype : Type name or dict of column -> type, default ``None`` - Data type for data or columns. E.g. ``{'a': np.float64, 'b': np.int32, 'c': 'Int64'}`` - Use ``str`` or ``object`` together with suitable ``na_values`` settings to preserve - and not interpret dtype. If converters are specified, they will be applied INSTEAD - of dtype conversion. - - .. versionadded:: 1.5.0 - - Support for defaultdict was added. Specify a defaultdict as input where - the default determines the dtype of the columns which are not explicitly - listed. - -dtype_backend : {"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames - Which dtype_backend to use, e.g. whether a DataFrame should have NumPy - arrays, nullable dtypes are used for all dtypes that have a nullable - implementation when "numpy_nullable" is set, pyarrow is used for all - dtypes if "pyarrow" is set. - - The dtype_backends are still experimental. - - .. versionadded:: 2.0 - -engine : {``'c'``, ``'python'``, ``'pyarrow'``} - Parser engine to use. The C and pyarrow engines are faster, while the python engine - is currently more feature-complete. Multithreading is currently only supported by - the pyarrow engine. - - .. versionadded:: 1.4.0 - - The "pyarrow" engine was added as an *experimental* engine, and some features - are unsupported, or may not work correctly, with this engine. -converters : dict, default ``None`` - Dict of functions for converting values in certain columns. Keys can either be - integers or column labels. -true_values : list, default ``None`` - Values to consider as ``True``. -false_values : list, default ``None`` - Values to consider as ``False``. -skipinitialspace : boolean, default ``False`` - Skip spaces after delimiter. -skiprows : list-like or integer, default ``None`` - Line numbers to skip (0-indexed) or number of lines to skip (int) at the start - of the file. - - If callable, the callable function will be evaluated against the row - indices, returning True if the row should be skipped and False otherwise: - - .. ipython:: python - - data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3" - pd.read_csv(StringIO(data)) - pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0) - -skipfooter : int, default ``0`` - Number of lines at bottom of file to skip (unsupported with engine='c'). - -nrows : int, default ``None`` - Number of rows of file to read. Useful for reading pieces of large files. -low_memory : boolean, default ``True`` - Internally process the file in chunks, resulting in lower memory use - while parsing, but possibly mixed type inference. To ensure no mixed - types either set ``False``, or specify the type with the ``dtype`` parameter. - Note that the entire file is read into a single ``DataFrame`` regardless, - use the ``chunksize`` or ``iterator`` parameter to return the data in chunks. - (Only valid with C parser) -memory_map : boolean, default False - If a filepath is provided for ``filepath_or_buffer``, map the file object - directly onto memory and access the data directly from there. Using this - option can improve performance because there is no longer any I/O overhead. - -NA and missing data handling -++++++++++++++++++++++++++++ - -na_values : scalar, str, list-like, or dict, default ``None`` - Additional strings to recognize as NA/NaN. If dict passed, specific per-column - NA values. See :ref:`na values const ` below - for a list of the values interpreted as NaN by default. - -keep_default_na : boolean, default ``True`` - Whether or not to include the default NaN values when parsing the data. - Depending on whether ``na_values`` is passed in, the behavior is as follows: - - * If ``keep_default_na`` is ``True``, and ``na_values`` are specified, ``na_values`` - is appended to the default NaN values used for parsing. - * If ``keep_default_na`` is ``True``, and ``na_values`` are not specified, only - the default NaN values are used for parsing. - * If ``keep_default_na`` is ``False``, and ``na_values`` are specified, only - the NaN values specified ``na_values`` are used for parsing. - * If ``keep_default_na`` is ``False``, and ``na_values`` are not specified, no - strings will be parsed as NaN. - - Note that if ``na_filter`` is passed in as ``False``, the ``keep_default_na`` and - ``na_values`` parameters will be ignored. -na_filter : boolean, default ``True`` - Detect missing value markers (empty strings and the value of na_values). In - data without any NAs, passing ``na_filter=False`` can improve the performance - of reading a large file. -verbose : boolean, default ``False`` - Indicate number of NA values placed in non-numeric columns. -skip_blank_lines : boolean, default ``True`` - If ``True``, skip over blank lines rather than interpreting as NaN values. - -.. _io.read_csv_table.datetime: - -Datetime handling -+++++++++++++++++ - -parse_dates : boolean or list of ints or names or list of lists or dict, default ``False``. - * If ``True`` -> try parsing the index. - * If ``[1, 2, 3]`` -> try parsing columns 1, 2, 3 each as a separate date - column. - - .. note:: - A fast-path exists for iso8601-formatted dates. -date_format : str or dict of column -> format, default ``None`` - If used in conjunction with ``parse_dates``, will parse dates according to this - format. For anything more complex, - please read in as ``object`` and then apply :func:`to_datetime` as-needed. - - .. versionadded:: 2.0.0 -dayfirst : boolean, default ``False`` - DD/MM format dates, international and European format. -cache_dates : boolean, default True - If True, use a cache of unique, converted dates to apply the datetime - conversion. May produce significant speed-up when parsing duplicate - date strings, especially ones with timezone offsets. - -Iteration -+++++++++ - -iterator : boolean, default ``False`` - Return ``TextFileReader`` object for iteration or getting chunks with - ``get_chunk()``. -chunksize : int, default ``None`` - Return ``TextFileReader`` object for iteration. See :ref:`iterating and chunking - ` below. - -Quoting, compression, and file format -+++++++++++++++++++++++++++++++++++++ - -compression : {``'infer'``, ``'gzip'``, ``'bz2'``, ``'zip'``, ``'xz'``, ``'zstd'``, ``None``, ``dict``}, default ``'infer'`` - For on-the-fly decompression of on-disk data. If 'infer', then use gzip, - bz2, zip, xz, or zstandard if ``filepath_or_buffer`` is path-like ending in '.gz', '.bz2', - '.zip', '.xz', '.zst', respectively, and no decompression otherwise. If using 'zip', - the ZIP file must contain only one data file to be read in. - Set to ``None`` for no decompression. Can also be a dict with key ``'method'`` - set to one of {``'zip'``, ``'gzip'``, ``'bz2'``, ``'zstd'``} and other key-value pairs are - forwarded to ``zipfile.ZipFile``, ``gzip.GzipFile``, ``bz2.BZ2File``, or ``zstandard.ZstdDecompressor``. - As an example, the following could be passed for faster compression and to - create a reproducible gzip archive: - ``compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}``. - - .. versionchanged:: 1.2.0 Previous versions forwarded dict entries for 'gzip' to ``gzip.open``. -thousands : str, default ``None`` - Thousands separator. -decimal : str, default ``'.'`` - Character to recognize as decimal point. E.g. use ``','`` for European data. -float_precision : string, default None - Specifies which converter the C engine should use for floating-point values. - The options are ``None`` for the ordinary converter, ``high`` for the - high-precision converter, and ``round_trip`` for the round-trip converter. -lineterminator : str (length 1), default ``None`` - Character to break file into lines. Only valid with C parser. -quotechar : str (length 1) - The character used to denote the start and end of a quoted item. Quoted items - can include the delimiter and it will be ignored. -quoting : int or ``csv.QUOTE_*`` instance, default ``0`` - Control field quoting behavior per ``csv.QUOTE_*`` constants. Use one of - ``QUOTE_MINIMAL`` (0), ``QUOTE_ALL`` (1), ``QUOTE_NONNUMERIC`` (2) or - ``QUOTE_NONE`` (3). -doublequote : boolean, default ``True`` - When ``quotechar`` is specified and ``quoting`` is not ``QUOTE_NONE``, - indicate whether or not to interpret two consecutive ``quotechar`` elements - **inside** a field as a single ``quotechar`` element. -escapechar : str (length 1), default ``None`` - One-character string used to escape delimiter when quoting is ``QUOTE_NONE``. -comment : str, default ``None`` - Indicates remainder of line should not be parsed. If found at the beginning of - a line, the line will be ignored altogether. This parameter must be a single - character. Like empty lines (as long as ``skip_blank_lines=True``), fully - commented lines are ignored by the parameter ``header`` but not by ``skiprows``. - For example, if ``comment='#'``, parsing '#empty\\na,b,c\\n1,2,3' with - ``header=0`` will result in 'a,b,c' being treated as the header. -encoding : str, default ``None`` - Encoding to use for UTF when reading/writing (e.g. ``'utf-8'``). `List of - Python standard encodings - `_. -dialect : str or :class:`python:csv.Dialect` instance, default ``None`` - If provided, this parameter will override values (default or not) for the - following parameters: ``delimiter``, ``doublequote``, ``escapechar``, - ``skipinitialspace``, ``quotechar``, and ``quoting``. If it is necessary to - override values, a ParserWarning will be issued. See :class:`python:csv.Dialect` - documentation for more details. - -Error handling -++++++++++++++ - -on_bad_lines : {{'error', 'warn', 'skip'}}, default 'error' - Specifies what to do upon encountering a bad line (a line with too many fields). - Allowed values are : - - - 'error', raise an ParserError when a bad line is encountered. - - 'warn', print a warning when a bad line is encountered and skip that line. - - 'skip', skip bad lines without raising or warning when they are encountered. - - .. versionadded:: 1.3.0 - -.. _io.dtypes: - -Specifying column data types -'''''''''''''''''''''''''''' - -You can indicate the data type for the whole ``DataFrame`` or individual -columns: - -.. ipython:: python - - import numpy as np - - data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" - print(data) - - df = pd.read_csv(StringIO(data), dtype=object) - df - df["a"][0] - df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) - df.dtypes - -Fortunately, pandas offers more than one way to ensure that your column(s) -contain only one ``dtype``. If you're unfamiliar with these concepts, you can -see :ref:`here` to learn more about dtypes, and -:ref:`here` to learn more about ``object`` conversion in -pandas. - - -For instance, you can use the ``converters`` argument -of :func:`~pandas.read_csv`: - -.. ipython:: python - - data = "col_1\n1\n2\n'A'\n4.22" - df = pd.read_csv(StringIO(data), converters={"col_1": str}) - df - df["col_1"].apply(type).value_counts() - -Or you can use the :func:`~pandas.to_numeric` function to coerce the -dtypes after reading in the data, - -.. ipython:: python - - df2 = pd.read_csv(StringIO(data)) - df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") - df2 - df2["col_1"].apply(type).value_counts() - -which will convert all valid parsing to floats, leaving the invalid parsing -as ``NaN``. - -Ultimately, how you deal with reading in columns containing mixed dtypes -depends on your specific needs. In the case above, if you wanted to ``NaN`` out -the data anomalies, then :func:`~pandas.to_numeric` is probably your best option. -However, if you wanted for all the data to be coerced, no matter the type, then -using the ``converters`` argument of :func:`~pandas.read_csv` would certainly be -worth trying. - -.. note:: - In some cases, reading in abnormal data with columns containing mixed dtypes - will result in an inconsistent dataset. If you rely on pandas to infer the - dtypes of your columns, the parsing engine will go and infer the dtypes for - different chunks of the data, rather than the whole dataset at once. Consequently, - you can end up with column(s) with mixed dtypes. For example, - - .. ipython:: python - :okwarning: - - col_1 = list(range(500000)) + ["a", "b"] + list(range(500000)) - df = pd.DataFrame({"col_1": col_1}) - df.to_csv("foo.csv") - mixed_df = pd.read_csv("foo.csv") - mixed_df["col_1"].apply(type).value_counts() - mixed_df["col_1"].dtype - - will result with ``mixed_df`` containing an ``int`` dtype for certain chunks - of the column, and ``str`` for others due to the mixed dtypes from the - data that was read in. It is important to note that the overall column will be - marked with a ``dtype`` of ``object``, which is used for columns with mixed dtypes. - -.. ipython:: python - :suppress: - - import os - - os.remove("foo.csv") - -Setting ``dtype_backend="numpy_nullable"`` will result in nullable dtypes for every column. - -.. ipython:: python - - data = """a,b,c,d,e,f,g,h,i,j - 1,2.5,True,a,,,,,12-31-2019, - 3,4.5,False,b,6,7.5,True,a,12-31-2019, - """ - - df = pd.read_csv(StringIO(data), dtype_backend="numpy_nullable", parse_dates=["i"]) - df - df.dtypes - -.. _io.categorical: - -Specifying categorical dtype -'''''''''''''''''''''''''''' - -``Categorical`` columns can be parsed directly by specifying ``dtype='category'`` or -``dtype=CategoricalDtype(categories, ordered)``. - -.. ipython:: python - - data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3" - - pd.read_csv(StringIO(data)) - pd.read_csv(StringIO(data)).dtypes - pd.read_csv(StringIO(data), dtype="category").dtypes - -Individual columns can be parsed as a ``Categorical`` using a dict -specification: - -.. ipython:: python - - pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes - -Specifying ``dtype='category'`` will result in an unordered ``Categorical`` -whose ``categories`` are the unique values observed in the data. For more -control on the categories and order, create a -:class:`~pandas.api.types.CategoricalDtype` ahead of time, and pass that for -that column's ``dtype``. - -.. ipython:: python - - from pandas.api.types import CategoricalDtype - - dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True) - pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes - -When using ``dtype=CategoricalDtype``, "unexpected" values outside of -``dtype.categories`` are treated as missing values. - -.. ipython:: python - - dtype = CategoricalDtype(["a", "b", "d"]) # No 'c' - pd.read_csv(StringIO(data), dtype={"col1": dtype}).col1 - -This matches the behavior of :meth:`Categorical.set_categories`. - -.. note:: - - With ``dtype='category'``, the resulting categories will always be parsed - as strings (object dtype). If the categories are numeric they can be - converted using the :func:`to_numeric` function, or as appropriate, another - converter such as :func:`to_datetime`. - - When ``dtype`` is a ``CategoricalDtype`` with homogeneous ``categories`` ( - all numeric, all datetimes, etc.), the conversion is done automatically. - - .. ipython:: python - - df = pd.read_csv(StringIO(data), dtype="category") - df.dtypes - df["col3"] - new_categories = pd.to_numeric(df["col3"].cat.categories) - df["col3"] = df["col3"].cat.rename_categories(new_categories) - df["col3"] - - -Naming and using columns -'''''''''''''''''''''''' - -.. _io.headers: - -Handling column names -+++++++++++++++++++++ - -A file may or may not have a header row. pandas assumes the first row should be -used as the column names: - -.. ipython:: python - - data = "a,b,c\n1,2,3\n4,5,6\n7,8,9" - print(data) - pd.read_csv(StringIO(data)) - -By specifying the ``names`` argument in conjunction with ``header`` you can -indicate other names to use and whether or not to throw away the header row (if -any): - -.. ipython:: python - - print(data) - pd.read_csv(StringIO(data), names=["foo", "bar", "baz"], header=0) - pd.read_csv(StringIO(data), names=["foo", "bar", "baz"], header=None) - -If the header is in a row other than the first, pass the row number to -``header``. This will skip the preceding rows: - -.. ipython:: python - - data = "skip this skip it\na,b,c\n1,2,3\n4,5,6\n7,8,9" - pd.read_csv(StringIO(data), header=1) - -.. note:: - - Default behavior is to infer the column names: if no names are - passed the behavior is identical to ``header=0`` and column names - are inferred from the first non-blank line of the file, if column - names are passed explicitly then the behavior is identical to - ``header=None``. - -.. _io.dupe_names: - -Duplicate names parsing -''''''''''''''''''''''' - -If the file or header contains duplicate names, pandas will by default -distinguish between them so as to prevent overwriting data: - -.. ipython:: python - - data = "a,b,a\n0,1,2\n3,4,5" - pd.read_csv(StringIO(data)) - -There is no more duplicate data because duplicate columns 'X', ..., 'X' become -'X', 'X.1', ..., 'X.N'. - -.. _io.usecols: - -Filtering columns (``usecols``) -+++++++++++++++++++++++++++++++ - -The ``usecols`` argument allows you to select any subset of the columns in a -file, either using the column names, position numbers or a callable: - -.. ipython:: python - - data = "a,b,c,d\n1,2,3,foo\n4,5,6,bar\n7,8,9,baz" - pd.read_csv(StringIO(data)) - pd.read_csv(StringIO(data), usecols=["b", "d"]) - pd.read_csv(StringIO(data), usecols=[0, 2, 3]) - pd.read_csv(StringIO(data), usecols=lambda x: x.upper() in ["A", "C"]) - -The ``usecols`` argument can also be used to specify which columns not to -use in the final result: - -.. ipython:: python - - pd.read_csv(StringIO(data), usecols=lambda x: x not in ["a", "c"]) - -In this case, the callable is specifying that we exclude the "a" and "c" -columns from the output. - -Comments and empty lines -'''''''''''''''''''''''' - -.. _io.skiplines: - -Ignoring line comments and empty lines -++++++++++++++++++++++++++++++++++++++ - -If the ``comment`` parameter is specified, then completely commented lines will -be ignored. By default, completely blank lines will be ignored as well. - -.. ipython:: python - - data = "\na,b,c\n \n# commented line\n1,2,3\n\n4,5,6" - print(data) - pd.read_csv(StringIO(data), comment="#") - -If ``skip_blank_lines=False``, then ``read_csv`` will not ignore blank lines: - -.. ipython:: python - - data = "a,b,c\n\n1,2,3\n\n\n4,5,6" - pd.read_csv(StringIO(data), skip_blank_lines=False) - -.. warning:: - - The presence of ignored lines might create ambiguities involving line numbers; - the parameter ``header`` uses row numbers (ignoring commented/empty - lines), while ``skiprows`` uses line numbers (including commented/empty lines): - - .. ipython:: python - - data = "#comment\na,b,c\nA,B,C\n1,2,3" - pd.read_csv(StringIO(data), comment="#", header=1) - data = "A,B,C\n#comment\na,b,c\n1,2,3" - pd.read_csv(StringIO(data), comment="#", skiprows=2) - - If both ``header`` and ``skiprows`` are specified, ``header`` will be - relative to the end of ``skiprows``. For example: - -.. ipython:: python - - data = ( - "# empty\n" - "# second empty line\n" - "# third emptyline\n" - "X,Y,Z\n" - "1,2,3\n" - "A,B,C\n" - "1,2.,4.\n" - "5.,NaN,10.0\n" - ) - print(data) - pd.read_csv(StringIO(data), comment="#", skiprows=4, header=1) - -.. _io.comments: - -Comments -++++++++ - -Sometimes comments or meta data may be included in a file: - -.. ipython:: python - - data = ( - "ID,level,category\n" - "Patient1,123000,x # really unpleasant\n" - "Patient2,23000,y # wouldn't take his medicine\n" - "Patient3,1234018,z # awesome" - ) - with open("tmp.csv", "w") as fh: - fh.write(data) - - print(open("tmp.csv").read()) - -By default, the parser includes the comments in the output: - -.. ipython:: python - - df = pd.read_csv("tmp.csv") - df - -We can suppress the comments using the ``comment`` keyword: - -.. ipython:: python - - df = pd.read_csv("tmp.csv", comment="#") - df - -.. ipython:: python - :suppress: - - os.remove("tmp.csv") - -.. _io.unicode: - -Dealing with Unicode data -''''''''''''''''''''''''' - -The ``encoding`` argument should be used for encoded unicode data, which will -result in byte strings being decoded to unicode in the result: - -.. ipython:: python - - from io import BytesIO - - data = b"word,length\n" b"Tr\xc3\xa4umen,7\n" b"Gr\xc3\xbc\xc3\x9fe,5" - data = data.decode("utf8").encode("latin-1") - df = pd.read_csv(BytesIO(data), encoding="latin-1") - df - df["word"][1] - -Some formats which encode all characters as multiple bytes, like UTF-16, won't -parse correctly at all without specifying the encoding. `Full list of Python -standard encodings -`_. - -.. _io.index_col: - -Index columns and trailing delimiters -''''''''''''''''''''''''''''''''''''' - -If a file has one more column of data than the number of column names, the -first column will be used as the ``DataFrame``'s row names: - -.. ipython:: python - - data = "a,b,c\n4,apple,bat,5.7\n8,orange,cow,10" - pd.read_csv(StringIO(data)) - -.. ipython:: python - - data = "index,a,b,c\n4,apple,bat,5.7\n8,orange,cow,10" - pd.read_csv(StringIO(data), index_col=0) - -Ordinarily, you can achieve this behavior using the ``index_col`` option. - -There are some exception cases when a file has been prepared with delimiters at -the end of each data line, confusing the parser. To explicitly disable the -index column inference and discard the last column, pass ``index_col=False``: - -.. ipython:: python - - data = "a,b,c\n4,apple,bat,\n8,orange,cow," - print(data) - pd.read_csv(StringIO(data)) - pd.read_csv(StringIO(data), index_col=False) - -If a subset of data is being parsed using the ``usecols`` option, the -``index_col`` specification is based on that subset, not the original data. - -.. ipython:: python - - data = "a,b,c\n4,apple,bat,\n8,orange,cow," - print(data) - pd.read_csv(StringIO(data), usecols=["b", "c"]) - pd.read_csv(StringIO(data), usecols=["b", "c"], index_col=0) - -.. _io.parse_dates: - -Date Handling -''''''''''''' - -Specifying date columns -+++++++++++++++++++++++ - -To better facilitate working with datetime data, :func:`read_csv` -uses the keyword arguments ``parse_dates`` and ``date_format`` -to allow users to specify a variety of columns and date/time formats to turn the -input text data into ``datetime`` objects. - -The simplest case is to just pass in ``parse_dates=True``: - -.. ipython:: python - - with open("foo.csv", mode="w") as f: - f.write("date,A,B,C\n20090101,a,1,2\n20090102,b,3,4\n20090103,c,4,5") - - # Use a column as an index, and parse it as dates. - df = pd.read_csv("foo.csv", index_col=0, parse_dates=True) - df - - # These are Python datetime objects - df.index - -It is often the case that we may want to store date and time data separately, -or store various date fields separately. the ``parse_dates`` keyword can be -used to specify columns to parse the dates and/or times. - - -.. note:: - If a column or index contains an unparsable date, the entire column or - index will be returned unaltered as an object data type. For non-standard - datetime parsing, use :func:`to_datetime` after ``pd.read_csv``. - - -.. note:: - read_csv has a fast_path for parsing datetime strings in iso8601 format, - e.g "2000-01-01T00:01:02+00:00" and similar variations. If you can arrange - for your data to store datetimes in this format, load times will be - significantly faster, ~20x has been observed. - - -Date parsing functions -++++++++++++++++++++++ - -Finally, the parser allows you to specify a custom ``date_format``. -Performance-wise, you should try these methods of parsing dates in order: - -1. If you know the format, use ``date_format``, e.g.: - ``date_format="%d/%m/%Y"`` or ``date_format={column_name: "%d/%m/%Y"}``. - -2. If you different formats for different columns, or want to pass any extra options (such - as ``utc``) to ``to_datetime``, then you should read in your data as ``object`` dtype, and - then use ``to_datetime``. - - -.. _io.csv.mixed_timezones: - -Parsing a CSV with mixed timezones -++++++++++++++++++++++++++++++++++ - -pandas cannot natively represent a column or index with mixed timezones. If your CSV -file contains columns with a mixture of timezones, the default result will be -an object-dtype column with strings, even with ``parse_dates``. -To parse the mixed-timezone values as a datetime column, read in as ``object`` dtype and -then call :func:`to_datetime` with ``utc=True``. - - -.. ipython:: python - - content = """\ - a - 2000-01-01T00:00:00+05:00 - 2000-01-01T00:00:00+06:00""" - df = pd.read_csv(StringIO(content)) - df["a"] = pd.to_datetime(df["a"], utc=True) - df["a"] - - -.. _io.dayfirst: - - -Inferring datetime format -+++++++++++++++++++++++++ - -Here are some examples of datetime strings that can be guessed (all -representing December 30th, 2011 at 00:00:00): - -* "20111230" -* "2011/12/30" -* "20111230 00:00:00" -* "12/30/2011 00:00:00" -* "30/Dec/2011 00:00:00" -* "30/December/2011 00:00:00" - -Note that format inference is sensitive to ``dayfirst``. With -``dayfirst=True``, it will guess "01/12/2011" to be December 1st. With -``dayfirst=False`` (default) it will guess "01/12/2011" to be January 12th. - -If you try to parse a column of date strings, pandas will attempt to guess the format -from the first non-NaN element, and will then parse the rest of the column with that -format. If pandas fails to guess the format (for example if your first string is -``'01 December US/Pacific 2000'``), then a warning will be raised and each -row will be parsed individually by ``dateutil.parser.parse``. The safest -way to parse dates is to explicitly set ``format=``. - -.. ipython:: python - - df = pd.read_csv( - "foo.csv", - index_col=0, - parse_dates=True, - ) - df - -In the case that you have mixed datetime formats within the same column, you can -pass ``format='mixed'`` - -.. ipython:: python - - data = StringIO("date\n12 Jan 2000\n2000-01-13\n") - df = pd.read_csv(data) - df['date'] = pd.to_datetime(df['date'], format='mixed') - df - -or, if your datetime formats are all ISO8601 (possibly not identically-formatted): - -.. ipython:: python - - data = StringIO("date\n2020-01-01\n2020-01-01 03:00\n") - df = pd.read_csv(data) - df['date'] = pd.to_datetime(df['date'], format='ISO8601') - df - -.. ipython:: python - :suppress: - - os.remove("foo.csv") - -International date formats -++++++++++++++++++++++++++ - -While US date formats tend to be MM/DD/YYYY, many international formats use -DD/MM/YYYY instead. For convenience, a ``dayfirst`` keyword is provided: - -.. ipython:: python - - data = "date,value,cat\n1/6/2000,5,a\n2/6/2000,10,b\n3/6/2000,15,c" - print(data) - with open("tmp.csv", "w") as fh: - fh.write(data) - - pd.read_csv("tmp.csv", parse_dates=[0]) - pd.read_csv("tmp.csv", dayfirst=True, parse_dates=[0]) - -.. ipython:: python - :suppress: - - os.remove("tmp.csv") - -Writing CSVs to binary file objects -+++++++++++++++++++++++++++++++++++ - -.. versionadded:: 1.2.0 - -``df.to_csv(..., mode="wb")`` allows writing a CSV to a file object -opened binary mode. In most cases, it is not necessary to specify -``mode`` as pandas will auto-detect whether the file object is -opened in text or binary mode. - -.. ipython:: python - - import io - - data = pd.DataFrame([0, 1, 2]) - buffer = io.BytesIO() - data.to_csv(buffer, encoding="utf-8", compression="gzip") - -.. _io.float_precision: - -Specifying method for floating-point conversion -''''''''''''''''''''''''''''''''''''''''''''''' - -The parameter ``float_precision`` can be specified in order to use -a specific floating-point converter during parsing with the C engine. -The options are the ordinary converter, the high-precision converter, and -the round-trip converter (which is guaranteed to round-trip values after -writing to a file). For example: - -.. ipython:: python - - val = "0.3066101993807095471566981359501369297504425048828125" - data = "a,b,c\n1,2,{0}".format(val) - abs( - pd.read_csv( - StringIO(data), - engine="c", - float_precision=None, - )["c"][0] - float(val) - ) - abs( - pd.read_csv( - StringIO(data), - engine="c", - float_precision="high", - )["c"][0] - float(val) - ) - abs( - pd.read_csv(StringIO(data), engine="c", float_precision="round_trip")["c"][0] - - float(val) - ) - - -.. _io.thousands: - -Thousand separators -''''''''''''''''''' - -For large numbers that have been written with a thousands separator, you can -set the ``thousands`` keyword to a string of length 1 so that integers will be parsed -correctly: - -By default, numbers with a thousands separator will be parsed as strings: - -.. ipython:: python - - data = ( - "ID|level|category\n" - "Patient1|123,000|x\n" - "Patient2|23,000|y\n" - "Patient3|1,234,018|z" - ) - - with open("tmp.csv", "w") as fh: - fh.write(data) - - df = pd.read_csv("tmp.csv", sep="|") - df - - df.level.dtype - -The ``thousands`` keyword allows integers to be parsed correctly: - -.. ipython:: python - - df = pd.read_csv("tmp.csv", sep="|", thousands=",") - df - - df.level.dtype - -.. ipython:: python - :suppress: - - os.remove("tmp.csv") - -.. _io.na_values: - -NA values -''''''''' - -To control which values are parsed as missing values (which are signified by -``NaN``), specify a string in ``na_values``. If you specify a list of strings, -then all values in it are considered to be missing values. If you specify a -number (a ``float``, like ``5.0`` or an ``integer`` like ``5``), the -corresponding equivalent values will also imply a missing value (in this case -effectively ``[5.0, 5]`` are recognized as ``NaN``). - -To completely override the default values that are recognized as missing, specify ``keep_default_na=False``. - -.. _io.navaluesconst: - -The default ``NaN`` recognized values are ``['-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A', '#N/A', 'N/A', -'n/a', 'NA', '', '#NA', 'NULL', 'null', 'NaN', '-NaN', 'nan', '-nan', 'None', '']``. - -Let us consider some examples: - -.. code-block:: python - - pd.read_csv("path_to_file.csv", na_values=[5]) - -In the example above ``5`` and ``5.0`` will be recognized as ``NaN``, in -addition to the defaults. A string will first be interpreted as a numerical -``5``, then as a ``NaN``. - -.. code-block:: python - - pd.read_csv("path_to_file.csv", keep_default_na=False, na_values=[""]) - -Above, only an empty field will be recognized as ``NaN``. - -.. code-block:: python - - pd.read_csv("path_to_file.csv", keep_default_na=False, na_values=["NA", "0"]) - -Above, both ``NA`` and ``0`` as strings are ``NaN``. - -.. code-block:: python - - pd.read_csv("path_to_file.csv", na_values=["Nope"]) - -The default values, in addition to the string ``"Nope"`` are recognized as -``NaN``. - -.. _io.infinity: - -Infinity -'''''''' - -``inf`` like values will be parsed as ``np.inf`` (positive infinity), and ``-inf`` as ``-np.inf`` (negative infinity). -These will ignore the case of the value, meaning ``Inf``, will also be parsed as ``np.inf``. - -.. _io.boolean: - -Boolean values -'''''''''''''' - -The common values ``True``, ``False``, ``TRUE``, and ``FALSE`` are all -recognized as boolean. Occasionally you might want to recognize other values -as being boolean. To do this, use the ``true_values`` and ``false_values`` -options as follows: - -.. ipython:: python - - data = "a,b,c\n1,Yes,2\n3,No,4" - print(data) - pd.read_csv(StringIO(data)) - pd.read_csv(StringIO(data), true_values=["Yes"], false_values=["No"]) - -.. _io.bad_lines: - -Handling "bad" lines -'''''''''''''''''''' - -Some files may have malformed lines with too few fields or too many. Lines with -too few fields will have NA values filled in the trailing fields. Lines with -too many fields will raise an error by default: - -.. ipython:: python - :okexcept: - - data = "a,b,c\n1,2,3\n4,5,6,7\n8,9,10" - pd.read_csv(StringIO(data)) - -You can elect to skip bad lines: - -.. ipython:: python - - data = "a,b,c\n1,2,3\n4,5,6,7\n8,9,10" - pd.read_csv(StringIO(data), on_bad_lines="skip") - -.. versionadded:: 1.4.0 - -Or pass a callable function to handle the bad line if ``engine="python"``. -The bad line will be a list of strings that was split by the ``sep``: - -.. ipython:: python - - external_list = [] - def bad_lines_func(line): - external_list.append(line) - return line[-3:] - pd.read_csv(StringIO(data), on_bad_lines=bad_lines_func, engine="python") - external_list - -.. note:: - - The callable function will handle only a line with too many fields. - Bad lines caused by other errors will be silently skipped. - - .. ipython:: python - - bad_lines_func = lambda line: print(line) - - data = 'name,type\nname a,a is of type a\nname b,"b\" is of type b"' - data - pd.read_csv(StringIO(data), on_bad_lines=bad_lines_func, engine="python") - - The line was not processed in this case, as a "bad line" here is caused by an escape character. - -You can also use the ``usecols`` parameter to eliminate extraneous column -data that appear in some lines but not others: - -.. ipython:: python - :okexcept: - - pd.read_csv(StringIO(data), usecols=[0, 1, 2]) - -In case you want to keep all data including the lines with too many fields, you can -specify a sufficient number of ``names``. This ensures that lines with not enough -fields are filled with ``NaN``. - -.. ipython:: python - - pd.read_csv(StringIO(data), names=['a', 'b', 'c', 'd']) - -.. _io.dialect: - -Dialect -''''''' - -The ``dialect`` keyword gives greater flexibility in specifying the file format. -By default it uses the Excel dialect but you can specify either the dialect name -or a :class:`python:csv.Dialect` instance. - -Suppose you had data with unenclosed quotes: - -.. ipython:: python - - data = "label1,label2,label3\n" 'index1,"a,c,e\n' "index2,b,d,f" - print(data) - -By default, ``read_csv`` uses the Excel dialect and treats the double quote as -the quote character, which causes it to fail when it finds a newline before it -finds the closing double quote. - -We can get around this using ``dialect``: - -.. ipython:: python - :okwarning: - - import csv - - dia = csv.excel() - dia.quoting = csv.QUOTE_NONE - pd.read_csv(StringIO(data), dialect=dia) - -All of the dialect options can be specified separately by keyword arguments: - -.. ipython:: python - - data = "a,b,c~1,2,3~4,5,6" - pd.read_csv(StringIO(data), lineterminator="~") - -Another common dialect option is ``skipinitialspace``, to skip any whitespace -after a delimiter: - -.. ipython:: python - - data = "a, b, c\n1, 2, 3\n4, 5, 6" - print(data) - pd.read_csv(StringIO(data), skipinitialspace=True) - -The parsers make every attempt to "do the right thing" and not be fragile. Type -inference is a pretty big deal. If a column can be coerced to integer dtype -without altering the contents, the parser will do so. Any non-numeric -columns will come through as object dtype as with the rest of pandas objects. - -.. _io.quoting: - -Quoting and Escape Characters -''''''''''''''''''''''''''''' - -Quotes (and other escape characters) in embedded fields can be handled in any -number of ways. One way is to use backslashes; to properly parse this data, you -should pass the ``escapechar`` option: - -.. ipython:: python - - data = 'a,b\n"hello, \\"Bob\\", nice to see you",5' - print(data) - pd.read_csv(StringIO(data), escapechar="\\") - -.. _io.fwf_reader: -.. _io.fwf: - -Files with fixed width columns -'''''''''''''''''''''''''''''' - -While :func:`read_csv` reads delimited data, the :func:`read_fwf` function works -with data files that have known and fixed column widths. The function parameters -to ``read_fwf`` are largely the same as ``read_csv`` with two extra parameters, and -a different usage of the ``delimiter`` parameter: - -* ``colspecs``: A list of pairs (tuples) giving the extents of the - fixed-width fields of each line as half-open intervals (i.e., [from, to[ ). - String value 'infer' can be used to instruct the parser to try detecting - the column specifications from the first 100 rows of the data. Default - behavior, if not specified, is to infer. -* ``widths``: A list of field widths which can be used instead of 'colspecs' - if the intervals are contiguous. -* ``delimiter``: Characters to consider as filler characters in the fixed-width file. - Can be used to specify the filler character of the fields - if it is not spaces (e.g., '~'). - -Consider a typical fixed-width data file: - -.. ipython:: python - - data1 = ( - "id8141 360.242940 149.910199 11950.7\n" - "id1594 444.953632 166.985655 11788.4\n" - "id1849 364.136849 183.628767 11806.2\n" - "id1230 413.836124 184.375703 11916.8\n" - "id1948 502.953953 173.237159 12468.3" - ) - with open("bar.csv", "w") as f: - f.write(data1) - -In order to parse this file into a ``DataFrame``, we simply need to supply the -column specifications to the ``read_fwf`` function along with the file name: - -.. ipython:: python - - # Column specifications are a list of half-intervals - colspecs = [(0, 6), (8, 20), (21, 33), (34, 43)] - df = pd.read_fwf("bar.csv", colspecs=colspecs, header=None, index_col=0) - df - -Note how the parser automatically picks column names X. when -``header=None`` argument is specified. Alternatively, you can supply just the -column widths for contiguous columns: - -.. ipython:: python - - # Widths are a list of integers - widths = [6, 14, 13, 10] - df = pd.read_fwf("bar.csv", widths=widths, header=None) - df - -The parser will take care of extra white spaces around the columns -so it's ok to have extra separation between the columns in the file. - -By default, ``read_fwf`` will try to infer the file's ``colspecs`` by using the -first 100 rows of the file. It can do it only in cases when the columns are -aligned and correctly separated by the provided ``delimiter`` (default delimiter -is whitespace). - -.. ipython:: python - - df = pd.read_fwf("bar.csv", header=None, index_col=0) - df - -``read_fwf`` supports the ``dtype`` parameter for specifying the types of -parsed columns to be different from the inferred type. - -.. ipython:: python - - pd.read_fwf("bar.csv", header=None, index_col=0).dtypes - pd.read_fwf("bar.csv", header=None, dtype={2: "object"}).dtypes - -.. ipython:: python - :suppress: - - os.remove("bar.csv") - - -Indexes -''''''' - -Files with an "implicit" index column -+++++++++++++++++++++++++++++++++++++ - -Consider a file with one less entry in the header than the number of data -column: - -.. ipython:: python - - data = "A,B,C\n20090101,a,1,2\n20090102,b,3,4\n20090103,c,4,5" - print(data) - with open("foo.csv", "w") as f: - f.write(data) - -In this special case, ``read_csv`` assumes that the first column is to be used -as the index of the ``DataFrame``: - -.. ipython:: python - - pd.read_csv("foo.csv") - -Note that the dates weren't automatically parsed. In that case you would need -to do as before: - -.. ipython:: python - - df = pd.read_csv("foo.csv", parse_dates=True) - df.index - -.. ipython:: python - :suppress: - - os.remove("foo.csv") - - -Reading an index with a ``MultiIndex`` -++++++++++++++++++++++++++++++++++++++ - -.. _io.csv_multiindex: - -Suppose you have data indexed by two columns: - -.. ipython:: python - - data = 'year,indiv,zit,xit\n1977,"A",1.2,.6\n1977,"B",1.5,.5' - print(data) - with open("mindex_ex.csv", mode="w") as f: - f.write(data) - -The ``index_col`` argument to ``read_csv`` can take a list of -column numbers to turn multiple columns into a ``MultiIndex`` for the index of the -returned object: - -.. ipython:: python - - df = pd.read_csv("mindex_ex.csv", index_col=[0, 1]) - df - df.loc[1977] - -.. ipython:: python - :suppress: - - os.remove("mindex_ex.csv") - -.. _io.multi_index_columns: - -Reading columns with a ``MultiIndex`` -+++++++++++++++++++++++++++++++++++++ - -By specifying list of row locations for the ``header`` argument, you -can read in a ``MultiIndex`` for the columns. Specifying non-consecutive -rows will skip the intervening rows. - -.. ipython:: python - - mi_idx = pd.MultiIndex.from_arrays([[1, 2, 3, 4], list("abcd")], names=list("ab")) - mi_col = pd.MultiIndex.from_arrays([[1, 2], list("ab")], names=list("cd")) - df = pd.DataFrame(np.ones((4, 2)), index=mi_idx, columns=mi_col) - df.to_csv("mi.csv") - print(open("mi.csv").read()) - pd.read_csv("mi.csv", header=[0, 1, 2, 3], index_col=[0, 1]) - -``read_csv`` is also able to interpret a more common format -of multi-columns indices. - -.. ipython:: python - - data = ",a,a,a,b,c,c\n,q,r,s,t,u,v\none,1,2,3,4,5,6\ntwo,7,8,9,10,11,12" - print(data) - with open("mi2.csv", "w") as fh: - fh.write(data) - - pd.read_csv("mi2.csv", header=[0, 1], index_col=0) - -.. note:: - If an ``index_col`` is not specified (e.g. you don't have an index, or wrote it - with ``df.to_csv(..., index=False)``, then any ``names`` on the columns index will - be *lost*. - -.. ipython:: python - :suppress: - - os.remove("mi.csv") - os.remove("mi2.csv") - -.. _io.sniff: - -Automatically "sniffing" the delimiter -'''''''''''''''''''''''''''''''''''''' - -``read_csv`` is capable of inferring delimited (not necessarily -comma-separated) files, as pandas uses the :class:`python:csv.Sniffer` -class of the csv module. For this, you have to specify ``sep=None``. - -.. ipython:: python - - df = pd.DataFrame(np.random.randn(10, 4)) - df.to_csv("tmp2.csv", sep=":", index=False) - pd.read_csv("tmp2.csv", sep=None, engine="python") - -.. ipython:: python - :suppress: - - os.remove("tmp2.csv") - -.. _io.multiple_files: - -Reading multiple files to create a single DataFrame -''''''''''''''''''''''''''''''''''''''''''''''''''' - -It's best to use :func:`~pandas.concat` to combine multiple files. -See the :ref:`cookbook` for an example. - -.. _io.chunking: - -Iterating through files chunk by chunk -'''''''''''''''''''''''''''''''''''''' - -Suppose you wish to iterate through a (potentially very large) file lazily -rather than reading the entire file into memory, such as the following: - - -.. ipython:: python - - df = pd.DataFrame(np.random.randn(10, 4)) - df.to_csv("tmp.csv", index=False) - table = pd.read_csv("tmp.csv") - table - - -By specifying a ``chunksize`` to ``read_csv``, the return -value will be an iterable object of type ``TextFileReader``: - -.. ipython:: python - - with pd.read_csv("tmp.csv", chunksize=4) as reader: - print(reader) - for chunk in reader: - print(chunk) - -.. versionchanged:: 1.2 - - ``read_csv/json/sas`` return a context-manager when iterating through a file. - -Specifying ``iterator=True`` will also return the ``TextFileReader`` object: - -.. ipython:: python - - with pd.read_csv("tmp.csv", iterator=True) as reader: - print(reader.get_chunk(5)) - -.. ipython:: python - :suppress: - - os.remove("tmp.csv") - -Specifying the parser engine -'''''''''''''''''''''''''''' - -pandas currently supports three engines, the C engine, the python engine, and an experimental -pyarrow engine (requires the ``pyarrow`` package). In general, the pyarrow engine is fastest -on larger workloads and is equivalent in speed to the C engine on most other workloads. -The python engine tends to be slower than the pyarrow and C engines on most workloads. However, -the pyarrow engine is much less robust than the C engine, which lacks a few features compared to the -Python engine. - -Where possible, pandas uses the C parser (specified as ``engine='c'``), but it may fall -back to Python if C-unsupported options are specified. - -Currently, options unsupported by the C and pyarrow engines include: - -* ``sep`` other than a single character (e.g. regex separators) -* ``skipfooter`` - -Specifying any of the above options will produce a ``ParserWarning`` unless the -python engine is selected explicitly using ``engine='python'``. - -Options that are unsupported by the pyarrow engine which are not covered by the list above include: - -* ``float_precision`` -* ``chunksize`` -* ``comment`` -* ``nrows`` -* ``thousands`` -* ``memory_map`` -* ``dialect`` -* ``on_bad_lines`` -* ``quoting`` -* ``lineterminator`` -* ``converters`` -* ``decimal`` -* ``iterator`` -* ``dayfirst`` -* ``verbose`` -* ``skipinitialspace`` -* ``low_memory`` - -Specifying these options with ``engine='pyarrow'`` will raise a ``ValueError``. - -.. _io.remote: - -Reading/writing remote files -'''''''''''''''''''''''''''' - -You can pass in a URL to read or write remote files to many of pandas' IO -functions - the following example shows reading a CSV file: - -.. code-block:: python - - df = pd.read_csv("https://download.bls.gov/pub/time.series/cu/cu.item", sep="\t") - -.. versionadded:: 1.3.0 - -A custom header can be sent alongside HTTP(s) requests by passing a dictionary -of header key value mappings to the ``storage_options`` keyword argument as shown below: - -.. code-block:: python - - headers = {"User-Agent": "pandas"} - df = pd.read_csv( - "https://download.bls.gov/pub/time.series/cu/cu.item", - sep="\t", - storage_options=headers - ) - -All URLs which are not local files or HTTP(s) are handled by -`fsspec`_, if installed, and its various filesystem implementations -(including Amazon S3, Google Cloud, SSH, FTP, webHDFS...). -Some of these implementations will require additional packages to be -installed, for example -S3 URLs require the `s3fs -`_ library: - -.. code-block:: python - - df = pd.read_json("s3://pandas-test/adatafile.json") - -When dealing with remote storage systems, you might need -extra configuration with environment variables or config files in -special locations. For example, to access data in your S3 bucket, -you will need to define credentials in one of the several ways listed in -the `S3Fs documentation -`_. The same is true -for several of the storage backends, and you should follow the links -at `fsimpl1`_ for implementations built into ``fsspec`` and `fsimpl2`_ -for those not included in the main ``fsspec`` -distribution. - -You can also pass parameters directly to the backend driver. Since ``fsspec`` does not -utilize the ``AWS_S3_HOST`` environment variable, we can directly define a -dictionary containing the endpoint_url and pass the object into the storage -option parameter: - -.. code-block:: python - - storage_options = {"client_kwargs": {"endpoint_url": "http://127.0.0.1:5555"}} - df = pd.read_json("s3://pandas-test/test-1", storage_options=storage_options) - -More sample configurations and documentation can be found at `S3Fs documentation -`__. - -If you do *not* have S3 credentials, you can still access public -data by specifying an anonymous connection, such as - -.. versionadded:: 1.2.0 - -.. code-block:: python - - pd.read_csv( - "s3://ncei-wcsd-archive/data/processed/SH1305/18kHz/SaKe2013" - "-D20130523-T080854_to_SaKe2013-D20130523-T085643.csv", - storage_options={"anon": True}, - ) - -``fsspec`` also allows complex URLs, for accessing data in compressed -archives, local caching of files, and more. To locally cache the above -example, you would modify the call to - -.. code-block:: python - - pd.read_csv( - "simplecache::s3://ncei-wcsd-archive/data/processed/SH1305/18kHz/" - "SaKe2013-D20130523-T080854_to_SaKe2013-D20130523-T085643.csv", - storage_options={"s3": {"anon": True}}, - ) - -where we specify that the "anon" parameter is meant for the "s3" part of -the implementation, not to the caching implementation. Note that this caches to a temporary -directory for the duration of the session only, but you can also specify -a permanent store. - -.. _fsspec: https://filesystem-spec.readthedocs.io/en/latest/ -.. _fsimpl1: https://filesystem-spec.readthedocs.io/en/latest/api.html#built-in-implementations -.. _fsimpl2: https://filesystem-spec.readthedocs.io/en/latest/api.html#other-known-implementations - -Writing out data -'''''''''''''''' - -.. _io.store_in_csv: - -Writing to CSV format -+++++++++++++++++++++ - -The ``Series`` and ``DataFrame`` objects have an instance method ``to_csv`` which -allows storing the contents of the object as a comma-separated-values file. The -function takes a number of arguments. Only the first is required. - -* ``path_or_buf``: A string path to the file to write or a file object. If a file object it must be opened with ``newline=''`` -* ``sep`` : Field delimiter for the output file (default ",") -* ``na_rep``: A string representation of a missing value (default '') -* ``float_format``: Format string for floating point numbers -* ``columns``: Columns to write (default None) -* ``header``: Whether to write out the column names (default True) -* ``index``: whether to write row (index) names (default True) -* ``index_label``: Column label(s) for index column(s) if desired. If None - (default), and ``header`` and ``index`` are True, then the index names are - used. (A sequence should be given if the ``DataFrame`` uses MultiIndex). -* ``mode`` : Python write mode, default 'w' -* ``encoding``: a string representing the encoding to use if the contents are - non-ASCII, for Python versions prior to 3 -* ``lineterminator``: Character sequence denoting line end (default ``os.linesep``) -* ``quoting``: Set quoting rules as in csv module (default csv.QUOTE_MINIMAL). Note that if you have set a ``float_format`` then floats are converted to strings and csv.QUOTE_NONNUMERIC will treat them as non-numeric -* ``quotechar``: Character used to quote fields (default '"') -* ``doublequote``: Control quoting of ``quotechar`` in fields (default True) -* ``escapechar``: Character used to escape ``sep`` and ``quotechar`` when - appropriate (default None) -* ``chunksize``: Number of rows to write at a time -* ``date_format``: Format string for datetime objects - -Writing a formatted string -++++++++++++++++++++++++++ - -.. _io.formatting: - -The ``DataFrame`` object has an instance method ``to_string`` which allows control -over the string representation of the object. All arguments are optional: - -* ``buf`` default None, for example a StringIO object -* ``columns`` default None, which columns to write -* ``col_space`` default None, minimum width of each column. -* ``na_rep`` default ``NaN``, representation of NA value -* ``formatters`` default None, a dictionary (by column) of functions each of - which takes a single argument and returns a formatted string -* ``float_format`` default None, a function which takes a single (float) - argument and returns a formatted string; to be applied to floats in the - ``DataFrame``. -* ``sparsify`` default True, set to False for a ``DataFrame`` with a hierarchical - index to print every MultiIndex key at each row. -* ``index_names`` default True, will print the names of the indices -* ``index`` default True, will print the index (ie, row labels) -* ``header`` default True, will print the column labels -* ``justify`` default ``left``, will print column headers left- or - right-justified - -The ``Series`` object also has a ``to_string`` method, but with only the ``buf``, -``na_rep``, ``float_format`` arguments. There is also a ``length`` argument -which, if set to ``True``, will additionally output the length of the Series. - -.. _io.json: - -JSON ----- - -Read and write ``JSON`` format files and strings. - -.. _io.json_writer: - -Writing JSON -'''''''''''' - -A ``Series`` or ``DataFrame`` can be converted to a valid JSON string. Use ``to_json`` -with optional parameters: - -* ``path_or_buf`` : the pathname or buffer to write the output. - This can be ``None`` in which case a JSON string is returned. -* ``orient`` : - - ``Series``: - * default is ``index`` - * allowed values are {``split``, ``records``, ``index``} - - ``DataFrame``: - * default is ``columns`` - * allowed values are {``split``, ``records``, ``index``, ``columns``, ``values``, ``table``} - - The format of the JSON string - - .. csv-table:: - :widths: 20, 150 - - ``split``, dict like {index -> [index]; columns -> [columns]; data -> [values]} - ``records``, list like [{column -> value}; ... ] - ``index``, dict like {index -> {column -> value}} - ``columns``, dict like {column -> {index -> value}} - ``values``, just the values array - ``table``, adhering to the JSON `Table Schema`_ - -* ``date_format`` : string, type of date conversion, 'epoch' for timestamp, 'iso' for ISO8601. -* ``double_precision`` : The number of decimal places to use when encoding floating point values, default 10. -* ``force_ascii`` : force encoded string to be ASCII, default True. -* ``date_unit`` : The time unit to encode to, governs timestamp and ISO8601 precision. One of 's', 'ms', 'us' or 'ns' for seconds, milliseconds, microseconds and nanoseconds respectively. Default 'ms'. -* ``default_handler`` : The handler to call if an object cannot otherwise be converted to a suitable format for JSON. Takes a single argument, which is the object to convert, and returns a serializable object. -* ``lines`` : If ``records`` orient, then will write each record per line as json. -* ``mode`` : string, writer mode when writing to path. 'w' for write, 'a' for append. Default 'w' - -Note ``NaN``'s, ``NaT``'s and ``None`` will be converted to ``null`` and ``datetime`` objects will be converted based on the ``date_format`` and ``date_unit`` parameters. - -.. ipython:: python - - dfj = pd.DataFrame(np.random.randn(5, 2), columns=list("AB")) - json = dfj.to_json() - json - -Orient options -++++++++++++++ - -There are a number of different options for the format of the resulting JSON -file / string. Consider the following ``DataFrame`` and ``Series``: - -.. ipython:: python - - dfjo = pd.DataFrame( - dict(A=range(1, 4), B=range(4, 7), C=range(7, 10)), - columns=list("ABC"), - index=list("xyz"), - ) - dfjo - sjo = pd.Series(dict(x=15, y=16, z=17), name="D") - sjo - -**Column oriented** (the default for ``DataFrame``) serializes the data as -nested JSON objects with column labels acting as the primary index: - -.. ipython:: python - - dfjo.to_json(orient="columns") - # Not available for Series - -**Index oriented** (the default for ``Series``) similar to column oriented -but the index labels are now primary: - -.. ipython:: python - - dfjo.to_json(orient="index") - sjo.to_json(orient="index") - -**Record oriented** serializes the data to a JSON array of column -> value records, -index labels are not included. This is useful for passing ``DataFrame`` data to plotting -libraries, for example the JavaScript library ``d3.js``: - -.. ipython:: python - - dfjo.to_json(orient="records") - sjo.to_json(orient="records") - -**Value oriented** is a bare-bones option which serializes to nested JSON arrays of -values only, column and index labels are not included: - -.. ipython:: python - - dfjo.to_json(orient="values") - # Not available for Series - -**Split oriented** serializes to a JSON object containing separate entries for -values, index and columns. Name is also included for ``Series``: - -.. ipython:: python - - dfjo.to_json(orient="split") - sjo.to_json(orient="split") - -**Table oriented** serializes to the JSON `Table Schema`_, allowing for the -preservation of metadata including but not limited to dtypes and index names. - -.. note:: - - Any orient option that encodes to a JSON object will not preserve the ordering of - index and column labels during round-trip serialization. If you wish to preserve - label ordering use the ``split`` option as it uses ordered containers. - -Date handling -+++++++++++++ - -Writing in ISO date format: - -.. ipython:: python - - dfd = pd.DataFrame(np.random.randn(5, 2), columns=list("AB")) - dfd["date"] = pd.Timestamp("20130101") - dfd = dfd.sort_index(axis=1, ascending=False) - json = dfd.to_json(date_format="iso") - json - -Writing in ISO date format, with microseconds: - -.. ipython:: python - - json = dfd.to_json(date_format="iso", date_unit="us") - json - -Writing to a file, with a date index and a date column: - -.. ipython:: python - - dfj2 = dfj.copy() - dfj2["date"] = pd.Timestamp("20130101") - dfj2["ints"] = list(range(5)) - dfj2["bools"] = True - dfj2.index = pd.date_range("20130101", periods=5) - dfj2.to_json("test.json", date_format="iso") - - with open("test.json") as fh: - print(fh.read()) - -Fallback behavior -+++++++++++++++++ - -If the JSON serializer cannot handle the container contents directly it will -fall back in the following manner: - -* if the dtype is unsupported (e.g. ``np.complex_``) then the ``default_handler``, if provided, will be called - for each value, otherwise an exception is raised. - -* if an object is unsupported it will attempt the following: - - - - check if the object has defined a ``toDict`` method and call it. - A ``toDict`` method should return a ``dict`` which will then be JSON serialized. - - - invoke the ``default_handler`` if one was provided. - - - convert the object to a ``dict`` by traversing its contents. However this will often fail - with an ``OverflowError`` or give unexpected results. - -In general the best approach for unsupported objects or dtypes is to provide a ``default_handler``. -For example: - -.. code-block:: python - - >>> DataFrame([1.0, 2.0, complex(1.0, 2.0)]).to_json() # raises - RuntimeError: Unhandled numpy dtype 15 - -can be dealt with by specifying a simple ``default_handler``: - -.. ipython:: python - - pd.DataFrame([1.0, 2.0, complex(1.0, 2.0)]).to_json(default_handler=str) - -.. _io.json_reader: - -Reading JSON -'''''''''''' - -Reading a JSON string to pandas object can take a number of parameters. -The parser will try to parse a ``DataFrame`` if ``typ`` is not supplied or -is ``None``. To explicitly force ``Series`` parsing, pass ``typ=series`` - -* ``filepath_or_buffer`` : a **VALID** JSON string or file handle / StringIO. The string could be - a URL. Valid URL schemes include http, ftp, S3, and file. For file URLs, a host - is expected. For instance, a local file could be - file ://localhost/path/to/table.json -* ``typ`` : type of object to recover (series or frame), default 'frame' -* ``orient`` : - - Series : - * default is ``index`` - * allowed values are {``split``, ``records``, ``index``} - - DataFrame - * default is ``columns`` - * allowed values are {``split``, ``records``, ``index``, ``columns``, ``values``, ``table``} - - The format of the JSON string - - .. csv-table:: - :widths: 20, 150 - - ``split``, dict like {index -> [index]; columns -> [columns]; data -> [values]} - ``records``, list like [{column -> value} ...] - ``index``, dict like {index -> {column -> value}} - ``columns``, dict like {column -> {index -> value}} - ``values``, just the values array - ``table``, adhering to the JSON `Table Schema`_ - - -* ``dtype`` : if True, infer dtypes, if a dict of column to dtype, then use those, if ``False``, then don't infer dtypes at all, default is True, apply only to the data. -* ``convert_axes`` : boolean, try to convert the axes to the proper dtypes, default is ``True`` -* ``convert_dates`` : a list of columns to parse for dates; If ``True``, then try to parse date-like columns, default is ``True``. -* ``keep_default_dates`` : boolean, default ``True``. If parsing dates, then parse the default date-like columns. -* ``precise_float`` : boolean, default ``False``. Set to enable usage of higher precision (strtod) function when decoding string to double values. Default (``False``) is to use fast but less precise builtin functionality. -* ``date_unit`` : string, the timestamp unit to detect if converting dates. Default - None. By default the timestamp precision will be detected, if this is not desired - then pass one of 's', 'ms', 'us' or 'ns' to force timestamp precision to - seconds, milliseconds, microseconds or nanoseconds respectively. -* ``lines`` : reads file as one json object per line. -* ``encoding`` : The encoding to use to decode py3 bytes. -* ``chunksize`` : when used in combination with ``lines=True``, return a ``pandas.api.typing.JsonReader`` which reads in ``chunksize`` lines per iteration. -* ``engine``: Either ``"ujson"``, the built-in JSON parser, or ``"pyarrow"`` which dispatches to pyarrow's ``pyarrow.json.read_json``. - The ``"pyarrow"`` is only available when ``lines=True`` - -The parser will raise one of ``ValueError/TypeError/AssertionError`` if the JSON is not parseable. - -If a non-default ``orient`` was used when encoding to JSON be sure to pass the same -option here so that decoding produces sensible results, see `Orient Options`_ for an -overview. - -Data conversion -+++++++++++++++ - -The default of ``convert_axes=True``, ``dtype=True``, and ``convert_dates=True`` -will try to parse the axes, and all of the data into appropriate types, -including dates. If you need to override specific dtypes, pass a dict to -``dtype``. ``convert_axes`` should only be set to ``False`` if you need to -preserve string-like numbers (e.g. '1', '2') in an axes. - -.. note:: - - Large integer values may be converted to dates if ``convert_dates=True`` and the data and / or column labels appear 'date-like'. The exact threshold depends on the ``date_unit`` specified. 'date-like' means that the column label meets one of the following criteria: - - * it ends with ``'_at'`` - * it ends with ``'_time'`` - * it begins with ``'timestamp'`` - * it is ``'modified'`` - * it is ``'date'`` - -.. warning:: - - When reading JSON data, automatic coercing into dtypes has some quirks: - - * an index can be reconstructed in a different order from serialization, that is, the returned order is not guaranteed to be the same as before serialization - * a column that was ``float`` data will be converted to ``integer`` if it can be done safely, e.g. a column of ``1.`` - * bool columns will be converted to ``integer`` on reconstruction - - Thus there are times where you may want to specify specific dtypes via the ``dtype`` keyword argument. - -Reading from a JSON string: - -.. ipython:: python - - from io import StringIO - pd.read_json(StringIO(json)) - -Reading from a file: - -.. ipython:: python - - pd.read_json("test.json") - -Don't convert any data (but still convert axes and dates): - -.. ipython:: python - - pd.read_json("test.json", dtype=object).dtypes - -Specify dtypes for conversion: - -.. ipython:: python - - pd.read_json("test.json", dtype={"A": "float32", "bools": "int8"}).dtypes - -Preserve string indices: - -.. ipython:: python - - from io import StringIO - si = pd.DataFrame( - np.zeros((4, 4)), columns=list(range(4)), index=[str(i) for i in range(4)] - ) - si - si.index - si.columns - json = si.to_json() - - sij = pd.read_json(StringIO(json), convert_axes=False) - sij - sij.index - sij.columns - -Dates written in nanoseconds need to be read back in nanoseconds: - -.. ipython:: python - - from io import StringIO - json = dfj2.to_json(date_format="iso", date_unit="ns") - - # Try to parse timestamps as milliseconds -> Won't Work - dfju = pd.read_json(StringIO(json), date_unit="ms") - dfju - - # Let pandas detect the correct precision - dfju = pd.read_json(StringIO(json)) - dfju - - # Or specify that all timestamps are in nanoseconds - dfju = pd.read_json(StringIO(json), date_unit="ns") - dfju - -By setting the ``dtype_backend`` argument you can control the default dtypes used for the resulting DataFrame. - -.. ipython:: python - - data = ( - '{"a":{"0":1,"1":3},"b":{"0":2.5,"1":4.5},"c":{"0":true,"1":false},"d":{"0":"a","1":"b"},' - '"e":{"0":null,"1":6.0},"f":{"0":null,"1":7.5},"g":{"0":null,"1":true},"h":{"0":null,"1":"a"},' - '"i":{"0":"12-31-2019","1":"12-31-2019"},"j":{"0":null,"1":null}}' - ) - df = pd.read_json(StringIO(data), dtype_backend="pyarrow") - df - df.dtypes - -.. _io.json_normalize: - -Normalization -''''''''''''' - -pandas provides a utility function to take a dict or list of dicts and *normalize* this semi-structured data -into a flat table. - -.. ipython:: python - - data = [ - {"id": 1, "name": {"first": "Coleen", "last": "Volk"}}, - {"name": {"given": "Mark", "family": "Regner"}}, - {"id": 2, "name": "Faye Raker"}, - ] - pd.json_normalize(data) - -.. ipython:: python - - data = [ - { - "state": "Florida", - "shortname": "FL", - "info": {"governor": "Rick Scott"}, - "county": [ - {"name": "Dade", "population": 12345}, - {"name": "Broward", "population": 40000}, - {"name": "Palm Beach", "population": 60000}, - ], - }, - { - "state": "Ohio", - "shortname": "OH", - "info": {"governor": "John Kasich"}, - "county": [ - {"name": "Summit", "population": 1234}, - {"name": "Cuyahoga", "population": 1337}, - ], - }, - ] - - pd.json_normalize(data, "county", ["state", "shortname", ["info", "governor"]]) - -The max_level parameter provides more control over which level to end normalization. -With max_level=1 the following snippet normalizes until 1st nesting level of the provided dict. - -.. ipython:: python - - data = [ - { - "CreatedBy": {"Name": "User001"}, - "Lookup": { - "TextField": "Some text", - "UserField": {"Id": "ID001", "Name": "Name001"}, - }, - "Image": {"a": "b"}, - } - ] - pd.json_normalize(data, max_level=1) - -.. _io.jsonl: - -Line delimited json -''''''''''''''''''' - -pandas is able to read and write line-delimited json files that are common in data processing pipelines -using Hadoop or Spark. - -For line-delimited json files, pandas can also return an iterator which reads in ``chunksize`` lines at a time. This can be useful for large files or to read from a stream. - -.. ipython:: python - - from io import StringIO - jsonl = """ - {"a": 1, "b": 2} - {"a": 3, "b": 4} - """ - df = pd.read_json(StringIO(jsonl), lines=True) - df - df.to_json(orient="records", lines=True) - - # reader is an iterator that returns ``chunksize`` lines each iteration - with pd.read_json(StringIO(jsonl), lines=True, chunksize=1) as reader: - reader - for chunk in reader: - print(chunk) - -Line-limited json can also be read using the pyarrow reader by specifying ``engine="pyarrow"``. - -.. ipython:: python - - from io import BytesIO - df = pd.read_json(BytesIO(jsonl.encode()), lines=True, engine="pyarrow") - df - -.. versionadded:: 2.0.0 - -.. _io.table_schema: - -Table schema -'''''''''''' - -`Table Schema`_ is a spec for describing tabular datasets as a JSON -object. The JSON includes information on the field names, types, and -other attributes. You can use the orient ``table`` to build -a JSON string with two fields, ``schema`` and ``data``. - -.. ipython:: python - - df = pd.DataFrame( - { - "A": [1, 2, 3], - "B": ["a", "b", "c"], - "C": pd.date_range("2016-01-01", freq="D", periods=3), - }, - index=pd.Index(range(3), name="idx"), - ) - df - df.to_json(orient="table", date_format="iso") - -The ``schema`` field contains the ``fields`` key, which itself contains -a list of column name to type pairs, including the ``Index`` or ``MultiIndex`` -(see below for a list of types). -The ``schema`` field also contains a ``primaryKey`` field if the (Multi)index -is unique. - -The second field, ``data``, contains the serialized data with the ``records`` -orient. -The index is included, and any datetimes are ISO 8601 formatted, as required -by the Table Schema spec. - -The full list of types supported are described in the Table Schema -spec. This table shows the mapping from pandas types: - -=============== ================= -pandas type Table Schema type -=============== ================= -int64 integer -float64 number -bool boolean -datetime64[ns] datetime -timedelta64[ns] duration -categorical any -object str -=============== ================= - -A few notes on the generated table schema: - -* The ``schema`` object contains a ``pandas_version`` field. This contains - the version of pandas' dialect of the schema, and will be incremented - with each revision. -* All dates are converted to UTC when serializing. Even timezone naive values, - which are treated as UTC with an offset of 0. - - .. ipython:: python - - from pandas.io.json import build_table_schema - - s = pd.Series(pd.date_range("2016", periods=4)) - build_table_schema(s) - -* datetimes with a timezone (before serializing), include an additional field - ``tz`` with the time zone name (e.g. ``'US/Central'``). - - .. ipython:: python - - s_tz = pd.Series(pd.date_range("2016", periods=12, tz="US/Central")) - build_table_schema(s_tz) - -* Periods are converted to timestamps before serialization, and so have the - same behavior of being converted to UTC. In addition, periods will contain - and additional field ``freq`` with the period's frequency, e.g. ``'A-DEC'``. - - .. ipython:: python - - s_per = pd.Series(1, index=pd.period_range("2016", freq="Y-DEC", periods=4)) - build_table_schema(s_per) - -* Categoricals use the ``any`` type and an ``enum`` constraint listing - the set of possible values. Additionally, an ``ordered`` field is included: - - .. ipython:: python - - s_cat = pd.Series(pd.Categorical(["a", "b", "a"])) - build_table_schema(s_cat) - -* A ``primaryKey`` field, containing an array of labels, is included - *if the index is unique*: - - .. ipython:: python - - s_dupe = pd.Series([1, 2], index=[1, 1]) - build_table_schema(s_dupe) - -* The ``primaryKey`` behavior is the same with MultiIndexes, but in this - case the ``primaryKey`` is an array: - - .. ipython:: python - - s_multi = pd.Series(1, index=pd.MultiIndex.from_product([("a", "b"), (0, 1)])) - build_table_schema(s_multi) - -* The default naming roughly follows these rules: - - - For series, the ``object.name`` is used. If that's none, then the - name is ``values`` - - For ``DataFrames``, the stringified version of the column name is used - - For ``Index`` (not ``MultiIndex``), ``index.name`` is used, with a - fallback to ``index`` if that is None. - - For ``MultiIndex``, ``mi.names`` is used. If any level has no name, - then ``level_`` is used. - -``read_json`` also accepts ``orient='table'`` as an argument. This allows for -the preservation of metadata such as dtypes and index names in a -round-trippable manner. - -.. ipython:: python - - df = pd.DataFrame( - { - "foo": [1, 2, 3, 4], - "bar": ["a", "b", "c", "d"], - "baz": pd.date_range("2018-01-01", freq="D", periods=4), - "qux": pd.Categorical(["a", "b", "c", "c"]), - }, - index=pd.Index(range(4), name="idx"), - ) - df - df.dtypes - - df.to_json("test.json", orient="table") - new_df = pd.read_json("test.json", orient="table") - new_df - new_df.dtypes - -Please note that the literal string 'index' as the name of an :class:`Index` -is not round-trippable, nor are any names beginning with ``'level_'`` within a -:class:`MultiIndex`. These are used by default in :func:`DataFrame.to_json` to -indicate missing values and the subsequent read cannot distinguish the intent. - -.. ipython:: python - :okwarning: - - df.index.name = "index" - df.to_json("test.json", orient="table") - new_df = pd.read_json("test.json", orient="table") - print(new_df.index.name) - -.. ipython:: python - :suppress: - - os.remove("test.json") - -When using ``orient='table'`` along with user-defined ``ExtensionArray``, -the generated schema will contain an additional ``extDtype`` key in the respective -``fields`` element. This extra key is not standard but does enable JSON roundtrips -for extension types (e.g. ``read_json(df.to_json(orient="table"), orient="table")``). - -The ``extDtype`` key carries the name of the extension, if you have properly registered -the ``ExtensionDtype``, pandas will use said name to perform a lookup into the registry -and re-convert the serialized data into your custom dtype. - -.. _Table Schema: https://specs.frictionlessdata.io/table-schema/ - - -HTML ----- - -.. _io.read_html: - -Reading HTML content -'''''''''''''''''''''' - -.. warning:: - - We **highly encourage** you to read the :ref:`HTML Table Parsing gotchas ` - below regarding the issues surrounding the BeautifulSoup4/html5lib/lxml parsers. - -The top-level :func:`~pandas.io.html.read_html` function can accept an HTML -string/file/URL and will parse HTML tables into list of pandas ``DataFrames``. -Let's look at a few examples. - -.. note:: - - ``read_html`` returns a ``list`` of ``DataFrame`` objects, even if there is - only a single table contained in the HTML content. - -Read a URL with no options: - -.. code-block:: ipython - - In [320]: url = "https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list" - - In [321]: pd.read_html(url) - Out[321]: - [ Bank NameBank CityCity StateSt ... Acquiring InstitutionAI Closing DateClosing FundFund - 0 Almena State Bank Almena KS ... Equity Bank October 23, 2020 10538 - 1 First City Bank of Florida Fort Walton Beach FL ... United Fidelity Bank, fsb October 16, 2020 10537 - 2 The First State Bank Barboursville WV ... MVB Bank, Inc. April 3, 2020 10536 - 3 Ericson State Bank Ericson NE ... Farmers and Merchants Bank February 14, 2020 10535 - 4 City National Bank of New Jersey Newark NJ ... Industrial Bank November 1, 2019 10534 - .. ... ... ... ... ... ... ... - 558 Superior Bank, FSB Hinsdale IL ... Superior Federal, FSB July 27, 2001 6004 - 559 Malta National Bank Malta OH ... North Valley Bank May 3, 2001 4648 - 560 First Alliance Bank & Trust Co. Manchester NH ... Southern New Hampshire Bank & Trust February 2, 2001 4647 - 561 National State Bank of Metropolis Metropolis IL ... Banterra Bank of Marion December 14, 2000 4646 - 562 Bank of Honolulu Honolulu HI ... Bank of the Orient October 13, 2000 4645 - - [563 rows x 7 columns]] - -.. note:: - - The data from the above URL changes every Monday so the resulting data above may be slightly different. - -Read a URL while passing headers alongside the HTTP request: - -.. code-block:: ipython - - In [322]: url = 'https://www.sump.org/notes/request/' # HTTP request reflector - - In [323]: pd.read_html(url) - Out[323]: - [ 0 1 - 0 Remote Socket: 51.15.105.256:51760 - 1 Protocol Version: HTTP/1.1 - 2 Request Method: GET - 3 Request URI: /notes/request/ - 4 Request Query: NaN, - 0 Accept-Encoding: identity - 1 Host: www.sump.org - 2 User-Agent: Python-urllib/3.8 - 3 Connection: close] - - In [324]: headers = { - .....: 'User-Agent':'Mozilla Firefox v14.0', - .....: 'Accept':'application/json', - .....: 'Connection':'keep-alive', - .....: 'Auth':'Bearer 2*/f3+fe68df*4' - .....: } - - In [325]: pd.read_html(url, storage_options=headers) - Out[325]: - [ 0 1 - 0 Remote Socket: 51.15.105.256:51760 - 1 Protocol Version: HTTP/1.1 - 2 Request Method: GET - 3 Request URI: /notes/request/ - 4 Request Query: NaN, - 0 User-Agent: Mozilla Firefox v14.0 - 1 AcceptEncoding: gzip, deflate, br - 2 Accept: application/json - 3 Connection: keep-alive - 4 Auth: Bearer 2*/f3+fe68df*4] - -.. note:: - - We see above that the headers we passed are reflected in the HTTP request. - -Read in the content of the file from the above URL and pass it to ``read_html`` -as a string: - -.. ipython:: python - - html_str = """ - - - - - - - - - - - -
ABC
abc
- """ - - with open("tmp.html", "w") as f: - f.write(html_str) - df = pd.read_html("tmp.html") - df[0] - -.. ipython:: python - :suppress: - - os.remove("tmp.html") - -You can even pass in an instance of ``StringIO`` if you so desire: - -.. ipython:: python - - dfs = pd.read_html(StringIO(html_str)) - dfs[0] - -.. note:: - - The following examples are not run by the IPython evaluator due to the fact - that having so many network-accessing functions slows down the documentation - build. If you spot an error or an example that doesn't run, please do not - hesitate to report it over on `pandas GitHub issues page - `__. - - -Read a URL and match a table that contains specific text: - -.. code-block:: python - - match = "Metcalf Bank" - df_list = pd.read_html(url, match=match) - -Specify a header row (by default ```` or ```` elements located within a -```` are used to form the column index, if multiple rows are contained within -```` then a MultiIndex is created); if specified, the header row is taken -from the data minus the parsed header elements (```` elements). - -.. code-block:: python - - dfs = pd.read_html(url, header=0) - -Specify an index column: - -.. code-block:: python - - dfs = pd.read_html(url, index_col=0) - -Specify a number of rows to skip: - -.. code-block:: python - - dfs = pd.read_html(url, skiprows=0) - -Specify a number of rows to skip using a list (``range`` works -as well): - -.. code-block:: python - - dfs = pd.read_html(url, skiprows=range(2)) - -Specify an HTML attribute: - -.. code-block:: python - - dfs1 = pd.read_html(url, attrs={"id": "table"}) - dfs2 = pd.read_html(url, attrs={"class": "sortable"}) - print(np.array_equal(dfs1[0], dfs2[0])) # Should be True - -Specify values that should be converted to NaN: - -.. code-block:: python - - dfs = pd.read_html(url, na_values=["No Acquirer"]) - -Specify whether to keep the default set of NaN values: - -.. code-block:: python - - dfs = pd.read_html(url, keep_default_na=False) - -Specify converters for columns. This is useful for numerical text data that has -leading zeros. By default columns that are numerical are cast to numeric -types and the leading zeros are lost. To avoid this, we can convert these -columns to strings. - -.. code-block:: python - - url_mcc = "https://en.wikipedia.org/wiki/Mobile_country_code?oldid=899173761" - dfs = pd.read_html( - url_mcc, - match="Telekom Albania", - header=0, - converters={"MNC": str}, - ) - -Use some combination of the above: - -.. code-block:: python - - dfs = pd.read_html(url, match="Metcalf Bank", index_col=0) - -Read in pandas ``to_html`` output (with some loss of floating point precision): - -.. code-block:: python - - df = pd.DataFrame(np.random.randn(2, 2)) - s = df.to_html(float_format="{0:.40g}".format) - dfin = pd.read_html(s, index_col=0) - -The ``lxml`` backend will raise an error on a failed parse if that is the only -parser you provide. If you only have a single parser you can provide just a -string, but it is considered good practice to pass a list with one string if, -for example, the function expects a sequence of strings. You may use: - -.. code-block:: python - - dfs = pd.read_html(url, "Metcalf Bank", index_col=0, flavor=["lxml"]) - -Or you could pass ``flavor='lxml'`` without a list: - -.. code-block:: python - - dfs = pd.read_html(url, "Metcalf Bank", index_col=0, flavor="lxml") - -However, if you have bs4 and html5lib installed and pass ``None`` or ``['lxml', -'bs4']`` then the parse will most likely succeed. Note that *as soon as a parse -succeeds, the function will return*. - -.. code-block:: python - - dfs = pd.read_html(url, "Metcalf Bank", index_col=0, flavor=["lxml", "bs4"]) - -Links can be extracted from cells along with the text using ``extract_links="all"``. - -.. ipython:: python - - html_table = """ - - - - - - - -
GitHub
pandas
- """ - - df = pd.read_html( - StringIO(html_table), - extract_links="all" - )[0] - df - df[("GitHub", None)] - df[("GitHub", None)].str[1] - -.. versionadded:: 1.5.0 - -.. _io.html: - -Writing to HTML files -'''''''''''''''''''''' - -``DataFrame`` objects have an instance method ``to_html`` which renders the -contents of the ``DataFrame`` as an HTML table. The function arguments are as -in the method ``to_string`` described above. - -.. note:: - - Not all of the possible options for ``DataFrame.to_html`` are shown here for - brevity's sake. See :func:`.DataFrame.to_html` for the - full set of options. - -.. note:: - - In an HTML-rendering supported environment like a Jupyter Notebook, ``display(HTML(...))``` - will render the raw HTML into the environment. - -.. ipython:: python - - from IPython.display import display, HTML - - df = pd.DataFrame(np.random.randn(2, 2)) - df - html = df.to_html() - print(html) # raw html - display(HTML(html)) - -The ``columns`` argument will limit the columns shown: - -.. ipython:: python - - html = df.to_html(columns=[0]) - print(html) - display(HTML(html)) - -``float_format`` takes a Python callable to control the precision of floating -point values: - -.. ipython:: python - - html = df.to_html(float_format="{0:.10f}".format) - print(html) - display(HTML(html)) - - -``bold_rows`` will make the row labels bold by default, but you can turn that -off: - -.. ipython:: python - - html = df.to_html(bold_rows=False) - print(html) - display(HTML(html)) - - -The ``classes`` argument provides the ability to give the resulting HTML -table CSS classes. Note that these classes are *appended* to the existing -``'dataframe'`` class. - -.. ipython:: python - - print(df.to_html(classes=["awesome_table_class", "even_more_awesome_class"])) - -The ``render_links`` argument provides the ability to add hyperlinks to cells -that contain URLs. - -.. ipython:: python - - url_df = pd.DataFrame( - { - "name": ["Python", "pandas"], - "url": ["https://www.python.org/", "https://pandas.pydata.org"], - } - ) - html = url_df.to_html(render_links=True) - print(html) - display(HTML(html)) - -Finally, the ``escape`` argument allows you to control whether the -"<", ">" and "&" characters escaped in the resulting HTML (by default it is -``True``). So to get the HTML without escaped characters pass ``escape=False`` - -.. ipython:: python - - df = pd.DataFrame({"a": list("&<>"), "b": np.random.randn(3)}) - -Escaped: - -.. ipython:: python - - html = df.to_html() - print(html) - display(HTML(html)) - -Not escaped: - -.. ipython:: python - - html = df.to_html(escape=False) - print(html) - display(HTML(html)) - -.. note:: - - Some browsers may not show a difference in the rendering of the previous two - HTML tables. - - -.. _io.html.gotchas: - -HTML Table Parsing Gotchas -'''''''''''''''''''''''''' - -There are some versioning issues surrounding the libraries that are used to -parse HTML tables in the top-level pandas io function ``read_html``. - -**Issues with** |lxml|_ - -* Benefits - - - |lxml|_ is very fast. - - - |lxml|_ requires Cython to install correctly. - -* Drawbacks - - - |lxml|_ does *not* make any guarantees about the results of its parse - *unless* it is given |svm|_. - - - In light of the above, we have chosen to allow you, the user, to use the - |lxml|_ backend, but **this backend will use** |html5lib|_ if |lxml|_ - fails to parse - - - It is therefore *highly recommended* that you install both - |BeautifulSoup4|_ and |html5lib|_, so that you will still get a valid - result (provided everything else is valid) even if |lxml|_ fails. - -**Issues with** |BeautifulSoup4|_ **using** |lxml|_ **as a backend** - -* The above issues hold here as well since |BeautifulSoup4|_ is essentially - just a wrapper around a parser backend. - -**Issues with** |BeautifulSoup4|_ **using** |html5lib|_ **as a backend** - -* Benefits - - - |html5lib|_ is far more lenient than |lxml|_ and consequently deals - with *real-life markup* in a much saner way rather than just, e.g., - dropping an element without notifying you. - - - |html5lib|_ *generates valid HTML5 markup from invalid markup - automatically*. This is extremely important for parsing HTML tables, - since it guarantees a valid document. However, that does NOT mean that - it is "correct", since the process of fixing markup does not have a - single definition. - - - |html5lib|_ is pure Python and requires no additional build steps beyond - its own installation. - -* Drawbacks - - - The biggest drawback to using |html5lib|_ is that it is slow as - molasses. However consider the fact that many tables on the web are not - big enough for the parsing algorithm runtime to matter. It is more - likely that the bottleneck will be in the process of reading the raw - text from the URL over the web, i.e., IO (input-output). For very large - tables, this might not be true. - - -.. |svm| replace:: **strictly valid markup** -.. _svm: https://validator.w3.org/docs/help.html#validation_basics - -.. |html5lib| replace:: **html5lib** -.. _html5lib: https://github.com/html5lib/html5lib-python - -.. |BeautifulSoup4| replace:: **BeautifulSoup4** -.. _BeautifulSoup4: https://www.crummy.com/software/BeautifulSoup - -.. |lxml| replace:: **lxml** -.. _lxml: https://lxml.de - -.. _io.latex: - -LaTeX ------ - -.. versionadded:: 1.3.0 - -Currently there are no methods to read from LaTeX, only output methods. - -Writing to LaTeX files -'''''''''''''''''''''' - -.. note:: - - DataFrame *and* Styler objects currently have a ``to_latex`` method. We recommend - using the `Styler.to_latex() <../reference/api/pandas.io.formats.style.Styler.to_latex.rst>`__ method - over `DataFrame.to_latex() <../reference/api/pandas.DataFrame.to_latex.rst>`__ due to the former's greater flexibility with - conditional styling, and the latter's possible future deprecation. - -Review the documentation for `Styler.to_latex <../reference/api/pandas.io.formats.style.Styler.to_latex.rst>`__, -which gives examples of conditional styling and explains the operation of its keyword -arguments. - -For simple application the following pattern is sufficient. - -.. ipython:: python - - df = pd.DataFrame([[1, 2], [3, 4]], index=["a", "b"], columns=["c", "d"]) - print(df.style.to_latex()) - -To format values before output, chain the `Styler.format <../reference/api/pandas.io.formats.style.Styler.format.rst>`__ -method. - -.. ipython:: python - - print(df.style.format("€ {}").to_latex()) - -XML ---- - -.. _io.read_xml: - -Reading XML -''''''''''' - -.. versionadded:: 1.3.0 - -The top-level :func:`~pandas.io.xml.read_xml` function can accept an XML -string/file/URL and will parse nodes and attributes into a pandas ``DataFrame``. - -.. note:: - - Since there is no standard XML structure where design types can vary in - many ways, ``read_xml`` works best with flatter, shallow versions. If - an XML document is deeply nested, use the ``stylesheet`` feature to - transform XML into a flatter version. - -Let's look at a few examples. - -Read an XML string: - -.. ipython:: python - - from io import StringIO - xml = """ - - - Everyday Italian - Giada De Laurentiis - 2005 - 30.00 - - - Harry Potter - J K. Rowling - 2005 - 29.99 - - - Learning XML - Erik T. Ray - 2003 - 39.95 - - """ - - df = pd.read_xml(StringIO(xml)) - df - -Read a URL with no options: - -.. ipython:: python - - df = pd.read_xml("https://www.w3schools.com/xml/books.xml") - df - -Read in the content of the "books.xml" file and pass it to ``read_xml`` -as a string: - -.. ipython:: python - - file_path = "books.xml" - with open(file_path, "w") as f: - f.write(xml) - - with open(file_path, "r") as f: - df = pd.read_xml(StringIO(f.read())) - df - -Read in the content of the "books.xml" as instance of ``StringIO`` or -``BytesIO`` and pass it to ``read_xml``: - -.. ipython:: python - - with open(file_path, "r") as f: - sio = StringIO(f.read()) - - df = pd.read_xml(sio) - df - -.. ipython:: python - - with open(file_path, "rb") as f: - bio = BytesIO(f.read()) - - df = pd.read_xml(bio) - df - -Even read XML from AWS S3 buckets such as NIH NCBI PMC Article Datasets providing -Biomedical and Life Science Journals: - -.. code-block:: python - - >>> df = pd.read_xml( - ... "s3://pmc-oa-opendata/oa_comm/xml/all/PMC1236943.xml", - ... xpath=".//journal-meta", - ...) - >>> df - journal-id journal-title issn publisher - 0 Cardiovasc Ultrasound Cardiovascular Ultrasound 1476-7120 NaN - -With `lxml`_ as default ``parser``, you access the full-featured XML library -that extends Python's ElementTree API. One powerful tool is ability to query -nodes selectively or conditionally with more expressive XPath: - -.. _lxml: https://lxml.de - -.. ipython:: python - - df = pd.read_xml(file_path, xpath="//book[year=2005]") - df - -Specify only elements or only attributes to parse: - -.. ipython:: python - - df = pd.read_xml(file_path, elems_only=True) - df - -.. ipython:: python - - df = pd.read_xml(file_path, attrs_only=True) - df - -.. ipython:: python - :suppress: - - os.remove("books.xml") - -XML documents can have namespaces with prefixes and default namespaces without -prefixes both of which are denoted with a special attribute ``xmlns``. In order -to parse by node under a namespace context, ``xpath`` must reference a prefix. - -For example, below XML contains a namespace with prefix, ``doc``, and URI at -``https://example.com``. In order to parse ``doc:row`` nodes, -``namespaces`` must be used. - -.. ipython:: python - - xml = """ - - - square - 360 - 4.0 - - - circle - 360 - - - - triangle - 180 - 3.0 - - """ - - df = pd.read_xml(StringIO(xml), - xpath="//doc:row", - namespaces={"doc": "https://example.com"}) - df - -Similarly, an XML document can have a default namespace without prefix. Failing -to assign a temporary prefix will return no nodes and raise a ``ValueError``. -But assigning *any* temporary name to correct URI allows parsing by nodes. - -.. ipython:: python - - xml = """ - - - square - 360 - 4.0 - - - circle - 360 - - - - triangle - 180 - 3.0 - - """ - - df = pd.read_xml(StringIO(xml), - xpath="//pandas:row", - namespaces={"pandas": "https://example.com"}) - df - -However, if XPath does not reference node names such as default, ``/*``, then -``namespaces`` is not required. - -.. note:: - - Since ``xpath`` identifies the parent of content to be parsed, only immediate - descendants which include child nodes or current attributes are parsed. - Therefore, ``read_xml`` will not parse the text of grandchildren or other - descendants and will not parse attributes of any descendant. To retrieve - lower level content, adjust xpath to lower level. For example, - - .. ipython:: python - :okwarning: - - xml = """ - - - square - 360 - - - circle - 360 - - - triangle - 180 - - """ - - df = pd.read_xml(StringIO(xml), xpath="./row") - df - - shows the attribute ``sides`` on ``shape`` element was not parsed as - expected since this attribute resides on the child of ``row`` element - and not ``row`` element itself. In other words, ``sides`` attribute is a - grandchild level descendant of ``row`` element. However, the ``xpath`` - targets ``row`` element which covers only its children and attributes. - -With `lxml`_ as parser, you can flatten nested XML documents with an XSLT -script which also can be string/file/URL types. As background, `XSLT`_ is -a special-purpose language written in a special XML file that can transform -original XML documents into other XML, HTML, even text (CSV, JSON, etc.) -using an XSLT processor. - -.. _lxml: https://lxml.de -.. _XSLT: https://www.w3.org/TR/xslt/ - -For example, consider this somewhat nested structure of Chicago "L" Rides -where station and rides elements encapsulate data in their own sections. -With below XSLT, ``lxml`` can transform original nested document into a flatter -output (as shown below for demonstration) for easier parse into ``DataFrame``: - -.. ipython:: python - - xml = """ - - - - 2020-09-01T00:00:00 - - 864.2 - 534 - 417.2 - - - - - 2020-09-01T00:00:00 - - 2707.4 - 1909.8 - 1438.6 - - - - - 2020-09-01T00:00:00 - - 2949.6 - 1657 - 1453.8 - - - """ - - xsl = """ - - - - - - - - - - - - - - - """ - - output = """ - - - 40850 - Library - 2020-09-01T00:00:00 - 864.2 - 534 - 417.2 - - - 41700 - Washington/Wabash - 2020-09-01T00:00:00 - 2707.4 - 1909.8 - 1438.6 - - - 40380 - Clark/Lake - 2020-09-01T00:00:00 - 2949.6 - 1657 - 1453.8 - - """ - - df = pd.read_xml(StringIO(xml), stylesheet=StringIO(xsl)) - df - -For very large XML files that can range in hundreds of megabytes to gigabytes, :func:`pandas.read_xml` -supports parsing such sizeable files using `lxml's iterparse`_ and `etree's iterparse`_ -which are memory-efficient methods to iterate through an XML tree and extract specific elements and attributes. -without holding entire tree in memory. - -.. versionadded:: 1.5.0 - -.. _`lxml's iterparse`: https://lxml.de/3.2/parsing.html#iterparse-and-iterwalk -.. _`etree's iterparse`: https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse - -To use this feature, you must pass a physical XML file path into ``read_xml`` and use the ``iterparse`` argument. -Files should not be compressed or point to online sources but stored on local disk. Also, ``iterparse`` should be -a dictionary where the key is the repeating nodes in document (which become the rows) and the value is a list of -any element or attribute that is a descendant (i.e., child, grandchild) of repeating node. Since XPath is not -used in this method, descendants do not need to share same relationship with one another. Below shows example -of reading in Wikipedia's very large (12 GB+) latest article data dump. - -.. code-block:: ipython - - In [1]: df = pd.read_xml( - ... "/path/to/downloaded/enwikisource-latest-pages-articles.xml", - ... iterparse = {"page": ["title", "ns", "id"]} - ... ) - ... df - Out[2]: - title ns id - 0 Gettysburg Address 0 21450 - 1 Main Page 0 42950 - 2 Declaration by United Nations 0 8435 - 3 Constitution of the United States of America 0 8435 - 4 Declaration of Independence (Israel) 0 17858 - ... ... ... ... - 3578760 Page:Black cat 1897 07 v2 n10.pdf/17 104 219649 - 3578761 Page:Black cat 1897 07 v2 n10.pdf/43 104 219649 - 3578762 Page:Black cat 1897 07 v2 n10.pdf/44 104 219649 - 3578763 The History of Tom Jones, a Foundling/Book IX 0 12084291 - 3578764 Page:Shakespeare of Stratford (1926) Yale.djvu/91 104 21450 - - [3578765 rows x 3 columns] - -.. _io.xml: - -Writing XML -''''''''''' - -.. versionadded:: 1.3.0 - -``DataFrame`` objects have an instance method ``to_xml`` which renders the -contents of the ``DataFrame`` as an XML document. - -.. note:: - - This method does not support special properties of XML including DTD, - CData, XSD schemas, processing instructions, comments, and others. - Only namespaces at the root level is supported. However, ``stylesheet`` - allows design changes after initial output. - -Let's look at a few examples. - -Write an XML without options: - -.. ipython:: python - - geom_df = pd.DataFrame( - { - "shape": ["square", "circle", "triangle"], - "degrees": [360, 360, 180], - "sides": [4, np.nan, 3], - } - ) - - print(geom_df.to_xml()) - - -Write an XML with new root and row name: - -.. ipython:: python - - print(geom_df.to_xml(root_name="geometry", row_name="objects")) - -Write an attribute-centric XML: - -.. ipython:: python - - print(geom_df.to_xml(attr_cols=geom_df.columns.tolist())) - -Write a mix of elements and attributes: - -.. ipython:: python - - print( - geom_df.to_xml( - index=False, - attr_cols=['shape'], - elem_cols=['degrees', 'sides']) - ) - -Any ``DataFrames`` with hierarchical columns will be flattened for XML element names -with levels delimited by underscores: - -.. ipython:: python - - ext_geom_df = pd.DataFrame( - { - "type": ["polygon", "other", "polygon"], - "shape": ["square", "circle", "triangle"], - "degrees": [360, 360, 180], - "sides": [4, np.nan, 3], - } - ) - - pvt_df = ext_geom_df.pivot_table(index='shape', - columns='type', - values=['degrees', 'sides'], - aggfunc='sum') - pvt_df - - print(pvt_df.to_xml()) - -Write an XML with default namespace: - -.. ipython:: python - - print(geom_df.to_xml(namespaces={"": "https://example.com"})) - -Write an XML with namespace prefix: - -.. ipython:: python - - print( - geom_df.to_xml(namespaces={"doc": "https://example.com"}, - prefix="doc") - ) - -Write an XML without declaration or pretty print: - -.. ipython:: python - - print( - geom_df.to_xml(xml_declaration=False, - pretty_print=False) - ) - -Write an XML and transform with stylesheet: - -.. ipython:: python - - xsl = """ - - - - - - - - - - - polygon - - - - - - - - """ - - print(geom_df.to_xml(stylesheet=StringIO(xsl))) - - -XML Final Notes -''''''''''''''' - -* All XML documents adhere to `W3C specifications`_. Both ``etree`` and ``lxml`` - parsers will fail to parse any markup document that is not well-formed or - follows XML syntax rules. Do be aware HTML is not an XML document unless it - follows XHTML specs. However, other popular markup types including KML, XAML, - RSS, MusicML, MathML are compliant `XML schemas`_. - -* For above reason, if your application builds XML prior to pandas operations, - use appropriate DOM libraries like ``etree`` and ``lxml`` to build the necessary - document and not by string concatenation or regex adjustments. Always remember - XML is a *special* text file with markup rules. - -* With very large XML files (several hundred MBs to GBs), XPath and XSLT - can become memory-intensive operations. Be sure to have enough available - RAM for reading and writing to large XML files (roughly about 5 times the - size of text). - -* Because XSLT is a programming language, use it with caution since such scripts - can pose a security risk in your environment and can run large or infinite - recursive operations. Always test scripts on small fragments before full run. - -* The `etree`_ parser supports all functionality of both ``read_xml`` and - ``to_xml`` except for complex XPath and any XSLT. Though limited in features, - ``etree`` is still a reliable and capable parser and tree builder. Its - performance may trail ``lxml`` to a certain degree for larger files but - relatively unnoticeable on small to medium size files. - -.. _`W3C specifications`: https://www.w3.org/TR/xml/ -.. _`XML schemas`: https://en.wikipedia.org/wiki/List_of_types_of_XML_schemas -.. _`etree`: https://docs.python.org/3/library/xml.etree.elementtree.html - - - -.. _io.excel: - -Excel files ------------ - -The :func:`~pandas.read_excel` method can read Excel 2007+ (``.xlsx``) files -using the ``openpyxl`` Python module. Excel 2003 (``.xls``) files -can be read using ``xlrd``. Binary Excel (``.xlsb``) -files can be read using ``pyxlsb``. All formats can be read -using :ref:`calamine` engine. -The :meth:`~DataFrame.to_excel` instance method is used for -saving a ``DataFrame`` to Excel. Generally the semantics are -similar to working with :ref:`csv` data. -See the :ref:`cookbook` for some advanced strategies. - -.. note:: - - When ``engine=None``, the following logic will be used to determine the engine: - - - If ``path_or_buffer`` is an OpenDocument format (.odf, .ods, .odt), - then `odf `_ will be used. - - Otherwise if ``path_or_buffer`` is an xls format, ``xlrd`` will be used. - - Otherwise if ``path_or_buffer`` is in xlsb format, ``pyxlsb`` will be used. - - Otherwise ``openpyxl`` will be used. - -.. _io.excel_reader: - -Reading Excel files -''''''''''''''''''' - -In the most basic use-case, ``read_excel`` takes a path to an Excel -file, and the ``sheet_name`` indicating which sheet to parse. - -When using the ``engine_kwargs`` parameter, pandas will pass these arguments to the -engine. For this, it is important to know which function pandas is -using internally. - -* For the engine openpyxl, pandas is using :func:`openpyxl.load_workbook` to read in (``.xlsx``) and (``.xlsm``) files. - -* For the engine xlrd, pandas is using :func:`xlrd.open_workbook` to read in (``.xls``) files. - -* For the engine pyxlsb, pandas is using :func:`pyxlsb.open_workbook` to read in (``.xlsb``) files. - -* For the engine odf, pandas is using :func:`odf.opendocument.load` to read in (``.ods``) files. - -* For the engine calamine, pandas is using :func:`python_calamine.load_workbook` - to read in (``.xlsx``), (``.xlsm``), (``.xls``), (``.xlsb``), (``.ods``) files. - -.. code-block:: python - - # Returns a DataFrame - pd.read_excel("path_to_file.xls", sheet_name="Sheet1") - - -.. _io.excel.excelfile_class: - -``ExcelFile`` class -+++++++++++++++++++ - -To facilitate working with multiple sheets from the same file, the ``ExcelFile`` -class can be used to wrap the file and can be passed into ``read_excel`` -There will be a performance benefit for reading multiple sheets as the file is -read into memory only once. - -.. code-block:: python - - xlsx = pd.ExcelFile("path_to_file.xls") - df = pd.read_excel(xlsx, "Sheet1") - -The ``ExcelFile`` class can also be used as a context manager. - -.. code-block:: python - - with pd.ExcelFile("path_to_file.xls") as xls: - df1 = pd.read_excel(xls, "Sheet1") - df2 = pd.read_excel(xls, "Sheet2") - -The ``sheet_names`` property will generate -a list of the sheet names in the file. - -The primary use-case for an ``ExcelFile`` is parsing multiple sheets with -different parameters: - -.. code-block:: python - - data = {} - # For when Sheet1's format differs from Sheet2 - with pd.ExcelFile("path_to_file.xls") as xls: - data["Sheet1"] = pd.read_excel(xls, "Sheet1", index_col=None, na_values=["NA"]) - data["Sheet2"] = pd.read_excel(xls, "Sheet2", index_col=1) - -Note that if the same parsing parameters are used for all sheets, a list -of sheet names can simply be passed to ``read_excel`` with no loss in performance. - -.. code-block:: python - - # using the ExcelFile class - data = {} - with pd.ExcelFile("path_to_file.xls") as xls: - data["Sheet1"] = pd.read_excel(xls, "Sheet1", index_col=None, na_values=["NA"]) - data["Sheet2"] = pd.read_excel(xls, "Sheet2", index_col=None, na_values=["NA"]) - - # equivalent using the read_excel function - data = pd.read_excel( - "path_to_file.xls", ["Sheet1", "Sheet2"], index_col=None, na_values=["NA"] - ) - -``ExcelFile`` can also be called with a ``xlrd.book.Book`` object -as a parameter. This allows the user to control how the excel file is read. -For example, sheets can be loaded on demand by calling ``xlrd.open_workbook()`` -with ``on_demand=True``. - -.. code-block:: python - - import xlrd - - xlrd_book = xlrd.open_workbook("path_to_file.xls", on_demand=True) - with pd.ExcelFile(xlrd_book) as xls: - df1 = pd.read_excel(xls, "Sheet1") - df2 = pd.read_excel(xls, "Sheet2") - -.. _io.excel.specifying_sheets: - -Specifying sheets -+++++++++++++++++ - -.. note:: The second argument is ``sheet_name``, not to be confused with ``ExcelFile.sheet_names``. - -.. note:: An ExcelFile's attribute ``sheet_names`` provides access to a list of sheets. - -* The arguments ``sheet_name`` allows specifying the sheet or sheets to read. -* The default value for ``sheet_name`` is 0, indicating to read the first sheet -* Pass a string to refer to the name of a particular sheet in the workbook. -* Pass an integer to refer to the index of a sheet. Indices follow Python - convention, beginning at 0. -* Pass a list of either strings or integers, to return a dictionary of specified sheets. -* Pass a ``None`` to return a dictionary of all available sheets. - -.. code-block:: python - - # Returns a DataFrame - pd.read_excel("path_to_file.xls", "Sheet1", index_col=None, na_values=["NA"]) - -Using the sheet index: - -.. code-block:: python - - # Returns a DataFrame - pd.read_excel("path_to_file.xls", 0, index_col=None, na_values=["NA"]) - -Using all default values: - -.. code-block:: python - - # Returns a DataFrame - pd.read_excel("path_to_file.xls") - -Using None to get all sheets: - -.. code-block:: python - - # Returns a dictionary of DataFrames - pd.read_excel("path_to_file.xls", sheet_name=None) - -Using a list to get multiple sheets: - -.. code-block:: python - - # Returns the 1st and 4th sheet, as a dictionary of DataFrames. - pd.read_excel("path_to_file.xls", sheet_name=["Sheet1", 3]) - -``read_excel`` can read more than one sheet, by setting ``sheet_name`` to either -a list of sheet names, a list of sheet positions, or ``None`` to read all sheets. -Sheets can be specified by sheet index or sheet name, using an integer or string, -respectively. - -.. _io.excel.reading_multiindex: - -Reading a ``MultiIndex`` -++++++++++++++++++++++++ - -``read_excel`` can read a ``MultiIndex`` index, by passing a list of columns to ``index_col`` -and a ``MultiIndex`` column by passing a list of rows to ``header``. If either the ``index`` -or ``columns`` have serialized level names those will be read in as well by specifying -the rows/columns that make up the levels. - -For example, to read in a ``MultiIndex`` index without names: - -.. ipython:: python - - df = pd.DataFrame( - {"a": [1, 2, 3, 4], "b": [5, 6, 7, 8]}, - index=pd.MultiIndex.from_product([["a", "b"], ["c", "d"]]), - ) - df.to_excel("path_to_file.xlsx") - df = pd.read_excel("path_to_file.xlsx", index_col=[0, 1]) - df - -If the index has level names, they will be parsed as well, using the same -parameters. - -.. ipython:: python - - df.index = df.index.set_names(["lvl1", "lvl2"]) - df.to_excel("path_to_file.xlsx") - df = pd.read_excel("path_to_file.xlsx", index_col=[0, 1]) - df - - -If the source file has both ``MultiIndex`` index and columns, lists specifying each -should be passed to ``index_col`` and ``header``: - -.. ipython:: python - - df.columns = pd.MultiIndex.from_product([["a"], ["b", "d"]], names=["c1", "c2"]) - df.to_excel("path_to_file.xlsx") - df = pd.read_excel("path_to_file.xlsx", index_col=[0, 1], header=[0, 1]) - df - -.. ipython:: python - :suppress: - - os.remove("path_to_file.xlsx") - -Missing values in columns specified in ``index_col`` will be forward filled to -allow roundtripping with ``to_excel`` for ``merged_cells=True``. To avoid forward -filling the missing values use ``set_index`` after reading the data instead of -``index_col``. - -Parsing specific columns -++++++++++++++++++++++++ - -It is often the case that users will insert columns to do temporary computations -in Excel and you may not want to read in those columns. ``read_excel`` takes -a ``usecols`` keyword to allow you to specify a subset of columns to parse. - -You can specify a comma-delimited set of Excel columns and ranges as a string: - -.. code-block:: python - - pd.read_excel("path_to_file.xls", "Sheet1", usecols="A,C:E") - -If ``usecols`` is a list of integers, then it is assumed to be the file column -indices to be parsed. - -.. code-block:: python - - pd.read_excel("path_to_file.xls", "Sheet1", usecols=[0, 2, 3]) - -Element order is ignored, so ``usecols=[0, 1]`` is the same as ``[1, 0]``. - -If ``usecols`` is a list of strings, it is assumed that each string corresponds -to a column name provided either by the user in ``names`` or inferred from the -document header row(s). Those strings define which columns will be parsed: - -.. code-block:: python - - pd.read_excel("path_to_file.xls", "Sheet1", usecols=["foo", "bar"]) - -Element order is ignored, so ``usecols=['baz', 'joe']`` is the same as ``['joe', 'baz']``. - -If ``usecols`` is callable, the callable function will be evaluated against -the column names, returning names where the callable function evaluates to ``True``. - -.. code-block:: python - - pd.read_excel("path_to_file.xls", "Sheet1", usecols=lambda x: x.isalpha()) - -Parsing dates -+++++++++++++ - -Datetime-like values are normally automatically converted to the appropriate -dtype when reading the excel file. But if you have a column of strings that -*look* like dates (but are not actually formatted as dates in excel), you can -use the ``parse_dates`` keyword to parse those strings to datetimes: - -.. code-block:: python - - pd.read_excel("path_to_file.xls", "Sheet1", parse_dates=["date_strings"]) - - -Cell converters -+++++++++++++++ - -It is possible to transform the contents of Excel cells via the ``converters`` -option. For instance, to convert a column to boolean: - -.. code-block:: python - - pd.read_excel("path_to_file.xls", "Sheet1", converters={"MyBools": bool}) - -This options handles missing values and treats exceptions in the converters -as missing data. Transformations are applied cell by cell rather than to the -column as a whole, so the array dtype is not guaranteed. For instance, a -column of integers with missing values cannot be transformed to an array -with integer dtype, because NaN is strictly a float. You can manually mask -missing data to recover integer dtype: - -.. code-block:: python - - def cfun(x): - return int(x) if x else -1 - - - pd.read_excel("path_to_file.xls", "Sheet1", converters={"MyInts": cfun}) - -Dtype specifications -++++++++++++++++++++ - -As an alternative to converters, the type for an entire column can -be specified using the ``dtype`` keyword, which takes a dictionary -mapping column names to types. To interpret data with -no type inference, use the type ``str`` or ``object``. - -.. code-block:: python - - pd.read_excel("path_to_file.xls", dtype={"MyInts": "int64", "MyText": str}) - -.. _io.excel_writer: - -Writing Excel files -''''''''''''''''''' - -Writing Excel files to disk -+++++++++++++++++++++++++++ - -To write a ``DataFrame`` object to a sheet of an Excel file, you can use the -``to_excel`` instance method. The arguments are largely the same as ``to_csv`` -described above, the first argument being the name of the excel file, and the -optional second argument the name of the sheet to which the ``DataFrame`` should be -written. For example: - -.. code-block:: python - - df.to_excel("path_to_file.xlsx", sheet_name="Sheet1") - -Files with a -``.xlsx`` extension will be written using ``xlsxwriter`` (if available) or -``openpyxl``. - -The ``DataFrame`` will be written in a way that tries to mimic the REPL output. -The ``index_label`` will be placed in the second -row instead of the first. You can place it in the first row by setting the -``merge_cells`` option in ``to_excel()`` to ``False``: - -.. code-block:: python - - df.to_excel("path_to_file.xlsx", index_label="label", merge_cells=False) - -In order to write separate ``DataFrames`` to separate sheets in a single Excel file, -one can pass an :class:`~pandas.io.excel.ExcelWriter`. - -.. code-block:: python - - with pd.ExcelWriter("path_to_file.xlsx") as writer: - df1.to_excel(writer, sheet_name="Sheet1") - df2.to_excel(writer, sheet_name="Sheet2") - -.. _io.excel_writing_buffer: - -When using the ``engine_kwargs`` parameter, pandas will pass these arguments to the -engine. For this, it is important to know which function pandas is using internally. - -* For the engine openpyxl, pandas is using :func:`openpyxl.Workbook` to create a new sheet and :func:`openpyxl.load_workbook` to append data to an existing sheet. The openpyxl engine writes to (``.xlsx``) and (``.xlsm``) files. - -* For the engine xlsxwriter, pandas is using :func:`xlsxwriter.Workbook` to write to (``.xlsx``) files. - -* For the engine odf, pandas is using :func:`odf.opendocument.OpenDocumentSpreadsheet` to write to (``.ods``) files. - -Writing Excel files to memory -+++++++++++++++++++++++++++++ - -pandas supports writing Excel files to buffer-like objects such as ``StringIO`` or -``BytesIO`` using :class:`~pandas.io.excel.ExcelWriter`. - -.. code-block:: python - - from io import BytesIO - - bio = BytesIO() - - # By setting the 'engine' in the ExcelWriter constructor. - writer = pd.ExcelWriter(bio, engine="xlsxwriter") - df.to_excel(writer, sheet_name="Sheet1") - - # Save the workbook - writer.save() - - # Seek to the beginning and read to copy the workbook to a variable in memory - bio.seek(0) - workbook = bio.read() - -.. note:: - - ``engine`` is optional but recommended. Setting the engine determines - the version of workbook produced. Setting ``engine='xlrd'`` will produce an - Excel 2003-format workbook (xls). Using either ``'openpyxl'`` or - ``'xlsxwriter'`` will produce an Excel 2007-format workbook (xlsx). If - omitted, an Excel 2007-formatted workbook is produced. - - -.. _io.excel.writers: - -Excel writer engines -'''''''''''''''''''' - -pandas chooses an Excel writer via two methods: - -1. the ``engine`` keyword argument -2. the filename extension (via the default specified in config options) - -By default, pandas uses the `XlsxWriter`_ for ``.xlsx``, `openpyxl`_ -for ``.xlsm``. If you have multiple -engines installed, you can set the default engine through :ref:`setting the -config options ` ``io.excel.xlsx.writer`` and -``io.excel.xls.writer``. pandas will fall back on `openpyxl`_ for ``.xlsx`` -files if `Xlsxwriter`_ is not available. - -.. _XlsxWriter: https://xlsxwriter.readthedocs.io -.. _openpyxl: https://openpyxl.readthedocs.io/ - -To specify which writer you want to use, you can pass an engine keyword -argument to ``to_excel`` and to ``ExcelWriter``. The built-in engines are: - -* ``openpyxl``: version 2.4 or higher is required -* ``xlsxwriter`` - -.. code-block:: python - - # By setting the 'engine' in the DataFrame 'to_excel()' methods. - df.to_excel("path_to_file.xlsx", sheet_name="Sheet1", engine="xlsxwriter") - - # By setting the 'engine' in the ExcelWriter constructor. - writer = pd.ExcelWriter("path_to_file.xlsx", engine="xlsxwriter") - - # Or via pandas configuration. - from pandas import options # noqa: E402 - - options.io.excel.xlsx.writer = "xlsxwriter" - - df.to_excel("path_to_file.xlsx", sheet_name="Sheet1") - -.. _io.excel.style: - -Style and formatting -'''''''''''''''''''' - -The look and feel of Excel worksheets created from pandas can be modified using the following parameters on the ``DataFrame``'s ``to_excel`` method. - -* ``float_format`` : Format string for floating point numbers (default ``None``). -* ``freeze_panes`` : A tuple of two integers representing the bottommost row and rightmost column to freeze. Each of these parameters is one-based, so (1, 1) will freeze the first row and first column (default ``None``). - -.. note:: - - As of pandas 3.0, by default spreadsheets created with the ``to_excel`` method - will not contain any styling. Users wishing to bold text, add bordered styles, - etc in a worksheet output by ``to_excel`` can do so by using :meth:`Styler.to_excel` - to create styled excel files. For documentation on styling spreadsheets, see - `here `__. - - -.. code-block:: python - - css = "border: 1px solid black; font-weight: bold;" - df.style.map_index(lambda x: css).map_index(lambda x: css, axis=1).to_excel("myfile.xlsx") - -Using the `Xlsxwriter`_ engine provides many options for controlling the -format of an Excel worksheet created with the ``to_excel`` method. Excellent examples can be found in the -`Xlsxwriter`_ documentation here: https://xlsxwriter.readthedocs.io/working_with_pandas.html - -.. _io.ods: - -OpenDocument Spreadsheets -------------------------- - -The io methods for `Excel files`_ also support reading and writing OpenDocument spreadsheets -using the `odfpy `__ module. The semantics and features for reading and writing -OpenDocument spreadsheets match what can be done for `Excel files`_ using -``engine='odf'``. The optional dependency 'odfpy' needs to be installed. - -The :func:`~pandas.read_excel` method can read OpenDocument spreadsheets - -.. code-block:: python - - # Returns a DataFrame - pd.read_excel("path_to_file.ods", engine="odf") - -Similarly, the :func:`~pandas.to_excel` method can write OpenDocument spreadsheets - -.. code-block:: python - - # Writes DataFrame to a .ods file - df.to_excel("path_to_file.ods", engine="odf") - -.. _io.xlsb: - -Binary Excel (.xlsb) files --------------------------- - -The :func:`~pandas.read_excel` method can also read binary Excel files -using the ``pyxlsb`` module. The semantics and features for reading -binary Excel files mostly match what can be done for `Excel files`_ using -``engine='pyxlsb'``. ``pyxlsb`` does not recognize datetime types -in files and will return floats instead (you can use :ref:`calamine` -if you need recognize datetime types). - -.. code-block:: python - - # Returns a DataFrame - pd.read_excel("path_to_file.xlsb", engine="pyxlsb") - -.. note:: - - Currently pandas only supports *reading* binary Excel files. Writing - is not implemented. - -.. _io.calamine: - -Calamine (Excel and ODS files) ------------------------------- - -The :func:`~pandas.read_excel` method can read Excel file (``.xlsx``, ``.xlsm``, ``.xls``, ``.xlsb``) -and OpenDocument spreadsheets (``.ods``) using the ``python-calamine`` module. -This module is a binding for Rust library `calamine `__ -and is faster than other engines in most cases. The optional dependency 'python-calamine' needs to be installed. - -.. code-block:: python - - # Returns a DataFrame - pd.read_excel("path_to_file.xlsb", engine="calamine") - -.. _io.clipboard: - -Clipboard ---------- - -A handy way to grab data is to use the :meth:`~DataFrame.read_clipboard` method, -which takes the contents of the clipboard buffer and passes them to the -``read_csv`` method. For instance, you can copy the following text to the -clipboard (CTRL-C on many operating systems): - -.. code-block:: console - - A B C - x 1 4 p - y 2 5 q - z 3 6 r - -And then import the data directly to a ``DataFrame`` by calling: - -.. code-block:: python - - >>> clipdf = pd.read_clipboard() - >>> clipdf - A B C - x 1 4 p - y 2 5 q - z 3 6 r - -The ``to_clipboard`` method can be used to write the contents of a ``DataFrame`` to -the clipboard. Following which you can paste the clipboard contents into other -applications (CTRL-V on many operating systems). Here we illustrate writing a -``DataFrame`` into clipboard and reading it back. - -.. code-block:: python - - >>> df = pd.DataFrame( - ... {"A": [1, 2, 3], "B": [4, 5, 6], "C": ["p", "q", "r"]}, index=["x", "y", "z"] - ... ) - - >>> df - A B C - x 1 4 p - y 2 5 q - z 3 6 r - >>> df.to_clipboard() - >>> pd.read_clipboard() - A B C - x 1 4 p - y 2 5 q - z 3 6 r - -We can see that we got the same content back, which we had earlier written to the clipboard. - -.. note:: - - You may need to install xclip or xsel (with PyQt5, PyQt4 or qtpy) on Linux to use these methods. - -.. _io.pickle: - -Pickling --------- - -All pandas objects are equipped with ``to_pickle`` methods which use Python's -``cPickle`` module to save data structures to disk using the pickle format. - -.. ipython:: python - - df - df.to_pickle("foo.pkl") - -The ``read_pickle`` function in the ``pandas`` namespace can be used to load -any pickled pandas object (or any other pickled object) from file: - - -.. ipython:: python - - pd.read_pickle("foo.pkl") - -.. ipython:: python - :suppress: - - os.remove("foo.pkl") - -.. warning:: - - Loading pickled data received from untrusted sources can be unsafe. - - See: https://docs.python.org/3/library/pickle.html - -.. warning:: - - :func:`read_pickle` is only guaranteed backwards compatible back to a few minor release. - -.. _io.pickle.compression: - -Compressed pickle files -''''''''''''''''''''''' - -:func:`read_pickle`, :meth:`DataFrame.to_pickle` and :meth:`Series.to_pickle` can read -and write compressed pickle files. The compression types of ``gzip``, ``bz2``, ``xz``, ``zstd`` are supported for reading and writing. -The ``zip`` file format only supports reading and must contain only one data file -to be read. - -The compression type can be an explicit parameter or be inferred from the file extension. -If 'infer', then use ``gzip``, ``bz2``, ``zip``, ``xz``, ``zstd`` if filename ends in ``'.gz'``, ``'.bz2'``, ``'.zip'``, -``'.xz'``, or ``'.zst'``, respectively. - -The compression parameter can also be a ``dict`` in order to pass options to the -compression protocol. It must have a ``'method'`` key set to the name -of the compression protocol, which must be one of -{``'zip'``, ``'gzip'``, ``'bz2'``, ``'xz'``, ``'zstd'``}. All other key-value pairs are passed to -the underlying compression library. - -.. ipython:: python - - df = pd.DataFrame( - { - "A": np.random.randn(1000), - "B": "foo", - "C": pd.date_range("20130101", periods=1000, freq="s"), - } - ) - df - -Using an explicit compression type: - -.. ipython:: python - - df.to_pickle("data.pkl.compress", compression="gzip") - rt = pd.read_pickle("data.pkl.compress", compression="gzip") - rt - -Inferring compression type from the extension: - -.. ipython:: python - - df.to_pickle("data.pkl.xz", compression="infer") - rt = pd.read_pickle("data.pkl.xz", compression="infer") - rt - -The default is to 'infer': - -.. ipython:: python - - df.to_pickle("data.pkl.gz") - rt = pd.read_pickle("data.pkl.gz") - rt - - df["A"].to_pickle("s1.pkl.bz2") - rt = pd.read_pickle("s1.pkl.bz2") - rt - -Passing options to the compression protocol in order to speed up compression: - -.. ipython:: python - - df.to_pickle("data.pkl.gz", compression={"method": "gzip", "compresslevel": 1}) - -.. ipython:: python - :suppress: - - os.remove("data.pkl.compress") - os.remove("data.pkl.xz") - os.remove("data.pkl.gz") - os.remove("s1.pkl.bz2") - -.. _io.msgpack: - -msgpack -------- - -pandas support for ``msgpack`` has been removed in version 1.0.0. It is -recommended to use :ref:`pickle ` instead. - -Alternatively, you can also the Arrow IPC serialization format for on-the-wire -transmission of pandas objects. For documentation on pyarrow, see -`here `__. - - -.. _io.hdf5: - -HDF5 (PyTables) ---------------- - -``HDFStore`` is a dict-like object which reads and writes pandas using -the high performance HDF5 format using the excellent `PyTables -`__ library. See the :ref:`cookbook ` -for some advanced strategies - -.. warning:: - - pandas uses PyTables for reading and writing HDF5 files, which allows - serializing object-dtype data with pickle. Loading pickled data received from - untrusted sources can be unsafe. - - See: https://docs.python.org/3/library/pickle.html for more. - -.. ipython:: python - :suppress: - :okexcept: - - os.remove("store.h5") - -.. ipython:: python - - store = pd.HDFStore("store.h5") - print(store) - -Objects can be written to the file just like adding key-value pairs to a -dict: - -.. ipython:: python - - index = pd.date_range("1/1/2000", periods=8) - s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"]) - df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=["A", "B", "C"]) - - # store.put('s', s) is an equivalent method - store["s"] = s - - store["df"] = df - - store - -In a current or later Python session, you can retrieve stored objects: - -.. ipython:: python - - # store.get('df') is an equivalent method - store["df"] - - # dotted (attribute) access provides get as well - store.df - -Deletion of the object specified by the key: - -.. ipython:: python - - # store.remove('df') is an equivalent method - del store["df"] - - store - -Closing a Store and using a context manager: - -.. ipython:: python - - store.close() - store - store.is_open - - # Working with, and automatically closing the store using a context manager - with pd.HDFStore("store.h5") as store: - store.keys() - -.. ipython:: python - :suppress: - - store.close() - os.remove("store.h5") - - - -Read/write API -'''''''''''''' - -``HDFStore`` supports a top-level API using ``read_hdf`` for reading and ``to_hdf`` for writing, -similar to how ``read_csv`` and ``to_csv`` work. - -.. ipython:: python - - df_tl = pd.DataFrame({"A": list(range(5)), "B": list(range(5))}) - df_tl.to_hdf("store_tl.h5", key="table", append=True) - pd.read_hdf("store_tl.h5", "table", where=["index>2"]) - -.. ipython:: python - :suppress: - :okexcept: - - os.remove("store_tl.h5") - - -HDFStore will by default not drop rows that are all missing. This behavior can be changed by setting ``dropna=True``. - - -.. ipython:: python - - df_with_missing = pd.DataFrame( - { - "col1": [0, np.nan, 2], - "col2": [1, np.nan, np.nan], - } - ) - df_with_missing - - df_with_missing.to_hdf("file.h5", key="df_with_missing", format="table", mode="w") - - pd.read_hdf("file.h5", "df_with_missing") - - df_with_missing.to_hdf( - "file.h5", key="df_with_missing", format="table", mode="w", dropna=True - ) - pd.read_hdf("file.h5", "df_with_missing") - - -.. ipython:: python - :suppress: - - os.remove("file.h5") - - -.. _io.hdf5-fixed: - -Fixed format -'''''''''''' - -The examples above show storing using ``put``, which write the HDF5 to ``PyTables`` in a fixed array format, called -the ``fixed`` format. These types of stores are **not** appendable once written (though you can simply -remove them and rewrite). Nor are they **queryable**; they must be -retrieved in their entirety. They also do not support dataframes with non-unique column names. -The ``fixed`` format stores offer very fast writing and slightly faster reading than ``table`` stores. -This format is specified by default when using ``put`` or ``to_hdf`` or by ``format='fixed'`` or ``format='f'``. - -.. warning:: - - A ``fixed`` format will raise a ``TypeError`` if you try to retrieve using a ``where``: - - .. ipython:: python - :okexcept: - - pd.DataFrame(np.random.randn(10, 2)).to_hdf("test_fixed.h5", key="df") - pd.read_hdf("test_fixed.h5", "df", where="index>5") - - .. ipython:: python - :suppress: - - os.remove("test_fixed.h5") - - -.. _io.hdf5-table: - -Table format -'''''''''''' - -``HDFStore`` supports another ``PyTables`` format on disk, the ``table`` -format. Conceptually a ``table`` is shaped very much like a DataFrame, -with rows and columns. A ``table`` may be appended to in the same or -other sessions. In addition, delete and query type operations are -supported. This format is specified by ``format='table'`` or ``format='t'`` -to ``append`` or ``put`` or ``to_hdf``. - -This format can be set as an option as well ``pd.set_option('io.hdf.default_format','table')`` to -enable ``put/append/to_hdf`` to by default store in the ``table`` format. - -.. ipython:: python - :suppress: - :okexcept: - - os.remove("store.h5") - -.. ipython:: python - - store = pd.HDFStore("store.h5") - df1 = df[0:4] - df2 = df[4:] - - # append data (creates a table automatically) - store.append("df", df1) - store.append("df", df2) - store - - # select the entire object - store.select("df") - - # the type of stored data - store.root.df._v_attrs.pandas_type - -.. note:: - - You can also create a ``table`` by passing ``format='table'`` or ``format='t'`` to a ``put`` operation. - -.. _io.hdf5-keys: - -Hierarchical keys -''''''''''''''''' - -Keys to a store can be specified as a string. These can be in a -hierarchical path-name like format (e.g. ``foo/bar/bah``), which will -generate a hierarchy of sub-stores (or ``Groups`` in PyTables -parlance). Keys can be specified without the leading '/' and are **always** -absolute (e.g. 'foo' refers to '/foo'). Removal operations can remove -everything in the sub-store and **below**, so be *careful*. - -.. ipython:: python - - store.put("foo/bar/bah", df) - store.append("food/orange", df) - store.append("food/apple", df) - store - - # a list of keys are returned - store.keys() - - # remove all nodes under this level - store.remove("food") - store - - -You can walk through the group hierarchy using the ``walk`` method which -will yield a tuple for each group key along with the relative keys of its contents. - -.. ipython:: python - - for (path, subgroups, subkeys) in store.walk(): - for subgroup in subgroups: - print("GROUP: {}/{}".format(path, subgroup)) - for subkey in subkeys: - key = "/".join([path, subkey]) - print("KEY: {}".format(key)) - print(store.get(key)) - - - -.. warning:: - - Hierarchical keys cannot be retrieved as dotted (attribute) access as described above for items stored under the root node. - - .. ipython:: python - :okexcept: - - store.foo.bar.bah - - .. ipython:: python - - # you can directly access the actual PyTables node but using the root node - store.root.foo.bar.bah - - Instead, use explicit string based keys: - - .. ipython:: python - - store["foo/bar/bah"] - - -.. _io.hdf5-types: - -Storing types -''''''''''''' - -Storing mixed types in a table -++++++++++++++++++++++++++++++ - -Storing mixed-dtype data is supported. Strings are stored as a -fixed-width using the maximum size of the appended column. Subsequent attempts -at appending longer strings will raise a ``ValueError``. - -Passing ``min_itemsize={`values`: size}`` as a parameter to append -will set a larger minimum for the string columns. Storing ``floats, -strings, ints, bools, datetime64`` are currently supported. For string -columns, passing ``nan_rep = 'nan'`` to append will change the default -nan representation on disk (which converts to/from ``np.nan``), this -defaults to ``nan``. - -.. ipython:: python - - df_mixed = pd.DataFrame( - { - "A": np.random.randn(8), - "B": np.random.randn(8), - "C": np.array(np.random.randn(8), dtype="float32"), - "string": "string", - "int": 1, - "bool": True, - "datetime64": pd.Timestamp("20010102"), - }, - index=list(range(8)), - ) - df_mixed.loc[df_mixed.index[3:5], ["A", "B", "string", "datetime64"]] = np.nan - - store.append("df_mixed", df_mixed, min_itemsize={"values": 50}) - df_mixed1 = store.select("df_mixed") - df_mixed1 - df_mixed1.dtypes.value_counts() - - # we have provided a minimum string column size - store.root.df_mixed.table - -Storing MultiIndex DataFrames -+++++++++++++++++++++++++++++ - -Storing MultiIndex ``DataFrames`` as tables is very similar to -storing/selecting from homogeneous index ``DataFrames``. - -.. ipython:: python - - index = pd.MultiIndex( - levels=[["foo", "bar", "baz", "qux"], ["one", "two", "three"]], - codes=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3], [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]], - names=["foo", "bar"], - ) - df_mi = pd.DataFrame(np.random.randn(10, 3), index=index, columns=["A", "B", "C"]) - df_mi - - store.append("df_mi", df_mi) - store.select("df_mi") - - # the levels are automatically included as data columns - store.select("df_mi", "foo=bar") - -.. note:: - The ``index`` keyword is reserved and cannot be use as a level name. - -.. _io.hdf5-query: - -Querying -'''''''' - -Querying a table -++++++++++++++++ - -``select`` and ``delete`` operations have an optional criterion that can -be specified to select/delete only a subset of the data. This allows one -to have a very large on-disk table and retrieve only a portion of the -data. - -A query is specified using the ``Term`` class under the hood, as a boolean expression. - -* ``index`` and ``columns`` are supported indexers of ``DataFrames``. -* if ``data_columns`` are specified, these can be used as additional indexers. -* level name in a MultiIndex, with default name ``level_0``, ``level_1``, … if not provided. - -Valid comparison operators are: - -``=, ==, !=, >, >=, <, <=`` - -Valid boolean expressions are combined with: - -* ``|`` : or -* ``&`` : and -* ``(`` and ``)`` : for grouping - -These rules are similar to how boolean expressions are used in pandas for indexing. - -.. note:: - - - ``=`` will be automatically expanded to the comparison operator ``==`` - - ``~`` is the not operator, but can only be used in very limited - circumstances - - If a list/tuple of expressions is passed they will be combined via ``&`` - -The following are valid expressions: - -* ``'index >= date'`` -* ``"columns = ['A', 'D']"`` -* ``"columns in ['A', 'D']"`` -* ``'columns = A'`` -* ``'columns == A'`` -* ``"~(columns = ['A', 'B'])"`` -* ``'index > df.index[3] & string = "bar"'`` -* ``'(index > df.index[3] & index <= df.index[6]) | string = "bar"'`` -* ``"ts >= Timestamp('2012-02-01')"`` -* ``"major_axis>=20130101"`` - -The ``indexers`` are on the left-hand side of the sub-expression: - -``columns``, ``major_axis``, ``ts`` - -The right-hand side of the sub-expression (after a comparison operator) can be: - -* functions that will be evaluated, e.g. ``Timestamp('2012-02-01')`` -* strings, e.g. ``"bar"`` -* date-like, e.g. ``20130101``, or ``"20130101"`` -* lists, e.g. ``"['A', 'B']"`` -* variables that are defined in the local names space, e.g. ``date`` - -.. note:: - - Passing a string to a query by interpolating it into the query - expression is not recommended. Simply assign the string of interest to a - variable and use that variable in an expression. For example, do this - - .. code-block:: python - - string = "HolyMoly'" - store.select("df", "index == string") - - instead of this - - .. code-block:: python - - string = "HolyMoly'" - store.select('df', f'index == {string}') - - The latter will **not** work and will raise a ``SyntaxError``.Note that - there's a single quote followed by a double quote in the ``string`` - variable. - - If you *must* interpolate, use the ``'%r'`` format specifier - - .. code-block:: python - - store.select("df", "index == %r" % string) - - which will quote ``string``. - - -Here are some examples: - -.. ipython:: python - - dfq = pd.DataFrame( - np.random.randn(10, 4), - columns=list("ABCD"), - index=pd.date_range("20130101", periods=10), - ) - store.append("dfq", dfq, format="table", data_columns=True) - -Use boolean expressions, with in-line function evaluation. - -.. ipython:: python - - store.select("dfq", "index>pd.Timestamp('20130104') & columns=['A', 'B']") - -Use inline column reference. - -.. ipython:: python - - store.select("dfq", where="A>0 or C>0") - -The ``columns`` keyword can be supplied to select a list of columns to be -returned, this is equivalent to passing a -``'columns=list_of_columns_to_filter'``: - -.. ipython:: python - - store.select("df", "columns=['A', 'B']") - -``start`` and ``stop`` parameters can be specified to limit the total search -space. These are in terms of the total number of rows in a table. - -.. note:: - - ``select`` will raise a ``ValueError`` if the query expression has an unknown - variable reference. Usually this means that you are trying to select on a column - that is **not** a data_column. - - ``select`` will raise a ``SyntaxError`` if the query expression is not valid. - - -.. _io.hdf5-timedelta: - -Query timedelta64[ns] -+++++++++++++++++++++ - -You can store and query using the ``timedelta64[ns]`` type. Terms can be -specified in the format: ``()``, where float may be signed (and fractional), and unit can be -``D,s,ms,us,ns`` for the timedelta. Here's an example: - -.. ipython:: python - - from datetime import timedelta - - dftd = pd.DataFrame( - { - "A": pd.Timestamp("20130101"), - "B": [ - pd.Timestamp("20130101") + timedelta(days=i, seconds=10) - for i in range(10) - ], - } - ) - dftd["C"] = dftd["A"] - dftd["B"] - dftd - store.append("dftd", dftd, data_columns=True) - store.select("dftd", "C<'-3.5D'") - -.. _io.query_multi: - -Query MultiIndex -++++++++++++++++ - -Selecting from a ``MultiIndex`` can be achieved by using the name of the level. - -.. ipython:: python - - df_mi.index.names - store.select("df_mi", "foo=baz and bar=two") - -If the ``MultiIndex`` levels names are ``None``, the levels are automatically made available via -the ``level_n`` keyword with ``n`` the level of the ``MultiIndex`` you want to select from. - -.. ipython:: python - - index = pd.MultiIndex( - levels=[["foo", "bar", "baz", "qux"], ["one", "two", "three"]], - codes=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3], [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]], - ) - df_mi_2 = pd.DataFrame(np.random.randn(10, 3), index=index, columns=["A", "B", "C"]) - df_mi_2 - - store.append("df_mi_2", df_mi_2) - - # the levels are automatically included as data columns with keyword level_n - store.select("df_mi_2", "level_0=foo and level_1=two") - - -Indexing -++++++++ - -You can create/modify an index for a table with ``create_table_index`` -after data is already in the table (after and ``append/put`` -operation). Creating a table index is **highly** encouraged. This will -speed your queries a great deal when you use a ``select`` with the -indexed dimension as the ``where``. - -.. note:: - - Indexes are automagically created on the indexables - and any data columns you specify. This behavior can be turned off by passing - ``index=False`` to ``append``. - -.. ipython:: python - - # we have automagically already created an index (in the first section) - i = store.root.df.table.cols.index.index - i.optlevel, i.kind - - # change an index by passing new parameters - store.create_table_index("df", optlevel=9, kind="full") - i = store.root.df.table.cols.index.index - i.optlevel, i.kind - -Oftentimes when appending large amounts of data to a store, it is useful to turn off index creation for each append, then recreate at the end. - -.. ipython:: python - - df_1 = pd.DataFrame(np.random.randn(10, 2), columns=list("AB")) - df_2 = pd.DataFrame(np.random.randn(10, 2), columns=list("AB")) - - st = pd.HDFStore("appends.h5", mode="w") - st.append("df", df_1, data_columns=["B"], index=False) - st.append("df", df_2, data_columns=["B"], index=False) - st.get_storer("df").table - -Then create the index when finished appending. - -.. ipython:: python - - st.create_table_index("df", columns=["B"], optlevel=9, kind="full") - st.get_storer("df").table - - st.close() - -.. ipython:: python - :suppress: - :okexcept: - - os.remove("appends.h5") - -See `here `__ for how to create a completely-sorted-index (CSI) on an existing store. - -.. _io.hdf5-query-data-columns: - -Query via data columns -++++++++++++++++++++++ - -You can designate (and index) certain columns that you want to be able -to perform queries (other than the ``indexable`` columns, which you can -always query). For instance say you want to perform this common -operation, on-disk, and return just the frame that matches this -query. You can specify ``data_columns = True`` to force all columns to -be ``data_columns``. - -.. ipython:: python - - df_dc = df.copy() - df_dc["string"] = "foo" - df_dc.loc[df_dc.index[4:6], "string"] = np.nan - df_dc.loc[df_dc.index[7:9], "string"] = "bar" - df_dc["string2"] = "cool" - df_dc.loc[df_dc.index[1:3], ["B", "C"]] = 1.0 - df_dc - - # on-disk operations - store.append("df_dc", df_dc, data_columns=["B", "C", "string", "string2"]) - store.select("df_dc", where="B > 0") - - # getting creative - store.select("df_dc", "B > 0 & C > 0 & string == foo") - - # this is in-memory version of this type of selection - df_dc[(df_dc.B > 0) & (df_dc.C > 0) & (df_dc.string == "foo")] - - # we have automagically created this index and the B/C/string/string2 - # columns are stored separately as ``PyTables`` columns - store.root.df_dc.table - -There is some performance degradation by making lots of columns into -``data columns``, so it is up to the user to designate these. In addition, -you cannot change data columns (nor indexables) after the first -append/put operation (Of course you can simply read in the data and -create a new table!). - -Iterator -++++++++ - -You can pass ``iterator=True`` or ``chunksize=number_in_a_chunk`` -to ``select`` and ``select_as_multiple`` to return an iterator on the results. -The default is 50,000 rows returned in a chunk. - -.. ipython:: python - - for df in store.select("df", chunksize=3): - print(df) - -.. note:: - - You can also use the iterator with ``read_hdf`` which will open, then - automatically close the store when finished iterating. - - .. code-block:: python - - for df in pd.read_hdf("store.h5", "df", chunksize=3): - print(df) - -Note, that the chunksize keyword applies to the **source** rows. So if you -are doing a query, then the chunksize will subdivide the total rows in the table -and the query applied, returning an iterator on potentially unequal sized chunks. - -Here is a recipe for generating a query and using it to create equal sized return -chunks. - -.. ipython:: python - - dfeq = pd.DataFrame({"number": np.arange(1, 11)}) - dfeq - - store.append("dfeq", dfeq, data_columns=["number"]) - - def chunks(l, n): - return [l[i: i + n] for i in range(0, len(l), n)] - - evens = [2, 4, 6, 8, 10] - coordinates = store.select_as_coordinates("dfeq", "number=evens") - for c in chunks(coordinates, 2): - print(store.select("dfeq", where=c)) - -Advanced queries -++++++++++++++++ - -Select a single column -^^^^^^^^^^^^^^^^^^^^^^ - -To retrieve a single indexable or data column, use the -method ``select_column``. This will, for example, enable you to get the index -very quickly. These return a ``Series`` of the result, indexed by the row number. -These do not currently accept the ``where`` selector. - -.. ipython:: python - - store.select_column("df_dc", "index") - store.select_column("df_dc", "string") - -.. _io.hdf5-selecting_coordinates: - -Selecting coordinates -^^^^^^^^^^^^^^^^^^^^^ - -Sometimes you want to get the coordinates (a.k.a the index locations) of your query. This returns an -``Index`` of the resulting locations. These coordinates can also be passed to subsequent -``where`` operations. - -.. ipython:: python - - df_coord = pd.DataFrame( - np.random.randn(1000, 2), index=pd.date_range("20000101", periods=1000) - ) - store.append("df_coord", df_coord) - c = store.select_as_coordinates("df_coord", "index > 20020101") - c - store.select("df_coord", where=c) - -.. _io.hdf5-where_mask: - -Selecting using a where mask -^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Sometime your query can involve creating a list of rows to select. Usually this ``mask`` would -be a resulting ``index`` from an indexing operation. This example selects the months of -a datetimeindex which are 5. - -.. ipython:: python - - df_mask = pd.DataFrame( - np.random.randn(1000, 2), index=pd.date_range("20000101", periods=1000) - ) - store.append("df_mask", df_mask) - c = store.select_column("df_mask", "index") - where = c[pd.DatetimeIndex(c).month == 5].index - store.select("df_mask", where=where) - -Storer object -^^^^^^^^^^^^^ - -If you want to inspect the stored object, retrieve via -``get_storer``. You could use this programmatically to say get the number -of rows in an object. - -.. ipython:: python - - store.get_storer("df_dc").nrows - - -Multiple table queries -++++++++++++++++++++++ - -The methods ``append_to_multiple`` and -``select_as_multiple`` can perform appending/selecting from -multiple tables at once. The idea is to have one table (call it the -selector table) that you index most/all of the columns, and perform your -queries. The other table(s) are data tables with an index matching the -selector table's index. You can then perform a very fast query -on the selector table, yet get lots of data back. This method is similar to -having a very wide table, but enables more efficient queries. - -The ``append_to_multiple`` method splits a given single DataFrame -into multiple tables according to ``d``, a dictionary that maps the -table names to a list of 'columns' you want in that table. If ``None`` -is used in place of a list, that table will have the remaining -unspecified columns of the given DataFrame. The argument ``selector`` -defines which table is the selector table (which you can make queries from). -The argument ``dropna`` will drop rows from the input ``DataFrame`` to ensure -tables are synchronized. This means that if a row for one of the tables -being written to is entirely ``np.nan``, that row will be dropped from all tables. - -If ``dropna`` is False, **THE USER IS RESPONSIBLE FOR SYNCHRONIZING THE TABLES**. -Remember that entirely ``np.Nan`` rows are not written to the HDFStore, so if -you choose to call ``dropna=False``, some tables may have more rows than others, -and therefore ``select_as_multiple`` may not work or it may return unexpected -results. - -.. ipython:: python - - df_mt = pd.DataFrame( - np.random.randn(8, 6), - index=pd.date_range("1/1/2000", periods=8), - columns=["A", "B", "C", "D", "E", "F"], - ) - df_mt["foo"] = "bar" - df_mt.loc[df_mt.index[1], ("A", "B")] = np.nan - - # you can also create the tables individually - store.append_to_multiple( - {"df1_mt": ["A", "B"], "df2_mt": None}, df_mt, selector="df1_mt" - ) - store - - # individual tables were created - store.select("df1_mt") - store.select("df2_mt") - - # as a multiple - store.select_as_multiple( - ["df1_mt", "df2_mt"], - where=["A>0", "B>0"], - selector="df1_mt", - ) - - -Delete from a table -''''''''''''''''''' - -You can delete from a table selectively by specifying a ``where``. In -deleting rows, it is important to understand the ``PyTables`` deletes -rows by erasing the rows, then **moving** the following data. Thus -deleting can potentially be a very expensive operation depending on the -orientation of your data. To get optimal performance, it's -worthwhile to have the dimension you are deleting be the first of the -``indexables``. - -Data is ordered (on the disk) in terms of the ``indexables``. Here's a -simple use case. You store panel-type data, with dates in the -``major_axis`` and ids in the ``minor_axis``. The data is then -interleaved like this: - -* date_1 - * id_1 - * id_2 - * . - * id_n -* date_2 - * id_1 - * . - * id_n - -It should be clear that a delete operation on the ``major_axis`` will be -fairly quick, as one chunk is removed, then the following data moved. On -the other hand a delete operation on the ``minor_axis`` will be very -expensive. In this case it would almost certainly be faster to rewrite -the table using a ``where`` that selects all but the missing data. - -.. warning:: - - Please note that HDF5 **DOES NOT RECLAIM SPACE** in the h5 files - automatically. Thus, repeatedly deleting (or removing nodes) and adding - again, **WILL TEND TO INCREASE THE FILE SIZE**. - - To *repack and clean* the file, use :ref:`ptrepack `. - -.. _io.hdf5-notes: - -Notes & caveats -''''''''''''''' - - -Compression -+++++++++++ - -``PyTables`` allows the stored data to be compressed. This applies to -all kinds of stores, not just tables. Two parameters are used to -control compression: ``complevel`` and ``complib``. - -* ``complevel`` specifies if and how hard data is to be compressed. - ``complevel=0`` and ``complevel=None`` disables compression and - ``0`_: The default compression library. - A classic in terms of compression, achieves good compression - rates but is somewhat slow. - - `lzo `_: Fast - compression and decompression. - - `bzip2 `_: Good compression rates. - - `blosc `_: Fast compression and - decompression. - - Support for alternative blosc compressors: - - - `blosc:blosclz `_ This is the - default compressor for ``blosc`` - - `blosc:lz4 - `_: - A compact, very popular and fast compressor. - - `blosc:lz4hc - `_: - A tweaked version of LZ4, produces better - compression ratios at the expense of speed. - - `blosc:snappy `_: - A popular compressor used in many places. - - `blosc:zlib `_: A classic; - somewhat slower than the previous ones, but - achieving better compression ratios. - - `blosc:zstd `_: An - extremely well balanced codec; it provides the best - compression ratios among the others above, and at - reasonably fast speed. - - If ``complib`` is defined as something other than the listed libraries a - ``ValueError`` exception is issued. - -.. note:: - - If the library specified with the ``complib`` option is missing on your platform, - compression defaults to ``zlib`` without further ado. - -Enable compression for all objects within the file: - -.. code-block:: python - - store_compressed = pd.HDFStore( - "store_compressed.h5", complevel=9, complib="blosc:blosclz" - ) - -Or on-the-fly compression (this only applies to tables) in stores where compression is not enabled: - -.. code-block:: python - - store.append("df", df, complib="zlib", complevel=5) - -.. _io.hdf5-ptrepack: - -ptrepack -++++++++ - -``PyTables`` offers better write performance when tables are compressed after -they are written, as opposed to turning on compression at the very -beginning. You can use the supplied ``PyTables`` utility -``ptrepack``. In addition, ``ptrepack`` can change compression levels -after the fact. - -.. code-block:: console - - ptrepack --chunkshape=auto --propindexes --complevel=9 --complib=blosc in.h5 out.h5 - -Furthermore ``ptrepack in.h5 out.h5`` will *repack* the file to allow -you to reuse previously deleted space. Alternatively, one can simply -remove the file and write again, or use the ``copy`` method. - -.. _io.hdf5-caveats: - -Caveats -+++++++ - -.. warning:: - - ``HDFStore`` is **not-threadsafe for writing**. The underlying - ``PyTables`` only supports concurrent reads (via threading or - processes). If you need reading and writing *at the same time*, you - need to serialize these operations in a single thread in a single - process. You will corrupt your data otherwise. See the (:issue:`2397`) for more information. - -* If you use locks to manage write access between multiple processes, you - may want to use :py:func:`~os.fsync` before releasing write locks. For - convenience you can use ``store.flush(fsync=True)`` to do this for you. -* Once a ``table`` is created columns (DataFrame) - are fixed; only exactly the same columns can be appended -* Be aware that timezones (e.g., ``zoneinfo.ZoneInfo('US/Eastern')``) - are not necessarily equal across timezone versions. So if data is - localized to a specific timezone in the HDFStore using one version - of a timezone library and that data is updated with another version, the data - will be converted to UTC since these timezones are not considered - equal. Either use the same version of timezone library or use ``tz_convert`` with - the updated timezone definition. - -.. warning:: - - ``PyTables`` will show a ``NaturalNameWarning`` if a column name - cannot be used as an attribute selector. - *Natural* identifiers contain only letters, numbers, and underscores, - and may not begin with a number. - Other identifiers cannot be used in a ``where`` clause - and are generally a bad idea. - -.. _io.hdf5-data_types: - -DataTypes -''''''''' - -``HDFStore`` will map an object dtype to the ``PyTables`` underlying -dtype. This means the following types are known to work: - -====================================================== ========================= -Type Represents missing values -====================================================== ========================= -floating : ``float64, float32, float16`` ``np.nan`` -integer : ``int64, int32, int8, uint64,uint32, uint8`` -boolean -``datetime64[ns]`` ``NaT`` -``timedelta64[ns]`` ``NaT`` -categorical : see the section below -object : ``strings`` ``np.nan`` -====================================================== ========================= - -``unicode`` columns are not supported, and **WILL FAIL**. - -.. _io.hdf5-categorical: - -Categorical data -++++++++++++++++ - -You can write data that contains ``category`` dtypes to a ``HDFStore``. -Queries work the same as if it was an object array. However, the ``category`` dtyped data is -stored in a more efficient manner. - -.. ipython:: python - - dfcat = pd.DataFrame( - {"A": pd.Series(list("aabbcdba")).astype("category"), "B": np.random.randn(8)} - ) - dfcat - dfcat.dtypes - cstore = pd.HDFStore("cats.h5", mode="w") - cstore.append("dfcat", dfcat, format="table", data_columns=["A"]) - result = cstore.select("dfcat", where="A in ['b', 'c']") - result - result.dtypes - -.. ipython:: python - :suppress: - :okexcept: - - cstore.close() - os.remove("cats.h5") - - -String columns -++++++++++++++ - -**min_itemsize** - -The underlying implementation of ``HDFStore`` uses a fixed column width (itemsize) for string columns. -A string column itemsize is calculated as the maximum of the -length of data (for that column) that is passed to the ``HDFStore``, **in the first append**. Subsequent appends, -may introduce a string for a column **larger** than the column can hold, an Exception will be raised (otherwise you -could have a silent truncation of these columns, leading to loss of information). In the future we may relax this and -allow a user-specified truncation to occur. - -Pass ``min_itemsize`` on the first table creation to a-priori specify the minimum length of a particular string column. -``min_itemsize`` can be an integer, or a dict mapping a column name to an integer. You can pass ``values`` as a key to -allow all *indexables* or *data_columns* to have this min_itemsize. - -Passing a ``min_itemsize`` dict will cause all passed columns to be created as *data_columns* automatically. - -.. note:: - - If you are not passing any ``data_columns``, then the ``min_itemsize`` will be the maximum of the length of any string passed - -.. ipython:: python - - dfs = pd.DataFrame({"A": "foo", "B": "bar"}, index=list(range(5))) - dfs - - # A and B have a size of 30 - store.append("dfs", dfs, min_itemsize=30) - store.get_storer("dfs").table - - # A is created as a data_column with a size of 30 - # B is size is calculated - store.append("dfs2", dfs, min_itemsize={"A": 30}) - store.get_storer("dfs2").table - -**nan_rep** - -String columns will serialize a ``np.nan`` (a missing value) with the ``nan_rep`` string representation. This defaults to the string value ``nan``. -You could inadvertently turn an actual ``nan`` value into a missing value. - -.. ipython:: python - - dfss = pd.DataFrame({"A": ["foo", "bar", "nan"]}) - dfss - - store.append("dfss", dfss) - store.select("dfss") - - # here you need to specify a different nan rep - store.append("dfss2", dfss, nan_rep="_nan_") - store.select("dfss2") - - -Performance -''''''''''' - -* ``tables`` format come with a writing performance penalty as compared to - ``fixed`` stores. The benefit is the ability to append/delete and - query (potentially very large amounts of data). Write times are - generally longer as compared with regular stores. Query times can - be quite fast, especially on an indexed axis. -* You can pass ``chunksize=`` to ``append``, specifying the - write chunksize (default is 50000). This will significantly lower - your memory usage on writing. -* You can pass ``expectedrows=`` to the first ``append``, - to set the TOTAL number of rows that ``PyTables`` will expect. - This will optimize read/write performance. -* Duplicate rows can be written to tables, but are filtered out in - selection (with the last items being selected; thus a table is - unique on major, minor pairs) -* A ``PerformanceWarning`` will be raised if you are attempting to - store types that will be pickled by PyTables (rather than stored as - endemic types). See - `Here `__ - for more information and some solutions. - - -.. ipython:: python - :suppress: - - store.close() - os.remove("store.h5") - - -.. _io.feather: - -Feather -------- - -Feather provides binary columnar serialization for data frames. It is designed to make reading and writing data -frames efficient, and to make sharing data across data analysis languages easy. - -Feather is designed to faithfully serialize and de-serialize DataFrames, supporting all of the pandas -dtypes, including extension dtypes such as categorical and datetime with tz. - -Several caveats: - -* The format will NOT write an ``Index``, or ``MultiIndex`` for the - ``DataFrame`` and will raise an error if a non-default one is provided. You - can ``.reset_index()`` to store the index or ``.reset_index(drop=True)`` to - ignore it. -* Duplicate column names and non-string columns names are not supported -* Actual Python objects in object dtype columns are not supported. These will - raise a helpful error message on an attempt at serialization. - -See the `Full Documentation `__. - -.. ipython:: python - - import pytz - - df = pd.DataFrame( - { - "a": list("abc"), - "b": list(range(1, 4)), - "c": np.arange(3, 6).astype("u1"), - "d": np.arange(4.0, 7.0, dtype="float64"), - "e": [True, False, True], - "f": pd.Categorical(list("abc")), - "g": pd.date_range("20130101", periods=3), - "h": pd.date_range("20130101", periods=3, tz=pytz.timezone("US/Eastern")), - "i": pd.date_range("20130101", periods=3, freq="ns"), - } - ) - - df - df.dtypes - -Write to a feather file. - -.. ipython:: python - :okwarning: - - df.to_feather("example.feather") - -Read from a feather file. - -.. ipython:: python - :okwarning: - - result = pd.read_feather("example.feather") - result - - # we preserve dtypes - result.dtypes - -.. ipython:: python - :suppress: - - os.remove("example.feather") - - -.. _io.parquet: - -Parquet -------- - -`Apache Parquet `__ provides a partitioned binary columnar serialization for data frames. It is designed to -make reading and writing data frames efficient, and to make sharing data across data analysis -languages easy. Parquet can use a variety of compression techniques to shrink the file size as much as possible -while still maintaining good read performance. - -Parquet is designed to faithfully serialize and de-serialize ``DataFrame`` s, supporting all of the pandas -dtypes, including extension dtypes such as datetime with tz. - -Several caveats. - -* Duplicate column names and non-string columns names are not supported. -* The ``pyarrow`` engine always writes the index to the output, but ``fastparquet`` only writes non-default - indexes. This extra column can cause problems for non-pandas consumers that are not expecting it. You can - force including or omitting indexes with the ``index`` argument, regardless of the underlying engine. -* Index level names, if specified, must be strings. -* In the ``pyarrow`` engine, categorical dtypes for non-string types can be serialized to parquet, but will de-serialize as their primitive dtype. -* The ``pyarrow`` engine preserves the ``ordered`` flag of categorical dtypes with string types. ``fastparquet`` does not preserve the ``ordered`` flag. -* Non supported types include ``Interval`` and actual Python object types. These will raise a helpful error message - on an attempt at serialization. ``Period`` type is supported with pyarrow >= 0.16.0. -* The ``pyarrow`` engine preserves extension data types such as the nullable integer and string data - type (requiring pyarrow >= 0.16.0, and requiring the extension type to implement the needed protocols, - see the :ref:`extension types documentation `). - -You can specify an ``engine`` to direct the serialization. This can be one of ``pyarrow``, or ``fastparquet``, or ``auto``. -If the engine is NOT specified, then the ``pd.options.io.parquet.engine`` option is checked; if this is also ``auto``, -then ``pyarrow`` is tried, and falling back to ``fastparquet``. - -See the documentation for `pyarrow `__ and `fastparquet `__. - -.. note:: - - These engines are very similar and should read/write nearly identical parquet format files. - ``pyarrow>=8.0.0`` supports timedelta data, ``fastparquet>=0.1.4`` supports timezone aware datetimes. - These libraries differ by having different underlying dependencies (``fastparquet`` by using ``numba``, while ``pyarrow`` uses a c-library). - -.. ipython:: python - - df = pd.DataFrame( - { - "a": list("abc"), - "b": list(range(1, 4)), - "c": np.arange(3, 6).astype("u1"), - "d": np.arange(4.0, 7.0, dtype="float64"), - "e": [True, False, True], - "f": pd.date_range("20130101", periods=3), - "g": pd.date_range("20130101", periods=3, tz="US/Eastern"), - "h": pd.Categorical(list("abc")), - "i": pd.Categorical(list("abc"), ordered=True), - } - ) - - df - df.dtypes - -Write to a parquet file. - -.. ipython:: python - - df.to_parquet("example_pa.parquet", engine="pyarrow") - df.to_parquet("example_fp.parquet", engine="fastparquet") - -Read from a parquet file. - -.. ipython:: python - - result = pd.read_parquet("example_fp.parquet", engine="fastparquet") - result = pd.read_parquet("example_pa.parquet", engine="pyarrow") - - result.dtypes - -By setting the ``dtype_backend`` argument you can control the default dtypes used for the resulting DataFrame. - -.. ipython:: python - - result = pd.read_parquet("example_pa.parquet", engine="pyarrow", dtype_backend="pyarrow") - - result.dtypes - -.. note:: - - Note that this is not supported for ``fastparquet``. - - -Read only certain columns of a parquet file. - -.. ipython:: python - - result = pd.read_parquet( - "example_fp.parquet", - engine="fastparquet", - columns=["a", "b"], - ) - result = pd.read_parquet( - "example_pa.parquet", - engine="pyarrow", - columns=["a", "b"], - ) - result.dtypes - - -.. ipython:: python - :suppress: - - os.remove("example_pa.parquet") - os.remove("example_fp.parquet") - - -Handling indexes -'''''''''''''''' - -Serializing a ``DataFrame`` to parquet may include the implicit index as one or -more columns in the output file. Thus, this code: - -.. ipython:: python - - df = pd.DataFrame({"a": [1, 2], "b": [3, 4]}) - df.to_parquet("test.parquet", engine="pyarrow") - -creates a parquet file with *three* columns if you use ``pyarrow`` for serialization: -``a``, ``b``, and ``__index_level_0__``. If you're using ``fastparquet``, the -index `may or may not `_ -be written to the file. - -This unexpected extra column causes some databases like Amazon Redshift to reject -the file, because that column doesn't exist in the target table. - -If you want to omit a dataframe's indexes when writing, pass ``index=False`` to -:func:`~pandas.DataFrame.to_parquet`: - -.. ipython:: python - - df.to_parquet("test.parquet", index=False) - -This creates a parquet file with just the two expected columns, ``a`` and ``b``. -If your ``DataFrame`` has a custom index, you won't get it back when you load -this file into a ``DataFrame``. - -Passing ``index=True`` will *always* write the index, even if that's not the -underlying engine's default behavior. - -.. ipython:: python - :suppress: - - os.remove("test.parquet") - - -Partitioning Parquet files -'''''''''''''''''''''''''' - -Parquet supports partitioning of data based on the values of one or more columns. - -.. ipython:: python - - df = pd.DataFrame({"a": [0, 0, 1, 1], "b": [0, 1, 0, 1]}) - df.to_parquet(path="test", engine="pyarrow", partition_cols=["a"], compression=None) - -The ``path`` specifies the parent directory to which data will be saved. -The ``partition_cols`` are the column names by which the dataset will be partitioned. -Columns are partitioned in the order they are given. The partition splits are -determined by the unique values in the partition columns. -The above example creates a partitioned dataset that may look like: - -.. code-block:: text - - test - ├── a=0 - │ ├── 0bac803e32dc42ae83fddfd029cbdebc.parquet - │ └── ... - └── a=1 - ├── e6ab24a4f45147b49b54a662f0c412a3.parquet - └── ... - -.. ipython:: python - :suppress: - - from shutil import rmtree - - try: - rmtree("test") - except OSError: - pass - -.. _io.orc: - -ORC ---- - -Similar to the :ref:`parquet ` format, the `ORC Format `__ is a binary columnar serialization -for data frames. It is designed to make reading data frames efficient. pandas provides both the reader and the writer for the -ORC format, :func:`~pandas.read_orc` and :func:`~pandas.DataFrame.to_orc`. This requires the `pyarrow `__ library. - -.. warning:: - - * It is *highly recommended* to install pyarrow using conda due to some issues occurred by pyarrow. - * :func:`~pandas.DataFrame.to_orc` requires pyarrow>=7.0.0. - * :func:`~pandas.read_orc` and :func:`~pandas.DataFrame.to_orc` are not supported on Windows yet, you can find valid environments on :ref:`install optional dependencies `. - * For supported dtypes please refer to `supported ORC features in Arrow `__. - * Currently timezones in datetime columns are not preserved when a dataframe is converted into ORC files. - -.. ipython:: python - - df = pd.DataFrame( - { - "a": list("abc"), - "b": list(range(1, 4)), - "c": np.arange(4.0, 7.0, dtype="float64"), - "d": [True, False, True], - "e": pd.date_range("20130101", periods=3), - } - ) - - df - df.dtypes - -Write to an orc file. - -.. ipython:: python - - df.to_orc("example_pa.orc", engine="pyarrow") - -Read from an orc file. - -.. ipython:: python - - result = pd.read_orc("example_pa.orc") - - result.dtypes - -Read only certain columns of an orc file. - -.. ipython:: python - - result = pd.read_orc( - "example_pa.orc", - columns=["a", "b"], - ) - result.dtypes - - -.. ipython:: python - :suppress: - - os.remove("example_pa.orc") - - -.. _io.sql: - -SQL queries ------------ - -The :mod:`pandas.io.sql` module provides a collection of query wrappers to both -facilitate data retrieval and to reduce dependency on DB-specific API. - -Where available, users may first want to opt for `Apache Arrow ADBC -`_ drivers. These drivers -should provide the best performance, null handling, and type detection. - - .. versionadded:: 2.2.0 - - Added native support for ADBC drivers - -For a full list of ADBC drivers and their development status, see the `ADBC Driver -Implementation Status `_ -documentation. - -Where an ADBC driver is not available or may be missing functionality, -users should opt for installing SQLAlchemy alongside their database driver library. -Examples of such drivers are `psycopg2 `__ -for PostgreSQL or `pymysql `__ for MySQL. -For `SQLite `__ this is -included in Python's standard library by default. -You can find an overview of supported drivers for each SQL dialect in the -`SQLAlchemy docs `__. - -If SQLAlchemy is not installed, you can use a :class:`sqlite3.Connection` in place of -a SQLAlchemy engine, connection, or URI string. - -See also some :ref:`cookbook examples ` for some advanced strategies. - -The key functions are: - -.. autosummary:: - - read_sql_table - read_sql_query - read_sql - DataFrame.to_sql - -.. note:: - - The function :func:`~pandas.read_sql` is a convenience wrapper around - :func:`~pandas.read_sql_table` and :func:`~pandas.read_sql_query` (and for - backward compatibility) and will delegate to specific function depending on - the provided input (database table name or sql query). - Table names do not need to be quoted if they have special characters. - -In the following example, we use the `SQlite `__ SQL database -engine. You can use a temporary SQLite database where data are stored in -"memory". - -To connect using an ADBC driver you will want to install the ``adbc_driver_sqlite`` using your -package manager. Once installed, you can use the DBAPI interface provided by the ADBC driver -to connect to your database. - -.. code-block:: python - - import adbc_driver_sqlite.dbapi as sqlite_dbapi - - # Create the connection - with sqlite_dbapi.connect("sqlite:///:memory:") as conn: - df = pd.read_sql_table("data", conn) - -To connect with SQLAlchemy you use the :func:`create_engine` function to create an engine -object from database URI. You only need to create the engine once per database you are -connecting to. -For more information on :func:`create_engine` and the URI formatting, see the examples -below and the SQLAlchemy `documentation `__ - -.. ipython:: python - - from sqlalchemy import create_engine - - # Create your engine. - engine = create_engine("sqlite:///:memory:") - -If you want to manage your own connections you can pass one of those instead. The example below opens a -connection to the database using a Python context manager that automatically closes the connection after -the block has completed. -See the `SQLAlchemy docs `__ -for an explanation of how the database connection is handled. - -.. code-block:: python - - with engine.connect() as conn, conn.begin(): - data = pd.read_sql_table("data", conn) - -.. warning:: - - When you open a connection to a database you are also responsible for closing it. - Side effects of leaving a connection open may include locking the database or - other breaking behaviour. - -Writing DataFrames -'''''''''''''''''' - -Assuming the following data is in a ``DataFrame`` ``data``, we can insert it into -the database using :func:`~pandas.DataFrame.to_sql`. - -+-----+------------+-------+-------+-------+ -| id | Date | Col_1 | Col_2 | Col_3 | -+=====+============+=======+=======+=======+ -| 26 | 2012-10-18 | X | 25.7 | True | -+-----+------------+-------+-------+-------+ -| 42 | 2012-10-19 | Y | -12.4 | False | -+-----+------------+-------+-------+-------+ -| 63 | 2012-10-20 | Z | 5.73 | True | -+-----+------------+-------+-------+-------+ - - -.. ipython:: python - - import datetime - - c = ["id", "Date", "Col_1", "Col_2", "Col_3"] - d = [ - (26, datetime.datetime(2010, 10, 18), "X", 27.5, True), - (42, datetime.datetime(2010, 10, 19), "Y", -12.5, False), - (63, datetime.datetime(2010, 10, 20), "Z", 5.73, True), - ] - - data = pd.DataFrame(d, columns=c) - - data - data.to_sql("data", con=engine) - -With some databases, writing large DataFrames can result in errors due to -packet size limitations being exceeded. This can be avoided by setting the -``chunksize`` parameter when calling ``to_sql``. For example, the following -writes ``data`` to the database in batches of 1000 rows at a time: - -.. ipython:: python - - data.to_sql("data_chunked", con=engine, chunksize=1000) - -SQL data types -++++++++++++++ - -Ensuring consistent data type management across SQL databases is challenging. -Not every SQL database offers the same types, and even when they do the implementation -of a given type can vary in ways that have subtle effects on how types can be -preserved. - -For the best odds at preserving database types users are advised to use -ADBC drivers when available. The Arrow type system offers a wider array of -types that more closely match database types than the historical pandas/NumPy -type system. To illustrate, note this (non-exhaustive) listing of types -available in different databases and pandas backends: - -+-----------------+-----------------------+----------------+---------+ -|numpy/pandas |arrow |postgres |sqlite | -+=================+=======================+================+=========+ -|int16/Int16 |int16 |SMALLINT |INTEGER | -+-----------------+-----------------------+----------------+---------+ -|int32/Int32 |int32 |INTEGER |INTEGER | -+-----------------+-----------------------+----------------+---------+ -|int64/Int64 |int64 |BIGINT |INTEGER | -+-----------------+-----------------------+----------------+---------+ -|float32 |float32 |REAL |REAL | -+-----------------+-----------------------+----------------+---------+ -|float64 |float64 |DOUBLE PRECISION|REAL | -+-----------------+-----------------------+----------------+---------+ -|object |string |TEXT |TEXT | -+-----------------+-----------------------+----------------+---------+ -|bool |``bool_`` |BOOLEAN | | -+-----------------+-----------------------+----------------+---------+ -|datetime64[ns] |timestamp(us) |TIMESTAMP | | -+-----------------+-----------------------+----------------+---------+ -|datetime64[ns,tz]|timestamp(us,tz) |TIMESTAMPTZ | | -+-----------------+-----------------------+----------------+---------+ -| |date32 |DATE | | -+-----------------+-----------------------+----------------+---------+ -| |month_day_nano_interval|INTERVAL | | -+-----------------+-----------------------+----------------+---------+ -| |binary |BINARY |BLOB | -+-----------------+-----------------------+----------------+---------+ -| |decimal128 |DECIMAL [#f1]_ | | -+-----------------+-----------------------+----------------+---------+ -| |list |ARRAY [#f1]_ | | -+-----------------+-----------------------+----------------+---------+ -| |struct |COMPOSITE TYPE | | -| | | [#f1]_ | | -+-----------------+-----------------------+----------------+---------+ - -.. rubric:: Footnotes - -.. [#f1] Not implemented as of writing, but theoretically possible - -If you are interested in preserving database types as best as possible -throughout the lifecycle of your DataFrame, users are encouraged to -leverage the ``dtype_backend="pyarrow"`` argument of :func:`~pandas.read_sql` - -.. code-block:: ipython - - # for roundtripping - with pg_dbapi.connect(uri) as conn: - df2 = pd.read_sql("pandas_table", conn, dtype_backend="pyarrow") - -This will prevent your data from being converted to the traditional pandas/NumPy -type system, which often converts SQL types in ways that make them impossible to -round-trip. - -In case an ADBC driver is not available, :func:`~pandas.DataFrame.to_sql` -will try to map your data to an appropriate SQL data type based on the dtype of -the data. When you have columns of dtype ``object``, pandas will try to infer -the data type. - -You can always override the default type by specifying the desired SQL type of -any of the columns by using the ``dtype`` argument. This argument needs a -dictionary mapping column names to SQLAlchemy types (or strings for the sqlite3 -fallback mode). -For example, specifying to use the sqlalchemy ``String`` type instead of the -default ``Text`` type for string columns: - -.. ipython:: python - - from sqlalchemy.types import String - - data.to_sql("data_dtype", con=engine, dtype={"Col_1": String}) - -.. note:: - - Due to the limited support for timedelta's in the different database - flavors, columns with type ``timedelta64`` will be written as integer - values as nanoseconds to the database and a warning will be raised. The only - exception to this is when using the ADBC PostgreSQL driver in which case a - timedelta will be written to the database as an ``INTERVAL`` - -.. note:: - - Columns of ``category`` dtype will be converted to the dense representation - as you would get with ``np.asarray(categorical)`` (e.g. for string categories - this gives an array of strings). - Because of this, reading the database table back in does **not** generate - a categorical. - -.. _io.sql_datetime_data: - -Datetime data types -''''''''''''''''''' - -Using ADBC or SQLAlchemy, :func:`~pandas.DataFrame.to_sql` is capable of writing -datetime data that is timezone naive or timezone aware. However, the resulting -data stored in the database ultimately depends on the supported data type -for datetime data of the database system being used. - -The following table lists supported data types for datetime data for some -common databases. Other database dialects may have different data types for -datetime data. - -=========== ============================================= =================== -Database SQL Datetime Types Timezone Support -=========== ============================================= =================== -SQLite ``TEXT`` No -MySQL ``TIMESTAMP`` or ``DATETIME`` No -PostgreSQL ``TIMESTAMP`` or ``TIMESTAMP WITH TIME ZONE`` Yes -=========== ============================================= =================== - -When writing timezone aware data to databases that do not support timezones, -the data will be written as timezone naive timestamps that are in local time -with respect to the timezone. - -:func:`~pandas.read_sql_table` is also capable of reading datetime data that is -timezone aware or naive. When reading ``TIMESTAMP WITH TIME ZONE`` types, pandas -will convert the data to UTC. - -.. _io.sql.method: - -Insertion method -++++++++++++++++ - -The parameter ``method`` controls the SQL insertion clause used. -Possible values are: - -- ``None``: Uses standard SQL ``INSERT`` clause (one per row). -- ``'multi'``: Pass multiple values in a single ``INSERT`` clause. - It uses a *special* SQL syntax not supported by all backends. - This usually provides better performance for analytic databases - like *Presto* and *Redshift*, but has worse performance for - traditional SQL backend if the table contains many columns. - For more information check the SQLAlchemy `documentation - `__. -- callable with signature ``(pd_table, conn, keys, data_iter)``: - This can be used to implement a more performant insertion method based on - specific backend dialect features. - -Example of a callable using PostgreSQL `COPY clause -`__:: - - # Alternative to_sql() *method* for DBs that support COPY FROM - import csv - from io import StringIO - - def psql_insert_copy(table, conn, keys, data_iter): - """ - Execute SQL statement inserting data - - Parameters - ---------- - table : pandas.io.sql.SQLTable - conn : sqlalchemy.engine.Engine or sqlalchemy.engine.Connection - keys : list of str - Column names - data_iter : Iterable that iterates the values to be inserted - """ - # gets a DBAPI connection that can provide a cursor - dbapi_conn = conn.connection - with dbapi_conn.cursor() as cur: - s_buf = StringIO() - writer = csv.writer(s_buf) - writer.writerows(data_iter) - s_buf.seek(0) - - columns = ', '.join(['"{}"'.format(k) for k in keys]) - if table.schema: - table_name = '{}.{}'.format(table.schema, table.name) - else: - table_name = table.name - - sql = 'COPY {} ({}) FROM STDIN WITH CSV'.format( - table_name, columns) - cur.copy_expert(sql=sql, file=s_buf) - -Reading tables -'''''''''''''' - -:func:`~pandas.read_sql_table` will read a database table given the -table name and optionally a subset of columns to read. - -.. note:: - - In order to use :func:`~pandas.read_sql_table`, you **must** have the - ADBC driver or SQLAlchemy optional dependency installed. - -.. ipython:: python - - pd.read_sql_table("data", engine) - -.. note:: - - ADBC drivers will map database types directly back to arrow types. For other drivers - note that pandas infers column dtypes from query outputs, and not by looking - up data types in the physical database schema. For example, assume ``userid`` - is an integer column in a table. Then, intuitively, ``select userid ...`` will - return integer-valued series, while ``select cast(userid as text) ...`` will - return object-valued (str) series. Accordingly, if the query output is empty, - then all resulting columns will be returned as object-valued (since they are - most general). If you foresee that your query will sometimes generate an empty - result, you may want to explicitly typecast afterwards to ensure dtype - integrity. - -You can also specify the name of the column as the ``DataFrame`` index, -and specify a subset of columns to be read. - -.. ipython:: python - - pd.read_sql_table("data", engine, index_col="id") - pd.read_sql_table("data", engine, columns=["Col_1", "Col_2"]) - -And you can explicitly force columns to be parsed as dates: - -.. ipython:: python - - pd.read_sql_table("data", engine, parse_dates=["Date"]) - -If needed you can explicitly specify a format string, or a dict of arguments -to pass to :func:`pandas.to_datetime`: - -.. code-block:: python - - pd.read_sql_table("data", engine, parse_dates={"Date": "%Y-%m-%d"}) - pd.read_sql_table( - "data", - engine, - parse_dates={"Date": {"format": "%Y-%m-%d %H:%M:%S"}}, - ) - - -You can check if a table exists using :func:`~pandas.io.sql.has_table` - -Schema support -'''''''''''''' - -Reading from and writing to different schemas is supported through the ``schema`` -keyword in the :func:`~pandas.read_sql_table` and :func:`~pandas.DataFrame.to_sql` -functions. Note however that this depends on the database flavor (sqlite does not -have schemas). For example: - -.. code-block:: python - - df.to_sql(name="table", con=engine, schema="other_schema") - pd.read_sql_table("table", engine, schema="other_schema") - -Querying -'''''''' - -You can query using raw SQL in the :func:`~pandas.read_sql_query` function. -In this case you must use the SQL variant appropriate for your database. -When using SQLAlchemy, you can also pass SQLAlchemy Expression language constructs, -which are database-agnostic. - -.. ipython:: python - - pd.read_sql_query("SELECT * FROM data", engine) - -Of course, you can specify a more "complex" query. - -.. ipython:: python - - pd.read_sql_query("SELECT id, Col_1, Col_2 FROM data WHERE id = 42;", engine) - -The :func:`~pandas.read_sql_query` function supports a ``chunksize`` argument. -Specifying this will return an iterator through chunks of the query result: - -.. ipython:: python - - df = pd.DataFrame(np.random.randn(20, 3), columns=list("abc")) - df.to_sql(name="data_chunks", con=engine, index=False) - -.. ipython:: python - - for chunk in pd.read_sql_query("SELECT * FROM data_chunks", engine, chunksize=5): - print(chunk) - - -Engine connection examples -'''''''''''''''''''''''''' - -To connect with SQLAlchemy you use the :func:`create_engine` function to create an engine -object from database URI. You only need to create the engine once per database you are -connecting to. - -.. code-block:: python - - from sqlalchemy import create_engine - - engine = create_engine("postgresql://scott:tiger@localhost:5432/mydatabase") - - engine = create_engine("mysql+mysqldb://scott:tiger@localhost/foo") - - engine = create_engine("oracle://scott:tiger@127.0.0.1:1521/sidname") - - engine = create_engine("mssql+pyodbc://mydsn") - - # sqlite:/// - # where is relative: - engine = create_engine("sqlite:///foo.db") - - # or absolute, starting with a slash: - engine = create_engine("sqlite:////absolute/path/to/foo.db") - -For more information see the examples the SQLAlchemy `documentation `__ - - -Advanced SQLAlchemy queries -''''''''''''''''''''''''''' - -You can use SQLAlchemy constructs to describe your query. - -Use :func:`sqlalchemy.text` to specify query parameters in a backend-neutral way - -.. ipython:: python - - import sqlalchemy as sa - - pd.read_sql( - sa.text("SELECT * FROM data where Col_1=:col1"), engine, params={"col1": "X"} - ) - -If you have an SQLAlchemy description of your database you can express where conditions using SQLAlchemy expressions - -.. ipython:: python - - metadata = sa.MetaData() - data_table = sa.Table( - "data", - metadata, - sa.Column("index", sa.Integer), - sa.Column("Date", sa.DateTime), - sa.Column("Col_1", sa.String), - sa.Column("Col_2", sa.Float), - sa.Column("Col_3", sa.Boolean), - ) - - pd.read_sql(sa.select(data_table).where(data_table.c.Col_3 is True), engine) - -You can combine SQLAlchemy expressions with parameters passed to :func:`read_sql` using :func:`sqlalchemy.bindparam` - -.. ipython:: python - - import datetime as dt - - expr = sa.select(data_table).where(data_table.c.Date > sa.bindparam("date")) - pd.read_sql(expr, engine, params={"date": dt.datetime(2010, 10, 18)}) - - -Sqlite fallback -''''''''''''''' - -The use of sqlite is supported without using SQLAlchemy. -This mode requires a Python database adapter which respect the `Python -DB-API `__. - -You can create connections like so: - -.. code-block:: python - - import sqlite3 - - con = sqlite3.connect(":memory:") - -And then issue the following queries: - -.. code-block:: python - - data.to_sql("data", con) - pd.read_sql_query("SELECT * FROM data", con) - - -.. _io.bigquery: - -Google BigQuery ---------------- - -The ``pandas-gbq`` package provides functionality to read/write from Google BigQuery. - -Full documentation can be found `here `__. - -.. _io.stata: - -STATA format ------------- - -.. _io.stata_writer: - -Writing to stata format -''''''''''''''''''''''' - -The method :func:`.DataFrame.to_stata` will write a DataFrame -into a .dta file. The format version of this file is always 115 (Stata 12). - -.. ipython:: python - - df = pd.DataFrame(np.random.randn(10, 2), columns=list("AB")) - df.to_stata("stata.dta") - -*Stata* data files have limited data type support; only strings with -244 or fewer characters, ``int8``, ``int16``, ``int32``, ``float32`` -and ``float64`` can be stored in ``.dta`` files. Additionally, -*Stata* reserves certain values to represent missing data. Exporting a -non-missing value that is outside of the permitted range in Stata for -a particular data type will retype the variable to the next larger -size. For example, ``int8`` values are restricted to lie between -127 -and 100 in Stata, and so variables with values above 100 will trigger -a conversion to ``int16``. ``nan`` values in floating points data -types are stored as the basic missing data type (``.`` in *Stata*). - -.. note:: - - It is not possible to export missing data values for integer data types. - - -The *Stata* writer gracefully handles other data types including ``int64``, -``bool``, ``uint8``, ``uint16``, ``uint32`` by casting to -the smallest supported type that can represent the data. For example, data -with a type of ``uint8`` will be cast to ``int8`` if all values are less than -100 (the upper bound for non-missing ``int8`` data in *Stata*), or, if values are -outside of this range, the variable is cast to ``int16``. - - -.. warning:: - - Conversion from ``int64`` to ``float64`` may result in a loss of precision - if ``int64`` values are larger than 2**53. - -.. warning:: - - :class:`~pandas.io.stata.StataWriter` and - :func:`.DataFrame.to_stata` only support fixed width - strings containing up to 244 characters, a limitation imposed by the version - 115 dta file format. Attempting to write *Stata* dta files with strings - longer than 244 characters raises a ``ValueError``. - -.. _io.stata_reader: - -Reading from Stata format -''''''''''''''''''''''''' - -The top-level function ``read_stata`` will read a dta file and return -either a ``DataFrame`` or a :class:`pandas.api.typing.StataReader` that can -be used to read the file incrementally. - -.. ipython:: python - - pd.read_stata("stata.dta") - -Specifying a ``chunksize`` yields a -:class:`pandas.api.typing.StataReader` instance that can be used to -read ``chunksize`` lines from the file at a time. The ``StataReader`` -object can be used as an iterator. - -.. ipython:: python - - with pd.read_stata("stata.dta", chunksize=3) as reader: - for df in reader: - print(df.shape) - -For more fine-grained control, use ``iterator=True`` and specify -``chunksize`` with each call to -:func:`~pandas.io.stata.StataReader.read`. - -.. ipython:: python - - with pd.read_stata("stata.dta", iterator=True) as reader: - chunk1 = reader.read(5) - chunk2 = reader.read(5) - -Currently the ``index`` is retrieved as a column. - -The parameter ``convert_categoricals`` indicates whether value labels should be -read and used to create a ``Categorical`` variable from them. Value labels can -also be retrieved by the function ``value_labels``, which requires :func:`~pandas.io.stata.StataReader.read` -to be called before use. - -The parameter ``convert_missing`` indicates whether missing value -representations in Stata should be preserved. If ``False`` (the default), -missing values are represented as ``np.nan``. If ``True``, missing values are -represented using ``StataMissingValue`` objects, and columns containing missing -values will have ``object`` data type. - -.. note:: - - :func:`~pandas.read_stata` and - :class:`~pandas.io.stata.StataReader` support .dta formats 113-115 - (Stata 10-12), 117 (Stata 13), and 118 (Stata 14). - -.. note:: - - Setting ``preserve_dtypes=False`` will upcast to the standard pandas data types: - ``int64`` for all integer types and ``float64`` for floating point data. By default, - the Stata data types are preserved when importing. - -.. note:: - - All :class:`~pandas.io.stata.StataReader` objects, whether created by :func:`~pandas.read_stata` - (when using ``iterator=True`` or ``chunksize``) or instantiated by hand, must be used as context - managers (e.g. the ``with`` statement). - While the :meth:`~pandas.io.stata.StataReader.close` method is available, its use is unsupported. - It is not part of the public API and will be removed in with future without warning. - -.. ipython:: python - :suppress: - - os.remove("stata.dta") - -.. _io.stata-categorical: - -Categorical data -++++++++++++++++ - -``Categorical`` data can be exported to *Stata* data files as value labeled data. -The exported data consists of the underlying category codes as integer data values -and the categories as value labels. *Stata* does not have an explicit equivalent -to a ``Categorical`` and information about *whether* the variable is ordered -is lost when exporting. - -.. warning:: - - *Stata* only supports string value labels, and so ``str`` is called on the - categories when exporting data. Exporting ``Categorical`` variables with - non-string categories produces a warning, and can result a loss of - information if the ``str`` representations of the categories are not unique. - -Labeled data can similarly be imported from *Stata* data files as ``Categorical`` -variables using the keyword argument ``convert_categoricals`` (``True`` by default). -The keyword argument ``order_categoricals`` (``True`` by default) determines -whether imported ``Categorical`` variables are ordered. - -.. note:: - - When importing categorical data, the values of the variables in the *Stata* - data file are not preserved since ``Categorical`` variables always - use integer data types between ``-1`` and ``n-1`` where ``n`` is the number - of categories. If the original values in the *Stata* data file are required, - these can be imported by setting ``convert_categoricals=False``, which will - import original data (but not the variable labels). The original values can - be matched to the imported categorical data since there is a simple mapping - between the original *Stata* data values and the category codes of imported - Categorical variables: missing values are assigned code ``-1``, and the - smallest original value is assigned ``0``, the second smallest is assigned - ``1`` and so on until the largest original value is assigned the code ``n-1``. - -.. note:: - - *Stata* supports partially labeled series. These series have value labels for - some but not all data values. Importing a partially labeled series will produce - a ``Categorical`` with string categories for the values that are labeled and - numeric categories for values with no label. - -.. _io.sas: - -.. _io.sas_reader: - -SAS formats ------------ - -The top-level function :func:`read_sas` can read (but not write) SAS -XPORT (.xpt) and SAS7BDAT (.sas7bdat) format files. - -SAS files only contain two value types: ASCII text and floating point -values (usually 8 bytes but sometimes truncated). For xport files, -there is no automatic type conversion to integers, dates, or -categoricals. For SAS7BDAT files, the format codes may allow date -variables to be automatically converted to dates. By default the -whole file is read and returned as a ``DataFrame``. - -Specify a ``chunksize`` or use ``iterator=True`` to obtain reader -objects (``XportReader`` or ``SAS7BDATReader``) for incrementally -reading the file. The reader objects also have attributes that -contain additional information about the file and its variables. - -Read a SAS7BDAT file: - -.. code-block:: python - - df = pd.read_sas("sas_data.sas7bdat") - -Obtain an iterator and read an XPORT file 100,000 lines at a time: - -.. code-block:: python - - def do_something(chunk): - pass - - - with pd.read_sas("sas_xport.xpt", chunk=100000) as rdr: - for chunk in rdr: - do_something(chunk) - -The specification_ for the xport file format is available from the SAS -web site. - -.. _specification: https://support.sas.com/content/dam/SAS/support/en/technical-papers/record-layout-of-a-sas-version-5-or-6-data-set-in-sas-transport-xport-format.pdf - -No official documentation is available for the SAS7BDAT format. - -.. _io.spss: - -.. _io.spss_reader: - -SPSS formats ------------- - -The top-level function :func:`read_spss` can read (but not write) SPSS -SAV (.sav) and ZSAV (.zsav) format files. - -SPSS files contain column names. By default the -whole file is read, categorical columns are converted into ``pd.Categorical``, -and a ``DataFrame`` with all columns is returned. - -Specify the ``usecols`` parameter to obtain a subset of columns. Specify ``convert_categoricals=False`` -to avoid converting categorical columns into ``pd.Categorical``. - -Read an SPSS file: - -.. code-block:: python - - df = pd.read_spss("spss_data.sav") - -Extract a subset of columns contained in ``usecols`` from an SPSS file and -avoid converting categorical columns into ``pd.Categorical``: - -.. code-block:: python - - df = pd.read_spss( - "spss_data.sav", - usecols=["foo", "bar"], - convert_categoricals=False, - ) - -More information about the SAV and ZSAV file formats is available here_. - -.. _here: https://www.ibm.com/docs/en/spss-statistics/22.0.0 - -.. _io.other: - -Other file formats ------------------- - -pandas itself only supports IO with a limited set of file formats that map -cleanly to its tabular data model. For reading and writing other file formats -into and from pandas, we recommend these packages from the broader community. - -netCDF -'''''' - -xarray_ provides data structures inspired by the pandas ``DataFrame`` for working -with multi-dimensional datasets, with a focus on the netCDF file format and -easy conversion to and from pandas. - -.. _xarray: https://xarray.pydata.org/en/stable/ - -.. _io.perf: - -Performance considerations --------------------------- - -This is an informal comparison of various IO methods, using pandas -0.24.2. Timings are machine dependent and small differences should be -ignored. - -.. code-block:: ipython - - In [1]: sz = 1000000 - In [2]: df = pd.DataFrame({'A': np.random.randn(sz), 'B': [1] * sz}) - - In [3]: df.info() - - RangeIndex: 1000000 entries, 0 to 999999 - Data columns (total 2 columns): - A 1000000 non-null float64 - B 1000000 non-null int64 - dtypes: float64(1), int64(1) - memory usage: 15.3 MB - -The following test functions will be used below to compare the performance of several IO methods: - -.. code-block:: python - - - - import numpy as np - - import os - - sz = 1000000 - df = pd.DataFrame({"A": np.random.randn(sz), "B": [1] * sz}) - - sz = 1000000 - np.random.seed(42) - df = pd.DataFrame({"A": np.random.randn(sz), "B": [1] * sz}) - - - def test_sql_write(df): - if os.path.exists("test.sql"): - os.remove("test.sql") - sql_db = sqlite3.connect("test.sql") - df.to_sql(name="test_table", con=sql_db) - sql_db.close() - - - def test_sql_read(): - sql_db = sqlite3.connect("test.sql") - pd.read_sql_query("select * from test_table", sql_db) - sql_db.close() - - - def test_hdf_fixed_write(df): - df.to_hdf("test_fixed.hdf", key="test", mode="w") - - - def test_hdf_fixed_read(): - pd.read_hdf("test_fixed.hdf", "test") - - - def test_hdf_fixed_write_compress(df): - df.to_hdf("test_fixed_compress.hdf", key="test", mode="w", complib="blosc") - - - def test_hdf_fixed_read_compress(): - pd.read_hdf("test_fixed_compress.hdf", "test") - - - def test_hdf_table_write(df): - df.to_hdf("test_table.hdf", key="test", mode="w", format="table") - - - def test_hdf_table_read(): - pd.read_hdf("test_table.hdf", "test") - - - def test_hdf_table_write_compress(df): - df.to_hdf( - "test_table_compress.hdf", key="test", mode="w", complib="blosc", format="table" - ) - - - def test_hdf_table_read_compress(): - pd.read_hdf("test_table_compress.hdf", "test") - - - def test_csv_write(df): - df.to_csv("test.csv", mode="w") - - - def test_csv_read(): - pd.read_csv("test.csv", index_col=0) - - - def test_feather_write(df): - df.to_feather("test.feather") - - - def test_feather_read(): - pd.read_feather("test.feather") - - - def test_pickle_write(df): - df.to_pickle("test.pkl") - - - def test_pickle_read(): - pd.read_pickle("test.pkl") - - - def test_pickle_write_compress(df): - df.to_pickle("test.pkl.compress", compression="xz") - - - def test_pickle_read_compress(): - pd.read_pickle("test.pkl.compress", compression="xz") - - - def test_parquet_write(df): - df.to_parquet("test.parquet") - - - def test_parquet_read(): - pd.read_parquet("test.parquet") - -When writing, the top three functions in terms of speed are ``test_feather_write``, ``test_hdf_fixed_write`` and ``test_hdf_fixed_write_compress``. - -.. code-block:: ipython - - In [4]: %timeit test_sql_write(df) - 3.29 s ± 43.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) - - In [5]: %timeit test_hdf_fixed_write(df) - 19.4 ms ± 560 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) - - In [6]: %timeit test_hdf_fixed_write_compress(df) - 19.6 ms ± 308 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) - - In [7]: %timeit test_hdf_table_write(df) - 449 ms ± 5.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) - - In [8]: %timeit test_hdf_table_write_compress(df) - 448 ms ± 11.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) - - In [9]: %timeit test_csv_write(df) - 3.66 s ± 26.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) - - In [10]: %timeit test_feather_write(df) - 9.75 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) - - In [11]: %timeit test_pickle_write(df) - 30.1 ms ± 229 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) - - In [12]: %timeit test_pickle_write_compress(df) - 4.29 s ± 15.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) - - In [13]: %timeit test_parquet_write(df) - 67.6 ms ± 706 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) - -When reading, the top three functions in terms of speed are ``test_feather_read``, ``test_pickle_read`` and -``test_hdf_fixed_read``. - - -.. code-block:: ipython - - In [14]: %timeit test_sql_read() - 1.77 s ± 17.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) - - In [15]: %timeit test_hdf_fixed_read() - 19.4 ms ± 436 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) - - In [16]: %timeit test_hdf_fixed_read_compress() - 19.5 ms ± 222 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) - - In [17]: %timeit test_hdf_table_read() - 38.6 ms ± 857 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) - - In [18]: %timeit test_hdf_table_read_compress() - 38.8 ms ± 1.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) - - In [19]: %timeit test_csv_read() - 452 ms ± 9.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) - - In [20]: %timeit test_feather_read() - 12.4 ms ± 99.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) - - In [21]: %timeit test_pickle_read() - 18.4 ms ± 191 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) - - In [22]: %timeit test_pickle_read_compress() - 915 ms ± 7.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) - - In [23]: %timeit test_parquet_read() - 24.4 ms ± 146 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) - - -The files ``test.pkl.compress``, ``test.parquet`` and ``test.feather`` took the least space on disk (in bytes). - -.. code-block:: none - - 29519500 Oct 10 06:45 test.csv - 16000248 Oct 10 06:45 test.feather - 8281983 Oct 10 06:49 test.parquet - 16000857 Oct 10 06:47 test.pkl - 7552144 Oct 10 06:48 test.pkl.compress - 34816000 Oct 10 06:42 test.sql - 24009288 Oct 10 06:43 test_fixed.hdf - 24009288 Oct 10 06:43 test_fixed_compress.hdf - 24458940 Oct 10 06:44 test_table.hdf - 24458940 Oct 10 06:44 test_table_compress.hdf diff --git a/doc/source/user_guide/io/clipboard.rst b/doc/source/user_guide/io/clipboard.rst new file mode 100644 index 0000000000000..67aefc0480ae9 --- /dev/null +++ b/doc/source/user_guide/io/clipboard.rst @@ -0,0 +1,57 @@ +.. _io.clipboard: + +========= +Clipboard +========= + +A handy way to grab data is to use the :meth:`~DataFrame.read_clipboard` method, +which takes the contents of the clipboard buffer and passes them to the +``read_csv`` method. For instance, you can copy the following text to the +clipboard (CTRL-C on many operating systems): + +.. code-block:: console + + A B C + x 1 4 p + y 2 5 q + z 3 6 r + +And then import the data directly to a ``DataFrame`` by calling: + +.. code-block:: python + + >>> clipdf = pd.read_clipboard() + >>> clipdf + A B C + x 1 4 p + y 2 5 q + z 3 6 r + +The ``to_clipboard`` method can be used to write the contents of a ``DataFrame`` to +the clipboard. Following which you can paste the clipboard contents into other +applications (CTRL-V on many operating systems). Here we illustrate writing a +``DataFrame`` into clipboard and reading it back. + +.. code-block:: python + + >>> df = pd.DataFrame( + ... {"A": [1, 2, 3], "B": [4, 5, 6], "C": ["p", "q", "r"]}, index=["x", "y", "z"] + ... ) + + >>> df + A B C + x 1 4 p + y 2 5 q + z 3 6 r + >>> df.to_clipboard() + >>> pd.read_clipboard() + A B C + x 1 4 p + y 2 5 q + z 3 6 r + +We can see that we got the same content back, which we had earlier written to the clipboard. + +.. note:: + + You may need to install xclip or xsel (with PyQt5, PyQt4 or qtpy) on Linux to use these methods. diff --git a/doc/source/user_guide/io/community_packages.rst b/doc/source/user_guide/io/community_packages.rst new file mode 100644 index 0000000000000..2b96d539c0ec3 --- /dev/null +++ b/doc/source/user_guide/io/community_packages.rst @@ -0,0 +1,27 @@ +.. _io.other: + +================================ +Community-supported file formats +================================ + +pandas itself only supports IO with a limited set of file formats that map +cleanly to its tabular data model. For reading and writing other file formats +into and from pandas, we recommend these packages from the broader community. + +.. _io.bigquery: + +Google BigQuery +''''''''''''''' + +The pandas-gbq_ package provides functionality to read/write from Google BigQuery. + +.. _pandas-gbq: https://pandas-gbq.readthedocs.io/en/latest/ + +netCDF +'''''' + +xarray_ provides data structures inspired by the pandas ``DataFrame`` for working +with multi-dimensional datasets, with a focus on the netCDF file format and +easy conversion to and from pandas. + +.. _xarray: https://xarray.pydata.org/en/stable/ diff --git a/doc/source/user_guide/io/csv.rst b/doc/source/user_guide/io/csv.rst new file mode 100644 index 0000000000000..829457f45880c --- /dev/null +++ b/doc/source/user_guide/io/csv.rst @@ -0,0 +1,1729 @@ +.. _io.read_csv_table: + +================ +CSV & text files +================ + +The workhorse function for reading text files (a.k.a. flat files) is +:func:`read_csv`. See the :ref:`cookbook` for some advanced strategies. + +Parsing options +''''''''''''''' + +:func:`read_csv` accepts the following common arguments: + +Basic ++++++ + +filepath_or_buffer : various + Either a path to a file (a :class:`python:str`, :class:`python:pathlib.Path`) + URL (including http, ftp, and S3 + locations), or any object with a ``read()`` method (such as an open file or + :class:`~python:io.StringIO`). +sep : str, defaults to ``','`` for :func:`read_csv`, ``\t`` for :func:`read_table` + Delimiter to use. If sep is ``None``, the C engine cannot automatically detect + the separator, but the Python parsing engine can, meaning the latter will be + used and automatically detect the separator by Python's builtin sniffer tool, + :class:`python:csv.Sniffer`. In addition, separators longer than 1 character and + different from ``'\s+'`` will be interpreted as regular expressions and + will also force the use of the Python parsing engine. Note that regex + delimiters are prone to ignoring quoted data. Regex example: ``'\\r\\t'``. +delimiter : str, default ``None`` + Alternative argument name for sep. + +Column and index locations and names +++++++++++++++++++++++++++++++++++++ + +header : int or list of ints, default ``'infer'`` + Row number(s) to use as the column names, and the start of the + data. Default behavior is to infer the column names: if no names are + passed the behavior is identical to ``header=0`` and column names + are inferred from the first line of the file, if column names are + passed explicitly then the behavior is identical to + ``header=None``. Explicitly pass ``header=0`` to be able to replace + existing names. + + The header can be a list of ints that specify row locations + for a MultiIndex on the columns e.g. ``[0,1,3]``. Intervening rows + that are not specified will be skipped (e.g. 2 in this example is + skipped). Note that this parameter ignores commented lines and empty + lines if ``skip_blank_lines=True``, so header=0 denotes the first + line of data rather than the first line of the file. +names : array-like, default ``None`` + List of column names to use. If file contains no header row, then you should + explicitly pass ``header=None``. Duplicates in this list are not allowed. +index_col : int, str, sequence of int / str, or False, optional, default ``None`` + Column(s) to use as the row labels of the ``DataFrame``, either given as + string name or column index. If a sequence of int / str is given, a + MultiIndex is used. + + .. note:: + ``index_col=False`` can be used to force pandas to *not* use the first + column as the index, e.g. when you have a malformed file with delimiters at + the end of each line. + + The default value of ``None`` instructs pandas to guess. If the number of + fields in the column header row is equal to the number of fields in the body + of the data file, then a default index is used. If it is larger, then + the first columns are used as index so that the remaining number of fields in + the body are equal to the number of fields in the header. + + The first row after the header is used to determine the number of columns, + which will go into the index. If the subsequent rows contain less columns + than the first row, they are filled with ``NaN``. + + This can be avoided through ``usecols``. This ensures that the columns are + taken as is and the trailing data are ignored. +usecols : list-like or callable, default ``None`` + Return a subset of the columns. If list-like, all elements must either + be positional (i.e. integer indices into the document columns) or strings + that correspond to column names provided either by the user in ``names`` or + inferred from the document header row(s). If ``names`` are given, the document + header row(s) are not taken into account. For example, a valid list-like + ``usecols`` parameter would be ``[0, 1, 2]`` or ``['foo', 'bar', 'baz']``. + + Element order is ignored, so ``usecols=[0, 1]`` is the same as ``[1, 0]``. To + instantiate a DataFrame from ``data`` with element order preserved use + ``pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']]`` for columns + in ``['foo', 'bar']`` order or + ``pd.read_csv(data, usecols=['foo', 'bar'])[['bar', 'foo']]`` for + ``['bar', 'foo']`` order. + + If callable, the callable function will be evaluated against the column names, + returning names where the callable function evaluates to True: + + .. ipython:: python + + import pandas as pd + from io import StringIO + + data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3" + pd.read_csv(StringIO(data)) + pd.read_csv(StringIO(data), usecols=lambda x: x.upper() in ["COL1", "COL3"]) + + Using this parameter results in much faster parsing time and lower memory usage + when using the c engine. The Python engine loads the data first before deciding + which columns to drop. + +General parsing configuration ++++++++++++++++++++++++++++++ + +dtype : Type name or dict of column -> type, default ``None`` + Data type for data or columns. E.g. ``{'a': np.float64, 'b': np.int32, 'c': 'Int64'}`` + Use ``str`` or ``object`` together with suitable ``na_values`` settings to preserve + and not interpret dtype. If converters are specified, they will be applied INSTEAD + of dtype conversion. + + .. versionadded:: 1.5.0 + + Support for defaultdict was added. Specify a defaultdict as input where + the default determines the dtype of the columns which are not explicitly + listed. + +dtype_backend : {"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames + Which dtype_backend to use, e.g. whether a DataFrame should have NumPy + arrays, nullable dtypes are used for all dtypes that have a nullable + implementation when "numpy_nullable" is set, pyarrow is used for all + dtypes if "pyarrow" is set. + + The dtype_backends are still experimental. + + .. versionadded:: 2.0 + +engine : {``'c'``, ``'python'``, ``'pyarrow'``} + Parser engine to use. The C and pyarrow engines are faster, while the python engine + is currently more feature-complete. Multithreading is currently only supported by + the pyarrow engine. + + .. versionadded:: 1.4.0 + + The "pyarrow" engine was added as an *experimental* engine, and some features + are unsupported, or may not work correctly, with this engine. +converters : dict, default ``None`` + Dict of functions for converting values in certain columns. Keys can either be + integers or column labels. +true_values : list, default ``None`` + Values to consider as ``True``. +false_values : list, default ``None`` + Values to consider as ``False``. +skipinitialspace : boolean, default ``False`` + Skip spaces after delimiter. +skiprows : list-like or integer, default ``None`` + Line numbers to skip (0-indexed) or number of lines to skip (int) at the start + of the file. + + If callable, the callable function will be evaluated against the row + indices, returning True if the row should be skipped and False otherwise: + + .. ipython:: python + + from io import StringIO + + data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3" + pd.read_csv(StringIO(data)) + pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0) + +skipfooter : int, default ``0`` + Number of lines at bottom of file to skip (unsupported with engine='c'). + +nrows : int, default ``None`` + Number of rows of file to read. Useful for reading pieces of large files. +low_memory : boolean, default ``True`` + Internally process the file in chunks, resulting in lower memory use + while parsing, but possibly mixed type inference. To ensure no mixed + types either set ``False``, or specify the type with the ``dtype`` parameter. + Note that the entire file is read into a single ``DataFrame`` regardless, + use the ``chunksize`` or ``iterator`` parameter to return the data in chunks. + (Only valid with C parser) +memory_map : boolean, default False + If a filepath is provided for ``filepath_or_buffer``, map the file object + directly onto memory and access the data directly from there. Using this + option can improve performance because there is no longer any I/O overhead. + +NA and missing data handling +++++++++++++++++++++++++++++ + +na_values : scalar, str, list-like, or dict, default ``None`` + Additional strings to recognize as NA/NaN. If dict passed, specific per-column + NA values. See :ref:`na values const ` below + for a list of the values interpreted as NaN by default. + +keep_default_na : boolean, default ``True`` + Whether or not to include the default NaN values when parsing the data. + Depending on whether ``na_values`` is passed in, the behavior is as follows: + + * If ``keep_default_na`` is ``True``, and ``na_values`` are specified, ``na_values`` + is appended to the default NaN values used for parsing. + * If ``keep_default_na`` is ``True``, and ``na_values`` are not specified, only + the default NaN values are used for parsing. + * If ``keep_default_na`` is ``False``, and ``na_values`` are specified, only + the NaN values specified ``na_values`` are used for parsing. + * If ``keep_default_na`` is ``False``, and ``na_values`` are not specified, no + strings will be parsed as NaN. + + Note that if ``na_filter`` is passed in as ``False``, the ``keep_default_na`` and + ``na_values`` parameters will be ignored. +na_filter : boolean, default ``True`` + Detect missing value markers (empty strings and the value of na_values). In + data without any NAs, passing ``na_filter=False`` can improve the performance + of reading a large file. +verbose : boolean, default ``False`` + Indicate number of NA values placed in non-numeric columns. +skip_blank_lines : boolean, default ``True`` + If ``True``, skip over blank lines rather than interpreting as NaN values. + +.. _io.read_csv_table.datetime: + +Datetime handling ++++++++++++++++++ + +parse_dates : boolean or list of ints or names or list of lists or dict, default ``False``. + * If ``True`` -> try parsing the index. + * If ``[1, 2, 3]`` -> try parsing columns 1, 2, 3 each as a separate date + column. + + .. note:: + A fast-path exists for iso8601-formatted dates. +date_format : str or dict of column -> format, default ``None`` + If used in conjunction with ``parse_dates``, will parse dates according to this + format. For anything more complex, + please read in as ``object`` and then apply :func:`to_datetime` as-needed. + + .. versionadded:: 2.0.0 +dayfirst : boolean, default ``False`` + DD/MM format dates, international and European format. +cache_dates : boolean, default True + If True, use a cache of unique, converted dates to apply the datetime + conversion. May produce significant speed-up when parsing duplicate + date strings, especially ones with timezone offsets. + +Iteration ++++++++++ + +iterator : boolean, default ``False`` + Return ``TextFileReader`` object for iteration or getting chunks with + ``get_chunk()``. +chunksize : int, default ``None`` + Return ``TextFileReader`` object for iteration. See :ref:`iterating and chunking + ` below. + +Quoting, compression, and file format ++++++++++++++++++++++++++++++++++++++ + +compression : {``'infer'``, ``'gzip'``, ``'bz2'``, ``'zip'``, ``'xz'``, ``'zstd'``, ``None``, ``dict``}, default ``'infer'`` + For on-the-fly decompression of on-disk data. If 'infer', then use gzip, + bz2, zip, xz, or zstandard if ``filepath_or_buffer`` is path-like ending in '.gz', '.bz2', + '.zip', '.xz', '.zst', respectively, and no decompression otherwise. If using 'zip', + the ZIP file must contain only one data file to be read in. + Set to ``None`` for no decompression. Can also be a dict with key ``'method'`` + set to one of {``'zip'``, ``'gzip'``, ``'bz2'``, ``'zstd'``} and other key-value pairs are + forwarded to ``zipfile.ZipFile``, ``gzip.GzipFile``, ``bz2.BZ2File``, or ``zstandard.ZstdDecompressor``. + As an example, the following could be passed for faster compression and to + create a reproducible gzip archive: + ``compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}``. + + .. versionchanged:: 1.2.0 Previous versions forwarded dict entries for 'gzip' to ``gzip.open``. +thousands : str, default ``None`` + Thousands separator. +decimal : str, default ``'.'`` + Character to recognize as decimal point. E.g. use ``','`` for European data. +float_precision : string, default None + Specifies which converter the C engine should use for floating-point values. + The options are ``None`` for the ordinary converter, ``high`` for the + high-precision converter, and ``round_trip`` for the round-trip converter. +lineterminator : str (length 1), default ``None`` + Character to break file into lines. Only valid with C parser. +quotechar : str (length 1) + The character used to denote the start and end of a quoted item. Quoted items + can include the delimiter and it will be ignored. +quoting : int or ``csv.QUOTE_*`` instance, default ``0`` + Control field quoting behavior per ``csv.QUOTE_*`` constants. Use one of + ``QUOTE_MINIMAL`` (0), ``QUOTE_ALL`` (1), ``QUOTE_NONNUMERIC`` (2) or + ``QUOTE_NONE`` (3). +doublequote : boolean, default ``True`` + When ``quotechar`` is specified and ``quoting`` is not ``QUOTE_NONE``, + indicate whether or not to interpret two consecutive ``quotechar`` elements + **inside** a field as a single ``quotechar`` element. +escapechar : str (length 1), default ``None`` + One-character string used to escape delimiter when quoting is ``QUOTE_NONE``. +comment : str, default ``None`` + Indicates remainder of line should not be parsed. If found at the beginning of + a line, the line will be ignored altogether. This parameter must be a single + character. Like empty lines (as long as ``skip_blank_lines=True``), fully + commented lines are ignored by the parameter ``header`` but not by ``skiprows``. + For example, if ``comment='#'``, parsing '#empty\\na,b,c\\n1,2,3' with + ``header=0`` will result in 'a,b,c' being treated as the header. +encoding : str, default ``None`` + Encoding to use for UTF when reading/writing (e.g. ``'utf-8'``). `List of + Python standard encodings + `_. +dialect : str or :class:`python:csv.Dialect` instance, default ``None`` + If provided, this parameter will override values (default or not) for the + following parameters: ``delimiter``, ``doublequote``, ``escapechar``, + ``skipinitialspace``, ``quotechar``, and ``quoting``. If it is necessary to + override values, a ParserWarning will be issued. See :class:`python:csv.Dialect` + documentation for more details. + +Error handling +++++++++++++++ + +on_bad_lines : {{'error', 'warn', 'skip'}}, default 'error' + Specifies what to do upon encountering a bad line (a line with too many fields). + Allowed values are : + + - 'error', raise an ParserError when a bad line is encountered. + - 'warn', print a warning when a bad line is encountered and skip that line. + - 'skip', skip bad lines without raising or warning when they are encountered. + + .. versionadded:: 1.3.0 + +.. _io.dtypes: + +Specifying column data types +'''''''''''''''''''''''''''' + +You can indicate the data type for the whole ``DataFrame`` or individual +columns: + +.. ipython:: python + + import numpy as np + from io import StringIO + + data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" + print(data) + + df = pd.read_csv(StringIO(data), dtype=object) + df + df["a"][0] + df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) + df.dtypes + +Fortunately, pandas offers more than one way to ensure that your column(s) +contain only one ``dtype``. If you're unfamiliar with these concepts, you can +see :ref:`here` to learn more about dtypes, and +:ref:`here` to learn more about ``object`` conversion in +pandas. + + +For instance, you can use the ``converters`` argument +of :func:`~pandas.read_csv`: + +.. ipython:: python + + from io import StringIO + + data = "col_1\n1\n2\n'A'\n4.22" + df = pd.read_csv(StringIO(data), converters={"col_1": str}) + df + df["col_1"].apply(type).value_counts() + +Or you can use the :func:`~pandas.to_numeric` function to coerce the +dtypes after reading in the data, + +.. ipython:: python + + from io import StringIO + + df2 = pd.read_csv(StringIO(data)) + df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") + df2 + df2["col_1"].apply(type).value_counts() + +which will convert all valid parsing to floats, leaving the invalid parsing +as ``NaN``. + +Ultimately, how you deal with reading in columns containing mixed dtypes +depends on your specific needs. In the case above, if you wanted to ``NaN`` out +the data anomalies, then :func:`~pandas.to_numeric` is probably your best option. +However, if you wanted for all the data to be coerced, no matter the type, then +using the ``converters`` argument of :func:`~pandas.read_csv` would certainly be +worth trying. + +.. note:: + In some cases, reading in abnormal data with columns containing mixed dtypes + will result in an inconsistent dataset. If you rely on pandas to infer the + dtypes of your columns, the parsing engine will go and infer the dtypes for + different chunks of the data, rather than the whole dataset at once. Consequently, + you can end up with column(s) with mixed dtypes. For example, + + .. ipython:: python + :okwarning: + + col_1 = list(range(500000)) + ["a", "b"] + list(range(500000)) + df = pd.DataFrame({"col_1": col_1}) + df.to_csv("foo.csv") + mixed_df = pd.read_csv("foo.csv") + mixed_df["col_1"].apply(type).value_counts() + mixed_df["col_1"].dtype + + will result with ``mixed_df`` containing an ``int`` dtype for certain chunks + of the column, and ``str`` for others due to the mixed dtypes from the + data that was read in. It is important to note that the overall column will be + marked with a ``dtype`` of ``object``, which is used for columns with mixed dtypes. + +.. ipython:: python + :suppress: + + import os + + os.remove("foo.csv") + +Setting ``dtype_backend="numpy_nullable"`` will result in nullable dtypes for every column. + +.. ipython:: python + + from io import StringIO + + data = """a,b,c,d,e,f,g,h,i,j + 1,2.5,True,a,,,,,12-31-2019, + 3,4.5,False,b,6,7.5,True,a,12-31-2019, + """ + + df = pd.read_csv(StringIO(data), dtype_backend="numpy_nullable", parse_dates=["i"]) + df + df.dtypes + +.. _io.categorical: + +Specifying categorical dtype +'''''''''''''''''''''''''''' + +``Categorical`` columns can be parsed directly by specifying ``dtype='category'`` or +``dtype=CategoricalDtype(categories, ordered)``. + +.. ipython:: python + + from io import StringIO + + data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3" + + pd.read_csv(StringIO(data)) + pd.read_csv(StringIO(data)).dtypes + pd.read_csv(StringIO(data), dtype="category").dtypes + +Individual columns can be parsed as a ``Categorical`` using a dict +specification: + +.. ipython:: python + + from io import StringIO + + pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes + +Specifying ``dtype='category'`` will result in an unordered ``Categorical`` +whose ``categories`` are the unique values observed in the data. For more +control on the categories and order, create a +:class:`~pandas.api.types.CategoricalDtype` ahead of time, and pass that for +that column's ``dtype``. + +.. ipython:: python + + from pandas.api.types import CategoricalDtype + from io import StringIO + + dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True) + pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes + +When using ``dtype=CategoricalDtype``, "unexpected" values outside of +``dtype.categories`` are treated as missing values. + +.. ipython:: python + + from io import StringIO + + dtype = CategoricalDtype(["a", "b", "d"]) # No 'c' + pd.read_csv(StringIO(data), dtype={"col1": dtype}).col1 + +This matches the behavior of :meth:`Categorical.set_categories`. + +.. note:: + + With ``dtype='category'``, the resulting categories will always be parsed + as strings (object dtype). If the categories are numeric they can be + converted using the :func:`to_numeric` function, or as appropriate, another + converter such as :func:`to_datetime`. + + When ``dtype`` is a ``CategoricalDtype`` with homogeneous ``categories`` ( + all numeric, all datetimes, etc.), the conversion is done automatically. + + .. ipython:: python + + from io import StringIO + + df = pd.read_csv(StringIO(data), dtype="category") + df.dtypes + df["col3"] + new_categories = pd.to_numeric(df["col3"].cat.categories) + df["col3"] = df["col3"].cat.rename_categories(new_categories) + df["col3"] + + +Naming and using columns +'''''''''''''''''''''''' + +.. _io.headers: + +Handling column names ++++++++++++++++++++++ + +A file may or may not have a header row. pandas assumes the first row should be +used as the column names: + +.. ipython:: python + + from io import StringIO + + data = "a,b,c\n1,2,3\n4,5,6\n7,8,9" + print(data) + pd.read_csv(StringIO(data)) + +By specifying the ``names`` argument in conjunction with ``header`` you can +indicate other names to use and whether or not to throw away the header row (if +any): + +.. ipython:: python + + from io import StringIO + + print(data) + pd.read_csv(StringIO(data), names=["foo", "bar", "baz"], header=0) + pd.read_csv(StringIO(data), names=["foo", "bar", "baz"], header=None) + +If the header is in a row other than the first, pass the row number to +``header``. This will skip the preceding rows: + +.. ipython:: python + + from io import StringIO + + data = "skip this skip it\na,b,c\n1,2,3\n4,5,6\n7,8,9" + pd.read_csv(StringIO(data), header=1) + +.. note:: + + Default behavior is to infer the column names: if no names are + passed the behavior is identical to ``header=0`` and column names + are inferred from the first non-blank line of the file, if column + names are passed explicitly then the behavior is identical to + ``header=None``. + +.. _io.dupe_names: + +Duplicate names parsing +''''''''''''''''''''''' + +If the file or header contains duplicate names, pandas will by default +distinguish between them so as to prevent overwriting data: + +.. ipython:: python + + from io import StringIO + + data = "a,b,a\n0,1,2\n3,4,5" + pd.read_csv(StringIO(data)) + +There is no more duplicate data because duplicate columns 'X', ..., 'X' become +'X', 'X.1', ..., 'X.N'. + +.. _io.usecols: + +Filtering columns (``usecols``) ++++++++++++++++++++++++++++++++ + +The ``usecols`` argument allows you to select any subset of the columns in a +file, either using the column names, position numbers or a callable: + +.. ipython:: python + + from io import StringIO + + data = "a,b,c,d\n1,2,3,foo\n4,5,6,bar\n7,8,9,baz" + pd.read_csv(StringIO(data)) + pd.read_csv(StringIO(data), usecols=["b", "d"]) + pd.read_csv(StringIO(data), usecols=[0, 2, 3]) + pd.read_csv(StringIO(data), usecols=lambda x: x.upper() in ["A", "C"]) + +The ``usecols`` argument can also be used to specify which columns not to +use in the final result: + +.. ipython:: python + + from io import StringIO + + pd.read_csv(StringIO(data), usecols=lambda x: x not in ["a", "c"]) + +In this case, the callable is specifying that we exclude the "a" and "c" +columns from the output. + +Comments and empty lines +'''''''''''''''''''''''' + +.. _io.skiplines: + +Ignoring line comments and empty lines +++++++++++++++++++++++++++++++++++++++ + +If the ``comment`` parameter is specified, then completely commented lines will +be ignored. By default, completely blank lines will be ignored as well. + +.. ipython:: python + + from io import StringIO + + data = "\na,b,c\n \n# commented line\n1,2,3\n\n4,5,6" + print(data) + pd.read_csv(StringIO(data), comment="#") + +If ``skip_blank_lines=False``, then ``read_csv`` will not ignore blank lines: + +.. ipython:: python + + from io import StringIO + + data = "a,b,c\n\n1,2,3\n\n\n4,5,6" + pd.read_csv(StringIO(data), skip_blank_lines=False) + +.. warning:: + + The presence of ignored lines might create ambiguities involving line numbers; + the parameter ``header`` uses row numbers (ignoring commented/empty + lines), while ``skiprows`` uses line numbers (including commented/empty lines): + + .. ipython:: python + + from io import StringIO + + data = "#comment\na,b,c\nA,B,C\n1,2,3" + pd.read_csv(StringIO(data), comment="#", header=1) + data = "A,B,C\n#comment\na,b,c\n1,2,3" + pd.read_csv(StringIO(data), comment="#", skiprows=2) + + If both ``header`` and ``skiprows`` are specified, ``header`` will be + relative to the end of ``skiprows``. For example: + +.. ipython:: python + + from io import StringIO + + data = ( + "# empty\n" + "# second empty line\n" + "# third emptyline\n" + "X,Y,Z\n" + "1,2,3\n" + "A,B,C\n" + "1,2.,4.\n" + "5.,NaN,10.0\n" + ) + print(data) + pd.read_csv(StringIO(data), comment="#", skiprows=4, header=1) + +.. _io.comments: + +Comments +++++++++ + +Sometimes comments or meta data may be included in a file: + +.. ipython:: python + + data = ( + "ID,level,category\n" + "Patient1,123000,x # really unpleasant\n" + "Patient2,23000,y # wouldn't take his medicine\n" + "Patient3,1234018,z # awesome" + ) + with open("tmp.csv", "w") as fh: + fh.write(data) + + print(open("tmp.csv").read()) + +By default, the parser includes the comments in the output: + +.. ipython:: python + + df = pd.read_csv("tmp.csv") + df + +We can suppress the comments using the ``comment`` keyword: + +.. ipython:: python + + df = pd.read_csv("tmp.csv", comment="#") + df + +.. ipython:: python + :suppress: + + os.remove("tmp.csv") + +.. _io.unicode: + +Dealing with Unicode data +''''''''''''''''''''''''' + +The ``encoding`` argument should be used for encoded unicode data, which will +result in byte strings being decoded to unicode in the result: + +.. ipython:: python + + from io import BytesIO + + data = b"word,length\n" b"Tr\xc3\xa4umen,7\n" b"Gr\xc3\xbc\xc3\x9fe,5" + data = data.decode("utf8").encode("latin-1") + df = pd.read_csv(BytesIO(data), encoding="latin-1") + df + df["word"][1] + +Some formats which encode all characters as multiple bytes, like UTF-16, won't +parse correctly at all without specifying the encoding. `Full list of Python +standard encodings +`_. + +.. _io.index_col: + +Index columns and trailing delimiters +''''''''''''''''''''''''''''''''''''' + +If a file has one more column of data than the number of column names, the +first column will be used as the ``DataFrame``'s row names: + +.. ipython:: python + + from io import StringIO + + data = "a,b,c\n4,apple,bat,5.7\n8,orange,cow,10" + pd.read_csv(StringIO(data)) + +.. ipython:: python + + from io import StringIO + + data = "index,a,b,c\n4,apple,bat,5.7\n8,orange,cow,10" + pd.read_csv(StringIO(data), index_col=0) + +Ordinarily, you can achieve this behavior using the ``index_col`` option. + +There are some exception cases when a file has been prepared with delimiters at +the end of each data line, confusing the parser. To explicitly disable the +index column inference and discard the last column, pass ``index_col=False``: + +.. ipython:: python + + from io import StringIO + + data = "a,b,c\n4,apple,bat,\n8,orange,cow," + print(data) + pd.read_csv(StringIO(data)) + pd.read_csv(StringIO(data), index_col=False) + +If a subset of data is being parsed using the ``usecols`` option, the +``index_col`` specification is based on that subset, not the original data. + +.. ipython:: python + + from io import StringIO + + data = "a,b,c\n4,apple,bat,\n8,orange,cow," + print(data) + pd.read_csv(StringIO(data), usecols=["b", "c"]) + pd.read_csv(StringIO(data), usecols=["b", "c"], index_col=0) + +.. _io.parse_dates: + +Date Handling +''''''''''''' + +Specifying date columns ++++++++++++++++++++++++ + +To better facilitate working with datetime data, :func:`read_csv` +uses the keyword arguments ``parse_dates`` and ``date_format`` +to allow users to specify a variety of columns and date/time formats to turn the +input text data into ``datetime`` objects. + +The simplest case is to just pass in ``parse_dates=True``: + +.. ipython:: python + + with open("foo.csv", mode="w") as f: + f.write("date,A,B,C\n20090101,a,1,2\n20090102,b,3,4\n20090103,c,4,5") + + # Use a column as an index, and parse it as dates. + df = pd.read_csv("foo.csv", index_col=0, parse_dates=True) + df + + # These are Python datetime objects + df.index + +It is often the case that we may want to store date and time data separately, +or store various date fields separately. the ``parse_dates`` keyword can be +used to specify columns to parse the dates and/or times. + + +.. note:: + If a column or index contains an unparsable date, the entire column or + index will be returned unaltered as an object data type. For non-standard + datetime parsing, use :func:`to_datetime` after ``pd.read_csv``. + + +.. note:: + read_csv has a fast_path for parsing datetime strings in iso8601 format, + e.g "2000-01-01T00:01:02+00:00" and similar variations. If you can arrange + for your data to store datetimes in this format, load times will be + significantly faster, ~20x has been observed. + + +Date parsing functions +++++++++++++++++++++++ + +Finally, the parser allows you to specify a custom ``date_format``. +Performance-wise, you should try these methods of parsing dates in order: + +1. If you know the format, use ``date_format``, e.g.: + ``date_format="%d/%m/%Y"`` or ``date_format={column_name: "%d/%m/%Y"}``. + +2. If you different formats for different columns, or want to pass any extra options (such + as ``utc``) to ``to_datetime``, then you should read in your data as ``object`` dtype, and + then use ``to_datetime``. + + +.. _io.csv.mixed_timezones: + +Parsing a CSV with mixed timezones +++++++++++++++++++++++++++++++++++ + +pandas cannot natively represent a column or index with mixed timezones. If your CSV +file contains columns with a mixture of timezones, the default result will be +an object-dtype column with strings, even with ``parse_dates``. +To parse the mixed-timezone values as a datetime column, read in as ``object`` dtype and +then call :func:`to_datetime` with ``utc=True``. + + +.. ipython:: python + + from io import StringIO + + content = """\ + a + 2000-01-01T00:00:00+05:00 + 2000-01-01T00:00:00+06:00""" + df = pd.read_csv(StringIO(content)) + df["a"] = pd.to_datetime(df["a"], utc=True) + df["a"] + + +.. _io.dayfirst: + + +Inferring datetime format ++++++++++++++++++++++++++ + +Here are some examples of datetime strings that can be guessed (all +representing December 30th, 2011 at 00:00:00): + +* "20111230" +* "2011/12/30" +* "20111230 00:00:00" +* "12/30/2011 00:00:00" +* "30/Dec/2011 00:00:00" +* "30/December/2011 00:00:00" + +Note that format inference is sensitive to ``dayfirst``. With +``dayfirst=True``, it will guess "01/12/2011" to be December 1st. With +``dayfirst=False`` (default) it will guess "01/12/2011" to be January 12th. + +If you try to parse a column of date strings, pandas will attempt to guess the format +from the first non-NaN element, and will then parse the rest of the column with that +format. If pandas fails to guess the format (for example if your first string is +``'01 December US/Pacific 2000'``), then a warning will be raised and each +row will be parsed individually by ``dateutil.parser.parse``. The safest +way to parse dates is to explicitly set ``format=``. + +.. ipython:: python + + df = pd.read_csv( + "foo.csv", + index_col=0, + parse_dates=True, + ) + df + +In the case that you have mixed datetime formats within the same column, you can +pass ``format='mixed'`` + +.. ipython:: python + + from io import StringIO + + data = StringIO("date\n12 Jan 2000\n2000-01-13\n") + df = pd.read_csv(data) + df['date'] = pd.to_datetime(df['date'], format='mixed') + df + +or, if your datetime formats are all ISO8601 (possibly not identically-formatted): + +.. ipython:: python + + from io import StringIO + + data = StringIO("date\n2020-01-01\n2020-01-01 03:00\n") + df = pd.read_csv(data) + df['date'] = pd.to_datetime(df['date'], format='ISO8601') + df + +.. ipython:: python + :suppress: + + os.remove("foo.csv") + +International date formats +++++++++++++++++++++++++++ + +While US date formats tend to be MM/DD/YYYY, many international formats use +DD/MM/YYYY instead. For convenience, a ``dayfirst`` keyword is provided: + +.. ipython:: python + + data = "date,value,cat\n1/6/2000,5,a\n2/6/2000,10,b\n3/6/2000,15,c" + print(data) + with open("tmp.csv", "w") as fh: + fh.write(data) + + pd.read_csv("tmp.csv", parse_dates=[0]) + pd.read_csv("tmp.csv", dayfirst=True, parse_dates=[0]) + +.. ipython:: python + :suppress: + + os.remove("tmp.csv") + +Writing CSVs to binary file objects ++++++++++++++++++++++++++++++++++++ + +.. versionadded:: 1.2.0 + +``df.to_csv(..., mode="wb")`` allows writing a CSV to a file object +opened binary mode. In most cases, it is not necessary to specify +``mode`` as pandas will auto-detect whether the file object is +opened in text or binary mode. + +.. ipython:: python + + import io + + data = pd.DataFrame([0, 1, 2]) + buffer = io.BytesIO() + data.to_csv(buffer, encoding="utf-8", compression="gzip") + +.. _io.float_precision: + +Specifying method for floating-point conversion +''''''''''''''''''''''''''''''''''''''''''''''' + +The parameter ``float_precision`` can be specified in order to use +a specific floating-point converter during parsing with the C engine. +The options are the ordinary converter, the high-precision converter, and +the round-trip converter (which is guaranteed to round-trip values after +writing to a file). For example: + +.. ipython:: python + + from io import StringIO + + val = "0.3066101993807095471566981359501369297504425048828125" + data = "a,b,c\n1,2,{0}".format(val) + abs( + pd.read_csv( + StringIO(data), + engine="c", + float_precision=None, + )["c"][0] - float(val) + ) + abs( + pd.read_csv( + StringIO(data), + engine="c", + float_precision="high", + )["c"][0] - float(val) + ) + abs( + pd.read_csv(StringIO(data), engine="c", float_precision="round_trip")["c"][0] + - float(val) + ) + + +.. _io.thousands: + +Thousand separators +''''''''''''''''''' + +For large numbers that have been written with a thousands separator, you can +set the ``thousands`` keyword to a string of length 1 so that integers will be parsed +correctly: + +By default, numbers with a thousands separator will be parsed as strings: + +.. ipython:: python + + data = ( + "ID|level|category\n" + "Patient1|123,000|x\n" + "Patient2|23,000|y\n" + "Patient3|1,234,018|z" + ) + + with open("tmp.csv", "w") as fh: + fh.write(data) + + df = pd.read_csv("tmp.csv", sep="|") + df + + df.level.dtype + +The ``thousands`` keyword allows integers to be parsed correctly: + +.. ipython:: python + + df = pd.read_csv("tmp.csv", sep="|", thousands=",") + df + + df.level.dtype + +.. ipython:: python + :suppress: + + os.remove("tmp.csv") + +.. _io.na_values: + +NA values +''''''''' + +To control which values are parsed as missing values (which are signified by +``NaN``), specify a string in ``na_values``. If you specify a list of strings, +then all values in it are considered to be missing values. If you specify a +number (a ``float``, like ``5.0`` or an ``integer`` like ``5``), the +corresponding equivalent values will also imply a missing value (in this case +effectively ``[5.0, 5]`` are recognized as ``NaN``). + +To completely override the default values that are recognized as missing, specify ``keep_default_na=False``. + +.. _io.navaluesconst: + +The default ``NaN`` recognized values are ``['-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A', '#N/A', 'N/A', +'n/a', 'NA', '', '#NA', 'NULL', 'null', 'NaN', '-NaN', 'nan', '-nan', 'None', '']``. + +Let us consider some examples: + +.. code-block:: python + + pd.read_csv("path_to_file.csv", na_values=[5]) + +In the example above ``5`` and ``5.0`` will be recognized as ``NaN``, in +addition to the defaults. A string will first be interpreted as a numerical +``5``, then as a ``NaN``. + +.. code-block:: python + + pd.read_csv("path_to_file.csv", keep_default_na=False, na_values=[""]) + +Above, only an empty field will be recognized as ``NaN``. + +.. code-block:: python + + pd.read_csv("path_to_file.csv", keep_default_na=False, na_values=["NA", "0"]) + +Above, both ``NA`` and ``0`` as strings are ``NaN``. + +.. code-block:: python + + pd.read_csv("path_to_file.csv", na_values=["Nope"]) + +The default values, in addition to the string ``"Nope"`` are recognized as +``NaN``. + +.. _io.infinity: + +Infinity +'''''''' + +``inf`` like values will be parsed as ``np.inf`` (positive infinity), and ``-inf`` as ``-np.inf`` (negative infinity). +These will ignore the case of the value, meaning ``Inf``, will also be parsed as ``np.inf``. + +.. _io.boolean: + +Boolean values +'''''''''''''' + +The common values ``True``, ``False``, ``TRUE``, and ``FALSE`` are all +recognized as boolean. Occasionally you might want to recognize other values +as being boolean. To do this, use the ``true_values`` and ``false_values`` +options as follows: + +.. ipython:: python + + from io import StringIO + + data = "a,b,c\n1,Yes,2\n3,No,4" + print(data) + pd.read_csv(StringIO(data)) + pd.read_csv(StringIO(data), true_values=["Yes"], false_values=["No"]) + +.. _io.bad_lines: + +Handling "bad" lines +'''''''''''''''''''' + +Some files may have malformed lines with too few fields or too many. Lines with +too few fields will have NA values filled in the trailing fields. Lines with +too many fields will raise an error by default: + +.. ipython:: python + :okexcept: + + from io import StringIO + + data = "a,b,c\n1,2,3\n4,5,6,7\n8,9,10" + pd.read_csv(StringIO(data)) + +You can elect to skip bad lines: + +.. ipython:: python + + from io import StringIO + + data = "a,b,c\n1,2,3\n4,5,6,7\n8,9,10" + pd.read_csv(StringIO(data), on_bad_lines="skip") + +.. versionadded:: 1.4.0 + +Or pass a callable function to handle the bad line if ``engine="python"``. +The bad line will be a list of strings that was split by the ``sep``: + +.. ipython:: python + + from io import StringIO + + external_list = [] + def bad_lines_func(line): + external_list.append(line) + return line[-3:] + pd.read_csv(StringIO(data), on_bad_lines=bad_lines_func, engine="python") + external_list + +.. note:: + + The callable function will handle only a line with too many fields. + Bad lines caused by other errors will be silently skipped. + + .. ipython:: python + + from io import StringIO + + bad_lines_func = lambda line: print(line) + + data = 'name,type\nname a,a is of type a\nname b,"b\" is of type b"' + data + pd.read_csv(StringIO(data), on_bad_lines=bad_lines_func, engine="python") + + The line was not processed in this case, as a "bad line" here is caused by an escape character. + +You can also use the ``usecols`` parameter to eliminate extraneous column +data that appear in some lines but not others: + +.. ipython:: python + :okexcept: + + from io import StringIO + + pd.read_csv(StringIO(data), usecols=[0, 1, 2]) + +In case you want to keep all data including the lines with too many fields, you can +specify a sufficient number of ``names``. This ensures that lines with not enough +fields are filled with ``NaN``. + +.. ipython:: python + + from io import StringIO + + pd.read_csv(StringIO(data), names=['a', 'b', 'c', 'd']) + +.. _io.dialect: + +Dialect +''''''' + +The ``dialect`` keyword gives greater flexibility in specifying the file format. +By default it uses the Excel dialect but you can specify either the dialect name +or a :class:`python:csv.Dialect` instance. + +Suppose you had data with unenclosed quotes: + +.. ipython:: python + + data = "label1,label2,label3\n" 'index1,"a,c,e\n' "index2,b,d,f" + print(data) + +By default, ``read_csv`` uses the Excel dialect and treats the double quote as +the quote character, which causes it to fail when it finds a newline before it +finds the closing double quote. + +We can get around this using ``dialect``: + +.. ipython:: python + :okwarning: + + import csv + from io import StringIO + + dia = csv.excel() + dia.quoting = csv.QUOTE_NONE + pd.read_csv(StringIO(data), dialect=dia) + +All of the dialect options can be specified separately by keyword arguments: + +.. ipython:: python + + from io import StringIO + + data = "a,b,c~1,2,3~4,5,6" + pd.read_csv(StringIO(data), lineterminator="~") + +Another common dialect option is ``skipinitialspace``, to skip any whitespace +after a delimiter: + +.. ipython:: python + + from io import StringIO + + data = "a, b, c\n1, 2, 3\n4, 5, 6" + print(data) + pd.read_csv(StringIO(data), skipinitialspace=True) + +The parsers make every attempt to "do the right thing" and not be fragile. Type +inference is a pretty big deal. If a column can be coerced to integer dtype +without altering the contents, the parser will do so. Any non-numeric +columns will come through as object dtype as with the rest of pandas objects. + +.. _io.quoting: + +Quoting and Escape Characters +''''''''''''''''''''''''''''' + +Quotes (and other escape characters) in embedded fields can be handled in any +number of ways. One way is to use backslashes; to properly parse this data, you +should pass the ``escapechar`` option: + +.. ipython:: python + + from io import StringIO + + data = 'a,b\n"hello, \\"Bob\\", nice to see you",5' + print(data) + pd.read_csv(StringIO(data), escapechar="\\") + +.. _io.fwf_reader: +.. _io.fwf: + +Files with fixed width columns +'''''''''''''''''''''''''''''' + +While :func:`read_csv` reads delimited data, the :func:`read_fwf` function works +with data files that have known and fixed column widths. The function parameters +to ``read_fwf`` are largely the same as ``read_csv`` with two extra parameters, and +a different usage of the ``delimiter`` parameter: + +* ``colspecs``: A list of pairs (tuples) giving the extents of the + fixed-width fields of each line as half-open intervals (i.e., [from, to[ ). + String value 'infer' can be used to instruct the parser to try detecting + the column specifications from the first 100 rows of the data. Default + behavior, if not specified, is to infer. +* ``widths``: A list of field widths which can be used instead of 'colspecs' + if the intervals are contiguous. +* ``delimiter``: Characters to consider as filler characters in the fixed-width file. + Can be used to specify the filler character of the fields + if it is not spaces (e.g., '~'). + +Consider a typical fixed-width data file: + +.. ipython:: python + + data1 = ( + "id8141 360.242940 149.910199 11950.7\n" + "id1594 444.953632 166.985655 11788.4\n" + "id1849 364.136849 183.628767 11806.2\n" + "id1230 413.836124 184.375703 11916.8\n" + "id1948 502.953953 173.237159 12468.3" + ) + with open("bar.csv", "w") as f: + f.write(data1) + +In order to parse this file into a ``DataFrame``, we simply need to supply the +column specifications to the ``read_fwf`` function along with the file name: + +.. ipython:: python + + # Column specifications are a list of half-intervals + colspecs = [(0, 6), (8, 20), (21, 33), (34, 43)] + df = pd.read_fwf("bar.csv", colspecs=colspecs, header=None, index_col=0) + df + +Note how the parser automatically picks column names X. when +``header=None`` argument is specified. Alternatively, you can supply just the +column widths for contiguous columns: + +.. ipython:: python + + # Widths are a list of integers + widths = [6, 14, 13, 10] + df = pd.read_fwf("bar.csv", widths=widths, header=None) + df + +The parser will take care of extra white spaces around the columns +so it's ok to have extra separation between the columns in the file. + +By default, ``read_fwf`` will try to infer the file's ``colspecs`` by using the +first 100 rows of the file. It can do it only in cases when the columns are +aligned and correctly separated by the provided ``delimiter`` (default delimiter +is whitespace). + +.. ipython:: python + + df = pd.read_fwf("bar.csv", header=None, index_col=0) + df + +``read_fwf`` supports the ``dtype`` parameter for specifying the types of +parsed columns to be different from the inferred type. + +.. ipython:: python + + pd.read_fwf("bar.csv", header=None, index_col=0).dtypes + pd.read_fwf("bar.csv", header=None, dtype={2: "object"}).dtypes + +.. ipython:: python + :suppress: + + os.remove("bar.csv") + + +Indexes +''''''' + +Files with an "implicit" index column ++++++++++++++++++++++++++++++++++++++ + +Consider a file with one less entry in the header than the number of data +column: + +.. ipython:: python + + data = "A,B,C\n20090101,a,1,2\n20090102,b,3,4\n20090103,c,4,5" + print(data) + with open("foo.csv", "w") as f: + f.write(data) + +In this special case, ``read_csv`` assumes that the first column is to be used +as the index of the ``DataFrame``: + +.. ipython:: python + + pd.read_csv("foo.csv") + +Note that the dates weren't automatically parsed. In that case you would need +to do as before: + +.. ipython:: python + + df = pd.read_csv("foo.csv", parse_dates=True) + df.index + +.. ipython:: python + :suppress: + + os.remove("foo.csv") + + +Reading an index with a ``MultiIndex`` +++++++++++++++++++++++++++++++++++++++ + +.. _io.csv_multiindex: + +Suppose you have data indexed by two columns: + +.. ipython:: python + + data = 'year,indiv,zit,xit\n1977,"A",1.2,.6\n1977,"B",1.5,.5' + print(data) + with open("mindex_ex.csv", mode="w") as f: + f.write(data) + +The ``index_col`` argument to ``read_csv`` can take a list of +column numbers to turn multiple columns into a ``MultiIndex`` for the index of the +returned object: + +.. ipython:: python + + df = pd.read_csv("mindex_ex.csv", index_col=[0, 1]) + df + df.loc[1977] + +.. ipython:: python + :suppress: + + os.remove("mindex_ex.csv") + +.. _io.multi_index_columns: + +Reading columns with a ``MultiIndex`` ++++++++++++++++++++++++++++++++++++++ + +By specifying list of row locations for the ``header`` argument, you +can read in a ``MultiIndex`` for the columns. Specifying non-consecutive +rows will skip the intervening rows. + +.. ipython:: python + + mi_idx = pd.MultiIndex.from_arrays([[1, 2, 3, 4], list("abcd")], names=list("ab")) + mi_col = pd.MultiIndex.from_arrays([[1, 2], list("ab")], names=list("cd")) + df = pd.DataFrame(np.ones((4, 2)), index=mi_idx, columns=mi_col) + df.to_csv("mi.csv") + print(open("mi.csv").read()) + pd.read_csv("mi.csv", header=[0, 1, 2, 3], index_col=[0, 1]) + +``read_csv`` is also able to interpret a more common format +of multi-columns indices. + +.. ipython:: python + + data = ",a,a,a,b,c,c\n,q,r,s,t,u,v\none,1,2,3,4,5,6\ntwo,7,8,9,10,11,12" + print(data) + with open("mi2.csv", "w") as fh: + fh.write(data) + + pd.read_csv("mi2.csv", header=[0, 1], index_col=0) + +.. note:: + If an ``index_col`` is not specified (e.g. you don't have an index, or wrote it + with ``df.to_csv(..., index=False)``, then any ``names`` on the columns index will + be *lost*. + +.. ipython:: python + :suppress: + + os.remove("mi.csv") + os.remove("mi2.csv") + +.. _io.sniff: + +Automatically "sniffing" the delimiter +'''''''''''''''''''''''''''''''''''''' + +``read_csv`` is capable of inferring delimited (not necessarily +comma-separated) files, as pandas uses the :class:`python:csv.Sniffer` +class of the csv module. For this, you have to specify ``sep=None``. + +.. ipython:: python + + df = pd.DataFrame(np.random.randn(10, 4)) + df.to_csv("tmp2.csv", sep=":", index=False) + pd.read_csv("tmp2.csv", sep=None, engine="python") + +.. ipython:: python + :suppress: + + os.remove("tmp2.csv") + +.. _io.multiple_files: + +Reading multiple files to create a single DataFrame +''''''''''''''''''''''''''''''''''''''''''''''''''' + +It's best to use :func:`~pandas.concat` to combine multiple files. +See the :ref:`cookbook` for an example. + +.. _io.chunking: + +Iterating through files chunk by chunk +'''''''''''''''''''''''''''''''''''''' + +Suppose you wish to iterate through a (potentially very large) file lazily +rather than reading the entire file into memory, such as the following: + + +.. ipython:: python + + df = pd.DataFrame(np.random.randn(10, 4)) + df.to_csv("tmp.csv", index=False) + table = pd.read_csv("tmp.csv") + table + + +By specifying a ``chunksize`` to ``read_csv``, the return +value will be an iterable object of type ``TextFileReader``: + +.. ipython:: python + + with pd.read_csv("tmp.csv", chunksize=4) as reader: + print(reader) + for chunk in reader: + print(chunk) + +.. versionchanged:: 1.2 + + ``read_csv/json/sas`` return a context-manager when iterating through a file. + +Specifying ``iterator=True`` will also return the ``TextFileReader`` object: + +.. ipython:: python + + with pd.read_csv("tmp.csv", iterator=True) as reader: + print(reader.get_chunk(5)) + +.. ipython:: python + :suppress: + + os.remove("tmp.csv") + +Specifying the parser engine +'''''''''''''''''''''''''''' + +pandas currently supports three engines, the C engine, the python engine, and an experimental +pyarrow engine (requires the ``pyarrow`` package). In general, the pyarrow engine is fastest +on larger workloads and is equivalent in speed to the C engine on most other workloads. +The python engine tends to be slower than the pyarrow and C engines on most workloads. However, +the pyarrow engine is much less robust than the C engine, which lacks a few features compared to the +Python engine. + +Where possible, pandas uses the C parser (specified as ``engine='c'``), but it may fall +back to Python if C-unsupported options are specified. + +Currently, options unsupported by the C and pyarrow engines include: + +* ``sep`` other than a single character (e.g. regex separators) +* ``skipfooter`` + +Specifying any of the above options will produce a ``ParserWarning`` unless the +python engine is selected explicitly using ``engine='python'``. + +Options that are unsupported by the pyarrow engine which are not covered by the list above include: + +* ``float_precision`` +* ``chunksize`` +* ``comment`` +* ``nrows`` +* ``thousands`` +* ``memory_map`` +* ``dialect`` +* ``on_bad_lines`` +* ``quoting`` +* ``lineterminator`` +* ``converters`` +* ``decimal`` +* ``iterator`` +* ``dayfirst`` +* ``verbose`` +* ``skipinitialspace`` +* ``low_memory`` + +Specifying these options with ``engine='pyarrow'`` will raise a ``ValueError``. + +.. _io.remote: + +Reading/writing remote files +'''''''''''''''''''''''''''' + +You can pass in a URL to read or write remote files to many of pandas' IO +functions - the following example shows reading a CSV file: + +.. code-block:: python + + df = pd.read_csv("https://download.bls.gov/pub/time.series/cu/cu.item", sep="\t") + +.. versionadded:: 1.3.0 + +A custom header can be sent alongside HTTP(s) requests by passing a dictionary +of header key value mappings to the ``storage_options`` keyword argument as shown below: + +.. code-block:: python + + headers = {"User-Agent": "pandas"} + df = pd.read_csv( + "https://download.bls.gov/pub/time.series/cu/cu.item", + sep="\t", + storage_options=headers + ) + +All URLs which are not local files or HTTP(s) are handled by +`fsspec`_, if installed, and its various filesystem implementations +(including Amazon S3, Google Cloud, SSH, FTP, webHDFS...). +Some of these implementations will require additional packages to be +installed, for example +S3 URLs require the `s3fs +`_ library: + +.. code-block:: python + + df = pd.read_json("s3://pandas-test/adatafile.json") + +When dealing with remote storage systems, you might need +extra configuration with environment variables or config files in +special locations. For example, to access data in your S3 bucket, +you will need to define credentials in one of the several ways listed in +the `S3Fs documentation +`_. The same is true +for several of the storage backends, and you should follow the links +at `fsimpl1`_ for implementations built into ``fsspec`` and `fsimpl2`_ +for those not included in the main ``fsspec`` +distribution. + +You can also pass parameters directly to the backend driver. Since ``fsspec`` does not +utilize the ``AWS_S3_HOST`` environment variable, we can directly define a +dictionary containing the endpoint_url and pass the object into the storage +option parameter: + +.. code-block:: python + + storage_options = {"client_kwargs": {"endpoint_url": "http://127.0.0.1:5555"}} + df = pd.read_json("s3://pandas-test/test-1", storage_options=storage_options) + +More sample configurations and documentation can be found at `S3Fs documentation +`__. + +If you do *not* have S3 credentials, you can still access public +data by specifying an anonymous connection, such as + +.. versionadded:: 1.2.0 + +.. code-block:: python + + pd.read_csv( + "s3://ncei-wcsd-archive/data/processed/SH1305/18kHz/SaKe2013" + "-D20130523-T080854_to_SaKe2013-D20130523-T085643.csv", + storage_options={"anon": True}, + ) + +``fsspec`` also allows complex URLs, for accessing data in compressed +archives, local caching of files, and more. To locally cache the above +example, you would modify the call to + +.. code-block:: python + + pd.read_csv( + "simplecache::s3://ncei-wcsd-archive/data/processed/SH1305/18kHz/" + "SaKe2013-D20130523-T080854_to_SaKe2013-D20130523-T085643.csv", + storage_options={"s3": {"anon": True}}, + ) + +where we specify that the "anon" parameter is meant for the "s3" part of +the implementation, not to the caching implementation. Note that this caches to a temporary +directory for the duration of the session only, but you can also specify +a permanent store. + +.. _fsspec: https://filesystem-spec.readthedocs.io/en/latest/ +.. _fsimpl1: https://filesystem-spec.readthedocs.io/en/latest/api.html#built-in-implementations +.. _fsimpl2: https://filesystem-spec.readthedocs.io/en/latest/api.html#other-known-implementations + +Writing out data +'''''''''''''''' + +.. _io.store_in_csv: + +Writing to CSV format ++++++++++++++++++++++ + +The ``Series`` and ``DataFrame`` objects have an instance method ``to_csv`` which +allows storing the contents of the object as a comma-separated-values file. The +function takes a number of arguments. Only the first is required. + +* ``path_or_buf``: A string path to the file to write or a file object. If a file object it must be opened with ``newline=''`` +* ``sep`` : Field delimiter for the output file (default ",") +* ``na_rep``: A string representation of a missing value (default '') +* ``float_format``: Format string for floating point numbers +* ``columns``: Columns to write (default None) +* ``header``: Whether to write out the column names (default True) +* ``index``: whether to write row (index) names (default True) +* ``index_label``: Column label(s) for index column(s) if desired. If None + (default), and ``header`` and ``index`` are True, then the index names are + used. (A sequence should be given if the ``DataFrame`` uses MultiIndex). +* ``mode`` : Python write mode, default 'w' +* ``encoding``: a string representing the encoding to use if the contents are + non-ASCII, for Python versions prior to 3 +* ``lineterminator``: Character sequence denoting line end (default ``os.linesep``) +* ``quoting``: Set quoting rules as in csv module (default csv.QUOTE_MINIMAL). Note that if you have set a ``float_format`` then floats are converted to strings and csv.QUOTE_NONNUMERIC will treat them as non-numeric +* ``quotechar``: Character used to quote fields (default '"') +* ``doublequote``: Control quoting of ``quotechar`` in fields (default True) +* ``escapechar``: Character used to escape ``sep`` and ``quotechar`` when + appropriate (default None) +* ``chunksize``: Number of rows to write at a time +* ``date_format``: Format string for datetime objects + +Writing a formatted string +++++++++++++++++++++++++++ + +.. _io.formatting: + +The ``DataFrame`` object has an instance method ``to_string`` which allows control +over the string representation of the object. All arguments are optional: + +* ``buf`` default None, for example a StringIO object +* ``columns`` default None, which columns to write +* ``col_space`` default None, minimum width of each column. +* ``na_rep`` default ``NaN``, representation of NA value +* ``formatters`` default None, a dictionary (by column) of functions each of + which takes a single argument and returns a formatted string +* ``float_format`` default None, a function which takes a single (float) + argument and returns a formatted string; to be applied to floats in the + ``DataFrame``. +* ``sparsify`` default True, set to False for a ``DataFrame`` with a hierarchical + index to print every MultiIndex key at each row. +* ``index_names`` default True, will print the names of the indices +* ``index`` default True, will print the index (ie, row labels) +* ``header`` default True, will print the column labels +* ``justify`` default ``left``, will print column headers left- or + right-justified + +The ``Series`` object also has a ``to_string`` method, but with only the ``buf``, +``na_rep``, ``float_format`` arguments. There is also a ``length`` argument +which, if set to ``True``, will additionally output the length of the Series. diff --git a/doc/source/user_guide/io/excel.rst b/doc/source/user_guide/io/excel.rst new file mode 100644 index 0000000000000..41ff0e7477235 --- /dev/null +++ b/doc/source/user_guide/io/excel.rst @@ -0,0 +1,531 @@ +.. _io.excel: + +=========== +Excel files +=========== + +The :func:`~pandas.read_excel` method can read Excel 2007+ (``.xlsx``) files +using the ``openpyxl`` Python module. Excel 2003 (``.xls``) files +can be read using ``xlrd``. Binary Excel (``.xlsb``) +files can be read using ``pyxlsb``. All formats can be read +using :ref:`calamine` engine. +The :meth:`~DataFrame.to_excel` instance method is used for +saving a ``DataFrame`` to Excel. Generally the semantics are +similar to working with :ref:`csv` data. +See the :ref:`cookbook` for some advanced strategies. + +.. note:: + + When ``engine=None``, the following logic will be used to determine the engine: + + - If ``path_or_buffer`` is an OpenDocument format (.odf, .ods, .odt), + then `odf `_ will be used. + - Otherwise if ``path_or_buffer`` is an xls format, ``xlrd`` will be used. + - Otherwise if ``path_or_buffer`` is in xlsb format, ``pyxlsb`` will be used. + - Otherwise ``openpyxl`` will be used. + +.. _io.excel_reader: + +Reading Excel files +''''''''''''''''''' + +In the most basic use-case, ``read_excel`` takes a path to an Excel +file, and the ``sheet_name`` indicating which sheet to parse. + +When using the ``engine_kwargs`` parameter, pandas will pass these arguments to the +engine. For this, it is important to know which function pandas is +using internally. + +* For the engine openpyxl, pandas is using :func:`openpyxl.load_workbook` to read in (``.xlsx``) and (``.xlsm``) files. + +* For the engine xlrd, pandas is using :func:`xlrd.open_workbook` to read in (``.xls``) files. + +* For the engine pyxlsb, pandas is using :func:`pyxlsb.open_workbook` to read in (``.xlsb``) files. + +* For the engine odf, pandas is using :func:`odf.opendocument.load` to read in (``.ods``) files. + +* For the engine calamine, pandas is using :func:`python_calamine.load_workbook` + to read in (``.xlsx``), (``.xlsm``), (``.xls``), (``.xlsb``), (``.ods``) files. + +.. code-block:: python + + # Returns a DataFrame + pd.read_excel("path_to_file.xls", sheet_name="Sheet1") + + +.. _io.excel.excelfile_class: + +``ExcelFile`` class ++++++++++++++++++++ + +To facilitate working with multiple sheets from the same file, the ``ExcelFile`` +class can be used to wrap the file and can be passed into ``read_excel`` +There will be a performance benefit for reading multiple sheets as the file is +read into memory only once. + +.. code-block:: python + + xlsx = pd.ExcelFile("path_to_file.xls") + df = pd.read_excel(xlsx, "Sheet1") + +The ``ExcelFile`` class can also be used as a context manager. + +.. code-block:: python + + with pd.ExcelFile("path_to_file.xls") as xls: + df1 = pd.read_excel(xls, "Sheet1") + df2 = pd.read_excel(xls, "Sheet2") + +The ``sheet_names`` property will generate +a list of the sheet names in the file. + +The primary use-case for an ``ExcelFile`` is parsing multiple sheets with +different parameters: + +.. code-block:: python + + data = {} + # For when Sheet1's format differs from Sheet2 + with pd.ExcelFile("path_to_file.xls") as xls: + data["Sheet1"] = pd.read_excel(xls, "Sheet1", index_col=None, na_values=["NA"]) + data["Sheet2"] = pd.read_excel(xls, "Sheet2", index_col=1) + +Note that if the same parsing parameters are used for all sheets, a list +of sheet names can simply be passed to ``read_excel`` with no loss in performance. + +.. code-block:: python + + # using the ExcelFile class + data = {} + with pd.ExcelFile("path_to_file.xls") as xls: + data["Sheet1"] = pd.read_excel(xls, "Sheet1", index_col=None, na_values=["NA"]) + data["Sheet2"] = pd.read_excel(xls, "Sheet2", index_col=None, na_values=["NA"]) + + # equivalent using the read_excel function + data = pd.read_excel( + "path_to_file.xls", ["Sheet1", "Sheet2"], index_col=None, na_values=["NA"] + ) + +``ExcelFile`` can also be called with a ``xlrd.book.Book`` object +as a parameter. This allows the user to control how the excel file is read. +For example, sheets can be loaded on demand by calling ``xlrd.open_workbook()`` +with ``on_demand=True``. + +.. code-block:: python + + import xlrd + + xlrd_book = xlrd.open_workbook("path_to_file.xls", on_demand=True) + with pd.ExcelFile(xlrd_book) as xls: + df1 = pd.read_excel(xls, "Sheet1") + df2 = pd.read_excel(xls, "Sheet2") + +.. _io.excel.specifying_sheets: + +Specifying sheets ++++++++++++++++++ + +.. note:: The second argument is ``sheet_name``, not to be confused with ``ExcelFile.sheet_names``. + +.. note:: An ExcelFile's attribute ``sheet_names`` provides access to a list of sheets. + +* The arguments ``sheet_name`` allows specifying the sheet or sheets to read. +* The default value for ``sheet_name`` is 0, indicating to read the first sheet +* Pass a string to refer to the name of a particular sheet in the workbook. +* Pass an integer to refer to the index of a sheet. Indices follow Python + convention, beginning at 0. +* Pass a list of either strings or integers, to return a dictionary of specified sheets. +* Pass a ``None`` to return a dictionary of all available sheets. + +.. code-block:: python + + # Returns a DataFrame + pd.read_excel("path_to_file.xls", "Sheet1", index_col=None, na_values=["NA"]) + +Using the sheet index: + +.. code-block:: python + + # Returns a DataFrame + pd.read_excel("path_to_file.xls", 0, index_col=None, na_values=["NA"]) + +Using all default values: + +.. code-block:: python + + # Returns a DataFrame + pd.read_excel("path_to_file.xls") + +Using None to get all sheets: + +.. code-block:: python + + # Returns a dictionary of DataFrames + pd.read_excel("path_to_file.xls", sheet_name=None) + +Using a list to get multiple sheets: + +.. code-block:: python + + # Returns the 1st and 4th sheet, as a dictionary of DataFrames. + pd.read_excel("path_to_file.xls", sheet_name=["Sheet1", 3]) + +``read_excel`` can read more than one sheet, by setting ``sheet_name`` to either +a list of sheet names, a list of sheet positions, or ``None`` to read all sheets. +Sheets can be specified by sheet index or sheet name, using an integer or string, +respectively. + +.. _io.excel.reading_multiindex: + +Reading a ``MultiIndex`` +++++++++++++++++++++++++ + +``read_excel`` can read a ``MultiIndex`` index, by passing a list of columns to ``index_col`` +and a ``MultiIndex`` column by passing a list of rows to ``header``. If either the ``index`` +or ``columns`` have serialized level names those will be read in as well by specifying +the rows/columns that make up the levels. + +For example, to read in a ``MultiIndex`` index without names: + +.. ipython:: python + + df = pd.DataFrame( + {"a": [1, 2, 3, 4], "b": [5, 6, 7, 8]}, + index=pd.MultiIndex.from_product([["a", "b"], ["c", "d"]]), + ) + df.to_excel("path_to_file.xlsx") + df = pd.read_excel("path_to_file.xlsx", index_col=[0, 1]) + df + +If the index has level names, they will be parsed as well, using the same +parameters. + +.. ipython:: python + + df.index = df.index.set_names(["lvl1", "lvl2"]) + df.to_excel("path_to_file.xlsx") + df = pd.read_excel("path_to_file.xlsx", index_col=[0, 1]) + df + + +If the source file has both ``MultiIndex`` index and columns, lists specifying each +should be passed to ``index_col`` and ``header``: + +.. ipython:: python + + df.columns = pd.MultiIndex.from_product([["a"], ["b", "d"]], names=["c1", "c2"]) + df.to_excel("path_to_file.xlsx") + df = pd.read_excel("path_to_file.xlsx", index_col=[0, 1], header=[0, 1]) + df + +.. ipython:: python + :suppress: + + os.remove("path_to_file.xlsx") + +Missing values in columns specified in ``index_col`` will be forward filled to +allow roundtripping with ``to_excel`` for ``merged_cells=True``. To avoid forward +filling the missing values use ``set_index`` after reading the data instead of +``index_col``. + +Parsing specific columns +++++++++++++++++++++++++ + +It is often the case that users will insert columns to do temporary computations +in Excel and you may not want to read in those columns. ``read_excel`` takes +a ``usecols`` keyword to allow you to specify a subset of columns to parse. + +You can specify a comma-delimited set of Excel columns and ranges as a string: + +.. code-block:: python + + pd.read_excel("path_to_file.xls", "Sheet1", usecols="A,C:E") + +If ``usecols`` is a list of integers, then it is assumed to be the file column +indices to be parsed. + +.. code-block:: python + + pd.read_excel("path_to_file.xls", "Sheet1", usecols=[0, 2, 3]) + +Element order is ignored, so ``usecols=[0, 1]`` is the same as ``[1, 0]``. + +If ``usecols`` is a list of strings, it is assumed that each string corresponds +to a column name provided either by the user in ``names`` or inferred from the +document header row(s). Those strings define which columns will be parsed: + +.. code-block:: python + + pd.read_excel("path_to_file.xls", "Sheet1", usecols=["foo", "bar"]) + +Element order is ignored, so ``usecols=['baz', 'joe']`` is the same as ``['joe', 'baz']``. + +If ``usecols`` is callable, the callable function will be evaluated against +the column names, returning names where the callable function evaluates to ``True``. + +.. code-block:: python + + pd.read_excel("path_to_file.xls", "Sheet1", usecols=lambda x: x.isalpha()) + +Parsing dates ++++++++++++++ + +Datetime-like values are normally automatically converted to the appropriate +dtype when reading the excel file. But if you have a column of strings that +*look* like dates (but are not actually formatted as dates in excel), you can +use the ``parse_dates`` keyword to parse those strings to datetimes: + +.. code-block:: python + + pd.read_excel("path_to_file.xls", "Sheet1", parse_dates=["date_strings"]) + + +Cell converters ++++++++++++++++ + +It is possible to transform the contents of Excel cells via the ``converters`` +option. For instance, to convert a column to boolean: + +.. code-block:: python + + pd.read_excel("path_to_file.xls", "Sheet1", converters={"MyBools": bool}) + +This options handles missing values and treats exceptions in the converters +as missing data. Transformations are applied cell by cell rather than to the +column as a whole, so the array dtype is not guaranteed. For instance, a +column of integers with missing values cannot be transformed to an array +with integer dtype, because NaN is strictly a float. You can manually mask +missing data to recover integer dtype: + +.. code-block:: python + + def cfun(x): + return int(x) if x else -1 + + + pd.read_excel("path_to_file.xls", "Sheet1", converters={"MyInts": cfun}) + +Dtype specifications +++++++++++++++++++++ + +As an alternative to converters, the type for an entire column can +be specified using the ``dtype`` keyword, which takes a dictionary +mapping column names to types. To interpret data with +no type inference, use the type ``str`` or ``object``. + +.. code-block:: python + + pd.read_excel("path_to_file.xls", dtype={"MyInts": "int64", "MyText": str}) + +.. _io.excel_writer: + +Writing Excel files +''''''''''''''''''' + +Writing Excel files to disk ++++++++++++++++++++++++++++ + +To write a ``DataFrame`` object to a sheet of an Excel file, you can use the +``to_excel`` instance method. The arguments are largely the same as ``to_csv`` +described above, the first argument being the name of the excel file, and the +optional second argument the name of the sheet to which the ``DataFrame`` should be +written. For example: + +.. code-block:: python + + df.to_excel("path_to_file.xlsx", sheet_name="Sheet1") + +Files with a +``.xlsx`` extension will be written using ``xlsxwriter`` (if available) or +``openpyxl``. + +The ``DataFrame`` will be written in a way that tries to mimic the REPL output. +The ``index_label`` will be placed in the second +row instead of the first. You can place it in the first row by setting the +``merge_cells`` option in ``to_excel()`` to ``False``: + +.. code-block:: python + + df.to_excel("path_to_file.xlsx", index_label="label", merge_cells=False) + +In order to write separate ``DataFrames`` to separate sheets in a single Excel file, +one can pass an :class:`~pandas.io.excel.ExcelWriter`. + +.. code-block:: python + + with pd.ExcelWriter("path_to_file.xlsx") as writer: + df1.to_excel(writer, sheet_name="Sheet1") + df2.to_excel(writer, sheet_name="Sheet2") + +.. _io.excel_writing_buffer: + +When using the ``engine_kwargs`` parameter, pandas will pass these arguments to the +engine. For this, it is important to know which function pandas is using internally. + +* For the engine openpyxl, pandas is using :func:`openpyxl.Workbook` to create a new sheet and :func:`openpyxl.load_workbook` to append data to an existing sheet. The openpyxl engine writes to (``.xlsx``) and (``.xlsm``) files. + +* For the engine xlsxwriter, pandas is using :func:`xlsxwriter.Workbook` to write to (``.xlsx``) files. + +* For the engine odf, pandas is using :func:`odf.opendocument.OpenDocumentSpreadsheet` to write to (``.ods``) files. + +Writing Excel files to memory ++++++++++++++++++++++++++++++ + +pandas supports writing Excel files to buffer-like objects such as ``StringIO`` or +``BytesIO`` using :class:`~pandas.io.excel.ExcelWriter`. + +.. code-block:: python + + from io import BytesIO + + bio = BytesIO() + + # By setting the 'engine' in the ExcelWriter constructor. + writer = pd.ExcelWriter(bio, engine="xlsxwriter") + df.to_excel(writer, sheet_name="Sheet1") + + # Save the workbook + writer.save() + + # Seek to the beginning and read to copy the workbook to a variable in memory + bio.seek(0) + workbook = bio.read() + +.. note:: + + ``engine`` is optional but recommended. Setting the engine determines + the version of workbook produced. Setting ``engine='xlrd'`` will produce an + Excel 2003-format workbook (xls). Using either ``'openpyxl'`` or + ``'xlsxwriter'`` will produce an Excel 2007-format workbook (xlsx). If + omitted, an Excel 2007-formatted workbook is produced. + + +.. _io.excel.writers: + +Excel writer engines +'''''''''''''''''''' + +pandas chooses an Excel writer via two methods: + +1. the ``engine`` keyword argument +2. the filename extension (via the default specified in config options) + +By default, pandas uses the `XlsxWriter`_ for ``.xlsx``, `openpyxl`_ +for ``.xlsm``. If you have multiple +engines installed, you can set the default engine through :ref:`setting the +config options ` ``io.excel.xlsx.writer`` and +``io.excel.xls.writer``. pandas will fall back on `openpyxl`_ for ``.xlsx`` +files if `Xlsxwriter`_ is not available. + +.. _XlsxWriter: https://xlsxwriter.readthedocs.io +.. _openpyxl: https://openpyxl.readthedocs.io/ + +To specify which writer you want to use, you can pass an engine keyword +argument to ``to_excel`` and to ``ExcelWriter``. The built-in engines are: + +* ``openpyxl``: version 2.4 or higher is required +* ``xlsxwriter`` + +.. code-block:: python + + # By setting the 'engine' in the DataFrame 'to_excel()' methods. + df.to_excel("path_to_file.xlsx", sheet_name="Sheet1", engine="xlsxwriter") + + # By setting the 'engine' in the ExcelWriter constructor. + writer = pd.ExcelWriter("path_to_file.xlsx", engine="xlsxwriter") + + # Or via pandas configuration. + from pandas import options # noqa: E402 + + options.io.excel.xlsx.writer = "xlsxwriter" + + df.to_excel("path_to_file.xlsx", sheet_name="Sheet1") + +.. _io.excel.style: + +Style and formatting +'''''''''''''''''''' + +The look and feel of Excel worksheets created from pandas can be modified using the following parameters on the ``DataFrame``'s ``to_excel`` method. + +* ``float_format`` : Format string for floating point numbers (default ``None``). +* ``freeze_panes`` : A tuple of two integers representing the bottommost row and rightmost column to freeze. Each of these parameters is one-based, so (1, 1) will freeze the first row and first column (default ``None``). + +.. note:: + + As of pandas 3.0, by default spreadsheets created with the ``to_excel`` method + will not contain any styling. Users wishing to bold text, add bordered styles, + etc in a worksheet output by ``to_excel`` can do so by using :meth:`Styler.to_excel` + to create styled excel files. For documentation on styling spreadsheets, see + `here `__. + + +.. code-block:: python + + css = "border: 1px solid black; font-weight: bold;" + df.style.map_index(lambda x: css).map_index(lambda x: css, axis=1).to_excel("myfile.xlsx") + +Using the `Xlsxwriter`_ engine provides many options for controlling the +format of an Excel worksheet created with the ``to_excel`` method. Excellent examples can be found in the +`Xlsxwriter`_ documentation here: https://xlsxwriter.readthedocs.io/working_with_pandas.html + +.. _io.ods: + +OpenDocument Spreadsheets +''''''''''''''''''''''''' + +The io methods for `Excel files`_ also support reading and writing OpenDocument spreadsheets +using the `odfpy `__ module. The semantics and features for reading and writing +OpenDocument spreadsheets match what can be done for `Excel files`_ using +``engine='odf'``. The optional dependency 'odfpy' needs to be installed. + +The :func:`~pandas.read_excel` method can read OpenDocument spreadsheets + +.. code-block:: python + + # Returns a DataFrame + pd.read_excel("path_to_file.ods", engine="odf") + +Similarly, the :func:`~pandas.to_excel` method can write OpenDocument spreadsheets + +.. code-block:: python + + # Writes DataFrame to a .ods file + df.to_excel("path_to_file.ods", engine="odf") + +.. _io.xlsb: + +Binary Excel (.xlsb) files +'''''''''''''''''''''''''' + +The :func:`~pandas.read_excel` method can also read binary Excel files +using the ``pyxlsb`` module. The semantics and features for reading +binary Excel files mostly match what can be done for `Excel files`_ using +``engine='pyxlsb'``. ``pyxlsb`` does not recognize datetime types +in files and will return floats instead (you can use :ref:`calamine` +if you need recognize datetime types). + +.. code-block:: python + + # Returns a DataFrame + pd.read_excel("path_to_file.xlsb", engine="pyxlsb") + +.. note:: + + Currently pandas only supports *reading* binary Excel files. Writing + is not implemented. + +.. _io.calamine: + +Calamine (Excel and ODS files) +'''''''''''''''''''''''''''''' + +The :func:`~pandas.read_excel` method can read Excel file (``.xlsx``, ``.xlsm``, ``.xls``, ``.xlsb``) +and OpenDocument spreadsheets (``.ods``) using the ``python-calamine`` module. +This module is a binding for Rust library `calamine `__ +and is faster than other engines in most cases. The optional dependency 'python-calamine' needs to be installed. + +.. code-block:: python + + # Returns a DataFrame + pd.read_excel("path_to_file.xlsb", engine="calamine") diff --git a/doc/source/user_guide/io/feather.rst b/doc/source/user_guide/io/feather.rst new file mode 100644 index 0000000000000..713660e7e0260 --- /dev/null +++ b/doc/source/user_guide/io/feather.rst @@ -0,0 +1,67 @@ +.. _io.feather: + +======= +Feather +======= + +Feather provides binary columnar serialization for data frames. It is designed to make reading and writing data +frames efficient, and to make sharing data across data analysis languages easy. + +Feather is designed to faithfully serialize and de-serialize DataFrames, supporting all of the pandas +dtypes, including extension dtypes such as categorical and datetime with tz. + +Several caveats: + +* The format will NOT write an ``Index``, or ``MultiIndex`` for the + ``DataFrame`` and will raise an error if a non-default one is provided. You + can ``.reset_index()`` to store the index or ``.reset_index(drop=True)`` to + ignore it. +* Duplicate column names and non-string columns names are not supported +* Actual Python objects in object dtype columns are not supported. These will + raise a helpful error message on an attempt at serialization. + +See the `Full Documentation `__. + +.. ipython:: python + + import pytz + + df = pd.DataFrame( + { + "a": list("abc"), + "b": list(range(1, 4)), + "c": np.arange(3, 6).astype("u1"), + "d": np.arange(4.0, 7.0, dtype="float64"), + "e": [True, False, True], + "f": pd.Categorical(list("abc")), + "g": pd.date_range("20130101", periods=3), + "h": pd.date_range("20130101", periods=3, tz=pytz.timezone("US/Eastern")), + "i": pd.date_range("20130101", periods=3, freq="ns"), + } + ) + + df + df.dtypes + +Write to a feather file. + +.. ipython:: python + :okwarning: + + df.to_feather("example.feather") + +Read from a feather file. + +.. ipython:: python + :okwarning: + + result = pd.read_feather("example.feather") + result + + # we preserve dtypes + result.dtypes + +.. ipython:: python + :suppress: + + os.remove("example.feather") diff --git a/doc/source/user_guide/io/hdf5.rst b/doc/source/user_guide/io/hdf5.rst new file mode 100644 index 0000000000000..55457339f0179 --- /dev/null +++ b/doc/source/user_guide/io/hdf5.rst @@ -0,0 +1,1096 @@ +.. _io.hdf5: + +=============== +HDF5 (PyTables) +=============== + +``HDFStore`` is a dict-like object which reads and writes pandas using +the high performance HDF5 format using the excellent `PyTables +`__ library. See the :ref:`cookbook ` +for some advanced strategies + +.. warning:: + + pandas uses PyTables for reading and writing HDF5 files, which allows + serializing object-dtype data with pickle. Loading pickled data received from + untrusted sources can be unsafe. + + See: https://docs.python.org/3/library/pickle.html for more. + +.. ipython:: python + :suppress: + :okexcept: + + os.remove("store.h5") + +.. ipython:: python + + store = pd.HDFStore("store.h5") + print(store) + +Objects can be written to the file just like adding key-value pairs to a +dict: + +.. ipython:: python + + index = pd.date_range("1/1/2000", periods=8) + s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"]) + df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=["A", "B", "C"]) + + # store.put('s', s) is an equivalent method + store["s"] = s + + store["df"] = df + + store + +In a current or later Python session, you can retrieve stored objects: + +.. ipython:: python + + # store.get('df') is an equivalent method + store["df"] + + # dotted (attribute) access provides get as well + store.df + +Deletion of the object specified by the key: + +.. ipython:: python + + # store.remove('df') is an equivalent method + del store["df"] + + store + +Closing a Store and using a context manager: + +.. ipython:: python + + store.close() + store + store.is_open + + # Working with, and automatically closing the store using a context manager + with pd.HDFStore("store.h5") as store: + store.keys() + +.. ipython:: python + :suppress: + + store.close() + os.remove("store.h5") + + + +Read/write API +'''''''''''''' + +``HDFStore`` supports a top-level API using ``read_hdf`` for reading and ``to_hdf`` for writing, +similar to how ``read_csv`` and ``to_csv`` work. + +.. ipython:: python + + df_tl = pd.DataFrame({"A": list(range(5)), "B": list(range(5))}) + df_tl.to_hdf("store_tl.h5", key="table", append=True) + pd.read_hdf("store_tl.h5", "table", where=["index>2"]) + +.. ipython:: python + :suppress: + :okexcept: + + os.remove("store_tl.h5") + + +HDFStore will by default not drop rows that are all missing. This behavior can be changed by setting ``dropna=True``. + + +.. ipython:: python + + df_with_missing = pd.DataFrame( + { + "col1": [0, np.nan, 2], + "col2": [1, np.nan, np.nan], + } + ) + df_with_missing + + df_with_missing.to_hdf("file.h5", key="df_with_missing", format="table", mode="w") + + pd.read_hdf("file.h5", "df_with_missing") + + df_with_missing.to_hdf( + "file.h5", key="df_with_missing", format="table", mode="w", dropna=True + ) + pd.read_hdf("file.h5", "df_with_missing") + + +.. ipython:: python + :suppress: + + os.remove("file.h5") + + +.. _io.hdf5-fixed: + +Fixed format +'''''''''''' + +The examples above show storing using ``put``, which write the HDF5 to ``PyTables`` in a fixed array format, called +the ``fixed`` format. These types of stores are **not** appendable once written (though you can simply +remove them and rewrite). Nor are they **queryable**; they must be +retrieved in their entirety. They also do not support dataframes with non-unique column names. +The ``fixed`` format stores offer very fast writing and slightly faster reading than ``table`` stores. +This format is specified by default when using ``put`` or ``to_hdf`` or by ``format='fixed'`` or ``format='f'``. + +.. warning:: + + A ``fixed`` format will raise a ``TypeError`` if you try to retrieve using a ``where``: + + .. ipython:: python + :okexcept: + + pd.DataFrame(np.random.randn(10, 2)).to_hdf("test_fixed.h5", key="df") + pd.read_hdf("test_fixed.h5", "df", where="index>5") + + .. ipython:: python + :suppress: + + os.remove("test_fixed.h5") + + +.. _io.hdf5-table: + +Table format +'''''''''''' + +``HDFStore`` supports another ``PyTables`` format on disk, the ``table`` +format. Conceptually a ``table`` is shaped very much like a DataFrame, +with rows and columns. A ``table`` may be appended to in the same or +other sessions. In addition, delete and query type operations are +supported. This format is specified by ``format='table'`` or ``format='t'`` +to ``append`` or ``put`` or ``to_hdf``. + +This format can be set as an option as well ``pd.set_option('io.hdf.default_format','table')`` to +enable ``put/append/to_hdf`` to by default store in the ``table`` format. + +.. ipython:: python + :suppress: + :okexcept: + + os.remove("store.h5") + +.. ipython:: python + + store = pd.HDFStore("store.h5") + df1 = df[0:4] + df2 = df[4:] + + # append data (creates a table automatically) + store.append("df", df1) + store.append("df", df2) + store + + # select the entire object + store.select("df") + + # the type of stored data + store.root.df._v_attrs.pandas_type + +.. note:: + + You can also create a ``table`` by passing ``format='table'`` or ``format='t'`` to a ``put`` operation. + +.. _io.hdf5-keys: + +Hierarchical keys +''''''''''''''''' + +Keys to a store can be specified as a string. These can be in a +hierarchical path-name like format (e.g. ``foo/bar/bah``), which will +generate a hierarchy of sub-stores (or ``Groups`` in PyTables +parlance). Keys can be specified without the leading '/' and are **always** +absolute (e.g. 'foo' refers to '/foo'). Removal operations can remove +everything in the sub-store and **below**, so be *careful*. + +.. ipython:: python + + store.put("foo/bar/bah", df) + store.append("food/orange", df) + store.append("food/apple", df) + store + + # a list of keys are returned + store.keys() + + # remove all nodes under this level + store.remove("food") + store + + +You can walk through the group hierarchy using the ``walk`` method which +will yield a tuple for each group key along with the relative keys of its contents. + +.. ipython:: python + + for (path, subgroups, subkeys) in store.walk(): + for subgroup in subgroups: + print("GROUP: {}/{}".format(path, subgroup)) + for subkey in subkeys: + key = "/".join([path, subkey]) + print("KEY: {}".format(key)) + print(store.get(key)) + + + +.. warning:: + + Hierarchical keys cannot be retrieved as dotted (attribute) access as described above for items stored under the root node. + + .. ipython:: python + :okexcept: + + store.foo.bar.bah + + .. ipython:: python + + # you can directly access the actual PyTables node but using the root node + store.root.foo.bar.bah + + Instead, use explicit string based keys: + + .. ipython:: python + + store["foo/bar/bah"] + + +.. _io.hdf5-types: + +Storing types +''''''''''''' + +Storing mixed types in a table +++++++++++++++++++++++++++++++ + +Storing mixed-dtype data is supported. Strings are stored as a +fixed-width using the maximum size of the appended column. Subsequent attempts +at appending longer strings will raise a ``ValueError``. + +Passing ``min_itemsize={`values`: size}`` as a parameter to append +will set a larger minimum for the string columns. Storing ``floats, +strings, ints, bools, datetime64`` are currently supported. For string +columns, passing ``nan_rep = 'nan'`` to append will change the default +nan representation on disk (which converts to/from ``np.nan``), this +defaults to ``nan``. + +.. ipython:: python + + df_mixed = pd.DataFrame( + { + "A": np.random.randn(8), + "B": np.random.randn(8), + "C": np.array(np.random.randn(8), dtype="float32"), + "string": "string", + "int": 1, + "bool": True, + "datetime64": pd.Timestamp("20010102"), + }, + index=list(range(8)), + ) + df_mixed.loc[df_mixed.index[3:5], ["A", "B", "string", "datetime64"]] = np.nan + + store.append("df_mixed", df_mixed, min_itemsize={"values": 50}) + df_mixed1 = store.select("df_mixed") + df_mixed1 + df_mixed1.dtypes.value_counts() + + # we have provided a minimum string column size + store.root.df_mixed.table + +Storing MultiIndex DataFrames ++++++++++++++++++++++++++++++ + +Storing MultiIndex ``DataFrames`` as tables is very similar to +storing/selecting from homogeneous index ``DataFrames``. + +.. ipython:: python + + index = pd.MultiIndex( + levels=[["foo", "bar", "baz", "qux"], ["one", "two", "three"]], + codes=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3], [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]], + names=["foo", "bar"], + ) + df_mi = pd.DataFrame(np.random.randn(10, 3), index=index, columns=["A", "B", "C"]) + df_mi + + store.append("df_mi", df_mi) + store.select("df_mi") + + # the levels are automatically included as data columns + store.select("df_mi", "foo=bar") + +.. note:: + The ``index`` keyword is reserved and cannot be use as a level name. + +.. _io.hdf5-query: + +Querying +'''''''' + +Querying a table +++++++++++++++++ + +``select`` and ``delete`` operations have an optional criterion that can +be specified to select/delete only a subset of the data. This allows one +to have a very large on-disk table and retrieve only a portion of the +data. + +A query is specified using the ``Term`` class under the hood, as a boolean expression. + +* ``index`` and ``columns`` are supported indexers of ``DataFrames``. +* if ``data_columns`` are specified, these can be used as additional indexers. +* level name in a MultiIndex, with default name ``level_0``, ``level_1``, … if not provided. + +Valid comparison operators are: + +``=, ==, !=, >, >=, <, <=`` + +Valid boolean expressions are combined with: + +* ``|`` : or +* ``&`` : and +* ``(`` and ``)`` : for grouping + +These rules are similar to how boolean expressions are used in pandas for indexing. + +.. note:: + + - ``=`` will be automatically expanded to the comparison operator ``==`` + - ``~`` is the not operator, but can only be used in very limited + circumstances + - If a list/tuple of expressions is passed they will be combined via ``&`` + +The following are valid expressions: + +* ``'index >= date'`` +* ``"columns = ['A', 'D']"`` +* ``"columns in ['A', 'D']"`` +* ``'columns = A'`` +* ``'columns == A'`` +* ``"~(columns = ['A', 'B'])"`` +* ``'index > df.index[3] & string = "bar"'`` +* ``'(index > df.index[3] & index <= df.index[6]) | string = "bar"'`` +* ``"ts >= Timestamp('2012-02-01')"`` +* ``"major_axis>=20130101"`` + +The ``indexers`` are on the left-hand side of the sub-expression: + +``columns``, ``major_axis``, ``ts`` + +The right-hand side of the sub-expression (after a comparison operator) can be: + +* functions that will be evaluated, e.g. ``Timestamp('2012-02-01')`` +* strings, e.g. ``"bar"`` +* date-like, e.g. ``20130101``, or ``"20130101"`` +* lists, e.g. ``"['A', 'B']"`` +* variables that are defined in the local names space, e.g. ``date`` + +.. note:: + + Passing a string to a query by interpolating it into the query + expression is not recommended. Simply assign the string of interest to a + variable and use that variable in an expression. For example, do this + + .. code-block:: python + + string = "HolyMoly'" + store.select("df", "index == string") + + instead of this + + .. code-block:: python + + string = "HolyMoly'" + store.select('df', f'index == {string}') + + The latter will **not** work and will raise a ``SyntaxError``.Note that + there's a single quote followed by a double quote in the ``string`` + variable. + + If you *must* interpolate, use the ``'%r'`` format specifier + + .. code-block:: python + + store.select("df", "index == %r" % string) + + which will quote ``string``. + + +Here are some examples: + +.. ipython:: python + + dfq = pd.DataFrame( + np.random.randn(10, 4), + columns=list("ABCD"), + index=pd.date_range("20130101", periods=10), + ) + store.append("dfq", dfq, format="table", data_columns=True) + +Use boolean expressions, with in-line function evaluation. + +.. ipython:: python + + store.select("dfq", "index>pd.Timestamp('20130104') & columns=['A', 'B']") + +Use inline column reference. + +.. ipython:: python + + store.select("dfq", where="A>0 or C>0") + +The ``columns`` keyword can be supplied to select a list of columns to be +returned, this is equivalent to passing a +``'columns=list_of_columns_to_filter'``: + +.. ipython:: python + + store.select("df", "columns=['A', 'B']") + +``start`` and ``stop`` parameters can be specified to limit the total search +space. These are in terms of the total number of rows in a table. + +.. note:: + + ``select`` will raise a ``ValueError`` if the query expression has an unknown + variable reference. Usually this means that you are trying to select on a column + that is **not** a data_column. + + ``select`` will raise a ``SyntaxError`` if the query expression is not valid. + + +.. _io.hdf5-timedelta: + +Query timedelta64[ns] ++++++++++++++++++++++ + +You can store and query using the ``timedelta64[ns]`` type. Terms can be +specified in the format: ``()``, where float may be signed (and fractional), and unit can be +``D,s,ms,us,ns`` for the timedelta. Here's an example: + +.. ipython:: python + + from datetime import timedelta + + dftd = pd.DataFrame( + { + "A": pd.Timestamp("20130101"), + "B": [ + pd.Timestamp("20130101") + timedelta(days=i, seconds=10) + for i in range(10) + ], + } + ) + dftd["C"] = dftd["A"] - dftd["B"] + dftd + store.append("dftd", dftd, data_columns=True) + store.select("dftd", "C<'-3.5D'") + +.. _io.query_multi: + +Query MultiIndex +++++++++++++++++ + +Selecting from a ``MultiIndex`` can be achieved by using the name of the level. + +.. ipython:: python + + df_mi.index.names + store.select("df_mi", "foo=baz and bar=two") + +If the ``MultiIndex`` levels names are ``None``, the levels are automatically made available via +the ``level_n`` keyword with ``n`` the level of the ``MultiIndex`` you want to select from. + +.. ipython:: python + + index = pd.MultiIndex( + levels=[["foo", "bar", "baz", "qux"], ["one", "two", "three"]], + codes=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3], [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]], + ) + df_mi_2 = pd.DataFrame(np.random.randn(10, 3), index=index, columns=["A", "B", "C"]) + df_mi_2 + + store.append("df_mi_2", df_mi_2) + + # the levels are automatically included as data columns with keyword level_n + store.select("df_mi_2", "level_0=foo and level_1=two") + + +Indexing +++++++++ + +You can create/modify an index for a table with ``create_table_index`` +after data is already in the table (after and ``append/put`` +operation). Creating a table index is **highly** encouraged. This will +speed your queries a great deal when you use a ``select`` with the +indexed dimension as the ``where``. + +.. note:: + + Indexes are automagically created on the indexables + and any data columns you specify. This behavior can be turned off by passing + ``index=False`` to ``append``. + +.. ipython:: python + + # we have automagically already created an index (in the first section) + i = store.root.df.table.cols.index.index + i.optlevel, i.kind + + # change an index by passing new parameters + store.create_table_index("df", optlevel=9, kind="full") + i = store.root.df.table.cols.index.index + i.optlevel, i.kind + +Oftentimes when appending large amounts of data to a store, it is useful to turn off index creation for each append, then recreate at the end. + +.. ipython:: python + + df_1 = pd.DataFrame(np.random.randn(10, 2), columns=list("AB")) + df_2 = pd.DataFrame(np.random.randn(10, 2), columns=list("AB")) + + st = pd.HDFStore("appends.h5", mode="w") + st.append("df", df_1, data_columns=["B"], index=False) + st.append("df", df_2, data_columns=["B"], index=False) + st.get_storer("df").table + +Then create the index when finished appending. + +.. ipython:: python + + st.create_table_index("df", columns=["B"], optlevel=9, kind="full") + st.get_storer("df").table + + st.close() + +.. ipython:: python + :suppress: + :okexcept: + + os.remove("appends.h5") + +See `here `__ for how to create a completely-sorted-index (CSI) on an existing store. + +.. _io.hdf5-query-data-columns: + +Query via data columns +++++++++++++++++++++++ + +You can designate (and index) certain columns that you want to be able +to perform queries (other than the ``indexable`` columns, which you can +always query). For instance say you want to perform this common +operation, on-disk, and return just the frame that matches this +query. You can specify ``data_columns = True`` to force all columns to +be ``data_columns``. + +.. ipython:: python + + df_dc = df.copy() + df_dc["string"] = "foo" + df_dc.loc[df_dc.index[4:6], "string"] = np.nan + df_dc.loc[df_dc.index[7:9], "string"] = "bar" + df_dc["string2"] = "cool" + df_dc.loc[df_dc.index[1:3], ["B", "C"]] = 1.0 + df_dc + + # on-disk operations + store.append("df_dc", df_dc, data_columns=["B", "C", "string", "string2"]) + store.select("df_dc", where="B > 0") + + # getting creative + store.select("df_dc", "B > 0 & C > 0 & string == foo") + + # this is in-memory version of this type of selection + df_dc[(df_dc.B > 0) & (df_dc.C > 0) & (df_dc.string == "foo")] + + # we have automagically created this index and the B/C/string/string2 + # columns are stored separately as ``PyTables`` columns + store.root.df_dc.table + +There is some performance degradation by making lots of columns into +``data columns``, so it is up to the user to designate these. In addition, +you cannot change data columns (nor indexables) after the first +append/put operation (Of course you can simply read in the data and +create a new table!). + +Iterator +++++++++ + +You can pass ``iterator=True`` or ``chunksize=number_in_a_chunk`` +to ``select`` and ``select_as_multiple`` to return an iterator on the results. +The default is 50,000 rows returned in a chunk. + +.. ipython:: python + + for df in store.select("df", chunksize=3): + print(df) + +.. note:: + + You can also use the iterator with ``read_hdf`` which will open, then + automatically close the store when finished iterating. + + .. code-block:: python + + for df in pd.read_hdf("store.h5", "df", chunksize=3): + print(df) + +Note, that the chunksize keyword applies to the **source** rows. So if you +are doing a query, then the chunksize will subdivide the total rows in the table +and the query applied, returning an iterator on potentially unequal sized chunks. + +Here is a recipe for generating a query and using it to create equal sized return +chunks. + +.. ipython:: python + + dfeq = pd.DataFrame({"number": np.arange(1, 11)}) + dfeq + + store.append("dfeq", dfeq, data_columns=["number"]) + + def chunks(l, n): + return [l[i: i + n] for i in range(0, len(l), n)] + + evens = [2, 4, 6, 8, 10] + coordinates = store.select_as_coordinates("dfeq", "number=evens") + for c in chunks(coordinates, 2): + print(store.select("dfeq", where=c)) + +Advanced queries +++++++++++++++++ + +Select a single column +^^^^^^^^^^^^^^^^^^^^^^ + +To retrieve a single indexable or data column, use the +method ``select_column``. This will, for example, enable you to get the index +very quickly. These return a ``Series`` of the result, indexed by the row number. +These do not currently accept the ``where`` selector. + +.. ipython:: python + + store.select_column("df_dc", "index") + store.select_column("df_dc", "string") + +.. _io.hdf5-selecting_coordinates: + +Selecting coordinates +^^^^^^^^^^^^^^^^^^^^^ + +Sometimes you want to get the coordinates (a.k.a the index locations) of your query. This returns an +``Index`` of the resulting locations. These coordinates can also be passed to subsequent +``where`` operations. + +.. ipython:: python + + df_coord = pd.DataFrame( + np.random.randn(1000, 2), index=pd.date_range("20000101", periods=1000) + ) + store.append("df_coord", df_coord) + c = store.select_as_coordinates("df_coord", "index > 20020101") + c + store.select("df_coord", where=c) + +.. _io.hdf5-where_mask: + +Selecting using a where mask +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Sometime your query can involve creating a list of rows to select. Usually this ``mask`` would +be a resulting ``index`` from an indexing operation. This example selects the months of +a datetimeindex which are 5. + +.. ipython:: python + + df_mask = pd.DataFrame( + np.random.randn(1000, 2), index=pd.date_range("20000101", periods=1000) + ) + store.append("df_mask", df_mask) + c = store.select_column("df_mask", "index") + where = c[pd.DatetimeIndex(c).month == 5].index + store.select("df_mask", where=where) + +Storer object +^^^^^^^^^^^^^ + +If you want to inspect the stored object, retrieve via +``get_storer``. You could use this programmatically to say get the number +of rows in an object. + +.. ipython:: python + + store.get_storer("df_dc").nrows + + +Multiple table queries +++++++++++++++++++++++ + +The methods ``append_to_multiple`` and +``select_as_multiple`` can perform appending/selecting from +multiple tables at once. The idea is to have one table (call it the +selector table) that you index most/all of the columns, and perform your +queries. The other table(s) are data tables with an index matching the +selector table's index. You can then perform a very fast query +on the selector table, yet get lots of data back. This method is similar to +having a very wide table, but enables more efficient queries. + +The ``append_to_multiple`` method splits a given single DataFrame +into multiple tables according to ``d``, a dictionary that maps the +table names to a list of 'columns' you want in that table. If ``None`` +is used in place of a list, that table will have the remaining +unspecified columns of the given DataFrame. The argument ``selector`` +defines which table is the selector table (which you can make queries from). +The argument ``dropna`` will drop rows from the input ``DataFrame`` to ensure +tables are synchronized. This means that if a row for one of the tables +being written to is entirely ``np.nan``, that row will be dropped from all tables. + +If ``dropna`` is False, **THE USER IS RESPONSIBLE FOR SYNCHRONIZING THE TABLES**. +Remember that entirely ``np.Nan`` rows are not written to the HDFStore, so if +you choose to call ``dropna=False``, some tables may have more rows than others, +and therefore ``select_as_multiple`` may not work or it may return unexpected +results. + +.. ipython:: python + + df_mt = pd.DataFrame( + np.random.randn(8, 6), + index=pd.date_range("1/1/2000", periods=8), + columns=["A", "B", "C", "D", "E", "F"], + ) + df_mt["foo"] = "bar" + df_mt.loc[df_mt.index[1], ("A", "B")] = np.nan + + # you can also create the tables individually + store.append_to_multiple( + {"df1_mt": ["A", "B"], "df2_mt": None}, df_mt, selector="df1_mt" + ) + store + + # individual tables were created + store.select("df1_mt") + store.select("df2_mt") + + # as a multiple + store.select_as_multiple( + ["df1_mt", "df2_mt"], + where=["A>0", "B>0"], + selector="df1_mt", + ) + + +Delete from a table +''''''''''''''''''' + +You can delete from a table selectively by specifying a ``where``. In +deleting rows, it is important to understand the ``PyTables`` deletes +rows by erasing the rows, then **moving** the following data. Thus +deleting can potentially be a very expensive operation depending on the +orientation of your data. To get optimal performance, it's +worthwhile to have the dimension you are deleting be the first of the +``indexables``. + +Data is ordered (on the disk) in terms of the ``indexables``. Here's a +simple use case. You store panel-type data, with dates in the +``major_axis`` and ids in the ``minor_axis``. The data is then +interleaved like this: + +* date_1 + * id_1 + * id_2 + * . + * id_n +* date_2 + * id_1 + * . + * id_n + +It should be clear that a delete operation on the ``major_axis`` will be +fairly quick, as one chunk is removed, then the following data moved. On +the other hand a delete operation on the ``minor_axis`` will be very +expensive. In this case it would almost certainly be faster to rewrite +the table using a ``where`` that selects all but the missing data. + +.. warning:: + + Please note that HDF5 **DOES NOT RECLAIM SPACE** in the h5 files + automatically. Thus, repeatedly deleting (or removing nodes) and adding + again, **WILL TEND TO INCREASE THE FILE SIZE**. + + To *repack and clean* the file, use :ref:`ptrepack `. + +.. _io.hdf5-notes: + +Notes & caveats +''''''''''''''' + + +Compression ++++++++++++ + +``PyTables`` allows the stored data to be compressed. This applies to +all kinds of stores, not just tables. Two parameters are used to +control compression: ``complevel`` and ``complib``. + +* ``complevel`` specifies if and how hard data is to be compressed. + ``complevel=0`` and ``complevel=None`` disables compression and + ``0`_: The default compression library. + A classic in terms of compression, achieves good compression + rates but is somewhat slow. + - `lzo `_: Fast + compression and decompression. + - `bzip2 `_: Good compression rates. + - `blosc `_: Fast compression and + decompression. + + Support for alternative blosc compressors: + + - `blosc:blosclz `_ This is the + default compressor for ``blosc`` + - `blosc:lz4 + `_: + A compact, very popular and fast compressor. + - `blosc:lz4hc + `_: + A tweaked version of LZ4, produces better + compression ratios at the expense of speed. + - `blosc:snappy `_: + A popular compressor used in many places. + - `blosc:zlib `_: A classic; + somewhat slower than the previous ones, but + achieving better compression ratios. + - `blosc:zstd `_: An + extremely well balanced codec; it provides the best + compression ratios among the others above, and at + reasonably fast speed. + + If ``complib`` is defined as something other than the listed libraries a + ``ValueError`` exception is issued. + +.. note:: + + If the library specified with the ``complib`` option is missing on your platform, + compression defaults to ``zlib`` without further ado. + +Enable compression for all objects within the file: + +.. code-block:: python + + store_compressed = pd.HDFStore( + "store_compressed.h5", complevel=9, complib="blosc:blosclz" + ) + +Or on-the-fly compression (this only applies to tables) in stores where compression is not enabled: + +.. code-block:: python + + store.append("df", df, complib="zlib", complevel=5) + +.. _io.hdf5-ptrepack: + +ptrepack +++++++++ + +``PyTables`` offers better write performance when tables are compressed after +they are written, as opposed to turning on compression at the very +beginning. You can use the supplied ``PyTables`` utility +``ptrepack``. In addition, ``ptrepack`` can change compression levels +after the fact. + +.. code-block:: console + + ptrepack --chunkshape=auto --propindexes --complevel=9 --complib=blosc in.h5 out.h5 + +Furthermore ``ptrepack in.h5 out.h5`` will *repack* the file to allow +you to reuse previously deleted space. Alternatively, one can simply +remove the file and write again, or use the ``copy`` method. + +.. _io.hdf5-caveats: + +Caveats ++++++++ + +.. warning:: + + ``HDFStore`` is **not-threadsafe for writing**. The underlying + ``PyTables`` only supports concurrent reads (via threading or + processes). If you need reading and writing *at the same time*, you + need to serialize these operations in a single thread in a single + process. You will corrupt your data otherwise. See the (:issue:`2397`) for more information. + +* If you use locks to manage write access between multiple processes, you + may want to use :py:func:`~os.fsync` before releasing write locks. For + convenience you can use ``store.flush(fsync=True)`` to do this for you. +* Once a ``table`` is created columns (DataFrame) + are fixed; only exactly the same columns can be appended +* Be aware that timezones (e.g., ``zoneinfo.ZoneInfo('US/Eastern')``) + are not necessarily equal across timezone versions. So if data is + localized to a specific timezone in the HDFStore using one version + of a timezone library and that data is updated with another version, the data + will be converted to UTC since these timezones are not considered + equal. Either use the same version of timezone library or use ``tz_convert`` with + the updated timezone definition. + +.. warning:: + + ``PyTables`` will show a ``NaturalNameWarning`` if a column name + cannot be used as an attribute selector. + *Natural* identifiers contain only letters, numbers, and underscores, + and may not begin with a number. + Other identifiers cannot be used in a ``where`` clause + and are generally a bad idea. + +.. _io.hdf5-data_types: + +DataTypes +''''''''' + +``HDFStore`` will map an object dtype to the ``PyTables`` underlying +dtype. This means the following types are known to work: + +====================================================== ========================= +Type Represents missing values +====================================================== ========================= +floating : ``float64, float32, float16`` ``np.nan`` +integer : ``int64, int32, int8, uint64,uint32, uint8`` +boolean +``datetime64[ns]`` ``NaT`` +``timedelta64[ns]`` ``NaT`` +categorical : see the section below +object : ``strings`` ``np.nan`` +====================================================== ========================= + +``unicode`` columns are not supported, and **WILL FAIL**. + +.. _io.hdf5-categorical: + +Categorical data +++++++++++++++++ + +You can write data that contains ``category`` dtypes to a ``HDFStore``. +Queries work the same as if it was an object array. However, the ``category`` dtyped data is +stored in a more efficient manner. + +.. ipython:: python + + dfcat = pd.DataFrame( + {"A": pd.Series(list("aabbcdba")).astype("category"), "B": np.random.randn(8)} + ) + dfcat + dfcat.dtypes + cstore = pd.HDFStore("cats.h5", mode="w") + cstore.append("dfcat", dfcat, format="table", data_columns=["A"]) + result = cstore.select("dfcat", where="A in ['b', 'c']") + result + result.dtypes + +.. ipython:: python + :suppress: + :okexcept: + + cstore.close() + os.remove("cats.h5") + + +String columns +++++++++++++++ + +**min_itemsize** + +The underlying implementation of ``HDFStore`` uses a fixed column width (itemsize) for string columns. +A string column itemsize is calculated as the maximum of the +length of data (for that column) that is passed to the ``HDFStore``, **in the first append**. Subsequent appends, +may introduce a string for a column **larger** than the column can hold, an Exception will be raised (otherwise you +could have a silent truncation of these columns, leading to loss of information). In the future we may relax this and +allow a user-specified truncation to occur. + +Pass ``min_itemsize`` on the first table creation to a-priori specify the minimum length of a particular string column. +``min_itemsize`` can be an integer, or a dict mapping a column name to an integer. You can pass ``values`` as a key to +allow all *indexables* or *data_columns* to have this min_itemsize. + +Passing a ``min_itemsize`` dict will cause all passed columns to be created as *data_columns* automatically. + +.. note:: + + If you are not passing any ``data_columns``, then the ``min_itemsize`` will be the maximum of the length of any string passed + +.. ipython:: python + + dfs = pd.DataFrame({"A": "foo", "B": "bar"}, index=list(range(5))) + dfs + + # A and B have a size of 30 + store.append("dfs", dfs, min_itemsize=30) + store.get_storer("dfs").table + + # A is created as a data_column with a size of 30 + # B is size is calculated + store.append("dfs2", dfs, min_itemsize={"A": 30}) + store.get_storer("dfs2").table + +**nan_rep** + +String columns will serialize a ``np.nan`` (a missing value) with the ``nan_rep`` string representation. This defaults to the string value ``nan``. +You could inadvertently turn an actual ``nan`` value into a missing value. + +.. ipython:: python + + dfss = pd.DataFrame({"A": ["foo", "bar", "nan"]}) + dfss + + store.append("dfss", dfss) + store.select("dfss") + + # here you need to specify a different nan rep + store.append("dfss2", dfss, nan_rep="_nan_") + store.select("dfss2") + + +Performance +''''''''''' + +* ``tables`` format come with a writing performance penalty as compared to + ``fixed`` stores. The benefit is the ability to append/delete and + query (potentially very large amounts of data). Write times are + generally longer as compared with regular stores. Query times can + be quite fast, especially on an indexed axis. +* You can pass ``chunksize=`` to ``append``, specifying the + write chunksize (default is 50000). This will significantly lower + your memory usage on writing. +* You can pass ``expectedrows=`` to the first ``append``, + to set the TOTAL number of rows that ``PyTables`` will expect. + This will optimize read/write performance. +* Duplicate rows can be written to tables, but are filtered out in + selection (with the last items being selected; thus a table is + unique on major, minor pairs) +* A ``PerformanceWarning`` will be raised if you are attempting to + store types that will be pickled by PyTables (rather than stored as + endemic types). See + `Here `__ + for more information and some solutions. + + +.. ipython:: python + :suppress: + + store.close() + os.remove("store.h5") diff --git a/doc/source/user_guide/io/html.rst b/doc/source/user_guide/io/html.rst new file mode 100644 index 0000000000000..879c2da281c92 --- /dev/null +++ b/doc/source/user_guide/io/html.rst @@ -0,0 +1,459 @@ +==== +HTML +==== + +.. _io.read_html: + +Reading HTML content +'''''''''''''''''''' + +.. warning:: + + We **highly encourage** you to read the :ref:`HTML Table Parsing gotchas ` + below regarding the issues surrounding the BeautifulSoup4/html5lib/lxml parsers. + +The top-level :func:`~pandas.io.html.read_html` function can accept an HTML +string/file/URL and will parse HTML tables into list of pandas ``DataFrames``. +Let's look at a few examples. + +.. note:: + + ``read_html`` returns a ``list`` of ``DataFrame`` objects, even if there is + only a single table contained in the HTML content. + +Read a URL with no options: + +.. code-block:: ipython + + In [320]: url = "https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list" + + In [321]: pd.read_html(url) + Out[321]: + [ Bank NameBank CityCity StateSt ... Acquiring InstitutionAI Closing DateClosing FundFund + 0 Almena State Bank Almena KS ... Equity Bank October 23, 2020 10538 + 1 First City Bank of Florida Fort Walton Beach FL ... United Fidelity Bank, fsb October 16, 2020 10537 + 2 The First State Bank Barboursville WV ... MVB Bank, Inc. April 3, 2020 10536 + 3 Ericson State Bank Ericson NE ... Farmers and Merchants Bank February 14, 2020 10535 + 4 City National Bank of New Jersey Newark NJ ... Industrial Bank November 1, 2019 10534 + .. ... ... ... ... ... ... ... + 558 Superior Bank, FSB Hinsdale IL ... Superior Federal, FSB July 27, 2001 6004 + 559 Malta National Bank Malta OH ... North Valley Bank May 3, 2001 4648 + 560 First Alliance Bank & Trust Co. Manchester NH ... Southern New Hampshire Bank & Trust February 2, 2001 4647 + 561 National State Bank of Metropolis Metropolis IL ... Banterra Bank of Marion December 14, 2000 4646 + 562 Bank of Honolulu Honolulu HI ... Bank of the Orient October 13, 2000 4645 + + [563 rows x 7 columns]] + +.. note:: + + The data from the above URL changes every Monday so the resulting data above may be slightly different. + +Read a URL while passing headers alongside the HTTP request: + +.. code-block:: ipython + + In [322]: url = 'https://www.sump.org/notes/request/' # HTTP request reflector + + In [323]: pd.read_html(url) + Out[323]: + [ 0 1 + 0 Remote Socket: 51.15.105.256:51760 + 1 Protocol Version: HTTP/1.1 + 2 Request Method: GET + 3 Request URI: /notes/request/ + 4 Request Query: NaN, + 0 Accept-Encoding: identity + 1 Host: www.sump.org + 2 User-Agent: Python-urllib/3.8 + 3 Connection: close] + + In [324]: headers = { + .....: 'User-Agent':'Mozilla Firefox v14.0', + .....: 'Accept':'application/json', + .....: 'Connection':'keep-alive', + .....: 'Auth':'Bearer 2*/f3+fe68df*4' + .....: } + + In [325]: pd.read_html(url, storage_options=headers) + Out[325]: + [ 0 1 + 0 Remote Socket: 51.15.105.256:51760 + 1 Protocol Version: HTTP/1.1 + 2 Request Method: GET + 3 Request URI: /notes/request/ + 4 Request Query: NaN, + 0 User-Agent: Mozilla Firefox v14.0 + 1 AcceptEncoding: gzip, deflate, br + 2 Accept: application/json + 3 Connection: keep-alive + 4 Auth: Bearer 2*/f3+fe68df*4] + +.. note:: + + We see above that the headers we passed are reflected in the HTTP request. + +Read in the content of the file from the above URL and pass it to ``read_html`` +as a string: + +.. ipython:: python + + html_str = """ + + + + + + + + + + + +
ABC
abc
+ """ + + with open("tmp.html", "w") as f: + f.write(html_str) + df = pd.read_html("tmp.html") + df[0] + +.. ipython:: python + :suppress: + + os.remove("tmp.html") + +You can even pass in an instance of ``StringIO`` if you so desire: + +.. ipython:: python + + from io import StringIO + + dfs = pd.read_html(StringIO(html_str)) + dfs[0] + +.. note:: + + The following examples are not run by the IPython evaluator due to the fact + that having so many network-accessing functions slows down the documentation + build. If you spot an error or an example that doesn't run, please do not + hesitate to report it over on `pandas GitHub issues page + `__. + + +Read a URL and match a table that contains specific text: + +.. code-block:: python + + match = "Metcalf Bank" + df_list = pd.read_html(url, match=match) + +Specify a header row (by default ```` or ```` elements located within a +```` are used to form the column index, if multiple rows are contained within +```` then a MultiIndex is created); if specified, the header row is taken +from the data minus the parsed header elements (```` elements). + +.. code-block:: python + + dfs = pd.read_html(url, header=0) + +Specify an index column: + +.. code-block:: python + + dfs = pd.read_html(url, index_col=0) + +Specify a number of rows to skip: + +.. code-block:: python + + dfs = pd.read_html(url, skiprows=0) + +Specify a number of rows to skip using a list (``range`` works +as well): + +.. code-block:: python + + dfs = pd.read_html(url, skiprows=range(2)) + +Specify an HTML attribute: + +.. code-block:: python + + dfs1 = pd.read_html(url, attrs={"id": "table"}) + dfs2 = pd.read_html(url, attrs={"class": "sortable"}) + print(np.array_equal(dfs1[0], dfs2[0])) # Should be True + +Specify values that should be converted to NaN: + +.. code-block:: python + + dfs = pd.read_html(url, na_values=["No Acquirer"]) + +Specify whether to keep the default set of NaN values: + +.. code-block:: python + + dfs = pd.read_html(url, keep_default_na=False) + +Specify converters for columns. This is useful for numerical text data that has +leading zeros. By default columns that are numerical are cast to numeric +types and the leading zeros are lost. To avoid this, we can convert these +columns to strings. + +.. code-block:: python + + url_mcc = "https://en.wikipedia.org/wiki/Mobile_country_code?oldid=899173761" + dfs = pd.read_html( + url_mcc, + match="Telekom Albania", + header=0, + converters={"MNC": str}, + ) + +Use some combination of the above: + +.. code-block:: python + + dfs = pd.read_html(url, match="Metcalf Bank", index_col=0) + +Read in pandas ``to_html`` output (with some loss of floating point precision): + +.. code-block:: python + + df = pd.DataFrame(np.random.randn(2, 2)) + s = df.to_html(float_format="{0:.40g}".format) + dfin = pd.read_html(s, index_col=0) + +The ``lxml`` backend will raise an error on a failed parse if that is the only +parser you provide. If you only have a single parser you can provide just a +string, but it is considered good practice to pass a list with one string if, +for example, the function expects a sequence of strings. You may use: + +.. code-block:: python + + dfs = pd.read_html(url, "Metcalf Bank", index_col=0, flavor=["lxml"]) + +Or you could pass ``flavor='lxml'`` without a list: + +.. code-block:: python + + dfs = pd.read_html(url, "Metcalf Bank", index_col=0, flavor="lxml") + +However, if you have bs4 and html5lib installed and pass ``None`` or ``['lxml', +'bs4']`` then the parse will most likely succeed. Note that *as soon as a parse +succeeds, the function will return*. + +.. code-block:: python + + dfs = pd.read_html(url, "Metcalf Bank", index_col=0, flavor=["lxml", "bs4"]) + +Links can be extracted from cells along with the text using ``extract_links="all"``. + +.. ipython:: python + + from io import StringIO + + html_table = """ + + + + + + + +
GitHub
pandas
+ """ + + df = pd.read_html( + StringIO(html_table), + extract_links="all" + )[0] + df + df[("GitHub", None)] + df[("GitHub", None)].str[1] + +.. versionadded:: 1.5.0 + +.. _io.html: + +Writing to HTML files +''''''''''''''''''''' + +``DataFrame`` objects have an instance method ``to_html`` which renders the +contents of the ``DataFrame`` as an HTML table. The function arguments are as +in the method ``to_string`` described above. + +.. note:: + + Not all of the possible options for ``DataFrame.to_html`` are shown here for + brevity's sake. See :func:`.DataFrame.to_html` for the + full set of options. + +.. note:: + + In an HTML-rendering supported environment like a Jupyter Notebook, ``display(HTML(...))``` + will render the raw HTML into the environment. + +.. ipython:: python + + from IPython.display import display, HTML + + df = pd.DataFrame(np.random.randn(2, 2)) + df + html = df.to_html() + print(html) # raw html + display(HTML(html)) + +The ``columns`` argument will limit the columns shown: + +.. ipython:: python + + html = df.to_html(columns=[0]) + print(html) + display(HTML(html)) + +``float_format`` takes a Python callable to control the precision of floating +point values: + +.. ipython:: python + + html = df.to_html(float_format="{0:.10f}".format) + print(html) + display(HTML(html)) + + +``bold_rows`` will make the row labels bold by default, but you can turn that +off: + +.. ipython:: python + + html = df.to_html(bold_rows=False) + print(html) + display(HTML(html)) + + +The ``classes`` argument provides the ability to give the resulting HTML +table CSS classes. Note that these classes are *appended* to the existing +``'dataframe'`` class. + +.. ipython:: python + + print(df.to_html(classes=["awesome_table_class", "even_more_awesome_class"])) + +The ``render_links`` argument provides the ability to add hyperlinks to cells +that contain URLs. + +.. ipython:: python + + url_df = pd.DataFrame( + { + "name": ["Python", "pandas"], + "url": ["https://www.python.org/", "https://pandas.pydata.org"], + } + ) + html = url_df.to_html(render_links=True) + print(html) + display(HTML(html)) + +Finally, the ``escape`` argument allows you to control whether the +"<", ">" and "&" characters escaped in the resulting HTML (by default it is +``True``). So to get the HTML without escaped characters pass ``escape=False`` + +.. ipython:: python + + df = pd.DataFrame({"a": list("&<>"), "b": np.random.randn(3)}) + +Escaped: + +.. ipython:: python + + html = df.to_html() + print(html) + display(HTML(html)) + +Not escaped: + +.. ipython:: python + + html = df.to_html(escape=False) + print(html) + display(HTML(html)) + +.. note:: + + Some browsers may not show a difference in the rendering of the previous two + HTML tables. + + +.. _io.html.gotchas: + +HTML Table Parsing Gotchas +'''''''''''''''''''''''''' + +There are some versioning issues surrounding the libraries that are used to +parse HTML tables in the top-level pandas io function ``read_html``. + +**Issues with** |lxml|_ + +* Benefits + + - |lxml|_ is very fast. + + - |lxml|_ requires Cython to install correctly. + +* Drawbacks + + - |lxml|_ does *not* make any guarantees about the results of its parse + *unless* it is given |svm|_. + + - In light of the above, we have chosen to allow you, the user, to use the + |lxml|_ backend, but **this backend will use** |html5lib|_ if |lxml|_ + fails to parse + + - It is therefore *highly recommended* that you install both + |BeautifulSoup4|_ and |html5lib|_, so that you will still get a valid + result (provided everything else is valid) even if |lxml|_ fails. + +**Issues with** |BeautifulSoup4|_ **using** |lxml|_ **as a backend** + +* The above issues hold here as well since |BeautifulSoup4|_ is essentially + just a wrapper around a parser backend. + +**Issues with** |BeautifulSoup4|_ **using** |html5lib|_ **as a backend** + +* Benefits + + - |html5lib|_ is far more lenient than |lxml|_ and consequently deals + with *real-life markup* in a much saner way rather than just, e.g., + dropping an element without notifying you. + + - |html5lib|_ *generates valid HTML5 markup from invalid markup + automatically*. This is extremely important for parsing HTML tables, + since it guarantees a valid document. However, that does NOT mean that + it is "correct", since the process of fixing markup does not have a + single definition. + + - |html5lib|_ is pure Python and requires no additional build steps beyond + its own installation. + +* Drawbacks + + - The biggest drawback to using |html5lib|_ is that it is slow as + molasses. However consider the fact that many tables on the web are not + big enough for the parsing algorithm runtime to matter. It is more + likely that the bottleneck will be in the process of reading the raw + text from the URL over the web, i.e., IO (input-output). For very large + tables, this might not be true. + + +.. |svm| replace:: **strictly valid markup** +.. _svm: https://validator.w3.org/docs/help.html#validation_basics + +.. |html5lib| replace:: **html5lib** +.. _html5lib: https://github.com/html5lib/html5lib-python + +.. |BeautifulSoup4| replace:: **BeautifulSoup4** +.. _BeautifulSoup4: https://www.crummy.com/software/BeautifulSoup + +.. |lxml| replace:: **lxml** +.. _lxml: https://lxml.de diff --git a/doc/source/user_guide/io/index.rst b/doc/source/user_guide/io/index.rst new file mode 100644 index 0000000000000..c61ba0547f7b2 --- /dev/null +++ b/doc/source/user_guide/io/index.rst @@ -0,0 +1,273 @@ +.. _io: + +=============================== +IO tools (text, CSV, HDF5, ...) +=============================== + +.. toctree:: + :maxdepth: 1 + :hidden: + + csv + json + html + latex + xml + clipboard + excel + hdf5 + feather + parquet + orc + stata + sas + spss + pickling + sql + community_packages + +The pandas I/O API is a set of top level ``reader`` functions accessed like +:func:`pandas.read_csv` that generally return a pandas object. The corresponding +``writer`` functions are object methods that are accessed like +:meth:`DataFrame.to_csv`. Below is a table containing available ``readers`` and +``writers``. + +.. csv-table:: + :header: "Format Type", "Data Description", "Reader", "Writer" + :widths: 30, 100, 60, 60 + + text,`CSV `__, :ref:`read_csv`, :ref:`to_csv` + text,Fixed-Width Text File, :ref:`read_fwf` , NA + text,`JSON `__, :ref:`read_json`, :ref:`to_json` + text,`HTML `__, :ref:`read_html`, :ref:`to_html` + text,`LaTeX `__, :ref:`Styler.to_latex` , NA + text,`XML `__, :ref:`read_xml`, :ref:`to_xml` + text, Local clipboard, :ref:`read_clipboard`, :ref:`to_clipboard` + binary,`MS Excel `__ , :ref:`read_excel`, :ref:`to_excel` + binary,`OpenDocument `__, :ref:`read_excel`, NA + binary,`HDF5 Format `__, :ref:`read_hdf`, :ref:`to_hdf` + binary,`Feather Format `__, :ref:`read_feather`, :ref:`to_feather` + binary,`Parquet Format `__, :ref:`read_parquet`, :ref:`to_parquet` + binary,`ORC Format `__, :ref:`read_orc`, :ref:`to_orc` + binary,`Stata `__, :ref:`read_stata`, :ref:`to_stata` + binary,`SAS `__, :ref:`read_sas` , NA + binary,`SPSS `__, :ref:`read_spss` , NA + binary,`Python Pickle Format `__, :ref:`read_pickle`, :ref:`to_pickle` + SQL,`SQL `__, :ref:`read_sql`,:ref:`to_sql` + +See also this list of :ref:`community-supported packages` offering support for other file formats. + + +.. _io.perf: + +Performance considerations +-------------------------- + +This is an informal comparison of various IO methods, using pandas +0.24.2. Timings are machine dependent and small differences should be +ignored. + +.. code-block:: ipython + + In [1]: sz = 1000000 + In [2]: df = pd.DataFrame({'A': np.random.randn(sz), 'B': [1] * sz}) + + In [3]: df.info() + + RangeIndex: 1000000 entries, 0 to 999999 + Data columns (total 2 columns): + A 1000000 non-null float64 + B 1000000 non-null int64 + dtypes: float64(1), int64(1) + memory usage: 15.3 MB + +The following test functions will be used below to compare the performance of several IO methods: + +.. code-block:: python + + + + import numpy as np + + import os + + sz = 1000000 + df = pd.DataFrame({"A": np.random.randn(sz), "B": [1] * sz}) + + sz = 1000000 + np.random.seed(42) + df = pd.DataFrame({"A": np.random.randn(sz), "B": [1] * sz}) + + + def test_sql_write(df): + if os.path.exists("test.sql"): + os.remove("test.sql") + sql_db = sqlite3.connect("test.sql") + df.to_sql(name="test_table", con=sql_db) + sql_db.close() + + + def test_sql_read(): + sql_db = sqlite3.connect("test.sql") + pd.read_sql_query("select * from test_table", sql_db) + sql_db.close() + + + def test_hdf_fixed_write(df): + df.to_hdf("test_fixed.hdf", key="test", mode="w") + + + def test_hdf_fixed_read(): + pd.read_hdf("test_fixed.hdf", "test") + + + def test_hdf_fixed_write_compress(df): + df.to_hdf("test_fixed_compress.hdf", key="test", mode="w", complib="blosc") + + + def test_hdf_fixed_read_compress(): + pd.read_hdf("test_fixed_compress.hdf", "test") + + + def test_hdf_table_write(df): + df.to_hdf("test_table.hdf", key="test", mode="w", format="table") + + + def test_hdf_table_read(): + pd.read_hdf("test_table.hdf", "test") + + + def test_hdf_table_write_compress(df): + df.to_hdf( + "test_table_compress.hdf", key="test", mode="w", complib="blosc", format="table" + ) + + + def test_hdf_table_read_compress(): + pd.read_hdf("test_table_compress.hdf", "test") + + + def test_csv_write(df): + df.to_csv("test.csv", mode="w") + + + def test_csv_read(): + pd.read_csv("test.csv", index_col=0) + + + def test_feather_write(df): + df.to_feather("test.feather") + + + def test_feather_read(): + pd.read_feather("test.feather") + + + def test_pickle_write(df): + df.to_pickle("test.pkl") + + + def test_pickle_read(): + pd.read_pickle("test.pkl") + + + def test_pickle_write_compress(df): + df.to_pickle("test.pkl.compress", compression="xz") + + + def test_pickle_read_compress(): + pd.read_pickle("test.pkl.compress", compression="xz") + + + def test_parquet_write(df): + df.to_parquet("test.parquet") + + + def test_parquet_read(): + pd.read_parquet("test.parquet") + +When writing, the top three functions in terms of speed are ``test_feather_write``, ``test_hdf_fixed_write`` and ``test_hdf_fixed_write_compress``. + +.. code-block:: ipython + + In [4]: %timeit test_sql_write(df) + 3.29 s ± 43.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) + + In [5]: %timeit test_hdf_fixed_write(df) + 19.4 ms ± 560 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) + + In [6]: %timeit test_hdf_fixed_write_compress(df) + 19.6 ms ± 308 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) + + In [7]: %timeit test_hdf_table_write(df) + 449 ms ± 5.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) + + In [8]: %timeit test_hdf_table_write_compress(df) + 448 ms ± 11.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) + + In [9]: %timeit test_csv_write(df) + 3.66 s ± 26.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) + + In [10]: %timeit test_feather_write(df) + 9.75 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) + + In [11]: %timeit test_pickle_write(df) + 30.1 ms ± 229 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) + + In [12]: %timeit test_pickle_write_compress(df) + 4.29 s ± 15.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) + + In [13]: %timeit test_parquet_write(df) + 67.6 ms ± 706 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) + +When reading, the top three functions in terms of speed are ``test_feather_read``, ``test_pickle_read`` and +``test_hdf_fixed_read``. + + +.. code-block:: ipython + + In [14]: %timeit test_sql_read() + 1.77 s ± 17.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) + + In [15]: %timeit test_hdf_fixed_read() + 19.4 ms ± 436 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) + + In [16]: %timeit test_hdf_fixed_read_compress() + 19.5 ms ± 222 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) + + In [17]: %timeit test_hdf_table_read() + 38.6 ms ± 857 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) + + In [18]: %timeit test_hdf_table_read_compress() + 38.8 ms ± 1.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) + + In [19]: %timeit test_csv_read() + 452 ms ± 9.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) + + In [20]: %timeit test_feather_read() + 12.4 ms ± 99.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) + + In [21]: %timeit test_pickle_read() + 18.4 ms ± 191 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) + + In [22]: %timeit test_pickle_read_compress() + 915 ms ± 7.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) + + In [23]: %timeit test_parquet_read() + 24.4 ms ± 146 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) + + +The files ``test.pkl.compress``, ``test.parquet`` and ``test.feather`` took the least space on disk (in bytes). + +.. code-block:: none + + 29519500 Oct 10 06:45 test.csv + 16000248 Oct 10 06:45 test.feather + 8281983 Oct 10 06:49 test.parquet + 16000857 Oct 10 06:47 test.pkl + 7552144 Oct 10 06:48 test.pkl.compress + 34816000 Oct 10 06:42 test.sql + 24009288 Oct 10 06:43 test_fixed.hdf + 24009288 Oct 10 06:43 test_fixed_compress.hdf + 24458940 Oct 10 06:44 test_table.hdf + 24458940 Oct 10 06:44 test_table_compress.hdf diff --git a/doc/source/user_guide/io/json.rst b/doc/source/user_guide/io/json.rst new file mode 100644 index 0000000000000..2861176cd80fb --- /dev/null +++ b/doc/source/user_guide/io/json.rst @@ -0,0 +1,618 @@ +.. _io.json: + +==== +JSON +==== + +Read and write ``JSON`` format files and strings. + +.. _io.json_writer: + +Writing JSON +'''''''''''' + +A ``Series`` or ``DataFrame`` can be converted to a valid JSON string. Use ``to_json`` +with optional parameters: + +* ``path_or_buf`` : the pathname or buffer to write the output. + This can be ``None`` in which case a JSON string is returned. +* ``orient`` : + + ``Series``: + * default is ``index`` + * allowed values are {``split``, ``records``, ``index``} + + ``DataFrame``: + * default is ``columns`` + * allowed values are {``split``, ``records``, ``index``, ``columns``, ``values``, ``table``} + + The format of the JSON string + + .. csv-table:: + :widths: 20, 150 + + ``split``, dict like {index -> [index]; columns -> [columns]; data -> [values]} + ``records``, list like [{column -> value}; ... ] + ``index``, dict like {index -> {column -> value}} + ``columns``, dict like {column -> {index -> value}} + ``values``, just the values array + ``table``, adhering to the JSON `Table Schema`_ + +* ``date_format`` : string, type of date conversion, 'epoch' for timestamp, 'iso' for ISO8601. +* ``double_precision`` : The number of decimal places to use when encoding floating point values, default 10. +* ``force_ascii`` : force encoded string to be ASCII, default True. +* ``date_unit`` : The time unit to encode to, governs timestamp and ISO8601 precision. One of 's', 'ms', 'us' or 'ns' for seconds, milliseconds, microseconds and nanoseconds respectively. Default 'ms'. +* ``default_handler`` : The handler to call if an object cannot otherwise be converted to a suitable format for JSON. Takes a single argument, which is the object to convert, and returns a serializable object. +* ``lines`` : If ``records`` orient, then will write each record per line as json. +* ``mode`` : string, writer mode when writing to path. 'w' for write, 'a' for append. Default 'w' + +Note ``NaN``'s, ``NaT``'s and ``None`` will be converted to ``null`` and ``datetime`` objects will be converted based on the ``date_format`` and ``date_unit`` parameters. + +.. ipython:: python + + dfj = pd.DataFrame(np.random.randn(5, 2), columns=list("AB")) + json = dfj.to_json() + json + +Orient options +++++++++++++++ + +There are a number of different options for the format of the resulting JSON +file / string. Consider the following ``DataFrame`` and ``Series``: + +.. ipython:: python + + dfjo = pd.DataFrame( + dict(A=range(1, 4), B=range(4, 7), C=range(7, 10)), + columns=list("ABC"), + index=list("xyz"), + ) + dfjo + sjo = pd.Series(dict(x=15, y=16, z=17), name="D") + sjo + +**Column oriented** (the default for ``DataFrame``) serializes the data as +nested JSON objects with column labels acting as the primary index: + +.. ipython:: python + + dfjo.to_json(orient="columns") + # Not available for Series + +**Index oriented** (the default for ``Series``) similar to column oriented +but the index labels are now primary: + +.. ipython:: python + + dfjo.to_json(orient="index") + sjo.to_json(orient="index") + +**Record oriented** serializes the data to a JSON array of column -> value records, +index labels are not included. This is useful for passing ``DataFrame`` data to plotting +libraries, for example the JavaScript library ``d3.js``: + +.. ipython:: python + + dfjo.to_json(orient="records") + sjo.to_json(orient="records") + +**Value oriented** is a bare-bones option which serializes to nested JSON arrays of +values only, column and index labels are not included: + +.. ipython:: python + + dfjo.to_json(orient="values") + # Not available for Series + +**Split oriented** serializes to a JSON object containing separate entries for +values, index and columns. Name is also included for ``Series``: + +.. ipython:: python + + dfjo.to_json(orient="split") + sjo.to_json(orient="split") + +**Table oriented** serializes to the JSON `Table Schema`_, allowing for the +preservation of metadata including but not limited to dtypes and index names. + +.. note:: + + Any orient option that encodes to a JSON object will not preserve the ordering of + index and column labels during round-trip serialization. If you wish to preserve + label ordering use the ``split`` option as it uses ordered containers. + +Date handling ++++++++++++++ + +Writing in ISO date format: + +.. ipython:: python + + dfd = pd.DataFrame(np.random.randn(5, 2), columns=list("AB")) + dfd["date"] = pd.Timestamp("20130101") + dfd = dfd.sort_index(axis=1, ascending=False) + json = dfd.to_json(date_format="iso") + json + +Writing in ISO date format, with microseconds: + +.. ipython:: python + + json = dfd.to_json(date_format="iso", date_unit="us") + json + +Writing to a file, with a date index and a date column: + +.. ipython:: python + + dfj2 = dfj.copy() + dfj2["date"] = pd.Timestamp("20130101") + dfj2["ints"] = list(range(5)) + dfj2["bools"] = True + dfj2.index = pd.date_range("20130101", periods=5) + dfj2.to_json("test.json", date_format="iso") + + with open("test.json") as fh: + print(fh.read()) + +Fallback behavior ++++++++++++++++++ + +If the JSON serializer cannot handle the container contents directly it will +fall back in the following manner: + +* if the dtype is unsupported (e.g. ``np.complex_``) then the ``default_handler``, if provided, will be called + for each value, otherwise an exception is raised. + +* if an object is unsupported it will attempt the following: + + + - check if the object has defined a ``toDict`` method and call it. + A ``toDict`` method should return a ``dict`` which will then be JSON serialized. + + - invoke the ``default_handler`` if one was provided. + + - convert the object to a ``dict`` by traversing its contents. However this will often fail + with an ``OverflowError`` or give unexpected results. + +In general the best approach for unsupported objects or dtypes is to provide a ``default_handler``. +For example: + +.. code-block:: python + + >>> DataFrame([1.0, 2.0, complex(1.0, 2.0)]).to_json() # raises + RuntimeError: Unhandled numpy dtype 15 + +can be dealt with by specifying a simple ``default_handler``: + +.. ipython:: python + + pd.DataFrame([1.0, 2.0, complex(1.0, 2.0)]).to_json(default_handler=str) + +.. _io.json_reader: + +Reading JSON +'''''''''''' + +Reading a JSON string to pandas object can take a number of parameters. +The parser will try to parse a ``DataFrame`` if ``typ`` is not supplied or +is ``None``. To explicitly force ``Series`` parsing, pass ``typ=series`` + +* ``filepath_or_buffer`` : a **VALID** JSON string or file handle / StringIO. The string could be + a URL. Valid URL schemes include http, ftp, S3, and file. For file URLs, a host + is expected. For instance, a local file could be + file ://localhost/path/to/table.json +* ``typ`` : type of object to recover (series or frame), default 'frame' +* ``orient`` : + + Series : + * default is ``index`` + * allowed values are {``split``, ``records``, ``index``} + + DataFrame + * default is ``columns`` + * allowed values are {``split``, ``records``, ``index``, ``columns``, ``values``, ``table``} + + The format of the JSON string + + .. csv-table:: + :widths: 20, 150 + + ``split``, dict like {index -> [index]; columns -> [columns]; data -> [values]} + ``records``, list like [{column -> value} ...] + ``index``, dict like {index -> {column -> value}} + ``columns``, dict like {column -> {index -> value}} + ``values``, just the values array + ``table``, adhering to the JSON `Table Schema`_ + + +* ``dtype`` : if True, infer dtypes, if a dict of column to dtype, then use those, if ``False``, then don't infer dtypes at all, default is True, apply only to the data. +* ``convert_axes`` : boolean, try to convert the axes to the proper dtypes, default is ``True`` +* ``convert_dates`` : a list of columns to parse for dates; If ``True``, then try to parse date-like columns, default is ``True``. +* ``keep_default_dates`` : boolean, default ``True``. If parsing dates, then parse the default date-like columns. +* ``precise_float`` : boolean, default ``False``. Set to enable usage of higher precision (strtod) function when decoding string to double values. Default (``False``) is to use fast but less precise builtin functionality. +* ``date_unit`` : string, the timestamp unit to detect if converting dates. Default + None. By default the timestamp precision will be detected, if this is not desired + then pass one of 's', 'ms', 'us' or 'ns' to force timestamp precision to + seconds, milliseconds, microseconds or nanoseconds respectively. +* ``lines`` : reads file as one json object per line. +* ``encoding`` : The encoding to use to decode py3 bytes. +* ``chunksize`` : when used in combination with ``lines=True``, return a ``pandas.api.typing.JsonReader`` which reads in ``chunksize`` lines per iteration. +* ``engine``: Either ``"ujson"``, the built-in JSON parser, or ``"pyarrow"`` which dispatches to pyarrow's ``pyarrow.json.read_json``. + The ``"pyarrow"`` is only available when ``lines=True`` + +The parser will raise one of ``ValueError/TypeError/AssertionError`` if the JSON is not parseable. + +If a non-default ``orient`` was used when encoding to JSON be sure to pass the same +option here so that decoding produces sensible results, see `Orient Options`_ for an +overview. + +Data conversion ++++++++++++++++ + +The default of ``convert_axes=True``, ``dtype=True``, and ``convert_dates=True`` +will try to parse the axes, and all of the data into appropriate types, +including dates. If you need to override specific dtypes, pass a dict to +``dtype``. ``convert_axes`` should only be set to ``False`` if you need to +preserve string-like numbers (e.g. '1', '2') in an axes. + +.. note:: + + Large integer values may be converted to dates if ``convert_dates=True`` and the data and / or column labels appear 'date-like'. The exact threshold depends on the ``date_unit`` specified. 'date-like' means that the column label meets one of the following criteria: + + * it ends with ``'_at'`` + * it ends with ``'_time'`` + * it begins with ``'timestamp'`` + * it is ``'modified'`` + * it is ``'date'`` + +.. warning:: + + When reading JSON data, automatic coercing into dtypes has some quirks: + + * an index can be reconstructed in a different order from serialization, that is, the returned order is not guaranteed to be the same as before serialization + * a column that was ``float`` data will be converted to ``integer`` if it can be done safely, e.g. a column of ``1.`` + * bool columns will be converted to ``integer`` on reconstruction + + Thus there are times where you may want to specify specific dtypes via the ``dtype`` keyword argument. + +Reading from a JSON string: + +.. ipython:: python + + from io import StringIO + pd.read_json(StringIO(json)) + +Reading from a file: + +.. ipython:: python + + pd.read_json("test.json") + +Don't convert any data (but still convert axes and dates): + +.. ipython:: python + + pd.read_json("test.json", dtype=object).dtypes + +Specify dtypes for conversion: + +.. ipython:: python + + pd.read_json("test.json", dtype={"A": "float32", "bools": "int8"}).dtypes + +Preserve string indices: + +.. ipython:: python + + from io import StringIO + si = pd.DataFrame( + np.zeros((4, 4)), columns=list(range(4)), index=[str(i) for i in range(4)] + ) + si + si.index + si.columns + json = si.to_json() + + sij = pd.read_json(StringIO(json), convert_axes=False) + sij + sij.index + sij.columns + +Dates written in nanoseconds need to be read back in nanoseconds: + +.. ipython:: python + + from io import StringIO + json = dfj2.to_json(date_format="iso", date_unit="ns") + + # Try to parse timestamps as milliseconds -> Won't Work + dfju = pd.read_json(StringIO(json), date_unit="ms") + dfju + + # Let pandas detect the correct precision + dfju = pd.read_json(StringIO(json)) + dfju + + # Or specify that all timestamps are in nanoseconds + dfju = pd.read_json(StringIO(json), date_unit="ns") + dfju + +By setting the ``dtype_backend`` argument you can control the default dtypes used for the resulting DataFrame. + +.. ipython:: python + + from io import StringIO + + data = ( + '{"a":{"0":1,"1":3},"b":{"0":2.5,"1":4.5},"c":{"0":true,"1":false},"d":{"0":"a","1":"b"},' + '"e":{"0":null,"1":6.0},"f":{"0":null,"1":7.5},"g":{"0":null,"1":true},"h":{"0":null,"1":"a"},' + '"i":{"0":"12-31-2019","1":"12-31-2019"},"j":{"0":null,"1":null}}' + ) + df = pd.read_json(StringIO(data), dtype_backend="pyarrow") + df + df.dtypes + +.. _io.json_normalize: + +Normalization +''''''''''''' + +pandas provides a utility function to take a dict or list of dicts and *normalize* this semi-structured data +into a flat table. + +.. ipython:: python + + data = [ + {"id": 1, "name": {"first": "Coleen", "last": "Volk"}}, + {"name": {"given": "Mark", "family": "Regner"}}, + {"id": 2, "name": "Faye Raker"}, + ] + pd.json_normalize(data) + +.. ipython:: python + + data = [ + { + "state": "Florida", + "shortname": "FL", + "info": {"governor": "Rick Scott"}, + "county": [ + {"name": "Dade", "population": 12345}, + {"name": "Broward", "population": 40000}, + {"name": "Palm Beach", "population": 60000}, + ], + }, + { + "state": "Ohio", + "shortname": "OH", + "info": {"governor": "John Kasich"}, + "county": [ + {"name": "Summit", "population": 1234}, + {"name": "Cuyahoga", "population": 1337}, + ], + }, + ] + + pd.json_normalize(data, "county", ["state", "shortname", ["info", "governor"]]) + +The max_level parameter provides more control over which level to end normalization. +With max_level=1 the following snippet normalizes until 1st nesting level of the provided dict. + +.. ipython:: python + + data = [ + { + "CreatedBy": {"Name": "User001"}, + "Lookup": { + "TextField": "Some text", + "UserField": {"Id": "ID001", "Name": "Name001"}, + }, + "Image": {"a": "b"}, + } + ] + pd.json_normalize(data, max_level=1) + +.. _io.jsonl: + +Line delimited json +''''''''''''''''''' + +pandas is able to read and write line-delimited json files that are common in data processing pipelines +using Hadoop or Spark. + +For line-delimited json files, pandas can also return an iterator which reads in ``chunksize`` lines at a time. This can be useful for large files or to read from a stream. + +.. ipython:: python + + from io import StringIO + jsonl = """ + {"a": 1, "b": 2} + {"a": 3, "b": 4} + """ + df = pd.read_json(StringIO(jsonl), lines=True) + df + df.to_json(orient="records", lines=True) + + # reader is an iterator that returns ``chunksize`` lines each iteration + with pd.read_json(StringIO(jsonl), lines=True, chunksize=1) as reader: + reader + for chunk in reader: + print(chunk) + +Line-limited json can also be read using the pyarrow reader by specifying ``engine="pyarrow"``. + +.. ipython:: python + + from io import BytesIO + df = pd.read_json(BytesIO(jsonl.encode()), lines=True, engine="pyarrow") + df + +.. versionadded:: 2.0.0 + +.. _io.table_schema: + +Table schema +'''''''''''' + +`Table Schema`_ is a spec for describing tabular datasets as a JSON +object. The JSON includes information on the field names, types, and +other attributes. You can use the orient ``table`` to build +a JSON string with two fields, ``schema`` and ``data``. + +.. ipython:: python + + df = pd.DataFrame( + { + "A": [1, 2, 3], + "B": ["a", "b", "c"], + "C": pd.date_range("2016-01-01", freq="D", periods=3), + }, + index=pd.Index(range(3), name="idx"), + ) + df + df.to_json(orient="table", date_format="iso") + +The ``schema`` field contains the ``fields`` key, which itself contains +a list of column name to type pairs, including the ``Index`` or ``MultiIndex`` +(see below for a list of types). +The ``schema`` field also contains a ``primaryKey`` field if the (Multi)index +is unique. + +The second field, ``data``, contains the serialized data with the ``records`` +orient. +The index is included, and any datetimes are ISO 8601 formatted, as required +by the Table Schema spec. + +The full list of types supported are described in the Table Schema +spec. This table shows the mapping from pandas types: + +=============== ================= +pandas type Table Schema type +=============== ================= +int64 integer +float64 number +bool boolean +datetime64[ns] datetime +timedelta64[ns] duration +categorical any +object str +=============== ================= + +A few notes on the generated table schema: + +* The ``schema`` object contains a ``pandas_version`` field. This contains + the version of pandas' dialect of the schema, and will be incremented + with each revision. +* All dates are converted to UTC when serializing. Even timezone naive values, + which are treated as UTC with an offset of 0. + + .. ipython:: python + + from pandas.io.json import build_table_schema + + s = pd.Series(pd.date_range("2016", periods=4)) + build_table_schema(s) + +* datetimes with a timezone (before serializing), include an additional field + ``tz`` with the time zone name (e.g. ``'US/Central'``). + + .. ipython:: python + + s_tz = pd.Series(pd.date_range("2016", periods=12, tz="US/Central")) + build_table_schema(s_tz) + +* Periods are converted to timestamps before serialization, and so have the + same behavior of being converted to UTC. In addition, periods will contain + and additional field ``freq`` with the period's frequency, e.g. ``'A-DEC'``. + + .. ipython:: python + + s_per = pd.Series(1, index=pd.period_range("2016", freq="Y-DEC", periods=4)) + build_table_schema(s_per) + +* Categoricals use the ``any`` type and an ``enum`` constraint listing + the set of possible values. Additionally, an ``ordered`` field is included: + + .. ipython:: python + + s_cat = pd.Series(pd.Categorical(["a", "b", "a"])) + build_table_schema(s_cat) + +* A ``primaryKey`` field, containing an array of labels, is included + *if the index is unique*: + + .. ipython:: python + + s_dupe = pd.Series([1, 2], index=[1, 1]) + build_table_schema(s_dupe) + +* The ``primaryKey`` behavior is the same with MultiIndexes, but in this + case the ``primaryKey`` is an array: + + .. ipython:: python + + s_multi = pd.Series(1, index=pd.MultiIndex.from_product([("a", "b"), (0, 1)])) + build_table_schema(s_multi) + +* The default naming roughly follows these rules: + + - For series, the ``object.name`` is used. If that's none, then the + name is ``values`` + - For ``DataFrames``, the stringified version of the column name is used + - For ``Index`` (not ``MultiIndex``), ``index.name`` is used, with a + fallback to ``index`` if that is None. + - For ``MultiIndex``, ``mi.names`` is used. If any level has no name, + then ``level_`` is used. + +``read_json`` also accepts ``orient='table'`` as an argument. This allows for +the preservation of metadata such as dtypes and index names in a +round-trippable manner. + +.. ipython:: python + + df = pd.DataFrame( + { + "foo": [1, 2, 3, 4], + "bar": ["a", "b", "c", "d"], + "baz": pd.date_range("2018-01-01", freq="D", periods=4), + "qux": pd.Categorical(["a", "b", "c", "c"]), + }, + index=pd.Index(range(4), name="idx"), + ) + df + df.dtypes + + df.to_json("test.json", orient="table") + new_df = pd.read_json("test.json", orient="table") + new_df + new_df.dtypes + +Please note that the literal string 'index' as the name of an :class:`Index` +is not round-trippable, nor are any names beginning with ``'level_'`` within a +:class:`MultiIndex`. These are used by default in :func:`DataFrame.to_json` to +indicate missing values and the subsequent read cannot distinguish the intent. + +.. ipython:: python + :okwarning: + + df.index.name = "index" + df.to_json("test.json", orient="table") + new_df = pd.read_json("test.json", orient="table") + print(new_df.index.name) + +.. ipython:: python + :suppress: + + os.remove("test.json") + +When using ``orient='table'`` along with user-defined ``ExtensionArray``, +the generated schema will contain an additional ``extDtype`` key in the respective +``fields`` element. This extra key is not standard but does enable JSON roundtrips +for extension types (e.g. ``read_json(df.to_json(orient="table"), orient="table")``). + +The ``extDtype`` key carries the name of the extension, if you have properly registered +the ``ExtensionDtype``, pandas will use said name to perform a lookup into the registry +and re-convert the serialized data into your custom dtype. + +.. _Table Schema: https://specs.frictionlessdata.io/table-schema/ diff --git a/doc/source/user_guide/io/latex.rst b/doc/source/user_guide/io/latex.rst new file mode 100644 index 0000000000000..2af006b0da9f8 --- /dev/null +++ b/doc/source/user_guide/io/latex.rst @@ -0,0 +1,35 @@ +.. _io.latex: + +===== +LaTeX +===== + +.. versionadded:: 1.3.0 + +Currently there are no methods to read from LaTeX, only output methods. + +Writing to LaTeX files +'''''''''''''''''''''' + +.. note:: + + DataFrame *and* Styler objects currently have a ``to_latex`` method. We recommend + using the :func:`Styler.to_latex` method over :func:`DataFrame.to_latex` due to the + former's greater flexibility with conditional styling, and the latter's possible + future deprecation. + +Review the documentation for :func:`Styler.to_latex`, which gives examples of +conditional styling and explains the operation of its keyword arguments. + +For simple application the following pattern is sufficient. + +.. ipython:: python + + df = pd.DataFrame([[1, 2], [3, 4]], index=["a", "b"], columns=["c", "d"]) + print(df.style.to_latex()) + +To format values before output, chain the :func:`Styler.format` method. + +.. ipython:: python + + print(df.style.format("€ {}").to_latex()) diff --git a/doc/source/user_guide/io/orc.rst b/doc/source/user_guide/io/orc.rst new file mode 100644 index 0000000000000..fc1f2e671b011 --- /dev/null +++ b/doc/source/user_guide/io/orc.rst @@ -0,0 +1,62 @@ +.. _io.orc: + +=== +ORC +=== + +Similar to the :ref:`parquet ` format, the `ORC Format `__ is a binary columnar serialization +for data frames. It is designed to make reading data frames efficient. pandas provides both the reader and the writer for the +ORC format, :func:`~pandas.read_orc` and :func:`~pandas.DataFrame.to_orc`. This requires the `pyarrow `__ library. + +.. warning:: + + * It is *highly recommended* to install pyarrow using conda due to some issues occurred by pyarrow. + * :func:`~pandas.DataFrame.to_orc` requires pyarrow>=7.0.0. + * :func:`~pandas.read_orc` and :func:`~pandas.DataFrame.to_orc` are not supported on Windows yet, you can find valid environments on :ref:`install optional dependencies `. + * For supported dtypes please refer to `supported ORC features in Arrow `__. + * Currently timezones in datetime columns are not preserved when a dataframe is converted into ORC files. + +.. ipython:: python + + df = pd.DataFrame( + { + "a": list("abc"), + "b": list(range(1, 4)), + "c": np.arange(4.0, 7.0, dtype="float64"), + "d": [True, False, True], + "e": pd.date_range("20130101", periods=3), + } + ) + + df + df.dtypes + +Write to an orc file. + +.. ipython:: python + + df.to_orc("example_pa.orc", engine="pyarrow") + +Read from an orc file. + +.. ipython:: python + + result = pd.read_orc("example_pa.orc") + + result.dtypes + +Read only certain columns of an orc file. + +.. ipython:: python + + result = pd.read_orc( + "example_pa.orc", + columns=["a", "b"], + ) + result.dtypes + + +.. ipython:: python + :suppress: + + os.remove("example_pa.orc") diff --git a/doc/source/user_guide/io/parquet.rst b/doc/source/user_guide/io/parquet.rst new file mode 100644 index 0000000000000..ada07471c9aac --- /dev/null +++ b/doc/source/user_guide/io/parquet.rst @@ -0,0 +1,187 @@ +.. _io.parquet: + +======= +Parquet +======= + +`Apache Parquet `__ provides a partitioned binary columnar serialization for data frames. It is designed to +make reading and writing data frames efficient, and to make sharing data across data analysis +languages easy. Parquet can use a variety of compression techniques to shrink the file size as much as possible +while still maintaining good read performance. + +Parquet is designed to faithfully serialize and de-serialize ``DataFrame`` s, supporting all of the pandas +dtypes, including extension dtypes such as datetime with tz. + +Several caveats. + +* Duplicate column names and non-string columns names are not supported. +* The ``pyarrow`` engine always writes the index to the output, but ``fastparquet`` only writes non-default + indexes. This extra column can cause problems for non-pandas consumers that are not expecting it. You can + force including or omitting indexes with the ``index`` argument, regardless of the underlying engine. +* Index level names, if specified, must be strings. +* In the ``pyarrow`` engine, categorical dtypes for non-string types can be serialized to parquet, but will de-serialize as their primitive dtype. +* The ``pyarrow`` engine preserves the ``ordered`` flag of categorical dtypes with string types. ``fastparquet`` does not preserve the ``ordered`` flag. +* Non supported types include ``Interval`` and actual Python object types. These will raise a helpful error message + on an attempt at serialization. ``Period`` type is supported with pyarrow >= 0.16.0. +* The ``pyarrow`` engine preserves extension data types such as the nullable integer and string data + type (requiring pyarrow >= 0.16.0, and requiring the extension type to implement the needed protocols, + see the :ref:`extension types documentation `). + +You can specify an ``engine`` to direct the serialization. This can be one of ``pyarrow``, or ``fastparquet``, or ``auto``. +If the engine is NOT specified, then the ``pd.options.io.parquet.engine`` option is checked; if this is also ``auto``, +then ``pyarrow`` is tried, and falling back to ``fastparquet``. + +See the documentation for `pyarrow `__ and `fastparquet `__. + +.. note:: + + These engines are very similar and should read/write nearly identical parquet format files. + ``pyarrow>=8.0.0`` supports timedelta data, ``fastparquet>=0.1.4`` supports timezone aware datetimes. + These libraries differ by having different underlying dependencies (``fastparquet`` by using ``numba``, while ``pyarrow`` uses a c-library). + +.. ipython:: python + + df = pd.DataFrame( + { + "a": list("abc"), + "b": list(range(1, 4)), + "c": np.arange(3, 6).astype("u1"), + "d": np.arange(4.0, 7.0, dtype="float64"), + "e": [True, False, True], + "f": pd.date_range("20130101", periods=3), + "g": pd.date_range("20130101", periods=3, tz="US/Eastern"), + "h": pd.Categorical(list("abc")), + "i": pd.Categorical(list("abc"), ordered=True), + } + ) + + df + df.dtypes + +Write to a parquet file. + +.. ipython:: python + + df.to_parquet("example_pa.parquet", engine="pyarrow") + df.to_parquet("example_fp.parquet", engine="fastparquet") + +Read from a parquet file. + +.. ipython:: python + + result = pd.read_parquet("example_fp.parquet", engine="fastparquet") + result = pd.read_parquet("example_pa.parquet", engine="pyarrow") + + result.dtypes + +By setting the ``dtype_backend`` argument you can control the default dtypes used for the resulting DataFrame. + +.. ipython:: python + + result = pd.read_parquet("example_pa.parquet", engine="pyarrow", dtype_backend="pyarrow") + + result.dtypes + +.. note:: + + Note that this is not supported for ``fastparquet``. + + +Read only certain columns of a parquet file. + +.. ipython:: python + + result = pd.read_parquet( + "example_fp.parquet", + engine="fastparquet", + columns=["a", "b"], + ) + result = pd.read_parquet( + "example_pa.parquet", + engine="pyarrow", + columns=["a", "b"], + ) + result.dtypes + + +.. ipython:: python + :suppress: + + os.remove("example_pa.parquet") + os.remove("example_fp.parquet") + + +Handling indexes +'''''''''''''''' + +Serializing a ``DataFrame`` to parquet may include the implicit index as one or +more columns in the output file. Thus, this code: + +.. ipython:: python + + df = pd.DataFrame({"a": [1, 2], "b": [3, 4]}) + df.to_parquet("test.parquet", engine="pyarrow") + +creates a parquet file with *three* columns if you use ``pyarrow`` for serialization: +``a``, ``b``, and ``__index_level_0__``. If you're using ``fastparquet``, the +index `may or may not `_ +be written to the file. + +This unexpected extra column causes some databases like Amazon Redshift to reject +the file, because that column doesn't exist in the target table. + +If you want to omit a dataframe's indexes when writing, pass ``index=False`` to +:func:`~pandas.DataFrame.to_parquet`: + +.. ipython:: python + + df.to_parquet("test.parquet", index=False) + +This creates a parquet file with just the two expected columns, ``a`` and ``b``. +If your ``DataFrame`` has a custom index, you won't get it back when you load +this file into a ``DataFrame``. + +Passing ``index=True`` will *always* write the index, even if that's not the +underlying engine's default behavior. + +.. ipython:: python + :suppress: + + os.remove("test.parquet") + + +Partitioning Parquet files +'''''''''''''''''''''''''' + +Parquet supports partitioning of data based on the values of one or more columns. + +.. ipython:: python + + df = pd.DataFrame({"a": [0, 0, 1, 1], "b": [0, 1, 0, 1]}) + df.to_parquet(path="test", engine="pyarrow", partition_cols=["a"], compression=None) + +The ``path`` specifies the parent directory to which data will be saved. +The ``partition_cols`` are the column names by which the dataset will be partitioned. +Columns are partitioned in the order they are given. The partition splits are +determined by the unique values in the partition columns. +The above example creates a partitioned dataset that may look like: + +.. code-block:: text + + test + ├── a=0 + │ ├── 0bac803e32dc42ae83fddfd029cbdebc.parquet + │ └── ... + └── a=1 + ├── e6ab24a4f45147b49b54a662f0c412a3.parquet + └── ... + +.. ipython:: python + :suppress: + + from shutil import rmtree + + try: + rmtree("test") + except OSError: + pass diff --git a/doc/source/user_guide/io/pickling.rst b/doc/source/user_guide/io/pickling.rst new file mode 100644 index 0000000000000..8da5e1f96a184 --- /dev/null +++ b/doc/source/user_guide/io/pickling.rst @@ -0,0 +1,121 @@ +.. _io.pickle: + +======== +Pickling +======== + +All pandas objects are equipped with ``to_pickle`` methods which use Python's +``cPickle`` module to save data structures to disk using the pickle format. + +.. ipython:: python + + df + df.to_pickle("foo.pkl") + +The ``read_pickle`` function in the ``pandas`` namespace can be used to load +any pickled pandas object (or any other pickled object) from file: + + +.. ipython:: python + + pd.read_pickle("foo.pkl") + +.. ipython:: python + :suppress: + + os.remove("foo.pkl") + +.. warning:: + + Loading pickled data received from untrusted sources can be unsafe. + + See: https://docs.python.org/3/library/pickle.html + +.. warning:: + + :func:`read_pickle` is only guaranteed backwards compatible back to a few minor release. + +.. _io.pickle.compression: + +Compressed pickle files +----------------------- + +:func:`read_pickle`, :meth:`DataFrame.to_pickle` and :meth:`Series.to_pickle` can read +and write compressed pickle files. The compression types of ``gzip``, ``bz2``, ``xz``, ``zstd`` are supported for reading and writing. +The ``zip`` file format only supports reading and must contain only one data file +to be read. + +The compression type can be an explicit parameter or be inferred from the file extension. +If 'infer', then use ``gzip``, ``bz2``, ``zip``, ``xz``, ``zstd`` if filename ends in ``'.gz'``, ``'.bz2'``, ``'.zip'``, +``'.xz'``, or ``'.zst'``, respectively. + +The compression parameter can also be a ``dict`` in order to pass options to the +compression protocol. It must have a ``'method'`` key set to the name +of the compression protocol, which must be one of +{``'zip'``, ``'gzip'``, ``'bz2'``, ``'xz'``, ``'zstd'``}. All other key-value pairs are passed to +the underlying compression library. + +.. ipython:: python + + df = pd.DataFrame( + { + "A": np.random.randn(1000), + "B": "foo", + "C": pd.date_range("20130101", periods=1000, freq="s"), + } + ) + df + +Using an explicit compression type: + +.. ipython:: python + + df.to_pickle("data.pkl.compress", compression="gzip") + rt = pd.read_pickle("data.pkl.compress", compression="gzip") + rt + +Inferring compression type from the extension: + +.. ipython:: python + + df.to_pickle("data.pkl.xz", compression="infer") + rt = pd.read_pickle("data.pkl.xz", compression="infer") + rt + +The default is to 'infer': + +.. ipython:: python + + df.to_pickle("data.pkl.gz") + rt = pd.read_pickle("data.pkl.gz") + rt + + df["A"].to_pickle("s1.pkl.bz2") + rt = pd.read_pickle("s1.pkl.bz2") + rt + +Passing options to the compression protocol in order to speed up compression: + +.. ipython:: python + + df.to_pickle("data.pkl.gz", compression={"method": "gzip", "compresslevel": 1}) + +.. ipython:: python + :suppress: + + os.remove("data.pkl.compress") + os.remove("data.pkl.xz") + os.remove("data.pkl.gz") + os.remove("s1.pkl.bz2") + +.. _io.msgpack: + +msgpack +------- + +pandas support for ``msgpack`` has been removed in version 1.0.0. It is +recommended to use :ref:`pickle ` instead. + +Alternatively, you can also the Arrow IPC serialization format for on-the-wire +transmission of pandas objects. For documentation on pyarrow, see +`here `__. diff --git a/doc/source/user_guide/io/sas.rst b/doc/source/user_guide/io/sas.rst new file mode 100644 index 0000000000000..00a164e354758 --- /dev/null +++ b/doc/source/user_guide/io/sas.rst @@ -0,0 +1,47 @@ +.. _io.sas: + +.. _io.sas_reader: + +=========== +SAS formats +=========== + +The top-level function :func:`read_sas` can read (but not write) SAS +XPORT (.xpt) and SAS7BDAT (.sas7bdat) format files. + +SAS files only contain two value types: ASCII text and floating point +values (usually 8 bytes but sometimes truncated). For xport files, +there is no automatic type conversion to integers, dates, or +categoricals. For SAS7BDAT files, the format codes may allow date +variables to be automatically converted to dates. By default the +whole file is read and returned as a ``DataFrame``. + +Specify a ``chunksize`` or use ``iterator=True`` to obtain reader +objects (``XportReader`` or ``SAS7BDATReader``) for incrementally +reading the file. The reader objects also have attributes that +contain additional information about the file and its variables. + +Read a SAS7BDAT file: + +.. code-block:: python + + df = pd.read_sas("sas_data.sas7bdat") + +Obtain an iterator and read an XPORT file 100,000 lines at a time: + +.. code-block:: python + + def do_something(chunk): + pass + + + with pd.read_sas("sas_xport.xpt", chunk=100000) as rdr: + for chunk in rdr: + do_something(chunk) + +The specification_ for the xport file format is available from the SAS +web site. + +.. _specification: https://support.sas.com/content/dam/SAS/support/en/technical-papers/record-layout-of-a-sas-version-5-or-6-data-set-in-sas-transport-xport-format.pdf + +No official documentation is available for the SAS7BDAT format. diff --git a/doc/source/user_guide/io/spss.rst b/doc/source/user_guide/io/spss.rst new file mode 100644 index 0000000000000..3f9802d2b8bfd --- /dev/null +++ b/doc/source/user_guide/io/spss.rst @@ -0,0 +1,38 @@ +.. _io.spss: + +.. _io.spss_reader: + +============ +SPSS formats +============ + +The top-level function :func:`read_spss` can read (but not write) SPSS +SAV (.sav) and ZSAV (.zsav) format files. + +SPSS files contain column names. By default the +whole file is read, categorical columns are converted into ``pd.Categorical``, +and a ``DataFrame`` with all columns is returned. + +Specify the ``usecols`` parameter to obtain a subset of columns. Specify ``convert_categoricals=False`` +to avoid converting categorical columns into ``pd.Categorical``. + +Read an SPSS file: + +.. code-block:: python + + df = pd.read_spss("spss_data.sav") + +Extract a subset of columns contained in ``usecols`` from an SPSS file and +avoid converting categorical columns into ``pd.Categorical``: + +.. code-block:: python + + df = pd.read_spss( + "spss_data.sav", + usecols=["foo", "bar"], + convert_categoricals=False, + ) + +More information about the SAV and ZSAV file formats is available here_. + +.. _here: https://www.ibm.com/docs/en/spss-statistics/22.0.0 diff --git a/doc/source/user_guide/io/sql.rst b/doc/source/user_guide/io/sql.rst new file mode 100644 index 0000000000000..f8fcfd5606202 --- /dev/null +++ b/doc/source/user_guide/io/sql.rst @@ -0,0 +1,523 @@ +.. _io.sql: + +=========== +SQL queries +=========== + +The :mod:`pandas.io.sql` module provides a collection of query wrappers to both +facilitate data retrieval and to reduce dependency on DB-specific API. + +Where available, users may first want to opt for `Apache Arrow ADBC +`_ drivers. These drivers +should provide the best performance, null handling, and type detection. + + .. versionadded:: 2.2.0 + + Added native support for ADBC drivers + +For a full list of ADBC drivers and their development status, see the `ADBC Driver +Implementation Status `_ +documentation. + +Where an ADBC driver is not available or may be missing functionality, +users should opt for installing SQLAlchemy alongside their database driver library. +Examples of such drivers are `psycopg2 `__ +for PostgreSQL or `pymysql `__ for MySQL. +For `SQLite `__ this is +included in Python's standard library by default. +You can find an overview of supported drivers for each SQL dialect in the +`SQLAlchemy docs `__. + +If SQLAlchemy is not installed, you can use a :class:`sqlite3.Connection` in place of +a SQLAlchemy engine, connection, or URI string. + +See also some :ref:`cookbook examples ` for some advanced strategies. + +The key functions are: + +.. currentmodule:: pandas +.. autosummary:: + + read_sql_table + read_sql_query + read_sql + DataFrame.to_sql + +.. note:: + + The function :func:`~pandas.read_sql` is a convenience wrapper around + :func:`~pandas.read_sql_table` and :func:`~pandas.read_sql_query` (and for + backward compatibility) and will delegate to specific function depending on + the provided input (database table name or sql query). + Table names do not need to be quoted if they have special characters. + +In the following example, we use the `SQlite `__ SQL database +engine. You can use a temporary SQLite database where data are stored in +"memory". + +To connect using an ADBC driver you will want to install the ``adbc_driver_sqlite`` using your +package manager. Once installed, you can use the DBAPI interface provided by the ADBC driver +to connect to your database. + +.. code-block:: python + + import adbc_driver_sqlite.dbapi as sqlite_dbapi + + # Create the connection + with sqlite_dbapi.connect("sqlite:///:memory:") as conn: + df = pd.read_sql_table("data", conn) + +To connect with SQLAlchemy you use the :func:`create_engine` function to create an engine +object from database URI. You only need to create the engine once per database you are +connecting to. +For more information on :func:`create_engine` and the URI formatting, see the examples +below and the SQLAlchemy `documentation `__ + +.. ipython:: python + + from sqlalchemy import create_engine + + # Create your engine. + engine = create_engine("sqlite:///:memory:") + +If you want to manage your own connections you can pass one of those instead. The example below opens a +connection to the database using a Python context manager that automatically closes the connection after +the block has completed. +See the `SQLAlchemy docs `__ +for an explanation of how the database connection is handled. + +.. code-block:: python + + with engine.connect() as conn, conn.begin(): + data = pd.read_sql_table("data", conn) + +.. warning:: + + When you open a connection to a database you are also responsible for closing it. + Side effects of leaving a connection open may include locking the database or + other breaking behaviour. + +Writing DataFrames +'''''''''''''''''' + +Assuming the following data is in a ``DataFrame`` ``data``, we can insert it into +the database using :func:`~pandas.DataFrame.to_sql`. + ++-----+------------+-------+-------+-------+ +| id | Date | Col_1 | Col_2 | Col_3 | ++=====+============+=======+=======+=======+ +| 26 | 2012-10-18 | X | 25.7 | True | ++-----+------------+-------+-------+-------+ +| 42 | 2012-10-19 | Y | -12.4 | False | ++-----+------------+-------+-------+-------+ +| 63 | 2012-10-20 | Z | 5.73 | True | ++-----+------------+-------+-------+-------+ + + +.. ipython:: python + + import datetime + + c = ["id", "Date", "Col_1", "Col_2", "Col_3"] + d = [ + (26, datetime.datetime(2010, 10, 18), "X", 27.5, True), + (42, datetime.datetime(2010, 10, 19), "Y", -12.5, False), + (63, datetime.datetime(2010, 10, 20), "Z", 5.73, True), + ] + + data = pd.DataFrame(d, columns=c) + + data + data.to_sql("data", con=engine) + +With some databases, writing large DataFrames can result in errors due to +packet size limitations being exceeded. This can be avoided by setting the +``chunksize`` parameter when calling ``to_sql``. For example, the following +writes ``data`` to the database in batches of 1000 rows at a time: + +.. ipython:: python + + data.to_sql("data_chunked", con=engine, chunksize=1000) + +SQL data types +++++++++++++++ + +Ensuring consistent data type management across SQL databases is challenging. +Not every SQL database offers the same types, and even when they do the implementation +of a given type can vary in ways that have subtle effects on how types can be +preserved. + +For the best odds at preserving database types users are advised to use +ADBC drivers when available. The Arrow type system offers a wider array of +types that more closely match database types than the historical pandas/NumPy +type system. To illustrate, note this (non-exhaustive) listing of types +available in different databases and pandas backends: + ++-----------------+-----------------------+----------------+---------+ +|numpy/pandas |arrow |postgres |sqlite | ++=================+=======================+================+=========+ +|int16/Int16 |int16 |SMALLINT |INTEGER | ++-----------------+-----------------------+----------------+---------+ +|int32/Int32 |int32 |INTEGER |INTEGER | ++-----------------+-----------------------+----------------+---------+ +|int64/Int64 |int64 |BIGINT |INTEGER | ++-----------------+-----------------------+----------------+---------+ +|float32 |float32 |REAL |REAL | ++-----------------+-----------------------+----------------+---------+ +|float64 |float64 |DOUBLE PRECISION|REAL | ++-----------------+-----------------------+----------------+---------+ +|object |string |TEXT |TEXT | ++-----------------+-----------------------+----------------+---------+ +|bool |``bool_`` |BOOLEAN | | ++-----------------+-----------------------+----------------+---------+ +|datetime64[ns] |timestamp(us) |TIMESTAMP | | ++-----------------+-----------------------+----------------+---------+ +|datetime64[ns,tz]|timestamp(us,tz) |TIMESTAMPTZ | | ++-----------------+-----------------------+----------------+---------+ +| |date32 |DATE | | ++-----------------+-----------------------+----------------+---------+ +| |month_day_nano_interval|INTERVAL | | ++-----------------+-----------------------+----------------+---------+ +| |binary |BINARY |BLOB | ++-----------------+-----------------------+----------------+---------+ +| |decimal128 |DECIMAL [#f1]_ | | ++-----------------+-----------------------+----------------+---------+ +| |list |ARRAY [#f1]_ | | ++-----------------+-----------------------+----------------+---------+ +| |struct |COMPOSITE TYPE | | +| | | [#f1]_ | | ++-----------------+-----------------------+----------------+---------+ + +.. rubric:: Footnotes + +.. [#f1] Not implemented as of writing, but theoretically possible + +If you are interested in preserving database types as best as possible +throughout the lifecycle of your DataFrame, users are encouraged to +leverage the ``dtype_backend="pyarrow"`` argument of :func:`~pandas.read_sql` + +.. code-block:: ipython + + # for roundtripping + with pg_dbapi.connect(uri) as conn: + df2 = pd.read_sql("pandas_table", conn, dtype_backend="pyarrow") + +This will prevent your data from being converted to the traditional pandas/NumPy +type system, which often converts SQL types in ways that make them impossible to +round-trip. + +In case an ADBC driver is not available, :func:`~pandas.DataFrame.to_sql` +will try to map your data to an appropriate SQL data type based on the dtype of +the data. When you have columns of dtype ``object``, pandas will try to infer +the data type. + +You can always override the default type by specifying the desired SQL type of +any of the columns by using the ``dtype`` argument. This argument needs a +dictionary mapping column names to SQLAlchemy types (or strings for the sqlite3 +fallback mode). +For example, specifying to use the sqlalchemy ``String`` type instead of the +default ``Text`` type for string columns: + +.. ipython:: python + + from sqlalchemy.types import String + + data.to_sql("data_dtype", con=engine, dtype={"Col_1": String}) + +.. note:: + + Due to the limited support for timedelta's in the different database + flavors, columns with type ``timedelta64`` will be written as integer + values as nanoseconds to the database and a warning will be raised. The only + exception to this is when using the ADBC PostgreSQL driver in which case a + timedelta will be written to the database as an ``INTERVAL`` + +.. note:: + + Columns of ``category`` dtype will be converted to the dense representation + as you would get with ``np.asarray(categorical)`` (e.g. for string categories + this gives an array of strings). + Because of this, reading the database table back in does **not** generate + a categorical. + +.. _io.sql_datetime_data: + +Datetime data types +''''''''''''''''''' + +Using ADBC or SQLAlchemy, :func:`~pandas.DataFrame.to_sql` is capable of writing +datetime data that is timezone naive or timezone aware. However, the resulting +data stored in the database ultimately depends on the supported data type +for datetime data of the database system being used. + +The following table lists supported data types for datetime data for some +common databases. Other database dialects may have different data types for +datetime data. + +=========== ============================================= =================== +Database SQL Datetime Types Timezone Support +=========== ============================================= =================== +SQLite ``TEXT`` No +MySQL ``TIMESTAMP`` or ``DATETIME`` No +PostgreSQL ``TIMESTAMP`` or ``TIMESTAMP WITH TIME ZONE`` Yes +=========== ============================================= =================== + +When writing timezone aware data to databases that do not support timezones, +the data will be written as timezone naive timestamps that are in local time +with respect to the timezone. + +:func:`~pandas.read_sql_table` is also capable of reading datetime data that is +timezone aware or naive. When reading ``TIMESTAMP WITH TIME ZONE`` types, pandas +will convert the data to UTC. + +.. _io.sql.method: + +Insertion method +++++++++++++++++ + +The parameter ``method`` controls the SQL insertion clause used. +Possible values are: + +- ``None``: Uses standard SQL ``INSERT`` clause (one per row). +- ``'multi'``: Pass multiple values in a single ``INSERT`` clause. + It uses a *special* SQL syntax not supported by all backends. + This usually provides better performance for analytic databases + like *Presto* and *Redshift*, but has worse performance for + traditional SQL backend if the table contains many columns. + For more information check the SQLAlchemy `documentation + `__. +- callable with signature ``(pd_table, conn, keys, data_iter)``: + This can be used to implement a more performant insertion method based on + specific backend dialect features. + +Example of a callable using PostgreSQL `COPY clause +`__:: + + # Alternative to_sql() *method* for DBs that support COPY FROM + import csv + from io import StringIO + + def psql_insert_copy(table, conn, keys, data_iter): + """ + Execute SQL statement inserting data + + Parameters + ---------- + table : pandas.io.sql.SQLTable + conn : sqlalchemy.engine.Engine or sqlalchemy.engine.Connection + keys : list of str + Column names + data_iter : Iterable that iterates the values to be inserted + """ + # gets a DBAPI connection that can provide a cursor + dbapi_conn = conn.connection + with dbapi_conn.cursor() as cur: + s_buf = StringIO() + writer = csv.writer(s_buf) + writer.writerows(data_iter) + s_buf.seek(0) + + columns = ', '.join(['"{}"'.format(k) for k in keys]) + if table.schema: + table_name = '{}.{}'.format(table.schema, table.name) + else: + table_name = table.name + + sql = 'COPY {} ({}) FROM STDIN WITH CSV'.format( + table_name, columns) + cur.copy_expert(sql=sql, file=s_buf) + +Reading tables +'''''''''''''' + +:func:`~pandas.read_sql_table` will read a database table given the +table name and optionally a subset of columns to read. + +.. note:: + + In order to use :func:`~pandas.read_sql_table`, you **must** have the + ADBC driver or SQLAlchemy optional dependency installed. + +.. ipython:: python + + pd.read_sql_table("data", engine) + +.. note:: + + ADBC drivers will map database types directly back to arrow types. For other drivers + note that pandas infers column dtypes from query outputs, and not by looking + up data types in the physical database schema. For example, assume ``userid`` + is an integer column in a table. Then, intuitively, ``select userid ...`` will + return integer-valued series, while ``select cast(userid as text) ...`` will + return object-valued (str) series. Accordingly, if the query output is empty, + then all resulting columns will be returned as object-valued (since they are + most general). If you foresee that your query will sometimes generate an empty + result, you may want to explicitly typecast afterwards to ensure dtype + integrity. + +You can also specify the name of the column as the ``DataFrame`` index, +and specify a subset of columns to be read. + +.. ipython:: python + + pd.read_sql_table("data", engine, index_col="id") + pd.read_sql_table("data", engine, columns=["Col_1", "Col_2"]) + +And you can explicitly force columns to be parsed as dates: + +.. ipython:: python + + pd.read_sql_table("data", engine, parse_dates=["Date"]) + +If needed you can explicitly specify a format string, or a dict of arguments +to pass to :func:`pandas.to_datetime`: + +.. code-block:: python + + pd.read_sql_table("data", engine, parse_dates={"Date": "%Y-%m-%d"}) + pd.read_sql_table( + "data", + engine, + parse_dates={"Date": {"format": "%Y-%m-%d %H:%M:%S"}}, + ) + + +You can check if a table exists using :func:`~pandas.io.sql.has_table` + +Schema support +'''''''''''''' + +Reading from and writing to different schemas is supported through the ``schema`` +keyword in the :func:`~pandas.read_sql_table` and :func:`~pandas.DataFrame.to_sql` +functions. Note however that this depends on the database flavor (sqlite does not +have schemas). For example: + +.. code-block:: python + + df.to_sql(name="table", con=engine, schema="other_schema") + pd.read_sql_table("table", engine, schema="other_schema") + +Querying +'''''''' + +You can query using raw SQL in the :func:`~pandas.read_sql_query` function. +In this case you must use the SQL variant appropriate for your database. +When using SQLAlchemy, you can also pass SQLAlchemy Expression language constructs, +which are database-agnostic. + +.. ipython:: python + + pd.read_sql_query("SELECT * FROM data", engine) + +Of course, you can specify a more "complex" query. + +.. ipython:: python + + pd.read_sql_query("SELECT id, Col_1, Col_2 FROM data WHERE id = 42;", engine) + +The :func:`~pandas.read_sql_query` function supports a ``chunksize`` argument. +Specifying this will return an iterator through chunks of the query result: + +.. ipython:: python + + df = pd.DataFrame(np.random.randn(20, 3), columns=list("abc")) + df.to_sql(name="data_chunks", con=engine, index=False) + +.. ipython:: python + + for chunk in pd.read_sql_query("SELECT * FROM data_chunks", engine, chunksize=5): + print(chunk) + + +Engine connection examples +'''''''''''''''''''''''''' + +To connect with SQLAlchemy you use the :func:`create_engine` function to create an engine +object from database URI. You only need to create the engine once per database you are +connecting to. + +.. code-block:: python + + from sqlalchemy import create_engine + + engine = create_engine("postgresql://scott:tiger@localhost:5432/mydatabase") + + engine = create_engine("mysql+mysqldb://scott:tiger@localhost/foo") + + engine = create_engine("oracle://scott:tiger@127.0.0.1:1521/sidname") + + engine = create_engine("mssql+pyodbc://mydsn") + + # sqlite:/// + # where is relative: + engine = create_engine("sqlite:///foo.db") + + # or absolute, starting with a slash: + engine = create_engine("sqlite:////absolute/path/to/foo.db") + +For more information see the examples the SQLAlchemy `documentation `__ + + +Advanced SQLAlchemy queries +''''''''''''''''''''''''''' + +You can use SQLAlchemy constructs to describe your query. + +Use :func:`sqlalchemy.text` to specify query parameters in a backend-neutral way + +.. ipython:: python + + import sqlalchemy as sa + + pd.read_sql( + sa.text("SELECT * FROM data where Col_1=:col1"), engine, params={"col1": "X"} + ) + +If you have an SQLAlchemy description of your database you can express where conditions using SQLAlchemy expressions + +.. ipython:: python + + metadata = sa.MetaData() + data_table = sa.Table( + "data", + metadata, + sa.Column("index", sa.Integer), + sa.Column("Date", sa.DateTime), + sa.Column("Col_1", sa.String), + sa.Column("Col_2", sa.Float), + sa.Column("Col_3", sa.Boolean), + ) + + pd.read_sql(sa.select(data_table).where(data_table.c.Col_3 is True), engine) + +You can combine SQLAlchemy expressions with parameters passed to :func:`read_sql` using :func:`sqlalchemy.bindparam` + +.. ipython:: python + + import datetime as dt + + expr = sa.select(data_table).where(data_table.c.Date > sa.bindparam("date")) + pd.read_sql(expr, engine, params={"date": dt.datetime(2010, 10, 18)}) + + +Sqlite fallback +''''''''''''''' + +The use of sqlite is supported without using SQLAlchemy. +This mode requires a Python database adapter which respect the `Python +DB-API `__. + +You can create connections like so: + +.. code-block:: python + + import sqlite3 + + con = sqlite3.connect(":memory:") + +And then issue the following queries: + +.. code-block:: python + + data.to_sql("data", con) + pd.read_sql_query("SELECT * FROM data", con) diff --git a/doc/source/user_guide/io/stata.rst b/doc/source/user_guide/io/stata.rst new file mode 100644 index 0000000000000..89f930525d3a8 --- /dev/null +++ b/doc/source/user_guide/io/stata.rst @@ -0,0 +1,171 @@ +.. _io.stata: + +============ +STATA format +============ + +.. _io.stata_writer: + +Writing to stata format +''''''''''''''''''''''' + +The method :func:`.DataFrame.to_stata` will write a DataFrame +into a .dta file. The format version of this file is always 115 (Stata 12). + +.. ipython:: python + + df = pd.DataFrame(np.random.randn(10, 2), columns=list("AB")) + df.to_stata("stata.dta") + +*Stata* data files have limited data type support; only strings with +244 or fewer characters, ``int8``, ``int16``, ``int32``, ``float32`` +and ``float64`` can be stored in ``.dta`` files. Additionally, +*Stata* reserves certain values to represent missing data. Exporting a +non-missing value that is outside of the permitted range in Stata for +a particular data type will retype the variable to the next larger +size. For example, ``int8`` values are restricted to lie between -127 +and 100 in Stata, and so variables with values above 100 will trigger +a conversion to ``int16``. ``nan`` values in floating points data +types are stored as the basic missing data type (``.`` in *Stata*). + +.. note:: + + It is not possible to export missing data values for integer data types. + + +The *Stata* writer gracefully handles other data types including ``int64``, +``bool``, ``uint8``, ``uint16``, ``uint32`` by casting to +the smallest supported type that can represent the data. For example, data +with a type of ``uint8`` will be cast to ``int8`` if all values are less than +100 (the upper bound for non-missing ``int8`` data in *Stata*), or, if values are +outside of this range, the variable is cast to ``int16``. + + +.. warning:: + + Conversion from ``int64`` to ``float64`` may result in a loss of precision + if ``int64`` values are larger than 2**53. + +.. warning:: + + :class:`~pandas.io.stata.StataWriter` and + :func:`.DataFrame.to_stata` only support fixed width + strings containing up to 244 characters, a limitation imposed by the version + 115 dta file format. Attempting to write *Stata* dta files with strings + longer than 244 characters raises a ``ValueError``. + +.. _io.stata_reader: + +Reading from Stata format +''''''''''''''''''''''''' + +The top-level function ``read_stata`` will read a dta file and return +either a ``DataFrame`` or a :class:`pandas.api.typing.StataReader` that can +be used to read the file incrementally. + +.. ipython:: python + + pd.read_stata("stata.dta") + +Specifying a ``chunksize`` yields a +:class:`pandas.api.typing.StataReader` instance that can be used to +read ``chunksize`` lines from the file at a time. The ``StataReader`` +object can be used as an iterator. + +.. ipython:: python + + with pd.read_stata("stata.dta", chunksize=3) as reader: + for df in reader: + print(df.shape) + +For more fine-grained control, use ``iterator=True`` and specify +``chunksize`` with each call to +:func:`~pandas.io.stata.StataReader.read`. + +.. ipython:: python + + with pd.read_stata("stata.dta", iterator=True) as reader: + chunk1 = reader.read(5) + chunk2 = reader.read(5) + +Currently the ``index`` is retrieved as a column. + +The parameter ``convert_categoricals`` indicates whether value labels should be +read and used to create a ``Categorical`` variable from them. Value labels can +also be retrieved by the function ``value_labels``, which requires :func:`~pandas.io.stata.StataReader.read` +to be called before use. + +The parameter ``convert_missing`` indicates whether missing value +representations in Stata should be preserved. If ``False`` (the default), +missing values are represented as ``np.nan``. If ``True``, missing values are +represented using ``StataMissingValue`` objects, and columns containing missing +values will have ``object`` data type. + +.. note:: + + :func:`~pandas.read_stata` and + :class:`~pandas.io.stata.StataReader` support .dta formats 113-115 + (Stata 10-12), 117 (Stata 13), and 118 (Stata 14). + +.. note:: + + Setting ``preserve_dtypes=False`` will upcast to the standard pandas data types: + ``int64`` for all integer types and ``float64`` for floating point data. By default, + the Stata data types are preserved when importing. + +.. note:: + + All :class:`~pandas.io.stata.StataReader` objects, whether created by :func:`~pandas.read_stata` + (when using ``iterator=True`` or ``chunksize``) or instantiated by hand, must be used as context + managers (e.g. the ``with`` statement). + While the :meth:`~pandas.io.stata.StataReader.close` method is available, its use is unsupported. + It is not part of the public API and will be removed in with future without warning. + +.. ipython:: python + :suppress: + + os.remove("stata.dta") + +.. _io.stata-categorical: + +Categorical data +++++++++++++++++ + +``Categorical`` data can be exported to *Stata* data files as value labeled data. +The exported data consists of the underlying category codes as integer data values +and the categories as value labels. *Stata* does not have an explicit equivalent +to a ``Categorical`` and information about *whether* the variable is ordered +is lost when exporting. + +.. warning:: + + *Stata* only supports string value labels, and so ``str`` is called on the + categories when exporting data. Exporting ``Categorical`` variables with + non-string categories produces a warning, and can result a loss of + information if the ``str`` representations of the categories are not unique. + +Labeled data can similarly be imported from *Stata* data files as ``Categorical`` +variables using the keyword argument ``convert_categoricals`` (``True`` by default). +The keyword argument ``order_categoricals`` (``True`` by default) determines +whether imported ``Categorical`` variables are ordered. + +.. note:: + + When importing categorical data, the values of the variables in the *Stata* + data file are not preserved since ``Categorical`` variables always + use integer data types between ``-1`` and ``n-1`` where ``n`` is the number + of categories. If the original values in the *Stata* data file are required, + these can be imported by setting ``convert_categoricals=False``, which will + import original data (but not the variable labels). The original values can + be matched to the imported categorical data since there is a simple mapping + between the original *Stata* data values and the category codes of imported + Categorical variables: missing values are assigned code ``-1``, and the + smallest original value is assigned ``0``, the second smallest is assigned + ``1`` and so on until the largest original value is assigned the code ``n-1``. + +.. note:: + + *Stata* supports partially labeled series. These series have value labels for + some but not all data values. Importing a partially labeled series will produce + a ``Categorical`` with string categories for the values that are labeled and + numeric categories for values with no label. diff --git a/doc/source/user_guide/io/xml.rst b/doc/source/user_guide/io/xml.rst new file mode 100644 index 0000000000000..aa619eeefe149 --- /dev/null +++ b/doc/source/user_guide/io/xml.rst @@ -0,0 +1,548 @@ +=== +XML +=== + +.. _io.read_xml: + +Reading XML +''''''''''' + +.. versionadded:: 1.3.0 + +The top-level :func:`~pandas.io.xml.read_xml` function can accept an XML +string/file/URL and will parse nodes and attributes into a pandas ``DataFrame``. + +.. note:: + + Since there is no standard XML structure where design types can vary in + many ways, ``read_xml`` works best with flatter, shallow versions. If + an XML document is deeply nested, use the ``stylesheet`` feature to + transform XML into a flatter version. + +Let's look at a few examples. + +Read an XML string: + +.. ipython:: python + + from io import StringIO + xml = """ + + + Everyday Italian + Giada De Laurentiis + 2005 + 30.00 + + + Harry Potter + J K. Rowling + 2005 + 29.99 + + + Learning XML + Erik T. Ray + 2003 + 39.95 + + """ + + df = pd.read_xml(StringIO(xml)) + df + +Read a URL with no options: + +.. ipython:: python + + df = pd.read_xml("https://www.w3schools.com/xml/books.xml") + df + +Read in the content of the "books.xml" file and pass it to ``read_xml`` +as a string: + +.. ipython:: python + + from io import StringIO + + file_path = "books.xml" + with open(file_path, "w") as f: + f.write(xml) + + with open(file_path, "r") as f: + df = pd.read_xml(StringIO(f.read())) + df + +Read in the content of the "books.xml" as instance of ``StringIO`` or +``BytesIO`` and pass it to ``read_xml``: + +.. ipython:: python + + from io import StringIO + + with open(file_path, "r") as f: + sio = StringIO(f.read()) + + df = pd.read_xml(sio) + df + +.. ipython:: python + + with open(file_path, "rb") as f: + bio = BytesIO(f.read()) + + df = pd.read_xml(bio) + df + +Even read XML from AWS S3 buckets such as NIH NCBI PMC Article Datasets providing +Biomedical and Life Science Journals: + +.. code-block:: python + + >>> df = pd.read_xml( + ... "s3://pmc-oa-opendata/oa_comm/xml/all/PMC1236943.xml", + ... xpath=".//journal-meta", + ...) + >>> df + journal-id journal-title issn publisher + 0 Cardiovasc Ultrasound Cardiovascular Ultrasound 1476-7120 NaN + +With `lxml`_ as default ``parser``, you access the full-featured XML library +that extends Python's ElementTree API. One powerful tool is ability to query +nodes selectively or conditionally with more expressive XPath: + +.. _lxml: https://lxml.de + +.. ipython:: python + + df = pd.read_xml(file_path, xpath="//book[year=2005]") + df + +Specify only elements or only attributes to parse: + +.. ipython:: python + + df = pd.read_xml(file_path, elems_only=True) + df + +.. ipython:: python + + df = pd.read_xml(file_path, attrs_only=True) + df + +.. ipython:: python + :suppress: + + os.remove("books.xml") + +XML documents can have namespaces with prefixes and default namespaces without +prefixes both of which are denoted with a special attribute ``xmlns``. In order +to parse by node under a namespace context, ``xpath`` must reference a prefix. + +For example, below XML contains a namespace with prefix, ``doc``, and URI at +``https://example.com``. In order to parse ``doc:row`` nodes, +``namespaces`` must be used. + +.. ipython:: python + + from io import StringIO + + xml = """ + + + square + 360 + 4.0 + + + circle + 360 + + + + triangle + 180 + 3.0 + + """ + + df = pd.read_xml(StringIO(xml), + xpath="//doc:row", + namespaces={"doc": "https://example.com"}) + df + +Similarly, an XML document can have a default namespace without prefix. Failing +to assign a temporary prefix will return no nodes and raise a ``ValueError``. +But assigning *any* temporary name to correct URI allows parsing by nodes. + +.. ipython:: python + + from io import StringIO + + xml = """ + + + square + 360 + 4.0 + + + circle + 360 + + + + triangle + 180 + 3.0 + + """ + + df = pd.read_xml(StringIO(xml), + xpath="//pandas:row", + namespaces={"pandas": "https://example.com"}) + df + +However, if XPath does not reference node names such as default, ``/*``, then +``namespaces`` is not required. + +.. note:: + + Since ``xpath`` identifies the parent of content to be parsed, only immediate + descendants which include child nodes or current attributes are parsed. + Therefore, ``read_xml`` will not parse the text of grandchildren or other + descendants and will not parse attributes of any descendant. To retrieve + lower level content, adjust xpath to lower level. For example, + + .. ipython:: python + :okwarning: + + from io import StringIO + + xml = """ + + + square + 360 + + + circle + 360 + + + triangle + 180 + + """ + + df = pd.read_xml(StringIO(xml), xpath="./row") + df + + shows the attribute ``sides`` on ``shape`` element was not parsed as + expected since this attribute resides on the child of ``row`` element + and not ``row`` element itself. In other words, ``sides`` attribute is a + grandchild level descendant of ``row`` element. However, the ``xpath`` + targets ``row`` element which covers only its children and attributes. + +With `lxml`_ as parser, you can flatten nested XML documents with an XSLT +script which also can be string/file/URL types. As background, `XSLT`_ is +a special-purpose language written in a special XML file that can transform +original XML documents into other XML, HTML, even text (CSV, JSON, etc.) +using an XSLT processor. + +.. _lxml: https://lxml.de +.. _XSLT: https://www.w3.org/TR/xslt/ + +For example, consider this somewhat nested structure of Chicago "L" Rides +where station and rides elements encapsulate data in their own sections. +With below XSLT, ``lxml`` can transform original nested document into a flatter +output (as shown below for demonstration) for easier parse into ``DataFrame``: + +.. ipython:: python + + from io import StringIO + + xml = """ + + + + 2020-09-01T00:00:00 + + 864.2 + 534 + 417.2 + + + + + 2020-09-01T00:00:00 + + 2707.4 + 1909.8 + 1438.6 + + + + + 2020-09-01T00:00:00 + + 2949.6 + 1657 + 1453.8 + + + """ + + xsl = """ + + + + + + + + + + + + + + + """ + + output = """ + + + 40850 + Library + 2020-09-01T00:00:00 + 864.2 + 534 + 417.2 + + + 41700 + Washington/Wabash + 2020-09-01T00:00:00 + 2707.4 + 1909.8 + 1438.6 + + + 40380 + Clark/Lake + 2020-09-01T00:00:00 + 2949.6 + 1657 + 1453.8 + + """ + + df = pd.read_xml(StringIO(xml), stylesheet=StringIO(xsl)) + df + +For very large XML files that can range in hundreds of megabytes to gigabytes, :func:`pandas.read_xml` +supports parsing such sizeable files using `lxml's iterparse`_ and `etree's iterparse`_ +which are memory-efficient methods to iterate through an XML tree and extract specific elements and attributes. +without holding entire tree in memory. + +.. versionadded:: 1.5.0 + +.. _`lxml's iterparse`: https://lxml.de/3.2/parsing.html#iterparse-and-iterwalk +.. _`etree's iterparse`: https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse + +To use this feature, you must pass a physical XML file path into ``read_xml`` and use the ``iterparse`` argument. +Files should not be compressed or point to online sources but stored on local disk. Also, ``iterparse`` should be +a dictionary where the key is the repeating nodes in document (which become the rows) and the value is a list of +any element or attribute that is a descendant (i.e., child, grandchild) of repeating node. Since XPath is not +used in this method, descendants do not need to share same relationship with one another. Below shows example +of reading in Wikipedia's very large (12 GB+) latest article data dump. + +.. code-block:: ipython + + In [1]: df = pd.read_xml( + ... "/path/to/downloaded/enwikisource-latest-pages-articles.xml", + ... iterparse = {"page": ["title", "ns", "id"]} + ... ) + ... df + Out[2]: + title ns id + 0 Gettysburg Address 0 21450 + 1 Main Page 0 42950 + 2 Declaration by United Nations 0 8435 + 3 Constitution of the United States of America 0 8435 + 4 Declaration of Independence (Israel) 0 17858 + ... ... ... ... + 3578760 Page:Black cat 1897 07 v2 n10.pdf/17 104 219649 + 3578761 Page:Black cat 1897 07 v2 n10.pdf/43 104 219649 + 3578762 Page:Black cat 1897 07 v2 n10.pdf/44 104 219649 + 3578763 The History of Tom Jones, a Foundling/Book IX 0 12084291 + 3578764 Page:Shakespeare of Stratford (1926) Yale.djvu/91 104 21450 + + [3578765 rows x 3 columns] + +.. _io.xml: + +Writing XML +''''''''''' + +.. versionadded:: 1.3.0 + +``DataFrame`` objects have an instance method ``to_xml`` which renders the +contents of the ``DataFrame`` as an XML document. + +.. note:: + + This method does not support special properties of XML including DTD, + CData, XSD schemas, processing instructions, comments, and others. + Only namespaces at the root level is supported. However, ``stylesheet`` + allows design changes after initial output. + +Let's look at a few examples. + +Write an XML without options: + +.. ipython:: python + + geom_df = pd.DataFrame( + { + "shape": ["square", "circle", "triangle"], + "degrees": [360, 360, 180], + "sides": [4, np.nan, 3], + } + ) + + print(geom_df.to_xml()) + + +Write an XML with new root and row name: + +.. ipython:: python + + print(geom_df.to_xml(root_name="geometry", row_name="objects")) + +Write an attribute-centric XML: + +.. ipython:: python + + print(geom_df.to_xml(attr_cols=geom_df.columns.tolist())) + +Write a mix of elements and attributes: + +.. ipython:: python + + print( + geom_df.to_xml( + index=False, + attr_cols=['shape'], + elem_cols=['degrees', 'sides']) + ) + +Any ``DataFrames`` with hierarchical columns will be flattened for XML element names +with levels delimited by underscores: + +.. ipython:: python + + ext_geom_df = pd.DataFrame( + { + "type": ["polygon", "other", "polygon"], + "shape": ["square", "circle", "triangle"], + "degrees": [360, 360, 180], + "sides": [4, np.nan, 3], + } + ) + + pvt_df = ext_geom_df.pivot_table(index='shape', + columns='type', + values=['degrees', 'sides'], + aggfunc='sum') + pvt_df + + print(pvt_df.to_xml()) + +Write an XML with default namespace: + +.. ipython:: python + + print(geom_df.to_xml(namespaces={"": "https://example.com"})) + +Write an XML with namespace prefix: + +.. ipython:: python + + print( + geom_df.to_xml(namespaces={"doc": "https://example.com"}, + prefix="doc") + ) + +Write an XML without declaration or pretty print: + +.. ipython:: python + + print( + geom_df.to_xml(xml_declaration=False, + pretty_print=False) + ) + +Write an XML and transform with stylesheet: + +.. ipython:: python + + from io import StringIO + + xsl = """ + + + + + + + + + + + polygon + + + + + + + + """ + + print(geom_df.to_xml(stylesheet=StringIO(xsl))) + + +XML Final Notes +''''''''''''''' + +* All XML documents adhere to `W3C specifications`_. Both ``etree`` and ``lxml`` + parsers will fail to parse any markup document that is not well-formed or + follows XML syntax rules. Do be aware HTML is not an XML document unless it + follows XHTML specs. However, other popular markup types including KML, XAML, + RSS, MusicML, MathML are compliant `XML schemas`_. + +* For above reason, if your application builds XML prior to pandas operations, + use appropriate DOM libraries like ``etree`` and ``lxml`` to build the necessary + document and not by string concatenation or regex adjustments. Always remember + XML is a *special* text file with markup rules. + +* With very large XML files (several hundred MBs to GBs), XPath and XSLT + can become memory-intensive operations. Be sure to have enough available + RAM for reading and writing to large XML files (roughly about 5 times the + size of text). + +* Because XSLT is a programming language, use it with caution since such scripts + can pose a security risk in your environment and can run large or infinite + recursive operations. Always test scripts on small fragments before full run. + +* The `etree`_ parser supports all functionality of both ``read_xml`` and + ``to_xml`` except for complex XPath and any XSLT. Though limited in features, + ``etree`` is still a reliable and capable parser and tree builder. Its + performance may trail ``lxml`` to a certain degree for larger files but + relatively unnoticeable on small to medium size files. + +.. _`W3C specifications`: https://www.w3.org/TR/xml/ +.. _`XML schemas`: https://en.wikipedia.org/wiki/List_of_types_of_XML_schemas +.. _`etree`: https://docs.python.org/3/library/xml.etree.elementtree.html diff --git a/doc/source/whatsnew/v1.4.0.rst b/doc/source/whatsnew/v1.4.0.rst index 7b1aef07e5f00..eba38b8b1bdf7 100644 --- a/doc/source/whatsnew/v1.4.0.rst +++ b/doc/source/whatsnew/v1.4.0.rst @@ -119,7 +119,7 @@ Multi-threaded CSV reading with a new CSV Engine based on pyarrow :func:`pandas.read_csv` now accepts ``engine="pyarrow"`` (requires at least ``pyarrow`` 1.0.1) as an argument, allowing for faster csv parsing on multicore -machines with pyarrow installed. See the :doc:`I/O docs ` for +machines with pyarrow installed. See the :doc:`I/O docs ` for more info. (:issue:`23697`, :issue:`43706`) .. _whatsnew_140.enhancements.window_rank: From 6eab1f2c74d2c877527a73390d58b79a2e227ea3 Mon Sep 17 00:00:00 2001 From: RedGuy12 Date: Mon, 10 Feb 2025 05:39:41 +0000 Subject: [PATCH 2/4] Merge branch 'main' into gh-10446 --- .github/workflows/unit-tests.yml | 32 +- .pre-commit-config.yaml | 10 +- asv_bench/benchmarks/io/style.py | 4 +- doc/make.py | 6 +- doc/source/user_guide/cookbook.rst | 4 +- doc/source/user_guide/enhancingperf.rst | 2 +- doc/source/user_guide/groupby.rst | 2 +- doc/source/user_guide/merging.rst | 4 +- doc/source/user_guide/pyarrow.rst | 2 +- doc/source/user_guide/scale.rst | 2 +- doc/source/user_guide/style.ipynb | 2 +- doc/source/user_guide/timeseries.rst | 4 +- doc/source/whatsnew/v3.0.0.rst | 7 +- pandas/_config/config.py | 14 + pandas/_libs/groupby.pyi | 5 + pandas/_libs/groupby.pyx | 99 +++++-- pandas/_libs/interval.pyx | 6 + pandas/_libs/tslibs/period.pyx | 11 + pandas/_typing.py | 4 +- pandas/core/_numba/kernels/min_max_.py | 8 +- pandas/core/_numba/kernels/var_.py | 7 +- pandas/core/apply.py | 3 +- pandas/core/arraylike.py | 2 +- pandas/core/arrays/base.py | 6 +- pandas/core/arrays/datetimes.py | 3 +- pandas/core/base.py | 5 + pandas/core/common.py | 2 +- pandas/core/computation/eval.py | 2 +- pandas/core/computation/expr.py | 2 +- pandas/core/computation/ops.py | 3 +- pandas/core/computation/parsing.py | 44 --- pandas/core/construction.py | 4 + pandas/core/dtypes/cast.py | 2 +- pandas/core/dtypes/common.py | 2 + pandas/core/dtypes/dtypes.py | 3 +- pandas/core/frame.py | 36 ++- pandas/core/generic.py | 3 +- pandas/core/groupby/groupby.py | 117 ++++++-- pandas/core/groupby/grouper.py | 3 +- pandas/core/indexers/objects.py | 6 +- pandas/core/indexing.py | 12 +- pandas/core/interchange/buffer.py | 3 +- pandas/core/internals/blocks.py | 3 +- pandas/core/internals/construction.py | 3 +- pandas/core/nanops.py | 11 +- pandas/core/ops/array_ops.py | 2 +- pandas/core/resample.py | 98 +++++- pandas/core/reshape/encoding.py | 3 +- pandas/core/reshape/merge.py | 103 ++++++- pandas/core/reshape/reshape.py | 2 + pandas/core/reshape/tile.py | 4 +- pandas/core/series.py | 2 +- pandas/core/tools/datetimes.py | 6 +- pandas/core/window/rolling.py | 24 +- pandas/io/excel/_odswriter.py | 2 +- pandas/io/formats/format.py | 10 +- pandas/io/formats/printing.py | 6 +- pandas/io/formats/style.py | 22 +- pandas/io/formats/style_render.py | 11 +- pandas/io/formats/xml.py | 6 +- pandas/io/json/_json.py | 2 +- pandas/io/orc.py | 7 + pandas/io/parsers/base_parser.py | 3 +- pandas/io/parsers/python_parser.py | 6 +- pandas/io/parsers/readers.py | 6 +- pandas/io/pytables.py | 9 + pandas/io/sas/sas_xport.py | 9 +- pandas/plotting/_core.py | 23 +- pandas/plotting/_matplotlib/boxplot.py | 3 +- pandas/plotting/_misc.py | 7 + pandas/tests/apply/test_frame_apply.py | 10 + pandas/tests/apply/test_numba.py | 12 +- pandas/tests/arrays/interval/test_formats.py | 4 +- pandas/tests/dtypes/cast/test_downcast.py | 4 +- pandas/tests/dtypes/test_common.py | 18 +- pandas/tests/dtypes/test_dtypes.py | 3 +- pandas/tests/dtypes/test_missing.py | 2 +- pandas/tests/extension/base/getitem.py | 4 +- pandas/tests/extension/decimal/array.py | 1 - pandas/tests/extension/json/array.py | 3 +- pandas/tests/extension/list/array.py | 3 +- pandas/tests/extension/test_arrow.py | 3 +- pandas/tests/frame/methods/test_info.py | 8 +- pandas/tests/frame/methods/test_join.py | 15 +- pandas/tests/frame/methods/test_sample.py | 3 +- pandas/tests/frame/methods/test_set_axis.py | 2 +- pandas/tests/frame/test_reductions.py | 38 +++ pandas/tests/frame/test_stack_unstack.py | 19 ++ pandas/tests/generic/test_finalize.py | 2 +- pandas/tests/groupby/aggregate/test_numba.py | 14 +- pandas/tests/groupby/test_api.py | 18 +- pandas/tests/groupby/test_apply.py | 2 +- pandas/tests/groupby/test_categorical.py | 2 +- pandas/tests/groupby/test_groupby.py | 2 +- pandas/tests/groupby/test_numba.py | 13 +- pandas/tests/groupby/test_raises.py | 5 +- pandas/tests/groupby/test_reductions.py | 141 +++++++++ pandas/tests/groupby/transform/test_numba.py | 12 +- .../indexes/categorical/test_indexing.py | 6 +- .../indexes/datetimes/methods/test_round.py | 6 +- .../tests/indexes/datetimes/test_formats.py | 13 +- .../tests/indexes/datetimes/test_indexing.py | 6 +- .../indexes/interval/test_constructors.py | 9 +- pandas/tests/indexes/interval/test_formats.py | 7 +- pandas/tests/indexes/multi/test_indexing.py | 3 +- pandas/tests/indexes/numeric/test_indexing.py | 3 +- pandas/tests/indexes/period/test_formats.py | 3 +- pandas/tests/indexes/period/test_indexing.py | 3 +- pandas/tests/indexes/test_base.py | 3 +- pandas/tests/indexes/test_index_new.py | 3 +- .../tests/indexes/timedeltas/test_indexing.py | 3 +- pandas/tests/indexing/test_iloc.py | 3 +- pandas/tests/io/excel/test_readers.py | 3 +- pandas/tests/io/excel/test_style.py | 6 +- pandas/tests/io/formats/style/test_style.py | 2 +- pandas/tests/io/formats/test_css.py | 3 +- pandas/tests/io/formats/test_printing.py | 3 + pandas/tests/io/formats/test_to_csv.py | 5 +- pandas/tests/io/formats/test_to_html.py | 3 +- pandas/tests/io/formats/test_to_markdown.py | 6 +- pandas/tests/io/formats/test_to_string.py | 43 +-- pandas/tests/io/json/test_pandas.py | 12 +- pandas/tests/io/json/test_readlines.py | 9 +- pandas/tests/io/json/test_ujson.py | 74 ++--- .../io/parser/common/test_read_errors.py | 3 +- pandas/tests/io/parser/test_mangle_dupes.py | 2 +- pandas/tests/io/parser/test_parse_dates.py | 4 +- pandas/tests/io/pytables/test_append.py | 35 ++- pandas/tests/io/pytables/test_round_trip.py | 9 +- pandas/tests/io/pytables/test_store.py | 2 +- pandas/tests/io/test_pickle.py | 49 --- pandas/tests/io/xml/test_xml.py | 3 +- pandas/tests/plotting/test_series.py | 2 +- pandas/tests/resample/test_time_grouper.py | 2 +- pandas/tests/reshape/merge/test_merge.py | 5 +- .../reshape/merge/test_merge_antijoin.py | 280 ++++++++++++++++++ .../tests/reshape/merge/test_merge_cross.py | 6 +- pandas/tests/scalar/period/test_period.py | 5 - .../scalar/timedelta/test_constructors.py | 3 +- .../scalar/timestamp/methods/test_round.py | 1 - pandas/tests/series/methods/test_between.py | 3 +- pandas/tests/series/test_constructors.py | 7 + pandas/tests/test_downstream.py | 10 +- pandas/tests/tools/test_to_datetime.py | 4 +- pandas/tests/tools/test_to_numeric.py | 6 +- pandas/tests/tseries/offsets/test_offsets.py | 6 +- pandas/tests/tseries/offsets/test_ticks.py | 3 +- pandas/tests/tslibs/test_parsing.py | 5 +- pandas/tests/window/test_numba.py | 12 +- pandas/tests/window/test_online.py | 13 +- pandas/tseries/frequencies.py | 3 +- pyproject.toml | 6 +- web/pandas/community/ecosystem.md | 32 +- 153 files changed, 1404 insertions(+), 611 deletions(-) create mode 100644 pandas/tests/reshape/merge/test_merge_antijoin.py diff --git a/.github/workflows/unit-tests.yml b/.github/workflows/unit-tests.yml index 842629ba331d6..08c41a1eeb21f 100644 --- a/.github/workflows/unit-tests.yml +++ b/.github/workflows/unit-tests.yml @@ -107,7 +107,7 @@ jobs: services: mysql: - image: mysql:8 + image: mysql:9 env: MYSQL_ALLOW_EMPTY_PASSWORD: yes MYSQL_DATABASE: pandas @@ -120,7 +120,7 @@ jobs: - 3306:3306 postgres: - image: postgres:16 + image: postgres:17 env: PGUSER: postgres POSTGRES_USER: postgres @@ -135,7 +135,7 @@ jobs: - 5432:5432 moto: - image: motoserver/moto:5.0.0 + image: motoserver/moto:5.0.27 env: AWS_ACCESS_KEY_ID: foobar_key AWS_SECRET_ACCESS_KEY: foobar_secret @@ -242,15 +242,14 @@ jobs: - name: Build environment and Run Tests # https://github.com/numpy/numpy/issues/24703#issuecomment-1722379388 run: | - /opt/python/cp311-cp311/bin/python -m venv ~/virtualenvs/pandas-dev + /opt/python/cp313-cp313/bin/python -m venv ~/virtualenvs/pandas-dev . ~/virtualenvs/pandas-dev/bin/activate python -m pip install --no-cache-dir -U pip wheel setuptools meson[ninja]==1.2.1 meson-python==0.13.1 python -m pip install numpy -Csetup-args="-Dallow-noblas=true" python -m pip install --no-cache-dir versioneer[toml] cython python-dateutil pytest>=7.3.2 pytest-xdist>=3.4.0 hypothesis>=6.84.0 python -m pip install --no-cache-dir --no-build-isolation -e . -Csetup-args="--werror" python -m pip list --no-cache-dir - export PANDAS_CI=1 - python -m pytest -m 'not slow and not network and not clipboard and not single_cpu' pandas --junitxml=test-data.xml + PANDAS_CI=1 python -m pytest -m 'not slow and not network and not clipboard and not single_cpu' pandas --junitxml=test-data.xml concurrency: # https://github.community/t/concurrecy-not-work-for-push/183068/7 group: ${{ github.event_name == 'push' && github.run_number || github.ref }}-32bit @@ -259,7 +258,7 @@ jobs: Linux-Musl: runs-on: ubuntu-22.04 container: - image: quay.io/pypa/musllinux_1_1_x86_64 + image: quay.io/pypa/musllinux_1_2_x86_64 steps: - name: Checkout pandas Repo # actions/checkout does not work since it requires node @@ -281,7 +280,7 @@ jobs: apk add musl-locales - name: Build environment run: | - /opt/python/cp311-cp311/bin/python -m venv ~/virtualenvs/pandas-dev + /opt/python/cp313-cp313/bin/python -m venv ~/virtualenvs/pandas-dev . ~/virtualenvs/pandas-dev/bin/activate python -m pip install --no-cache-dir -U pip wheel setuptools meson-python==0.13.1 meson[ninja]==1.2.1 python -m pip install --no-cache-dir versioneer[toml] cython numpy python-dateutil pytest>=7.3.2 pytest-xdist>=3.4.0 hypothesis>=6.84.0 @@ -291,8 +290,7 @@ jobs: - name: Run Tests run: | . ~/virtualenvs/pandas-dev/bin/activate - export PANDAS_CI=1 - python -m pytest -m 'not slow and not network and not clipboard and not single_cpu' pandas --junitxml=test-data.xml + PANDAS_CI=1 python -m pytest -m 'not slow and not network and not clipboard and not single_cpu' pandas --junitxml=test-data.xml concurrency: # https://github.community/t/concurrecy-not-work-for-push/183068/7 group: ${{ github.event_name == 'push' && github.run_number || github.ref }}-musl @@ -357,8 +355,7 @@ jobs: python --version python -m pip install --upgrade pip setuptools wheel meson[ninja]==1.2.1 meson-python==0.13.1 python -m pip install --pre --extra-index-url https://pypi.anaconda.org/scientific-python-nightly-wheels/simple numpy - python -m pip install versioneer[toml] - python -m pip install python-dateutil tzdata cython hypothesis>=6.84.0 pytest>=7.3.2 pytest-xdist>=3.4.0 pytest-cov + python -m pip install versioneer[toml] python-dateutil tzdata cython hypothesis>=6.84.0 pytest>=7.3.2 pytest-xdist>=3.4.0 pytest-cov python -m pip install -ve . --no-build-isolation --no-index --no-deps -Csetup-args="--werror" python -m pip list @@ -375,7 +372,7 @@ jobs: concurrency: # https://github.community/t/concurrecy-not-work-for-push/183068/7 - group: ${{ github.event_name == 'push' && github.run_number || github.ref }}-${{ matrix.os }}-python-freethreading-dev + group: ${{ github.event_name == 'push' && github.run_number || github.ref }}-python-freethreading-dev cancel-in-progress: true env: @@ -396,14 +393,11 @@ jobs: nogil: true - name: Build Environment - # TODO: Once numpy 2.2.1 is out, don't install nightly version - # Tests segfault with numpy 2.2.0: https://github.com/numpy/numpy/pull/27955 run: | python --version - python -m pip install --upgrade pip setuptools wheel meson[ninja]==1.2.1 meson-python==0.13.1 - python -m pip install --pre --extra-index-url https://pypi.anaconda.org/scientific-python-nightly-wheels/simple cython numpy - python -m pip install versioneer[toml] - python -m pip install python-dateutil pytz tzdata hypothesis>=6.84.0 pytest>=7.3.2 pytest-xdist>=3.4.0 pytest-cov + python -m pip install --upgrade pip setuptools wheel numpy meson[ninja]==1.2.1 meson-python==0.13.1 + python -m pip install --pre --extra-index-url https://pypi.anaconda.org/scientific-python-nightly-wheels/simple cython + python -m pip install versioneer[toml] python-dateutil pytz tzdata hypothesis>=6.84.0 pytest>=7.3.2 pytest-xdist>=3.4.0 pytest-cov python -m pip install -ve . --no-build-isolation --no-index --no-deps -Csetup-args="--werror" python -m pip list diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index 1dd8dfc54111e..77bcadf57dd2d 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -19,7 +19,7 @@ ci: skip: [pyright, mypy] repos: - repo: https://github.com/astral-sh/ruff-pre-commit - rev: v0.8.6 + rev: v0.9.4 hooks: - id: ruff args: [--exit-non-zero-on-fix] @@ -41,7 +41,7 @@ repos: pass_filenames: true require_serial: false - repo: https://github.com/codespell-project/codespell - rev: v2.3.0 + rev: v2.4.1 hooks: - id: codespell types_or: [python, rst, markdown, cython, c] @@ -70,7 +70,7 @@ repos: - id: trailing-whitespace args: [--markdown-linebreak-ext=md] - repo: https://github.com/PyCQA/isort - rev: 5.13.2 + rev: 6.0.0 hooks: - id: isort - repo: https://github.com/asottile/pyupgrade @@ -95,14 +95,14 @@ repos: - id: sphinx-lint args: ["--enable", "all", "--disable", "line-too-long"] - repo: https://github.com/pre-commit/mirrors-clang-format - rev: v19.1.6 + rev: v19.1.7 hooks: - id: clang-format files: ^pandas/_libs/src|^pandas/_libs/include args: [-i] types_or: [c, c++] - repo: https://github.com/trim21/pre-commit-mirror-meson - rev: v1.6.1 + rev: v1.7.0 hooks: - id: meson-fmt args: ['--inplace'] diff --git a/asv_bench/benchmarks/io/style.py b/asv_bench/benchmarks/io/style.py index 24fd8a0d20aba..0486cabb29845 100644 --- a/asv_bench/benchmarks/io/style.py +++ b/asv_bench/benchmarks/io/style.py @@ -13,8 +13,8 @@ class Render: def setup(self, cols, rows): self.df = DataFrame( np.random.randn(rows, cols), - columns=[f"float_{i+1}" for i in range(cols)], - index=[f"row_{i+1}" for i in range(rows)], + columns=[f"float_{i + 1}" for i in range(cols)], + index=[f"row_{i + 1}" for i in range(rows)], ) def time_apply_render(self, cols, rows): diff --git a/doc/make.py b/doc/make.py index 02deb5002fea1..9542563dc037b 100755 --- a/doc/make.py +++ b/doc/make.py @@ -260,8 +260,7 @@ def latex(self, force=False): for i in range(3): self._run_os("pdflatex", "-interaction=nonstopmode", "pandas.tex") raise SystemExit( - "You should check the file " - '"build/latex/pandas.pdf" for problems.' + 'You should check the file "build/latex/pandas.pdf" for problems.' ) self._run_os("make") return ret_code @@ -343,8 +342,7 @@ def main(): dest="verbosity", default=0, help=( - "increase verbosity (can be repeated), " - "passed to the sphinx build command" + "increase verbosity (can be repeated), passed to the sphinx build command" ), ) argparser.add_argument( diff --git a/doc/source/user_guide/cookbook.rst b/doc/source/user_guide/cookbook.rst index b2b5c5cc1014e..91a0b4a4fe967 100644 --- a/doc/source/user_guide/cookbook.rst +++ b/doc/source/user_guide/cookbook.rst @@ -874,7 +874,7 @@ Timeseries `__ `Aggregation and plotting time series -`__ +`__ Turn a matrix with hours in columns and days in rows into a continuous row sequence in the form of a time series. `How to rearrange a Python pandas DataFrame? @@ -1043,7 +1043,7 @@ CSV The :ref:`CSV ` docs -`read_csv in action `__ +`read_csv in action `__ `appending to a csv `__ diff --git a/doc/source/user_guide/enhancingperf.rst b/doc/source/user_guide/enhancingperf.rst index c4721f3a6b09c..e55a6cda47ac2 100644 --- a/doc/source/user_guide/enhancingperf.rst +++ b/doc/source/user_guide/enhancingperf.rst @@ -427,7 +427,7 @@ prefer that Numba throw an error if it cannot compile a function in a way that speeds up your code, pass Numba the argument ``nopython=True`` (e.g. ``@jit(nopython=True)``). For more on troubleshooting Numba modes, see the `Numba troubleshooting page -`__. +`__. Using ``parallel=True`` (e.g. ``@jit(parallel=True)``) may result in a ``SIGABRT`` if the threading layer leads to unsafe behavior. You can first `specify a safe threading layer `__ diff --git a/doc/source/user_guide/groupby.rst b/doc/source/user_guide/groupby.rst index 4a32381a7de47..4ec34db6ed959 100644 --- a/doc/source/user_guide/groupby.rst +++ b/doc/source/user_guide/groupby.rst @@ -418,7 +418,7 @@ You can also include the grouping columns if you want to operate on them. .. note:: - The ``groupby`` operation in Pandas drops the ``name`` field of the columns Index object + The ``groupby`` operation in pandas drops the ``name`` field of the columns Index object after the operation. This change ensures consistency in syntax between different column selection methods within groupby operations. diff --git a/doc/source/user_guide/merging.rst b/doc/source/user_guide/merging.rst index cfd2f40aa93a3..fb707674b4dbf 100644 --- a/doc/source/user_guide/merging.rst +++ b/doc/source/user_guide/merging.rst @@ -586,7 +586,7 @@ A string argument to ``indicator`` will use the value as the name for the indica Overlapping value columns ~~~~~~~~~~~~~~~~~~~~~~~~~ -The merge ``suffixes`` argument takes a tuple of list of strings to append to +The merge ``suffixes`` argument takes a tuple or list of strings to append to overlapping column names in the input :class:`DataFrame` to disambiguate the result columns: @@ -979,7 +979,7 @@ nearest key rather than equal keys. For each row in the ``left`` :class:`DataFra the last row in the ``right`` :class:`DataFrame` are selected where the ``on`` key is less than the left's key. Both :class:`DataFrame` must be sorted by the key. -Optionally an :func:`merge_asof` can perform a group-wise merge by matching the +Optionally :func:`merge_asof` can perform a group-wise merge by matching the ``by`` key in addition to the nearest match on the ``on`` key. .. ipython:: python diff --git a/doc/source/user_guide/pyarrow.rst b/doc/source/user_guide/pyarrow.rst index aecbce0441b53..1807341530e69 100644 --- a/doc/source/user_guide/pyarrow.rst +++ b/doc/source/user_guide/pyarrow.rst @@ -22,7 +22,7 @@ Data Structure Integration A :class:`Series`, :class:`Index`, or the columns of a :class:`DataFrame` can be directly backed by a :external+pyarrow:py:class:`pyarrow.ChunkedArray` which is similar to a NumPy array. To construct these from the main pandas data structures, you can pass in a string of the type followed by -``[pyarrow]``, e.g. ``"int64[pyarrow]""`` into the ``dtype`` parameter +``[pyarrow]``, e.g. ``"int64[pyarrow]"`` into the ``dtype`` parameter .. ipython:: python diff --git a/doc/source/user_guide/scale.rst b/doc/source/user_guide/scale.rst index 29df2994fbc35..d12993f7ead4b 100644 --- a/doc/source/user_guide/scale.rst +++ b/doc/source/user_guide/scale.rst @@ -5,7 +5,7 @@ Scaling to large datasets ************************* pandas provides data structures for in-memory analytics, which makes using pandas -to analyze datasets that are larger than memory datasets somewhat tricky. Even datasets +to analyze datasets that are larger than memory somewhat tricky. Even datasets that are a sizable fraction of memory become unwieldy, as some pandas operations need to make intermediate copies. diff --git a/doc/source/user_guide/style.ipynb b/doc/source/user_guide/style.ipynb index abb7181fc8d72..9cda1486eb48b 100644 --- a/doc/source/user_guide/style.ipynb +++ b/doc/source/user_guide/style.ipynb @@ -1288,7 +1288,7 @@ "outputs": [], "source": [ "df2.loc[:4].style.highlight_max(\n", - " axis=1, props=(\"color:white; \" \"font-weight:bold; \" \"background-color:darkblue;\")\n", + " axis=1, props=(\"color:white; font-weight:bold; background-color:darkblue;\")\n", ")" ] }, diff --git a/doc/source/user_guide/timeseries.rst b/doc/source/user_guide/timeseries.rst index 4299dca4774b9..d046d13f71daf 100644 --- a/doc/source/user_guide/timeseries.rst +++ b/doc/source/user_guide/timeseries.rst @@ -1580,7 +1580,7 @@ the pandas objects. ts = ts[:5] ts.shift(1) -The ``shift`` method accepts an ``freq`` argument which can accept a +The ``shift`` method accepts a ``freq`` argument which can accept a ``DateOffset`` class or other ``timedelta``-like object or also an :ref:`offset alias `. @@ -2570,7 +2570,7 @@ because daylight savings time (DST) in a local time zone causes some times to oc twice within one day ("clocks fall back"). The following options are available: * ``'raise'``: Raises a ``ValueError`` (the default behavior) -* ``'infer'``: Attempt to determine the correct offset base on the monotonicity of the timestamps +* ``'infer'``: Attempt to determine the correct offset based on the monotonicity of the timestamps * ``'NaT'``: Replaces ambiguous times with ``NaT`` * ``bool``: ``True`` represents a DST time, ``False`` represents non-DST time. An array-like of ``bool`` values is supported for a sequence of times. diff --git a/doc/source/whatsnew/v3.0.0.rst b/doc/source/whatsnew/v3.0.0.rst index 64f4a66a109f5..570faa00e97a8 100644 --- a/doc/source/whatsnew/v3.0.0.rst +++ b/doc/source/whatsnew/v3.0.0.rst @@ -35,6 +35,7 @@ Other enhancements - :class:`pandas.api.typing.NoDefault` is available for typing ``no_default`` - :func:`DataFrame.to_excel` now raises an ``UserWarning`` when the character count in a cell exceeds Excel's limitation of 32767 characters (:issue:`56954`) - :func:`pandas.merge` now validates the ``how`` parameter input (merge type) (:issue:`59435`) +- :func:`pandas.merge`, :meth:`DataFrame.merge` and :meth:`DataFrame.join` now support anti joins (``left_anti`` and ``right_anti``) in the ``how`` parameter (:issue:`42916`) - :func:`read_spss` now supports kwargs to be passed to pyreadstat (:issue:`56356`) - :func:`read_stata` now returns ``datetime64`` resolutions better matching those natively stored in the stata format (:issue:`55642`) - :meth:`DataFrame.agg` called with ``axis=1`` and a ``func`` which relabels the result index now raises a ``NotImplementedError`` (:issue:`58807`). @@ -59,15 +60,16 @@ Other enhancements - :meth:`Series.cummin` and :meth:`Series.cummax` now supports :class:`CategoricalDtype` (:issue:`52335`) - :meth:`Series.plot` now correctly handle the ``ylabel`` parameter for pie charts, allowing for explicit control over the y-axis label (:issue:`58239`) - :meth:`DataFrame.plot.scatter` argument ``c`` now accepts a column of strings, where rows with the same string are colored identically (:issue:`16827` and :issue:`16485`) +- :class:`DataFrameGroupBy` and :class:`SeriesGroupBy` methods ``sum``, ``mean``, ``median``, ``prod``, ``min``, ``max``, ``std``, ``var`` and ``sem`` now accept ``skipna`` parameter (:issue:`15675`) - :class:`Rolling` and :class:`Expanding` now support aggregations ``first`` and ``last`` (:issue:`33155`) - :func:`read_parquet` accepts ``to_pandas_kwargs`` which are forwarded to :meth:`pyarrow.Table.to_pandas` which enables passing additional keywords to customize the conversion to pandas, such as ``maps_as_pydicts`` to read the Parquet map data type as python dictionaries (:issue:`56842`) -- :meth:`.DataFrameGroupBy.mean`, :meth:`.DataFrameGroupBy.sum`, :meth:`.SeriesGroupBy.mean` and :meth:`.SeriesGroupBy.sum` now accept ``skipna`` parameter (:issue:`15675`) - :meth:`.DataFrameGroupBy.transform`, :meth:`.SeriesGroupBy.transform`, :meth:`.DataFrameGroupBy.agg`, :meth:`.SeriesGroupBy.agg`, :meth:`.SeriesGroupBy.apply`, :meth:`.DataFrameGroupBy.apply` now support ``kurt`` (:issue:`40139`) - :meth:`DataFrameGroupBy.transform`, :meth:`SeriesGroupBy.transform`, :meth:`DataFrameGroupBy.agg`, :meth:`SeriesGroupBy.agg`, :meth:`RollingGroupby.apply`, :meth:`ExpandingGroupby.apply`, :meth:`Rolling.apply`, :meth:`Expanding.apply`, :meth:`DataFrame.apply` with ``engine="numba"`` now supports positional arguments passed as kwargs (:issue:`58995`) - :meth:`Rolling.agg`, :meth:`Expanding.agg` and :meth:`ExponentialMovingWindow.agg` now accept :class:`NamedAgg` aggregations through ``**kwargs`` (:issue:`28333`) - :meth:`Series.map` can now accept kwargs to pass on to func (:issue:`59814`) - :meth:`Series.str.get_dummies` now accepts a ``dtype`` parameter to specify the dtype of the resulting DataFrame (:issue:`47872`) - :meth:`pandas.concat` will raise a ``ValueError`` when ``ignore_index=True`` and ``keys`` is not ``None`` (:issue:`59274`) +- :py:class:`frozenset` elements in pandas objects are now natively printed (:issue:`60690`) - Implemented :meth:`Series.str.isascii` and :meth:`Series.str.isascii` (:issue:`59091`) - Multiplying two :class:`DateOffset` objects will now raise a ``TypeError`` instead of a ``RecursionError`` (:issue:`59442`) - Restore support for reading Stata 104-format and enable reading 103-format dta files (:issue:`58554`) @@ -631,6 +633,7 @@ Datetimelike - Bug in :func:`date_range` where using a negative frequency value would not include all points between the start and end values (:issue:`56147`) - Bug in :func:`tseries.api.guess_datetime_format` would fail to infer time format when "%Y" == "%H%M" (:issue:`57452`) - Bug in :func:`tseries.frequencies.to_offset` would fail to parse frequency strings starting with "LWOM" (:issue:`59218`) +- Bug in :meth:`DataFrame.min` and :meth:`DataFrame.max` casting ``datetime64`` and ``timedelta64`` columns to ``float64`` and losing precision (:issue:`60850`) - Bug in :meth:`Dataframe.agg` with df with missing values resulting in IndexError (:issue:`58810`) - Bug in :meth:`DatetimeIndex.is_year_start` and :meth:`DatetimeIndex.is_quarter_start` does not raise on Custom business days frequencies bigger then "1C" (:issue:`58664`) - Bug in :meth:`DatetimeIndex.is_year_start` and :meth:`DatetimeIndex.is_quarter_start` returning ``False`` on double-digit frequencies (:issue:`58523`) @@ -766,6 +769,7 @@ Reshaping - Bug in :meth:`DataFrame.unstack` producing incorrect results when ``sort=False`` (:issue:`54987`, :issue:`55516`) - Bug in :meth:`DataFrame.merge` when merging two :class:`DataFrame` on ``intc`` or ``uintc`` types on Windows (:issue:`60091`, :issue:`58713`) - Bug in :meth:`DataFrame.pivot_table` incorrectly subaggregating results when called without an ``index`` argument (:issue:`58722`) +- Bug in :meth:`DataFrame.stack` with the new implementation where ``ValueError`` is raised when ``level=[]`` (:issue:`60740`) - Bug in :meth:`DataFrame.unstack` producing incorrect results when manipulating empty :class:`DataFrame` with an :class:`ExtentionDtype` (:issue:`59123`) Sparse @@ -789,6 +793,7 @@ Styler Other ^^^^^ - Bug in :class:`DataFrame` when passing a ``dict`` with a NA scalar and ``columns`` that would always return ``np.nan`` (:issue:`57205`) +- Bug in :class:`Series` ignoring errors when trying to convert :class:`Series` input data to the given ``dtype`` (:issue:`60728`) - Bug in :func:`eval` on :class:`ExtensionArray` on including division ``/`` failed with a ``TypeError``. (:issue:`58748`) - Bug in :func:`eval` where the names of the :class:`Series` were not preserved when using ``engine="numexpr"``. (:issue:`10239`) - Bug in :func:`eval` with ``engine="numexpr"`` returning unexpected result for float division. (:issue:`59736`) diff --git a/pandas/_config/config.py b/pandas/_config/config.py index 35139979f92fe..ce53e05608ba7 100644 --- a/pandas/_config/config.py +++ b/pandas/_config/config.py @@ -141,6 +141,10 @@ def get_option(pat: str) -> Any: """ Retrieve the value of the specified option. + This method allows users to query the current value of a given option + in the pandas configuration system. Options control various display, + performance, and behavior-related settings within pandas. + Parameters ---------- pat : str @@ -321,6 +325,11 @@ def reset_option(pat: str) -> None: """ Reset one or more options to their default value. + This method resets the specified pandas option(s) back to their default + values. It allows partial string matching for convenience, but users should + exercise caution to avoid unintended resets due to changes in option names + in future versions. + Parameters ---------- pat : str/regex @@ -424,6 +433,11 @@ def option_context(*args) -> Generator[None]: """ Context manager to temporarily set options in a ``with`` statement. + This method allows users to set one or more pandas options temporarily + within a controlled block. The previous options' values are restored + once the block is exited. This is useful when making temporary adjustments + to pandas' behavior without affecting the global state. + Parameters ---------- *args : str | object diff --git a/pandas/_libs/groupby.pyi b/pandas/_libs/groupby.pyi index e3909203d1f5a..163fc23535022 100644 --- a/pandas/_libs/groupby.pyi +++ b/pandas/_libs/groupby.pyi @@ -13,6 +13,7 @@ def group_median_float64( mask: np.ndarray | None = ..., result_mask: np.ndarray | None = ..., is_datetimelike: bool = ..., # bint + skipna: bool = ..., ) -> None: ... def group_cumprod( out: np.ndarray, # float64_t[:, ::1] @@ -76,6 +77,7 @@ def group_prod( mask: np.ndarray | None, result_mask: np.ndarray | None = ..., min_count: int = ..., + skipna: bool = ..., ) -> None: ... def group_var( out: np.ndarray, # floating[:, ::1] @@ -88,6 +90,7 @@ def group_var( result_mask: np.ndarray | None = ..., is_datetimelike: bool = ..., name: str = ..., + skipna: bool = ..., ) -> None: ... def group_skew( out: np.ndarray, # float64_t[:, ::1] @@ -183,6 +186,7 @@ def group_max( is_datetimelike: bool = ..., mask: np.ndarray | None = ..., result_mask: np.ndarray | None = ..., + skipna: bool = ..., ) -> None: ... def group_min( out: np.ndarray, # groupby_t[:, ::1] @@ -193,6 +197,7 @@ def group_min( is_datetimelike: bool = ..., mask: np.ndarray | None = ..., result_mask: np.ndarray | None = ..., + skipna: bool = ..., ) -> None: ... def group_idxmin_idxmax( out: npt.NDArray[np.intp], diff --git a/pandas/_libs/groupby.pyx b/pandas/_libs/groupby.pyx index 70af22f514ce0..16a104a46ed3d 100644 --- a/pandas/_libs/groupby.pyx +++ b/pandas/_libs/groupby.pyx @@ -62,7 +62,12 @@ cdef enum InterpolationEnumType: INTERPOLATION_MIDPOINT -cdef float64_t median_linear_mask(float64_t* a, int n, uint8_t* mask) noexcept nogil: +cdef float64_t median_linear_mask( + float64_t* a, + int n, + uint8_t* mask, + bint skipna=True +) noexcept nogil: cdef: int i, j, na_count = 0 float64_t* tmp @@ -77,7 +82,7 @@ cdef float64_t median_linear_mask(float64_t* a, int n, uint8_t* mask) noexcept n na_count += 1 if na_count: - if na_count == n: + if na_count == n or not skipna: return NaN tmp = malloc((n - na_count) * sizeof(float64_t)) @@ -104,7 +109,8 @@ cdef float64_t median_linear_mask(float64_t* a, int n, uint8_t* mask) noexcept n cdef float64_t median_linear( float64_t* a, int n, - bint is_datetimelike=False + bint is_datetimelike=False, + bint skipna=True, ) noexcept nogil: cdef: int i, j, na_count = 0 @@ -125,7 +131,7 @@ cdef float64_t median_linear( na_count += 1 if na_count: - if na_count == n: + if na_count == n or not skipna: return NaN tmp = malloc((n - na_count) * sizeof(float64_t)) @@ -186,6 +192,7 @@ def group_median_float64( const uint8_t[:, :] mask=None, uint8_t[:, ::1] result_mask=None, bint is_datetimelike=False, + bint skipna=True, ) -> None: """ Only aggregates on axis=0 @@ -229,7 +236,7 @@ def group_median_float64( for j in range(ngroups): size = _counts[j + 1] - result = median_linear_mask(ptr, size, ptr_mask) + result = median_linear_mask(ptr, size, ptr_mask, skipna) out[j, i] = result if result != result: @@ -244,7 +251,7 @@ def group_median_float64( ptr += _counts[0] for j in range(ngroups): size = _counts[j + 1] - out[j, i] = median_linear(ptr, size, is_datetimelike) + out[j, i] = median_linear(ptr, size, is_datetimelike, skipna) ptr += size @@ -804,17 +811,18 @@ def group_prod( const uint8_t[:, ::1] mask, uint8_t[:, ::1] result_mask=None, Py_ssize_t min_count=0, + bint skipna=True, ) -> None: """ Only aggregates on axis=0 """ cdef: Py_ssize_t i, j, N, K, lab, ncounts = len(counts) - int64float_t val + int64float_t val, nan_val int64float_t[:, ::1] prodx int64_t[:, ::1] nobs Py_ssize_t len_values = len(values), len_labels = len(labels) - bint isna_entry, uses_mask = mask is not None + bint isna_entry, isna_result, uses_mask = mask is not None if len_values != len_labels: raise ValueError("len(index) != len(labels)") @@ -823,6 +831,7 @@ def group_prod( prodx = np.ones((out).shape, dtype=(out).base.dtype) N, K = (values).shape + nan_val = _get_na_val(0, False) with nogil: for i in range(N): @@ -836,12 +845,23 @@ def group_prod( if uses_mask: isna_entry = mask[i, j] + isna_result = result_mask[lab, j] else: isna_entry = _treat_as_na(val, False) + isna_result = _treat_as_na(prodx[lab, j], False) + + if not skipna and isna_result: + # If prod is already NA, no need to update it + continue if not isna_entry: nobs[lab, j] += 1 prodx[lab, j] *= val + elif not skipna: + if uses_mask: + result_mask[lab, j] = True + else: + prodx[lab, j] = nan_val _check_below_mincount( out, uses_mask, result_mask, ncounts, K, nobs, min_count, prodx @@ -862,6 +882,7 @@ def group_var( uint8_t[:, ::1] result_mask=None, bint is_datetimelike=False, str name="var", + bint skipna=True, ) -> None: cdef: Py_ssize_t i, j, N, K, lab, ncounts = len(counts) @@ -869,7 +890,7 @@ def group_var( floating[:, ::1] mean int64_t[:, ::1] nobs Py_ssize_t len_values = len(values), len_labels = len(labels) - bint isna_entry, uses_mask = mask is not None + bint isna_entry, isna_result, uses_mask = mask is not None bint is_std = name == "std" bint is_sem = name == "sem" @@ -898,19 +919,34 @@ def group_var( if uses_mask: isna_entry = mask[i, j] + isna_result = result_mask[lab, j] elif is_datetimelike: # With group_var, we cannot just use _treat_as_na bc # datetimelike dtypes get cast to float64 instead of # to int64. isna_entry = val == NPY_NAT + isna_result = out[lab, j] == NPY_NAT else: isna_entry = _treat_as_na(val, is_datetimelike) + isna_result = _treat_as_na(out[lab, j], is_datetimelike) + + if not skipna and isna_result: + # If aggregate is already NA, don't add to it. This is important for + # datetimelike because adding a value to NPY_NAT may not result + # in a NPY_NAT + continue if not isna_entry: nobs[lab, j] += 1 oldmean = mean[lab, j] mean[lab, j] += (val - oldmean) / nobs[lab, j] out[lab, j] += (val - mean[lab, j]) * (val - oldmean) + elif not skipna: + nobs[lab, j] = 0 + if uses_mask: + result_mask[lab, j] = True + else: + out[lab, j] = NAN for i in range(ncounts): for j in range(K): @@ -1164,7 +1200,7 @@ def group_mean( mean_t[:, ::1] sumx, compensation int64_t[:, ::1] nobs Py_ssize_t len_values = len(values), len_labels = len(labels) - bint isna_entry, uses_mask = mask is not None + bint isna_entry, isna_result, uses_mask = mask is not None assert min_count == -1, "'min_count' only used in sum and prod" @@ -1194,25 +1230,24 @@ def group_mean( for j in range(K): val = values[i, j] - if not skipna and ( - (uses_mask and result_mask[lab, j]) or - (is_datetimelike and sumx[lab, j] == NPY_NAT) or - _treat_as_na(sumx[lab, j], False) - ): - # If sum is already NA, don't add to it. This is important for - # datetimelike because adding a value to NPY_NAT may not result - # in NPY_NAT - continue - if uses_mask: isna_entry = mask[i, j] + isna_result = result_mask[lab, j] elif is_datetimelike: # With group_mean, we cannot just use _treat_as_na bc # datetimelike dtypes get cast to float64 instead of # to int64. isna_entry = val == NPY_NAT + isna_result = sumx[lab, j] == NPY_NAT else: isna_entry = _treat_as_na(val, is_datetimelike) + isna_result = _treat_as_na(sumx[lab, j], is_datetimelike) + + if not skipna and isna_result: + # If sum is already NA, don't add to it. This is important for + # datetimelike because adding a value to NPY_NAT may not result + # in NPY_NAT + continue if not isna_entry: nobs[lab, j] += 1 @@ -1806,6 +1841,7 @@ cdef group_min_max( bint compute_max=True, const uint8_t[:, ::1] mask=None, uint8_t[:, ::1] result_mask=None, + bint skipna=True, ): """ Compute minimum/maximum of columns of `values`, in row groups `labels`. @@ -1833,6 +1869,8 @@ cdef group_min_max( result_mask : ndarray[bool, ndim=2], optional If not None, these specify locations in the output that are NA. Modified in-place. + skipna : bool, default True + If True, ignore nans in `values`. Notes ----- @@ -1841,17 +1879,18 @@ cdef group_min_max( """ cdef: Py_ssize_t i, j, N, K, lab, ngroups = len(counts) - numeric_t val + numeric_t val, nan_val numeric_t[:, ::1] group_min_or_max int64_t[:, ::1] nobs bint uses_mask = mask is not None - bint isna_entry + bint isna_entry, isna_result if not len(values) == len(labels): raise AssertionError("len(index) != len(labels)") min_count = max(min_count, 1) nobs = np.zeros((out).shape, dtype=np.int64) + nan_val = _get_na_val(0, is_datetimelike) group_min_or_max = np.empty_like(out) group_min_or_max[:] = _get_min_or_max(0, compute_max, is_datetimelike) @@ -1870,8 +1909,15 @@ cdef group_min_max( if uses_mask: isna_entry = mask[i, j] + isna_result = result_mask[lab, j] else: isna_entry = _treat_as_na(val, is_datetimelike) + isna_result = _treat_as_na(group_min_or_max[lab, j], + is_datetimelike) + + if not skipna and isna_result: + # If current min/max is already NA, it will always be NA + continue if not isna_entry: nobs[lab, j] += 1 @@ -1881,6 +1927,11 @@ cdef group_min_max( else: if val < group_min_or_max[lab, j]: group_min_or_max[lab, j] = val + elif not skipna: + if uses_mask: + result_mask[lab, j] = True + else: + group_min_or_max[lab, j] = nan_val _check_below_mincount( out, uses_mask, result_mask, ngroups, K, nobs, min_count, group_min_or_max @@ -2012,6 +2063,7 @@ def group_max( bint is_datetimelike=False, const uint8_t[:, ::1] mask=None, uint8_t[:, ::1] result_mask=None, + bint skipna=True, ) -> None: """See group_min_max.__doc__""" group_min_max( @@ -2024,6 +2076,7 @@ def group_max( compute_max=True, mask=mask, result_mask=result_mask, + skipna=skipna, ) @@ -2038,6 +2091,7 @@ def group_min( bint is_datetimelike=False, const uint8_t[:, ::1] mask=None, uint8_t[:, ::1] result_mask=None, + bint skipna=True, ) -> None: """See group_min_max.__doc__""" group_min_max( @@ -2050,6 +2104,7 @@ def group_min( compute_max=False, mask=mask, result_mask=result_mask, + skipna=skipna, ) diff --git a/pandas/_libs/interval.pyx b/pandas/_libs/interval.pyx index 564019d7c0d8c..5d0876591a151 100644 --- a/pandas/_libs/interval.pyx +++ b/pandas/_libs/interval.pyx @@ -209,6 +209,12 @@ cdef class IntervalMixin: """ Indicates if an interval is empty, meaning it contains no points. + An interval is considered empty if its `left` and `right` endpoints + are equal, and it is not closed on both sides. This means that the + interval does not include any real points. In the case of an + :class:`pandas.arrays.IntervalArray` or :class:`IntervalIndex`, the + property returns a boolean array indicating the emptiness of each interval. + Returns ------- bool or ndarray diff --git a/pandas/_libs/tslibs/period.pyx b/pandas/_libs/tslibs/period.pyx index f697180da5eeb..bef1956996b4f 100644 --- a/pandas/_libs/tslibs/period.pyx +++ b/pandas/_libs/tslibs/period.pyx @@ -2140,6 +2140,12 @@ cdef class _Period(PeriodMixin): """ Get day of the month that a Period falls on. + The `day` property provides a simple way to access the day component + of a `Period` object, which represents time spans in various frequencies + (e.g., daily, hourly, monthly). If the period's frequency does not include + a day component (e.g., yearly or quarterly periods), the returned day + corresponds to the first day of that period. + Returns ------- int @@ -2836,6 +2842,11 @@ class Period(_Period): """ Represents a period of time. + A `Period` represents a specific time span rather than a point in time. + Unlike `Timestamp`, which represents a single instant, a `Period` defines a + duration, such as a month, quarter, or year. The exact representation is + determined by the `freq` parameter. + Parameters ---------- value : Period, str, datetime, date or pandas.Timestamp, default None diff --git a/pandas/_typing.py b/pandas/_typing.py index b515305fb6903..4365ee85f72e3 100644 --- a/pandas/_typing.py +++ b/pandas/_typing.py @@ -442,7 +442,9 @@ def closed(self) -> bool: AnyAll = Literal["any", "all"] # merge -MergeHow = Literal["left", "right", "inner", "outer", "cross"] +MergeHow = Literal[ + "left", "right", "inner", "outer", "cross", "left_anti", "right_anti" +] MergeValidate = Literal[ "one_to_one", "1:1", diff --git a/pandas/core/_numba/kernels/min_max_.py b/pandas/core/_numba/kernels/min_max_.py index 59d36732ebae6..d56453e4e5abf 100644 --- a/pandas/core/_numba/kernels/min_max_.py +++ b/pandas/core/_numba/kernels/min_max_.py @@ -88,6 +88,7 @@ def grouped_min_max( ngroups: int, min_periods: int, is_max: bool, + skipna: bool = True, ) -> tuple[np.ndarray, list[int]]: N = len(labels) nobs = np.zeros(ngroups, dtype=np.int64) @@ -97,13 +98,16 @@ def grouped_min_max( for i in range(N): lab = labels[i] val = values[i] - if lab < 0: + if lab < 0 or (nobs[lab] >= 1 and np.isnan(output[lab])): continue if values.dtype.kind == "i" or not np.isnan(val): nobs[lab] += 1 else: - # NaN value cannot be a min/max value + if not skipna: + # If skipna is False and we encounter a NaN, + # both min and max of the group will be NaN + output[lab] = np.nan continue if nobs[lab] == 1: diff --git a/pandas/core/_numba/kernels/var_.py b/pandas/core/_numba/kernels/var_.py index 69aec4d6522c4..5d720c877815d 100644 --- a/pandas/core/_numba/kernels/var_.py +++ b/pandas/core/_numba/kernels/var_.py @@ -176,6 +176,7 @@ def grouped_var( ngroups: int, min_periods: int, ddof: int = 1, + skipna: bool = True, ) -> tuple[np.ndarray, list[int]]: N = len(labels) @@ -190,7 +191,11 @@ def grouped_var( lab = labels[i] val = values[i] - if lab < 0: + if lab < 0 or np.isnan(output[lab]): + continue + + if not skipna and np.isnan(val): + output[lab] = np.nan continue mean_x = means[lab] diff --git a/pandas/core/apply.py b/pandas/core/apply.py index af513d49bcfe0..f36fc82fb1a11 100644 --- a/pandas/core/apply.py +++ b/pandas/core/apply.py @@ -1645,8 +1645,7 @@ def reconstruct_func( # GH 28426 will raise error if duplicated function names are used and # there is no reassigned name raise SpecificationError( - "Function names must be unique if there is no new column names " - "assigned" + "Function names must be unique if there is no new column names assigned" ) if func is None: # nicer error message diff --git a/pandas/core/arraylike.py b/pandas/core/arraylike.py index 43ac69508d1a4..51ddd9e91b227 100644 --- a/pandas/core/arraylike.py +++ b/pandas/core/arraylike.py @@ -329,7 +329,7 @@ def array_ufunc(self, ufunc: np.ufunc, method: str, *inputs: Any, **kwargs: Any) reconstruct_axes = dict(zip(self._AXIS_ORDERS, self.axes)) if self.ndim == 1: - names = {getattr(x, "name") for x in inputs if hasattr(x, "name")} + names = {x.name for x in inputs if hasattr(x, "name")} name = names.pop() if len(names) == 1 else None reconstruct_kwargs = {"name": name} else: diff --git a/pandas/core/arrays/base.py b/pandas/core/arrays/base.py index e831883998098..33745438e2aea 100644 --- a/pandas/core/arrays/base.py +++ b/pandas/core/arrays/base.py @@ -1791,9 +1791,11 @@ def take(self, indices, allow_fill=False, fill_value=None): # type for the array, to the physical storage type for # the data, before passing to take. - result = take(data, indices, fill_value=fill_value, allow_fill=allow_fill) + result = take( + data, indices, fill_value=fill_value, allow_fill=allow_fill + ) return self._from_sequence(result, dtype=self.dtype) - """ # noqa: E501 + """ # Implementer note: The `fill_value` parameter should be a user-facing # value, an instance of self.dtype.type. When passed `fill_value=None`, # the default of `self.dtype.na_value` should be used. diff --git a/pandas/core/arrays/datetimes.py b/pandas/core/arrays/datetimes.py index 43cc492f82885..df40c9c11b117 100644 --- a/pandas/core/arrays/datetimes.py +++ b/pandas/core/arrays/datetimes.py @@ -2707,8 +2707,7 @@ def _maybe_infer_tz(tz: tzinfo | None, inferred_tz: tzinfo | None) -> tzinfo | N pass elif not timezones.tz_compare(tz, inferred_tz): raise TypeError( - f"data is already tz-aware {inferred_tz}, unable to " - f"set specified tz: {tz}" + f"data is already tz-aware {inferred_tz}, unable to set specified tz: {tz}" ) return tz diff --git a/pandas/core/base.py b/pandas/core/base.py index 61a7c079d87f8..a64cd8633c1db 100644 --- a/pandas/core/base.py +++ b/pandas/core/base.py @@ -506,6 +506,11 @@ def array(self) -> ExtensionArray: """ The ExtensionArray of the data backing this Series or Index. + This property provides direct access to the underlying array data of a + Series or Index without requiring conversion to a NumPy array. It + returns an ExtensionArray, which is the native storage format for + pandas extension dtypes. + Returns ------- ExtensionArray diff --git a/pandas/core/common.py b/pandas/core/common.py index 9788ec972ba1b..100ad312bd839 100644 --- a/pandas/core/common.py +++ b/pandas/core/common.py @@ -359,7 +359,7 @@ def is_full_slice(obj, line: int) -> bool: def get_callable_name(obj): # typical case has name if hasattr(obj, "__name__"): - return getattr(obj, "__name__") + return obj.__name__ # some objects don't; could recurse if isinstance(obj, partial): return get_callable_name(obj.func) diff --git a/pandas/core/computation/eval.py b/pandas/core/computation/eval.py index 9d844e590582a..f8e3200ef2ba0 100644 --- a/pandas/core/computation/eval.py +++ b/pandas/core/computation/eval.py @@ -204,7 +204,7 @@ def eval( By default, with the numexpr engine, the following operations are supported: - - Arthimetic operations: ``+``, ``-``, ``*``, ``/``, ``**``, ``%`` + - Arithmetic operations: ``+``, ``-``, ``*``, ``/``, ``**``, ``%`` - Boolean operations: ``|`` (or), ``&`` (and), and ``~`` (not) - Comparison operators: ``<``, ``<=``, ``==``, ``!=``, ``>=``, ``>`` diff --git a/pandas/core/computation/expr.py b/pandas/core/computation/expr.py index 010fad1bbf0b6..14a393b02409c 100644 --- a/pandas/core/computation/expr.py +++ b/pandas/core/computation/expr.py @@ -698,7 +698,7 @@ def visit_Call(self, node, side=None, **kwargs): if not isinstance(key, ast.keyword): # error: "expr" has no attribute "id" raise ValueError( - "keyword error in function call " f"'{node.func.id}'" # type: ignore[attr-defined] + f"keyword error in function call '{node.func.id}'" # type: ignore[attr-defined] ) if key.arg: diff --git a/pandas/core/computation/ops.py b/pandas/core/computation/ops.py index 9b26de42e119b..f06ded6d9f98e 100644 --- a/pandas/core/computation/ops.py +++ b/pandas/core/computation/ops.py @@ -512,8 +512,7 @@ def __init__(self, op: Literal["+", "-", "~", "not"], operand) -> None: self.func = _unary_ops_dict[op] except KeyError as err: raise ValueError( - f"Invalid unary operator {op!r}, " - f"valid operators are {UNARY_OPS_SYMS}" + f"Invalid unary operator {op!r}, valid operators are {UNARY_OPS_SYMS}" ) from err def __call__(self, env) -> MathCall: diff --git a/pandas/core/computation/parsing.py b/pandas/core/computation/parsing.py index 35a6d1c6ad269..8441941797a6e 100644 --- a/pandas/core/computation/parsing.py +++ b/pandas/core/computation/parsing.py @@ -123,16 +123,6 @@ def clean_column_name(name: Hashable) -> Hashable: ------- name : hashable Returns the name after tokenizing and cleaning. - - Notes - ----- - For some cases, a name cannot be converted to a valid Python identifier. - In that case :func:`tokenize_string` raises a SyntaxError. - In that case, we just return the name unmodified. - - If this name was used in the query string (this makes the query call impossible) - an error will be raised by :func:`tokenize_backtick_quoted_string` instead, - which is not caught and propagates to the user level. """ try: # Escape backticks @@ -145,40 +135,6 @@ def clean_column_name(name: Hashable) -> Hashable: return name -def tokenize_backtick_quoted_string( - token_generator: Iterator[tokenize.TokenInfo], source: str, string_start: int -) -> tuple[int, str]: - """ - Creates a token from a backtick quoted string. - - Moves the token_generator forwards till right after the next backtick. - - Parameters - ---------- - token_generator : Iterator[tokenize.TokenInfo] - The generator that yields the tokens of the source string (Tuple[int, str]). - The generator is at the first token after the backtick (`) - - source : str - The Python source code string. - - string_start : int - This is the start of backtick quoted string inside the source string. - - Returns - ------- - tok: Tuple[int, str] - The token that represents the backtick quoted string. - The integer is equal to BACKTICK_QUOTED_STRING (100). - """ - for _, tokval, start, _, _ in token_generator: - if tokval == "`": - string_end = start[1] - break - - return BACKTICK_QUOTED_STRING, source[string_start:string_end] - - class ParseState(Enum): DEFAULT = 0 IN_BACKTICK = 1 diff --git a/pandas/core/construction.py b/pandas/core/construction.py index 50088804e0245..ada492787a179 100644 --- a/pandas/core/construction.py +++ b/pandas/core/construction.py @@ -81,6 +81,10 @@ def array( """ Create an array. + This method constructs an array using pandas extension types when possible. + If `dtype` is specified, it determines the type of array returned. Otherwise, + pandas attempts to infer the appropriate dtype based on `data`. + Parameters ---------- data : Sequence of objects diff --git a/pandas/core/dtypes/cast.py b/pandas/core/dtypes/cast.py index 02b9291da9b31..94531c2ac87e8 100644 --- a/pandas/core/dtypes/cast.py +++ b/pandas/core/dtypes/cast.py @@ -1651,7 +1651,7 @@ def maybe_cast_to_integer_array(arr: list | np.ndarray, dtype: np.dtype) -> np.n # (test_constructor_coercion_signed_to_unsigned) so safe to ignore. warnings.filterwarnings( "ignore", - "NumPy will stop allowing conversion of " "out-of-bound Python int", + "NumPy will stop allowing conversion of out-of-bound Python int", DeprecationWarning, ) casted = np.asarray(arr, dtype=dtype) diff --git a/pandas/core/dtypes/common.py b/pandas/core/dtypes/common.py index b0c8ec1ffc083..e8881ff014a0c 100644 --- a/pandas/core/dtypes/common.py +++ b/pandas/core/dtypes/common.py @@ -1836,6 +1836,8 @@ def pandas_dtype(dtype) -> DtypeObj: # raise a consistent TypeError if failed try: with warnings.catch_warnings(): + # TODO: warnings.catch_warnings can be removed when numpy>2.2.2 + # is the minimum version # GH#51523 - Series.astype(np.integer) doesn't show # numpy deprecation warning of np.integer # Hence enabling DeprecationWarning diff --git a/pandas/core/dtypes/dtypes.py b/pandas/core/dtypes/dtypes.py index 1eb1a630056a2..d8dd6441913b5 100644 --- a/pandas/core/dtypes/dtypes.py +++ b/pandas/core/dtypes/dtypes.py @@ -605,8 +605,7 @@ def update_dtype(self, dtype: str_type | CategoricalDtype) -> CategoricalDtype: return self elif not self.is_dtype(dtype): raise ValueError( - f"a CategoricalDtype must be passed to perform an update, " - f"got {dtype!r}" + f"a CategoricalDtype must be passed to perform an update, got {dtype!r}" ) else: # from here on, dtype is a CategoricalDtype diff --git a/pandas/core/frame.py b/pandas/core/frame.py index 3669d8249dd27..57a7b9467a05e 100644 --- a/pandas/core/frame.py +++ b/pandas/core/frame.py @@ -315,7 +315,8 @@ ----------%s right : DataFrame or named Series Object to merge with. -how : {'left', 'right', 'outer', 'inner', 'cross'}, default 'inner' +how : {'left', 'right', 'outer', 'inner', 'cross', 'left_anti', 'right_anti'}, + default 'inner' Type of merge to be performed. * left: use only keys from left frame, similar to a SQL left outer join; @@ -328,6 +329,10 @@ join; preserve the order of the left keys. * cross: creates the cartesian product from both frames, preserves the order of the left keys. + * left_anti: use only keys from left frame that are not in right frame, similar + to SQL left anti join; preserve key order. + * right_anti: use only keys from right frame that are not in left frame, similar + to SQL right anti join; preserve key order. on : label or list Column or index level names to join on. These must be found in both DataFrames. If `on` is None and not merging on indexes then this defaults @@ -1016,6 +1021,10 @@ def shape(self) -> tuple[int, int]: """ Return a tuple representing the dimensionality of the DataFrame. + Unlike the `len()` method, which only returns the number of rows, `shape` + provides both row and column counts, making it a more informative method for + understanding dataset size. + See Also -------- numpy.ndarray.shape : Tuple of array dimensions. @@ -3205,9 +3214,13 @@ def to_html( Convert the characters <, >, and & to HTML-safe sequences. notebook : {True, False}, default False Whether the generated HTML is for IPython Notebook. - border : int - A ``border=border`` attribute is included in the opening - `` tag. Default ``pd.options.display.html.border``. + border : int or bool + When an integer value is provided, it sets the border attribute in + the opening tag, specifying the thickness of the border. + If ``False`` or ``0`` is passed, the border attribute will not + be present in the ``
`` tag. + The default value for this parameter is governed by + ``pd.options.display.html.border``. table_id : str, optional A css id is included in the opening `
` tag if specified. render_links : bool, default False @@ -4789,6 +4802,10 @@ def select_dtypes(self, include=None, exclude=None) -> DataFrame: """ Return a subset of the DataFrame's columns based on the column dtypes. + This method allows for filtering columns based on their data types. + It is useful when working with heterogeneous DataFrames where operations + need to be performed on a specific subset of data types. + Parameters ---------- include, exclude : scalar or list-like @@ -10605,7 +10622,8 @@ def join( values given, the `other` DataFrame must have a MultiIndex. Can pass an array as the join key if it is not already contained in the calling DataFrame. Like an Excel VLOOKUP operation. - how : {'left', 'right', 'outer', 'inner', 'cross'}, default 'left' + how : {'left', 'right', 'outer', 'inner', 'cross', 'left_anti', 'right_anti'}, + default 'left' How to handle the operation of the two objects. * left: use calling frame's index (or column if on is specified) @@ -10617,6 +10635,10 @@ def join( of the calling's one. * cross: creates the cartesian product from both frames, preserves the order of the left keys. + * left_anti: use set difference of calling frame's index and `other`'s + index. + * right_anti: use set difference of `other`'s index and calling frame's + index. lsuffix : str, default '' Suffix to use from left frame's overlapping columns. rsuffix : str, default '' @@ -13673,6 +13695,10 @@ def isin_(x): doc=""" The column labels of the DataFrame. + This property holds the column names as a pandas ``Index`` object. + It provides an immutable sequence of column labels that can be + used for data selection, renaming, and alignment in DataFrame operations. + Returns ------- pandas.Index diff --git a/pandas/core/generic.py b/pandas/core/generic.py index e0a4f9d9c546a..f376518d4d3b8 100644 --- a/pandas/core/generic.py +++ b/pandas/core/generic.py @@ -5537,8 +5537,7 @@ def filter( nkw = common.count_not_none(items, like, regex) if nkw > 1: raise TypeError( - "Keyword arguments `items`, `like`, or `regex` " - "are mutually exclusive" + "Keyword arguments `items`, `like`, or `regex` are mutually exclusive" ) if axis is None: diff --git a/pandas/core/groupby/groupby.py b/pandas/core/groupby/groupby.py index f9059e6e8896f..d0c0ed29b6d44 100644 --- a/pandas/core/groupby/groupby.py +++ b/pandas/core/groupby/groupby.py @@ -570,6 +570,13 @@ def indices(self) -> dict[Hashable, npt.NDArray[np.intp]]: """ Dict {group name -> group indices}. + The dictionary keys represent the group labels (e.g., timestamps for a + time-based resampling operation), and the values are arrays of integer + positions indicating where the elements of each group are located in the + original data. This property is particularly useful when working with + resampled data, as it provides insight into how the original time-series data + has been grouped. + See Also -------- core.groupby.DataFrameGroupBy.indices : Provides a mapping of group rows to @@ -2163,8 +2170,7 @@ def mean( numeric_only no longer accepts ``None`` and defaults to ``False``. skipna : bool, default True - Exclude NA/null values. If an entire row/column is NA, the result - will be NA. + Exclude NA/null values. If an entire group is NA, the result will be NA. .. versionadded:: 3.0.0 @@ -2248,7 +2254,7 @@ def mean( return result.__finalize__(self.obj, method="groupby") @final - def median(self, numeric_only: bool = False) -> NDFrameT: + def median(self, numeric_only: bool = False, skipna: bool = True) -> NDFrameT: """ Compute median of groups, excluding missing values. @@ -2263,6 +2269,11 @@ def median(self, numeric_only: bool = False) -> NDFrameT: numeric_only no longer accepts ``None`` and defaults to False. + skipna : bool, default True + Exclude NA/null values. If an entire group is NA, the result will be NA. + + .. versionadded:: 3.0.0 + Returns ------- Series or DataFrame @@ -2335,8 +2346,11 @@ def median(self, numeric_only: bool = False) -> NDFrameT: """ result = self._cython_agg_general( "median", - alt=lambda x: Series(x, copy=False).median(numeric_only=numeric_only), + alt=lambda x: Series(x, copy=False).median( + numeric_only=numeric_only, skipna=skipna + ), numeric_only=numeric_only, + skipna=skipna, ) return result.__finalize__(self.obj, method="groupby") @@ -2349,6 +2363,7 @@ def std( engine: Literal["cython", "numba"] | None = None, engine_kwargs: dict[str, bool] | None = None, numeric_only: bool = False, + skipna: bool = True, ): """ Compute standard deviation of groups, excluding missing values. @@ -2387,6 +2402,11 @@ def std( numeric_only now defaults to ``False``. + skipna : bool, default True + Exclude NA/null values. If an entire group is NA, the result will be NA. + + .. versionadded:: 3.0.0 + Returns ------- Series or DataFrame @@ -2441,14 +2461,16 @@ def std( engine_kwargs, min_periods=0, ddof=ddof, + skipna=skipna, ) ) else: return self._cython_agg_general( "std", - alt=lambda x: Series(x, copy=False).std(ddof=ddof), + alt=lambda x: Series(x, copy=False).std(ddof=ddof, skipna=skipna), numeric_only=numeric_only, ddof=ddof, + skipna=skipna, ) @final @@ -2460,6 +2482,7 @@ def var( engine: Literal["cython", "numba"] | None = None, engine_kwargs: dict[str, bool] | None = None, numeric_only: bool = False, + skipna: bool = True, ): """ Compute variance of groups, excluding missing values. @@ -2497,6 +2520,11 @@ def var( numeric_only now defaults to ``False``. + skipna : bool, default True + Exclude NA/null values. If an entire group is NA, the result will be NA. + + .. versionadded:: 3.0.0 + Returns ------- Series or DataFrame @@ -2550,13 +2578,15 @@ def var( engine_kwargs, min_periods=0, ddof=ddof, + skipna=skipna, ) else: return self._cython_agg_general( "var", - alt=lambda x: Series(x, copy=False).var(ddof=ddof), + alt=lambda x: Series(x, copy=False).var(ddof=ddof, skipna=skipna), numeric_only=numeric_only, ddof=ddof, + skipna=skipna, ) @final @@ -2598,8 +2628,7 @@ def _value_counts( doesnt_exist = subsetted - unique_cols if doesnt_exist: raise ValueError( - f"Keys {doesnt_exist} in subset do not " - f"exist in the DataFrame." + f"Keys {doesnt_exist} in subset do not exist in the DataFrame." ) else: subsetted = unique_cols @@ -2686,7 +2715,9 @@ def _value_counts( return result.__finalize__(self.obj, method="value_counts") @final - def sem(self, ddof: int = 1, numeric_only: bool = False) -> NDFrameT: + def sem( + self, ddof: int = 1, numeric_only: bool = False, skipna: bool = True + ) -> NDFrameT: """ Compute standard error of the mean of groups, excluding missing values. @@ -2706,6 +2737,11 @@ def sem(self, ddof: int = 1, numeric_only: bool = False) -> NDFrameT: numeric_only now defaults to ``False``. + skipna : bool, default True + Exclude NA/null values. If an entire group is NA, the result will be NA. + + .. versionadded:: 3.0.0 + Returns ------- Series or DataFrame @@ -2780,9 +2816,10 @@ def sem(self, ddof: int = 1, numeric_only: bool = False) -> NDFrameT: ) return self._cython_agg_general( "sem", - alt=lambda x: Series(x, copy=False).sem(ddof=ddof), + alt=lambda x: Series(x, copy=False).sem(ddof=ddof, skipna=skipna), numeric_only=numeric_only, ddof=ddof, + skipna=skipna, ) @final @@ -2959,7 +2996,9 @@ def sum( return result @final - def prod(self, numeric_only: bool = False, min_count: int = 0) -> NDFrameT: + def prod( + self, numeric_only: bool = False, min_count: int = 0, skipna: bool = True + ) -> NDFrameT: """ Compute prod of group values. @@ -2976,6 +3015,11 @@ def prod(self, numeric_only: bool = False, min_count: int = 0) -> NDFrameT: The required number of valid values to perform the operation. If fewer than ``min_count`` non-NA values are present the result will be NA. + skipna : bool, default True + Exclude NA/null values. If an entire group is NA, the result will be NA. + + .. versionadded:: 3.0.0 + Returns ------- Series or DataFrame @@ -3024,17 +3068,22 @@ def prod(self, numeric_only: bool = False, min_count: int = 0) -> NDFrameT: 2 30 72 """ return self._agg_general( - numeric_only=numeric_only, min_count=min_count, alias="prod", npfunc=np.prod + numeric_only=numeric_only, + min_count=min_count, + skipna=skipna, + alias="prod", + npfunc=np.prod, ) @final @doc( - _groupby_agg_method_engine_template, + _groupby_agg_method_skipna_engine_template, fname="min", no=False, mc=-1, e=None, ek=None, + s=True, example=dedent( """\ For SeriesGroupBy: @@ -3074,6 +3123,7 @@ def min( self, numeric_only: bool = False, min_count: int = -1, + skipna: bool = True, engine: Literal["cython", "numba"] | None = None, engine_kwargs: dict[str, bool] | None = None, ): @@ -3086,23 +3136,26 @@ def min( engine_kwargs, min_periods=min_count, is_max=False, + skipna=skipna, ) else: return self._agg_general( numeric_only=numeric_only, min_count=min_count, + skipna=skipna, alias="min", npfunc=np.min, ) @final @doc( - _groupby_agg_method_engine_template, + _groupby_agg_method_skipna_engine_template, fname="max", no=False, mc=-1, e=None, ek=None, + s=True, example=dedent( """\ For SeriesGroupBy: @@ -3142,6 +3195,7 @@ def max( self, numeric_only: bool = False, min_count: int = -1, + skipna: bool = True, engine: Literal["cython", "numba"] | None = None, engine_kwargs: dict[str, bool] | None = None, ): @@ -3154,11 +3208,13 @@ def max( engine_kwargs, min_periods=min_count, is_max=True, + skipna=skipna, ) else: return self._agg_general( numeric_only=numeric_only, min_count=min_count, + skipna=skipna, alias="max", npfunc=np.max, ) @@ -3180,8 +3236,7 @@ def first( The required number of valid values to perform the operation. If fewer than ``min_count`` valid values are present the result will be NA. skipna : bool, default True - Exclude NA/null values. If an entire row/column is NA, the result - will be NA. + Exclude NA/null values. If an entire group is NA, the result will be NA. .. versionadded:: 2.2.1 @@ -3267,8 +3322,7 @@ def last( The required number of valid values to perform the operation. If fewer than ``min_count`` valid values are present the result will be NA. skipna : bool, default True - Exclude NA/null values. If an entire row/column is NA, the result - will be NA. + Exclude NA/null values. If an entire group is NA, the result will be NA. .. versionadded:: 2.2.1 @@ -3605,10 +3659,10 @@ def rolling( Parameters ---------- window : int, timedelta, str, offset, or BaseIndexer subclass - Size of the moving window. + Interval of the moving window. - If an integer, the fixed number of observations used for - each window. + If an integer, the delta between the start and end of each window. + The number of points in the window depends on the ``closed`` argument. If a timedelta, str, or offset, the time period of each window. Each window will be a variable sized based on the observations included in @@ -3654,14 +3708,22 @@ def rolling( an integer index is not used to calculate the rolling window. closed : str, default None - If ``'right'``, the first point in the window is excluded from calculations. + Determines the inclusivity of points in the window + + If ``'right'``, uses the window (first, last] meaning the last point + is included in the calculations. + + If ``'left'``, uses the window [first, last) meaning the first point + is included in the calculations. - If ``'left'``, the last point in the window is excluded from calculations. + If ``'both'``, uses the window [first, last] meaning all points in + the window are included in the calculations. - If ``'both'``, no points in the window are excluded from calculations. + If ``'neither'``, uses the window (first, last) meaning the first + and last points in the window are excluded from calculations. - If ``'neither'``, the first and last points in the window are excluded - from calculations. + () and [] are referencing open and closed set + notation respetively. Default ``None`` (``'right'``). @@ -5461,8 +5523,7 @@ def _idxmax_idxmin( numeric_only : bool, default False Include only float, int, boolean columns. skipna : bool, default True - Exclude NA/null values. If an entire row/column is NA, the result - will be NA. + Exclude NA/null values. If an entire group is NA, the result will be NA. ignore_unobserved : bool, default False When True and an unobserved group is encountered, do not raise. This used for transform where unobserved groups do not play an impact on the result. diff --git a/pandas/core/groupby/grouper.py b/pandas/core/groupby/grouper.py index 5f9ebdcea4a2d..c9d874fc08dbe 100644 --- a/pandas/core/groupby/grouper.py +++ b/pandas/core/groupby/grouper.py @@ -516,8 +516,7 @@ def __init__( ): grper = pprint_thing(grouping_vector) errmsg = ( - "Grouper result violates len(labels) == " - f"len(data)\nresult: {grper}" + f"Grouper result violates len(labels) == len(data)\nresult: {grper}" ) raise AssertionError(errmsg) diff --git a/pandas/core/indexers/objects.py b/pandas/core/indexers/objects.py index 0064aa91056e8..88379164534f2 100644 --- a/pandas/core/indexers/objects.py +++ b/pandas/core/indexers/objects.py @@ -478,9 +478,9 @@ def get_window_bounds( ) start = start.astype(np.int64) end = end.astype(np.int64) - assert len(start) == len( - end - ), "these should be equal in length from get_window_bounds" + assert len(start) == len(end), ( + "these should be equal in length from get_window_bounds" + ) # Cannot use groupby_indices as they might not be monotonic with the object # we're rolling over window_indices = np.arange( diff --git a/pandas/core/indexing.py b/pandas/core/indexing.py index 656ee54cbc5d4..8a493fef54d3b 100644 --- a/pandas/core/indexing.py +++ b/pandas/core/indexing.py @@ -975,8 +975,7 @@ def _validate_tuple_indexer(self, key: tuple) -> tuple: self._validate_key(k, i) except ValueError as err: raise ValueError( - "Location based indexing can only have " - f"[{self._valid_types}] types" + f"Location based indexing can only have [{self._valid_types}] types" ) from err return key @@ -1589,8 +1588,7 @@ def _validate_key(self, key, axis: AxisInt) -> None: "is not available" ) raise ValueError( - "iLocation based boolean indexing cannot use " - "an indexable as a mask" + "iLocation based boolean indexing cannot use an indexable as a mask" ) return @@ -1994,8 +1992,7 @@ def _setitem_with_indexer_split_path(self, indexer, value, name: str): return self._setitem_with_indexer((pi, info_axis[0]), value[0]) raise ValueError( - "Must have equal len keys and value " - "when setting with an iterable" + "Must have equal len keys and value when setting with an iterable" ) elif lplane_indexer == 0 and len(value) == len(self.obj.index): @@ -2023,8 +2020,7 @@ def _setitem_with_indexer_split_path(self, indexer, value, name: str): else: raise ValueError( - "Must have equal len keys and value " - "when setting with an iterable" + "Must have equal len keys and value when setting with an iterable" ) else: diff --git a/pandas/core/interchange/buffer.py b/pandas/core/interchange/buffer.py index 62bf396256f2a..8953360a91c8e 100644 --- a/pandas/core/interchange/buffer.py +++ b/pandas/core/interchange/buffer.py @@ -31,8 +31,7 @@ def __init__(self, x: np.ndarray, allow_copy: bool = True) -> None: x = x.copy() else: raise RuntimeError( - "Exports cannot be zero-copy in the case " - "of a non-contiguous buffer" + "Exports cannot be zero-copy in the case of a non-contiguous buffer" ) # Store the numpy array in which the data resides as a private diff --git a/pandas/core/internals/blocks.py b/pandas/core/internals/blocks.py index f44ad926dda5c..d1a9081b234de 100644 --- a/pandas/core/internals/blocks.py +++ b/pandas/core/internals/blocks.py @@ -2264,8 +2264,7 @@ def check_ndim(values, placement: BlockPlacement, ndim: int) -> None: if values.ndim > ndim: # Check for both np.ndarray and ExtensionArray raise ValueError( - "Wrong number of dimensions. " - f"values.ndim > ndim [{values.ndim} > {ndim}]" + f"Wrong number of dimensions. values.ndim > ndim [{values.ndim} > {ndim}]" ) if not is_1d_only_ea_dtype(values.dtype): diff --git a/pandas/core/internals/construction.py b/pandas/core/internals/construction.py index dfff34656f82b..69da2be0306f6 100644 --- a/pandas/core/internals/construction.py +++ b/pandas/core/internals/construction.py @@ -907,8 +907,7 @@ def _validate_or_indexify_columns( if not is_mi_list and len(columns) != len(content): # pragma: no cover # caller's responsibility to check for this... raise AssertionError( - f"{len(columns)} columns passed, passed data had " - f"{len(content)} columns" + f"{len(columns)} columns passed, passed data had {len(content)} columns" ) if is_mi_list: # check if nested list column, length of each sub-list should be equal diff --git a/pandas/core/nanops.py b/pandas/core/nanops.py index d6154e2352c63..d1dc0ff809497 100644 --- a/pandas/core/nanops.py +++ b/pandas/core/nanops.py @@ -1093,11 +1093,14 @@ def reduction( if values.size == 0: return _na_for_min_count(values, axis) + dtype = values.dtype values, mask = _get_values( values, skipna, fill_value_typ=fill_value_typ, mask=mask ) result = getattr(values, meth)(axis) - result = _maybe_null_out(result, axis, mask, values.shape) + result = _maybe_null_out( + result, axis, mask, values.shape, datetimelike=dtype.kind in "mM" + ) return result return reduction @@ -1499,6 +1502,7 @@ def _maybe_null_out( mask: npt.NDArray[np.bool_] | None, shape: tuple[int, ...], min_count: int = 1, + datetimelike: bool = False, ) -> np.ndarray | float | NaTType: """ Returns @@ -1520,7 +1524,10 @@ def _maybe_null_out( null_mask = np.broadcast_to(below_count, new_shape) if np.any(null_mask): - if is_numeric_dtype(result): + if datetimelike: + # GH#60646 For datetimelike, no need to cast to float + result[null_mask] = iNaT + elif is_numeric_dtype(result): if np.iscomplexobj(result): result = result.astype("c16") elif not is_float_dtype(result): diff --git a/pandas/core/ops/array_ops.py b/pandas/core/ops/array_ops.py index 983a3df57e369..3a466b6fc7fc8 100644 --- a/pandas/core/ops/array_ops.py +++ b/pandas/core/ops/array_ops.py @@ -164,7 +164,7 @@ def _masked_arith_op(x: np.ndarray, y, op) -> np.ndarray: else: if not is_scalar(y): raise TypeError( - f"Cannot broadcast np.ndarray with operand of type { type(y) }" + f"Cannot broadcast np.ndarray with operand of type {type(y)}" ) # mask is only meaningful for x diff --git a/pandas/core/resample.py b/pandas/core/resample.py index 4b3b7a72b5a5c..1cfc75ea11725 100644 --- a/pandas/core/resample.py +++ b/pandas/core/resample.py @@ -1269,8 +1269,53 @@ def last( ) @final - @doc(GroupBy.median) def median(self, numeric_only: bool = False): + """ + Compute median of groups, excluding missing values. + + For multiple groupings, the result index will be a MultiIndex + + Parameters + ---------- + numeric_only : bool, default False + Include only float, int, boolean columns. + + .. versionchanged:: 2.0.0 + + numeric_only no longer accepts ``None`` and defaults to False. + + Returns + ------- + Series or DataFrame + Median of values within each group. + + See Also + -------- + Series.groupby : Apply a function groupby to a Series. + DataFrame.groupby : Apply a function groupby to each row or column of a + DataFrame. + + Examples + -------- + + >>> ser = pd.Series( + ... [1, 2, 3, 3, 4, 5], + ... index=pd.DatetimeIndex( + ... [ + ... "2023-01-01", + ... "2023-01-10", + ... "2023-01-15", + ... "2023-02-01", + ... "2023-02-10", + ... "2023-02-15", + ... ] + ... ), + ... ) + >>> ser.resample("MS").median() + 2023-01-01 2.0 + 2023-02-01 4.0 + Freq: MS, dtype: float64 + """ return self._downsample("median", numeric_only=numeric_only) @final @@ -1450,12 +1495,61 @@ def var( return self._downsample("var", ddof=ddof, numeric_only=numeric_only) @final - @doc(GroupBy.sem) def sem( self, ddof: int = 1, numeric_only: bool = False, ): + """ + Compute standard error of the mean of groups, excluding missing values. + + For multiple groupings, the result index will be a MultiIndex. + + Parameters + ---------- + ddof : int, default 1 + Degrees of freedom. + + numeric_only : bool, default False + Include only `float`, `int` or `boolean` data. + + .. versionadded:: 1.5.0 + + .. versionchanged:: 2.0.0 + + numeric_only now defaults to ``False``. + + Returns + ------- + Series or DataFrame + Standard error of the mean of values within each group. + + See Also + -------- + DataFrame.sem : Return unbiased standard error of the mean over requested axis. + Series.sem : Return unbiased standard error of the mean over requested axis. + + Examples + -------- + + >>> ser = pd.Series( + ... [1, 3, 2, 4, 3, 8], + ... index=pd.DatetimeIndex( + ... [ + ... "2023-01-01", + ... "2023-01-10", + ... "2023-01-15", + ... "2023-02-01", + ... "2023-02-10", + ... "2023-02-15", + ... ] + ... ), + ... ) + >>> ser.resample("MS").sem() + 2023-01-01 0.577350 + 2023-02-01 1.527525 + Freq: MS, dtype: float64 + """ return self._downsample("sem", ddof=ddof, numeric_only=numeric_only) @final diff --git a/pandas/core/reshape/encoding.py b/pandas/core/reshape/encoding.py index 33ff182f5baee..6a590ee5b227e 100644 --- a/pandas/core/reshape/encoding.py +++ b/pandas/core/reshape/encoding.py @@ -495,8 +495,7 @@ def from_dummies( if col_isna_mask.any(): raise ValueError( - "Dummy DataFrame contains NA value in column: " - f"'{col_isna_mask.idxmax()}'" + f"Dummy DataFrame contains NA value in column: '{col_isna_mask.idxmax()}'" ) # index data with a list of all columns that are dummies diff --git a/pandas/core/reshape/merge.py b/pandas/core/reshape/merge.py index 5fddd9f9aca5b..09be82c59a5c6 100644 --- a/pandas/core/reshape/merge.py +++ b/pandas/core/reshape/merge.py @@ -180,7 +180,8 @@ def merge( First pandas object to merge. right : DataFrame or named Series Second pandas object to merge. - how : {'left', 'right', 'outer', 'inner', 'cross'}, default 'inner' + how : {'left', 'right', 'outer', 'inner', 'cross', 'left_anti', 'right_anti}, + default 'inner' Type of merge to be performed. * left: use only keys from left frame, similar to a SQL left outer join; @@ -193,6 +194,10 @@ def merge( join; preserve the order of the left keys. * cross: creates the cartesian product from both frames, preserves the order of the left keys. + * left_anti: use only keys from left frame that are not in right frame, similar + to SQL left anti join; preserve key order. + * right_anti: use only keys from right frame that are not in left frame, similar + to SQL right anti join; preserve key order. on : label or list Column or index level names to join on. These must be found in both DataFrames. If `on` is None and not merging on indexes then this defaults @@ -953,7 +958,7 @@ def __init__( self, left: DataFrame | Series, right: DataFrame | Series, - how: JoinHow | Literal["asof"] = "inner", + how: JoinHow | Literal["left_anti", "right_anti", "asof"] = "inner", on: IndexLabel | AnyArrayLike | None = None, left_on: IndexLabel | AnyArrayLike | None = None, right_on: IndexLabel | AnyArrayLike | None = None, @@ -968,7 +973,7 @@ def __init__( _right = _validate_operand(right) self.left = self.orig_left = _left self.right = self.orig_right = _right - self.how = how + self.how, self.anti_join = self._validate_how(how) self.on = com.maybe_make_list(on) @@ -998,14 +1003,6 @@ def __init__( ) raise MergeError(msg) - # GH 59435: raise when "how" is not a valid Merge type - merge_type = {"left", "right", "inner", "outer", "cross", "asof"} - if how not in merge_type: - raise ValueError( - f"'{how}' is not a valid Merge type: " - f"left, right, inner, outer, cross, asof" - ) - self.left_on, self.right_on = self._validate_left_right_on(left_on, right_on) ( @@ -1035,6 +1032,37 @@ def __init__( if validate is not None: self._validate_validate_kwd(validate) + @final + def _validate_how( + self, how: JoinHow | Literal["left_anti", "right_anti", "asof"] + ) -> tuple[JoinHow | Literal["asof"], bool]: + """ + Validate the 'how' parameter and return the actual join type and whether + this is an anti join. + """ + # GH 59435: raise when "how" is not a valid Merge type + merge_type = { + "left", + "right", + "inner", + "outer", + "left_anti", + "right_anti", + "cross", + "asof", + } + if how not in merge_type: + raise ValueError( + f"'{how}' is not a valid Merge type: " + f"left, right, inner, outer, left_anti, right_anti, cross, asof" + ) + anti_join = False + if how in {"left_anti", "right_anti"}: + how = how.split("_")[0] # type: ignore[assignment] + anti_join = True + how = cast(JoinHow | Literal["asof"], how) + return how, anti_join + def _maybe_require_matching_dtypes( self, left_join_keys: list[ArrayLike], right_join_keys: list[ArrayLike] ) -> None: @@ -1405,6 +1433,11 @@ def _get_join_info( n = len(left_ax) if left_indexer is None else len(left_indexer) join_index = default_index(n) + if self.anti_join: + join_index, left_indexer, right_indexer = self._handle_anti_join( + join_index, left_indexer, right_indexer + ) + return join_index, left_indexer, right_indexer @final @@ -1447,6 +1480,48 @@ def _create_join_index( return index.copy() return index.take(indexer) + @final + def _handle_anti_join( + self, + join_index: Index, + left_indexer: npt.NDArray[np.intp] | None, + right_indexer: npt.NDArray[np.intp] | None, + ) -> tuple[Index, npt.NDArray[np.intp] | None, npt.NDArray[np.intp] | None]: + """ + Handle anti join by returning the correct join index and indexers + + Parameters + ---------- + join_index : Index + join index + left_indexer : np.ndarray[np.intp] or None + left indexer + right_indexer : np.ndarray[np.intp] or None + right indexer + + Returns + ------- + Index, np.ndarray[np.intp] or None, np.ndarray[np.intp] or None + """ + # Make sure indexers are not None + if left_indexer is None: + left_indexer = np.arange(len(self.left)) + if right_indexer is None: + right_indexer = np.arange(len(self.right)) + + assert self.how in {"left", "right"} + if self.how == "left": + # Filter to rows where left keys are not in right keys + filt = right_indexer == -1 + else: + # Filter to rows where right keys are not in left keys + filt = left_indexer == -1 + join_index = join_index[filt] + left_indexer = left_indexer[filt] + right_indexer = right_indexer[filt] + + return join_index, left_indexer, right_indexer + @final def _get_merge_keys( self, @@ -1929,9 +2004,9 @@ def get_join_indexers( np.ndarray[np.intp] or None Indexer into the right_keys. """ - assert len(left_keys) == len( - right_keys - ), "left_keys and right_keys must be the same length" + assert len(left_keys) == len(right_keys), ( + "left_keys and right_keys must be the same length" + ) # fast-path for empty left/right left_n = len(left_keys[0]) diff --git a/pandas/core/reshape/reshape.py b/pandas/core/reshape/reshape.py index 9b7b768fe7adb..c60fe71a7ff28 100644 --- a/pandas/core/reshape/reshape.py +++ b/pandas/core/reshape/reshape.py @@ -929,6 +929,8 @@ def _reorder_for_extension_array_stack( def stack_v3(frame: DataFrame, level: list[int]) -> Series | DataFrame: if frame.columns.nunique() != len(frame.columns): raise ValueError("Columns with duplicate values are not supported in stack") + if not len(level): + return frame set_levels = set(level) stack_cols = frame.columns._drop_level_numbers( [k for k in range(frame.columns.nlevels - 1, -1, -1) if k not in set_levels] diff --git a/pandas/core/reshape/tile.py b/pandas/core/reshape/tile.py index b3f946f289891..034b861a83f43 100644 --- a/pandas/core/reshape/tile.py +++ b/pandas/core/reshape/tile.py @@ -73,7 +73,7 @@ def cut( Parameters ---------- - x : array-like + x : 1d ndarray or Series The input array to be binned. Must be 1-dimensional. bins : int, sequence of scalars, or IntervalIndex The criteria to bin by. @@ -126,7 +126,7 @@ def cut( Categorical for all other inputs. The values stored within are whatever the type in the sequence is. - * False : returns an ndarray of integers. + * False : returns a 1d ndarray or Series of integers. bins : numpy.ndarray or IntervalIndex. The computed or specified bins. Only returned when `retbins=True`. diff --git a/pandas/core/series.py b/pandas/core/series.py index 4fa8b86fa4c16..351622135b31f 100644 --- a/pandas/core/series.py +++ b/pandas/core/series.py @@ -500,7 +500,7 @@ def __init__( # create/copy the manager if isinstance(data, SingleBlockManager): if dtype is not None: - data = data.astype(dtype=dtype, errors="ignore") + data = data.astype(dtype=dtype) elif copy: data = data.copy() else: diff --git a/pandas/core/tools/datetimes.py b/pandas/core/tools/datetimes.py index 30487de7bafd5..0a10001a3113f 100644 --- a/pandas/core/tools/datetimes.py +++ b/pandas/core/tools/datetimes.py @@ -192,9 +192,9 @@ def should_cache( else: check_count = 500 else: - assert ( - 0 <= check_count <= len(arg) - ), "check_count must be in next bounds: [0; len(arg)]" + assert 0 <= check_count <= len(arg), ( + "check_count must be in next bounds: [0; len(arg)]" + ) if check_count == 0: return False diff --git a/pandas/core/window/rolling.py b/pandas/core/window/rolling.py index 631ab15464942..69fce8cf2137e 100644 --- a/pandas/core/window/rolling.py +++ b/pandas/core/window/rolling.py @@ -881,10 +881,10 @@ class Window(BaseWindow): Parameters ---------- window : int, timedelta, str, offset, or BaseIndexer subclass - Size of the moving window. + Interval of the moving window. - If an integer, the fixed number of observations used for - each window. + If an integer, the delta between the start and end of each window. + The number of points in the window depends on the ``closed`` argument. If a timedelta, str, or offset, the time period of each window. Each window will be a variable sized based on the observations included in @@ -929,14 +929,22 @@ class Window(BaseWindow): an integer index is not used to calculate the rolling window. closed : str, default None - If ``'right'``, the first point in the window is excluded from calculations. + Determines the inclusivity of points in the window - If ``'left'``, the last point in the window is excluded from calculations. + If ``'right'``, uses the window (first, last] meaning the last point + is included in the calculations. - If ``'both'``, no point in the window is excluded from calculations. + If ``'left'``, uses the window [first, last) meaning the first point + is included in the calculations. - If ``'neither'``, the first and last points in the window are excluded - from calculations. + If ``'both'``, uses the window [first, last] meaning all points in + the window are included in the calculations. + + If ``'neither'``, uses the window (first, last) meaning the first + and last points in the window are excluded from calculations. + + () and [] are referencing open and closed set + notation respetively. Default ``None`` (``'right'``). diff --git a/pandas/io/excel/_odswriter.py b/pandas/io/excel/_odswriter.py index 10a06aec72a57..ba4919c9298ed 100644 --- a/pandas/io/excel/_odswriter.py +++ b/pandas/io/excel/_odswriter.py @@ -270,7 +270,7 @@ def _process_style(self, style: dict[str, Any] | None) -> str | None: style_key = json.dumps(style) if style_key in self._style_dict: return self._style_dict[style_key] - name = f"pd{len(self._style_dict)+1}" + name = f"pd{len(self._style_dict) + 1}" self._style_dict[style_key] = name odf_style = Style(name=name, family="table-cell") if "font" in style: diff --git a/pandas/io/formats/format.py b/pandas/io/formats/format.py index 46ecb2b9a8f12..b7fbc4e5e22b7 100644 --- a/pandas/io/formats/format.py +++ b/pandas/io/formats/format.py @@ -897,9 +897,13 @@ def to_html( ``
`` tag, in addition to the default "dataframe". notebook : {True, False}, optional, default False Whether the generated HTML is for IPython Notebook. - border : int - A ``border=border`` attribute is included in the opening - ``
`` tag. Default ``pd.options.display.html.border``. + border : int or bool + When an integer value is provided, it sets the border attribute in + the opening tag, specifying the thickness of the border. + If ``False`` or ``0`` is passed, the border attribute will not + be present in the ``
`` tag. + The default value for this parameter is governed by + ``pd.options.display.html.border``. table_id : str, optional A css id is included in the opening `
` tag if specified. render_links : bool, default False diff --git a/pandas/io/formats/printing.py b/pandas/io/formats/printing.py index a9936ba8c8f2c..5a52ee78cb9be 100644 --- a/pandas/io/formats/printing.py +++ b/pandas/io/formats/printing.py @@ -111,6 +111,8 @@ def _pprint_seq( """ if isinstance(seq, set): fmt = "{{{body}}}" + elif isinstance(seq, frozenset): + fmt = "frozenset({body})" else: fmt = "[{body}]" if hasattr(seq, "__setitem__") else "({body})" @@ -336,8 +338,8 @@ def format_object_summary( if indent_for_name: name_len = len(name) - space1 = f'\n{(" " * (name_len + 1))}' - space2 = f'\n{(" " * (name_len + 2))}' + space1 = f"\n{(' ' * (name_len + 1))}" + space2 = f"\n{(' ' * (name_len + 2))}" else: space1 = "\n" space2 = "\n " # space for the opening '[' diff --git a/pandas/io/formats/style.py b/pandas/io/formats/style.py index 3f37556867954..f2ec41d2c6a43 100644 --- a/pandas/io/formats/style.py +++ b/pandas/io/formats/style.py @@ -2021,7 +2021,7 @@ def apply( more details. """ self._todo.append( - (lambda instance: getattr(instance, "_apply"), (func, axis, subset), kwargs) + (lambda instance: instance._apply, (func, axis, subset), kwargs) ) return self @@ -2128,7 +2128,7 @@ def apply_index( """ self._todo.append( ( - lambda instance: getattr(instance, "_apply_index"), + lambda instance: instance._apply_index, (func, axis, level, "apply"), kwargs, ) @@ -2157,7 +2157,7 @@ def map_index( ) -> Styler: self._todo.append( ( - lambda instance: getattr(instance, "_apply_index"), + lambda instance: instance._apply_index, (func, axis, level, "map"), kwargs, ) @@ -2230,9 +2230,7 @@ def map(self, func: Callable, subset: Subset | None = None, **kwargs) -> Styler: See `Table Visualization <../../user_guide/style.ipynb>`_ user guide for more details. """ - self._todo.append( - (lambda instance: getattr(instance, "_map"), (func, subset), kwargs) - ) + self._todo.append((lambda instance: instance._map, (func, subset), kwargs)) return self def set_table_attributes(self, attributes: str) -> Styler: @@ -2588,7 +2586,7 @@ def set_sticky( for i, level in enumerate(levels_): styles.append( { - "selector": f"thead tr:nth-child({level+1}) th", + "selector": f"thead tr:nth-child({level + 1}) th", "props": props + ( f"top:{i * pixel_size}px; height:{pixel_size}px; " @@ -2599,7 +2597,7 @@ def set_sticky( if not all(name is None for name in self.index.names): styles.append( { - "selector": f"thead tr:nth-child({obj.nlevels+1}) th", + "selector": f"thead tr:nth-child({obj.nlevels + 1}) th", "props": props + ( f"top:{(len(levels_)) * pixel_size}px; " @@ -2619,7 +2617,7 @@ def set_sticky( styles.extend( [ { - "selector": f"thead tr th:nth-child({level+1})", + "selector": f"thead tr th:nth-child({level + 1})", "props": props_ + "z-index:3 !important;", }, { @@ -4214,8 +4212,10 @@ def css_bar(start: float, end: float, color: str) -> str: if end > start: cell_css += "background: linear-gradient(90deg," if start > 0: - cell_css += f" transparent {start*100:.1f}%, {color} {start*100:.1f}%," - cell_css += f" {color} {end*100:.1f}%, transparent {end*100:.1f}%)" + cell_css += ( + f" transparent {start * 100:.1f}%, {color} {start * 100:.1f}%," + ) + cell_css += f" {color} {end * 100:.1f}%, transparent {end * 100:.1f}%)" return cell_css def css_calc(x, left: float, right: float, align: str, color: str | list | tuple): diff --git a/pandas/io/formats/style_render.py b/pandas/io/formats/style_render.py index 2d1218b007d19..482ed316c7ce4 100644 --- a/pandas/io/formats/style_render.py +++ b/pandas/io/formats/style_render.py @@ -850,10 +850,7 @@ def _generate_body_row( data_element = _element( "td", - ( - f"{self.css['data']} {self.css['row']}{r} " - f"{self.css['col']}{c}{cls}" - ), + (f"{self.css['data']} {self.css['row']}{r} {self.css['col']}{c}{cls}"), value, data_element_visible, attributes="", @@ -973,7 +970,7 @@ def concatenated_visible_rows(obj): idx_len = d["index_lengths"].get((lvl, r), None) if idx_len is not None: # i.e. not a sparsified entry d["clines"][rn + idx_len].append( - f"\\cline{{{lvln+1}-{len(visible_index_levels)+data_len}}}" + f"\\cline{{{lvln + 1}-{len(visible_index_levels) + data_len}}}" # noqa: E501 ) def format( @@ -1557,7 +1554,7 @@ def relabel_index( >>> df = pd.DataFrame({"samples": np.random.rand(10)}) >>> styler = df.loc[np.random.randint(0, 10, 3)].style - >>> styler.relabel_index([f"sample{i+1} ({{}})" for i in range(3)]) + >>> styler.relabel_index([f"sample{i + 1} ({{}})" for i in range(3)]) ... # doctest: +SKIP samples sample1 (5) 0.315811 @@ -2520,7 +2517,7 @@ def color(value, user_arg, command, comm_arg): if value[0] == "#" and len(value) == 7: # color is hex code return command, f"[HTML]{{{value[1:].upper()}}}{arg}" if value[0] == "#" and len(value) == 4: # color is short hex code - val = f"{value[1].upper()*2}{value[2].upper()*2}{value[3].upper()*2}" + val = f"{value[1].upper() * 2}{value[2].upper() * 2}{value[3].upper() * 2}" return command, f"[HTML]{{{val}}}{arg}" elif value[:3] == "rgb": # color is rgb or rgba r = re.findall("(?<=\\()[0-9\\s%]+(?=,)", value)[0].strip() diff --git a/pandas/io/formats/xml.py b/pandas/io/formats/xml.py index 47f162e93216d..febf43b9a1018 100644 --- a/pandas/io/formats/xml.py +++ b/pandas/io/formats/xml.py @@ -260,7 +260,7 @@ def _other_namespaces(self) -> dict: nmsp_dict: dict[str, str] = {} if self.namespaces: nmsp_dict = { - f"xmlns{p if p=='' else f':{p}'}": n + f"xmlns{p if p == '' else f':{p}'}": n for p, n in self.namespaces.items() if n != self.prefix_uri[1:-1] } @@ -404,7 +404,7 @@ def _get_prefix_uri(self) -> str: f"{self.prefix} is not included in namespaces" ) from err elif "" in self.namespaces: - uri = f'{{{self.namespaces[""]}}}' + uri = f"{{{self.namespaces['']}}}" else: uri = "" @@ -502,7 +502,7 @@ def _get_prefix_uri(self) -> str: f"{self.prefix} is not included in namespaces" ) from err elif "" in self.namespaces: - uri = f'{{{self.namespaces[""]}}}' + uri = f"{{{self.namespaces['']}}}" else: uri = "" diff --git a/pandas/io/json/_json.py b/pandas/io/json/_json.py index 237518b3c8d92..703a2b3656c9c 100644 --- a/pandas/io/json/_json.py +++ b/pandas/io/json/_json.py @@ -917,7 +917,7 @@ def _combine_lines(self, lines) -> str: Combines a list of JSON objects into one JSON object. """ return ( - f'[{",".join([line for line in (line.strip() for line in lines) if line])}]' + f"[{','.join([line for line in (line.strip() for line in lines) if line])}]" ) @overload diff --git a/pandas/io/orc.py b/pandas/io/orc.py index a945f3dc38d35..1a2d564d5b44d 100644 --- a/pandas/io/orc.py +++ b/pandas/io/orc.py @@ -45,6 +45,13 @@ def read_orc( """ Load an ORC object from the file path, returning a DataFrame. + This method reads an ORC (Optimized Row Columnar) file into a pandas + DataFrame using the `pyarrow.orc` library. ORC is a columnar storage format + that provides efficient compression and fast retrieval for analytical workloads. + It allows reading specific columns, handling different filesystem + types (such as local storage, cloud storage via fsspec, or pyarrow filesystem), + and supports different data type backends, including `numpy_nullable` and `pyarrow`. + Parameters ---------- path : str, path object, or file-like object diff --git a/pandas/io/parsers/base_parser.py b/pandas/io/parsers/base_parser.py index e263c69376d05..c283f600eb971 100644 --- a/pandas/io/parsers/base_parser.py +++ b/pandas/io/parsers/base_parser.py @@ -112,8 +112,7 @@ def __init__(self, kwds) -> None: parse_dates = bool(parse_dates) elif not isinstance(parse_dates, list): raise TypeError( - "Only booleans and lists are accepted " - "for the 'parse_dates' parameter" + "Only booleans and lists are accepted for the 'parse_dates' parameter" ) self.parse_dates: bool | list = parse_dates self.date_parser = kwds.pop("date_parser", lib.no_default) diff --git a/pandas/io/parsers/python_parser.py b/pandas/io/parsers/python_parser.py index db9547a18b600..e7b5c7f06a79a 100644 --- a/pandas/io/parsers/python_parser.py +++ b/pandas/io/parsers/python_parser.py @@ -595,8 +595,7 @@ def _infer_columns( joi = list(map(str, header[:-1] if have_mi_columns else header)) msg = f"[{','.join(joi)}], len of {len(joi)}, " raise ValueError( - f"Passed header={msg}" - f"but only {self.line_pos} lines in file" + f"Passed header={msg}but only {self.line_pos} lines in file" ) from err # We have an empty file, so check @@ -1219,8 +1218,7 @@ def _rows_to_cols(self, content: list[list[Scalar]]) -> list[np.ndarray]: for row_num, actual_len in bad_lines: msg = ( - f"Expected {col_len} fields in line {row_num + 1}, saw " - f"{actual_len}" + f"Expected {col_len} fields in line {row_num + 1}, saw {actual_len}" ) if ( self.delimiter diff --git a/pandas/io/parsers/readers.py b/pandas/io/parsers/readers.py index 54877017f76fc..67193f930b4dc 100644 --- a/pandas/io/parsers/readers.py +++ b/pandas/io/parsers/readers.py @@ -1219,8 +1219,7 @@ def _get_options_with_defaults(self, engine: CSVEngine) -> dict[str, Any]: and value != getattr(value, "value", default) ): raise ValueError( - f"The {argname!r} option is not supported with the " - f"'pyarrow' engine" + f"The {argname!r} option is not supported with the 'pyarrow' engine" ) options[argname] = value @@ -1396,8 +1395,7 @@ def _clean_options( if not is_integer(skiprows) and skiprows is not None: # pyarrow expects skiprows to be passed as an integer raise ValueError( - "skiprows argument must be an integer when using " - "engine='pyarrow'" + "skiprows argument must be an integer when using engine='pyarrow'" ) else: if is_integer(skiprows): diff --git a/pandas/io/pytables.py b/pandas/io/pytables.py index e18db2e53113f..b4c78b063c180 100644 --- a/pandas/io/pytables.py +++ b/pandas/io/pytables.py @@ -3524,6 +3524,12 @@ def validate(self, other) -> None: # Value of type "Optional[Any]" is not indexable [index] oax = ov[i] # type: ignore[index] if sax != oax: + if c == "values_axes" and sax.kind != oax.kind: + raise ValueError( + f"Cannot serialize the column [{oax.values[0]}] " + f"because its data contents are not [{sax.kind}] " + f"but [{oax.kind}] object dtype" + ) raise ValueError( f"invalid combination of [{c}] on appending data " f"[{sax}] vs current table [{oax}]" @@ -5136,6 +5142,9 @@ def _maybe_convert_for_string_atom( data = bvalues.copy() data[mask] = nan_rep + if existing_col and mask.any() and len(nan_rep) > existing_col.itemsize: + raise ValueError("NaN representation is too large for existing column size") + # see if we have a valid string type inferred_type = lib.infer_dtype(data, skipna=False) if inferred_type != "string": diff --git a/pandas/io/sas/sas_xport.py b/pandas/io/sas/sas_xport.py index 89dbdab64c23c..a9c45e720fd56 100644 --- a/pandas/io/sas/sas_xport.py +++ b/pandas/io/sas/sas_xport.py @@ -33,19 +33,16 @@ ReadBuffer, ) _correct_line1 = ( - "HEADER RECORD*******LIBRARY HEADER RECORD!!!!!!!" - "000000000000000000000000000000 " + "HEADER RECORD*******LIBRARY HEADER RECORD!!!!!!!000000000000000000000000000000 " ) _correct_header1 = ( "HEADER RECORD*******MEMBER HEADER RECORD!!!!!!!000000000000000001600000000" ) _correct_header2 = ( - "HEADER RECORD*******DSCRPTR HEADER RECORD!!!!!!!" - "000000000000000000000000000000 " + "HEADER RECORD*******DSCRPTR HEADER RECORD!!!!!!!000000000000000000000000000000 " ) _correct_obs_header = ( - "HEADER RECORD*******OBS HEADER RECORD!!!!!!!" - "000000000000000000000000000000 " + "HEADER RECORD*******OBS HEADER RECORD!!!!!!!000000000000000000000000000000 " ) _fieldkeys = [ "ntype", diff --git a/pandas/plotting/_core.py b/pandas/plotting/_core.py index aee872f9ae50a..9670b5439c87e 100644 --- a/pandas/plotting/_core.py +++ b/pandas/plotting/_core.py @@ -247,11 +247,14 @@ def hist_frame( .. plot:: :context: close-figs - >>> data = {"length": [1.5, 0.5, 1.2, 0.9, 3], "width": [0.7, 0.2, 0.15, 0.2, 1.1]} + >>> data = { + ... "length": [1.5, 0.5, 1.2, 0.9, 3], + ... "width": [0.7, 0.2, 0.15, 0.2, 1.1], + ... } >>> index = ["pig", "rabbit", "duck", "chicken", "horse"] >>> df = pd.DataFrame(data, index=index) >>> hist = df.hist(bins=3) - """ # noqa: E501 + """ plot_backend = _get_plot_backend(backend) return plot_backend.hist_frame( data, @@ -845,7 +848,10 @@ class PlotAccessor(PandasObject): :context: close-figs >>> df = pd.DataFrame( - ... {"length": [1.5, 0.5, 1.2, 0.9, 3], "width": [0.7, 0.2, 0.15, 0.2, 1.1]}, + ... { + ... "length": [1.5, 0.5, 1.2, 0.9, 3], + ... "width": [0.7, 0.2, 0.15, 0.2, 1.1], + ... }, ... index=["pig", "rabbit", "duck", "chicken", "horse"], ... ) >>> plot = df.plot(title="DataFrame Plot") @@ -866,7 +872,7 @@ class PlotAccessor(PandasObject): >>> df = pd.DataFrame({"col1": [1, 2, 3, 4], "col2": ["A", "B", "A", "B"]}) >>> plot = df.groupby("col2").plot(kind="bar", title="DataFrameGroupBy Plot") - """ # noqa: E501 + """ _common_kinds = ("line", "bar", "barh", "kde", "density", "area", "hist", "box") _series_kinds = ("pie",) @@ -993,8 +999,7 @@ def __call__(self, *args, **kwargs): if kind not in self._all_kinds: raise ValueError( - f"{kind} is not a valid plot kind " - f"Valid plot kinds: {self._all_kinds}" + f"{kind} is not a valid plot kind Valid plot kinds: {self._all_kinds}" ) data = self._parent @@ -1630,7 +1635,9 @@ def area( ... "signups": [5, 5, 6, 12, 14, 13], ... "visits": [20, 42, 28, 62, 81, 50], ... }, - ... index=pd.date_range(start="2018/01/01", end="2018/07/01", freq="ME"), + ... index=pd.date_range( + ... start="2018/01/01", end="2018/07/01", freq="ME" + ... ), ... ) >>> ax = df.plot.area() @@ -1662,7 +1669,7 @@ def area( ... } ... ) >>> ax = df.plot.area(x="day") - """ # noqa: E501 + """ return self(kind="area", x=x, y=y, stacked=stacked, **kwargs) def pie(self, y: IndexLabel | None = None, **kwargs) -> PlotAccessor: diff --git a/pandas/plotting/_matplotlib/boxplot.py b/pandas/plotting/_matplotlib/boxplot.py index 5ad30a68ae3c9..af77972da8634 100644 --- a/pandas/plotting/_matplotlib/boxplot.py +++ b/pandas/plotting/_matplotlib/boxplot.py @@ -123,8 +123,7 @@ def _validate_color_args(self, color, colormap): if colormap is not None: warnings.warn( - "'color' and 'colormap' cannot be used " - "simultaneously. Using 'color'", + "'color' and 'colormap' cannot be used simultaneously. Using 'color'", stacklevel=find_stack_level(), ) diff --git a/pandas/plotting/_misc.py b/pandas/plotting/_misc.py index 3f839cefe798e..0e0fb23d924bc 100644 --- a/pandas/plotting/_misc.py +++ b/pandas/plotting/_misc.py @@ -30,6 +30,13 @@ def table(ax: Axes, data: DataFrame | Series, **kwargs) -> Table: """ Helper function to convert DataFrame and Series to matplotlib.table. + This method provides an easy way to visualize tabular data within a Matplotlib + figure. It automatically extracts index and column labels from the DataFrame + or Series, unless explicitly specified. This function is particularly useful + when displaying summary tables alongside other plots or when creating static + reports. It utilizes the `matplotlib.pyplot.table` backend and allows + customization through various styling options available in Matplotlib. + Parameters ---------- ax : Matplotlib axes object diff --git a/pandas/tests/apply/test_frame_apply.py b/pandas/tests/apply/test_frame_apply.py index d36d723c4be6a..b9e407adc3051 100644 --- a/pandas/tests/apply/test_frame_apply.py +++ b/pandas/tests/apply/test_frame_apply.py @@ -4,6 +4,8 @@ import numpy as np import pytest +from pandas.compat import is_platform_arm + from pandas.core.dtypes.dtypes import CategoricalDtype import pandas as pd @@ -16,6 +18,7 @@ ) import pandas._testing as tm from pandas.tests.frame.common import zip_frames +from pandas.util.version import Version @pytest.fixture @@ -65,6 +68,13 @@ def test_apply(float_frame, engine, request): @pytest.mark.parametrize("raw", [True, False]) @pytest.mark.parametrize("nopython", [True, False]) def test_apply_args(float_frame, axis, raw, engine, nopython): + numba = pytest.importorskip("numba") + if ( + engine == "numba" + and Version(numba.__version__) == Version("0.61") + and is_platform_arm() + ): + pytest.skip(f"Segfaults on ARM platforms with numba {numba.__version__}") engine_kwargs = {"nopython": nopython} result = float_frame.apply( lambda x, y: x + y, diff --git a/pandas/tests/apply/test_numba.py b/pandas/tests/apply/test_numba.py index d6cd9c321ace6..75bc3f5b74b9d 100644 --- a/pandas/tests/apply/test_numba.py +++ b/pandas/tests/apply/test_numba.py @@ -1,6 +1,7 @@ import numpy as np import pytest +from pandas.compat import is_platform_arm import pandas.util._test_decorators as td import pandas as pd @@ -9,8 +10,17 @@ Index, ) import pandas._testing as tm +from pandas.util.version import Version -pytestmark = [td.skip_if_no("numba"), pytest.mark.single_cpu] +pytestmark = [td.skip_if_no("numba"), pytest.mark.single_cpu, pytest.mark.skipif()] + +numba = pytest.importorskip("numba") +pytestmark.append( + pytest.mark.skipif( + Version(numba.__version__) == Version("0.61") and is_platform_arm(), + reason=f"Segfaults on ARM platforms with numba {numba.__version__}", + ) +) @pytest.fixture(params=[0, 1]) diff --git a/pandas/tests/arrays/interval/test_formats.py b/pandas/tests/arrays/interval/test_formats.py index 535efee519374..88c9bf81d718c 100644 --- a/pandas/tests/arrays/interval/test_formats.py +++ b/pandas/tests/arrays/interval/test_formats.py @@ -6,8 +6,6 @@ def test_repr(): arr = IntervalArray.from_tuples([(0, 1), (1, 2)]) result = repr(arr) expected = ( - "\n" - "[(0, 1], (1, 2]]\n" - "Length: 2, dtype: interval[int64, right]" + "\n[(0, 1], (1, 2]]\nLength: 2, dtype: interval[int64, right]" ) assert result == expected diff --git a/pandas/tests/dtypes/cast/test_downcast.py b/pandas/tests/dtypes/cast/test_downcast.py index 9430ba2c478ae..69200b2e5fc96 100644 --- a/pandas/tests/dtypes/cast/test_downcast.py +++ b/pandas/tests/dtypes/cast/test_downcast.py @@ -33,9 +33,9 @@ ( # This is a judgement call, but we do _not_ downcast Decimal # objects - np.array([decimal.Decimal(0.0)]), + np.array([decimal.Decimal("0.0")]), "int64", - np.array([decimal.Decimal(0.0)]), + np.array([decimal.Decimal("0.0")]), ), ( # GH#45837 diff --git a/pandas/tests/dtypes/test_common.py b/pandas/tests/dtypes/test_common.py index fa48393dd183e..2bda2fddec2ff 100644 --- a/pandas/tests/dtypes/test_common.py +++ b/pandas/tests/dtypes/test_common.py @@ -22,6 +22,7 @@ import pandas._testing as tm from pandas.api.types import pandas_dtype from pandas.arrays import SparseArray +from pandas.util.version import Version # EA & Actual Dtypes @@ -788,11 +789,18 @@ def test_validate_allhashable(): def test_pandas_dtype_numpy_warning(): # GH#51523 - with tm.assert_produces_warning( - DeprecationWarning, - check_stacklevel=False, - match="Converting `np.integer` or `np.signedinteger` to a dtype is deprecated", - ): + if Version(np.__version__) <= Version("2.2.2"): + ctx = tm.assert_produces_warning( + DeprecationWarning, + check_stacklevel=False, + match=( + "Converting `np.integer` or `np.signedinteger` to a dtype is deprecated" + ), + ) + else: + ctx = tm.external_error_raised(TypeError) + + with ctx: pandas_dtype(np.integer) diff --git a/pandas/tests/dtypes/test_dtypes.py b/pandas/tests/dtypes/test_dtypes.py index b7e37ff270e60..621217a8c9317 100644 --- a/pandas/tests/dtypes/test_dtypes.py +++ b/pandas/tests/dtypes/test_dtypes.py @@ -660,8 +660,7 @@ def test_construction_generic(self, subtype): def test_construction_not_supported(self, subtype): # GH 19016 msg = ( - "category, object, and string subtypes are not supported " - "for IntervalDtype" + "category, object, and string subtypes are not supported for IntervalDtype" ) with pytest.raises(TypeError, match=msg): IntervalDtype(subtype) diff --git a/pandas/tests/dtypes/test_missing.py b/pandas/tests/dtypes/test_missing.py index 73c462d492d2d..c61cda83cf6e0 100644 --- a/pandas/tests/dtypes/test_missing.py +++ b/pandas/tests/dtypes/test_missing.py @@ -321,7 +321,7 @@ def test_period(self): def test_decimal(self): # scalars GH#23530 - a = Decimal(1.0) + a = Decimal("1.0") assert isna(a) is False assert notna(a) is True diff --git a/pandas/tests/extension/base/getitem.py b/pandas/tests/extension/base/getitem.py index 27fa1206f6f7f..1f3680bf67e90 100644 --- a/pandas/tests/extension/base/getitem.py +++ b/pandas/tests/extension/base/getitem.py @@ -139,8 +139,8 @@ def test_getitem_invalid(self, data): "index out of bounds", # pyarrow "Out of bounds access", # Sparse f"loc must be an integer between -{ub} and {ub}", # Sparse - f"index {ub+1} is out of bounds for axis 0 with size {ub}", - f"index -{ub+1} is out of bounds for axis 0 with size {ub}", + f"index {ub + 1} is out of bounds for axis 0 with size {ub}", + f"index -{ub + 1} is out of bounds for axis 0 with size {ub}", ] ) with pytest.raises(IndexError, match=msg): diff --git a/pandas/tests/extension/decimal/array.py b/pandas/tests/extension/decimal/array.py index 59f313b4c9edb..2ee6a73ec4054 100644 --- a/pandas/tests/extension/decimal/array.py +++ b/pandas/tests/extension/decimal/array.py @@ -125,7 +125,6 @@ def to_numpy( return result def __array_ufunc__(self, ufunc: np.ufunc, method: str, *inputs, **kwargs): - # if not all( isinstance(t, self._HANDLED_TYPES + (DecimalArray,)) for t in inputs ): diff --git a/pandas/tests/extension/json/array.py b/pandas/tests/extension/json/array.py index a68c8a06e1d18..b110911bda400 100644 --- a/pandas/tests/extension/json/array.py +++ b/pandas/tests/extension/json/array.py @@ -176,8 +176,7 @@ def take(self, indexer, allow_fill=False, fill_value=None): # an ndarary. indexer = np.asarray(indexer) msg = ( - "Index is out of bounds or cannot do a " - "non-empty take from an empty array." + "Index is out of bounds or cannot do a non-empty take from an empty array." ) if allow_fill: diff --git a/pandas/tests/extension/list/array.py b/pandas/tests/extension/list/array.py index da53bdcb4e37e..8b4728c7d6292 100644 --- a/pandas/tests/extension/list/array.py +++ b/pandas/tests/extension/list/array.py @@ -81,8 +81,7 @@ def take(self, indexer, allow_fill=False, fill_value=None): # an ndarary. indexer = np.asarray(indexer) msg = ( - "Index is out of bounds or cannot do a " - "non-empty take from an empty array." + "Index is out of bounds or cannot do a non-empty take from an empty array." ) if allow_fill: diff --git a/pandas/tests/extension/test_arrow.py b/pandas/tests/extension/test_arrow.py index 4fccf02e08bd6..d6f428f4938a6 100644 --- a/pandas/tests/extension/test_arrow.py +++ b/pandas/tests/extension/test_arrow.py @@ -964,8 +964,7 @@ def _get_arith_xfail_marker(self, opname, pa_dtype): mark = pytest.mark.xfail( raises=TypeError, reason=( - f"{opname} not supported between" - f"pd.NA and {pa_dtype} Python scalar" + f"{opname} not supported betweenpd.NA and {pa_dtype} Python scalar" ), ) elif opname == "__rfloordiv__" and ( diff --git a/pandas/tests/frame/methods/test_info.py b/pandas/tests/frame/methods/test_info.py index 74e4383950174..de6737ec3bc39 100644 --- a/pandas/tests/frame/methods/test_info.py +++ b/pandas/tests/frame/methods/test_info.py @@ -11,6 +11,7 @@ HAS_PYARROW, IS64, PYPY, + is_platform_arm, ) from pandas import ( @@ -23,6 +24,7 @@ option_context, ) import pandas._testing as tm +from pandas.util.version import Version @pytest.fixture @@ -522,7 +524,7 @@ def test_info_int_columns(using_infer_string): 0 1 2 non-null int64 1 2 2 non-null int64 dtypes: int64(2) - memory usage: {'50.0' if using_infer_string and HAS_PYARROW else '48.0+'} bytes + memory usage: {"50.0" if using_infer_string and HAS_PYARROW else "48.0+"} bytes """ ) assert result == expected @@ -544,7 +546,9 @@ def test_memory_usage_empty_no_warning(using_infer_string): @pytest.mark.single_cpu def test_info_compute_numba(): # GH#51922 - pytest.importorskip("numba") + numba = pytest.importorskip("numba") + if Version(numba.__version__) == Version("0.61") and is_platform_arm(): + pytest.skip(f"Segfaults on ARM platforms with numba {numba.__version__}") df = DataFrame([[1, 2], [3, 4]]) with option_context("compute.use_numba", True): diff --git a/pandas/tests/frame/methods/test_join.py b/pandas/tests/frame/methods/test_join.py index 479ea7d7ba692..aaa9485cab580 100644 --- a/pandas/tests/frame/methods/test_join.py +++ b/pandas/tests/frame/methods/test_join.py @@ -277,7 +277,20 @@ def test_join_index(float_frame): tm.assert_index_equal(joined.index, float_frame.index.sort_values()) tm.assert_index_equal(joined.columns, expected_columns) - join_msg = "'foo' is not a valid Merge type: left, right, inner, outer, cross, asof" + # left anti + joined = f.join(f2, how="left_anti") + tm.assert_index_equal(joined.index, float_frame.index[:5]) + tm.assert_index_equal(joined.columns, expected_columns) + + # right anti + joined = f.join(f2, how="right_anti") + tm.assert_index_equal(joined.index, float_frame.index[10:][::-1]) + tm.assert_index_equal(joined.columns, expected_columns) + + join_msg = ( + "'foo' is not a valid Merge type: left, right, inner, outer, " + "left_anti, right_anti, cross, asof" + ) with pytest.raises(ValueError, match=re.escape(join_msg)): f.join(f2, how="foo") diff --git a/pandas/tests/frame/methods/test_sample.py b/pandas/tests/frame/methods/test_sample.py index 91d735a8b2fa7..a9d56cbfd2b46 100644 --- a/pandas/tests/frame/methods/test_sample.py +++ b/pandas/tests/frame/methods/test_sample.py @@ -198,8 +198,7 @@ def test_sample_upsampling_without_replacement(self, frame_or_series): obj = tm.get_obj(obj, frame_or_series) msg = ( - "Replace has to be set to `True` when " - "upsampling the population `frac` > 1." + "Replace has to be set to `True` when upsampling the population `frac` > 1." ) with pytest.raises(ValueError, match=msg): obj.sample(frac=2, replace=False) diff --git a/pandas/tests/frame/methods/test_set_axis.py b/pandas/tests/frame/methods/test_set_axis.py index 1967941bca9f0..7b75bcf4f348d 100644 --- a/pandas/tests/frame/methods/test_set_axis.py +++ b/pandas/tests/frame/methods/test_set_axis.py @@ -93,7 +93,7 @@ def test_set_axis_setattr_index_wrong_length(self, obj): # wrong length msg = ( f"Length mismatch: Expected axis has {len(obj)} elements, " - f"new values have {len(obj)-1} elements" + f"new values have {len(obj) - 1} elements" ) with pytest.raises(ValueError, match=msg): obj.index = np.arange(len(obj) - 1) diff --git a/pandas/tests/frame/test_reductions.py b/pandas/tests/frame/test_reductions.py index 04b1456cdbea6..64e686d25faa7 100644 --- a/pandas/tests/frame/test_reductions.py +++ b/pandas/tests/frame/test_reductions.py @@ -1544,6 +1544,44 @@ def test_min_max_dt64_with_NaT(self): exp = Series([pd.NaT], index=["foo"]) tm.assert_series_equal(res, exp) + def test_min_max_dt64_with_NaT_precision(self): + # GH#60646 Make sure the reduction doesn't cast input timestamps to + # float and lose precision. + df = DataFrame( + {"foo": [pd.NaT, pd.NaT, Timestamp("2012-05-01 09:20:00.123456789")]}, + dtype="datetime64[ns]", + ) + + res = df.min(axis=1) + exp = df.foo.rename(None) + tm.assert_series_equal(res, exp) + + res = df.max(axis=1) + exp = df.foo.rename(None) + tm.assert_series_equal(res, exp) + + def test_min_max_td64_with_NaT_precision(self): + # GH#60646 Make sure the reduction doesn't cast input timedeltas to + # float and lose precision. + df = DataFrame( + { + "foo": [ + pd.NaT, + pd.NaT, + to_timedelta("10000 days 06:05:01.123456789"), + ], + }, + dtype="timedelta64[ns]", + ) + + res = df.min(axis=1) + exp = df.foo.rename(None) + tm.assert_series_equal(res, exp) + + res = df.max(axis=1) + exp = df.foo.rename(None) + tm.assert_series_equal(res, exp) + def test_min_max_dt64_with_NaT_skipna_false(self, request, tz_naive_fixture): # GH#36907 tz = tz_naive_fixture diff --git a/pandas/tests/frame/test_stack_unstack.py b/pandas/tests/frame/test_stack_unstack.py index abc14d10514fa..22fdfd3a01408 100644 --- a/pandas/tests/frame/test_stack_unstack.py +++ b/pandas/tests/frame/test_stack_unstack.py @@ -1452,6 +1452,25 @@ def test_stack_empty_frame(dropna, future_stack): tm.assert_series_equal(result, expected) +@pytest.mark.filterwarnings("ignore:The previous implementation of stack is deprecated") +@pytest.mark.parametrize("dropna", [True, False, lib.no_default]) +def test_stack_empty_level(dropna, future_stack, int_frame): + # GH 60740 + if future_stack and dropna is not lib.no_default: + with pytest.raises(ValueError, match="dropna must be unspecified"): + DataFrame(dtype=np.int64).stack(dropna=dropna, future_stack=future_stack) + else: + expected = int_frame + result = int_frame.copy().stack( + level=[], dropna=dropna, future_stack=future_stack + ) + tm.assert_frame_equal(result, expected) + + expected = DataFrame() + result = DataFrame().stack(level=[], dropna=dropna, future_stack=future_stack) + tm.assert_frame_equal(result, expected) + + @pytest.mark.filterwarnings("ignore:The previous implementation of stack is deprecated") @pytest.mark.parametrize("dropna", [True, False, lib.no_default]) @pytest.mark.parametrize("fill_value", [None, 0]) diff --git a/pandas/tests/generic/test_finalize.py b/pandas/tests/generic/test_finalize.py index 433e559ef620e..a88090b00499d 100644 --- a/pandas/tests/generic/test_finalize.py +++ b/pandas/tests/generic/test_finalize.py @@ -644,7 +644,7 @@ def test_timedelta_methods(method): operator.methodcaller("add_categories", ["c"]), operator.methodcaller("as_ordered"), operator.methodcaller("as_unordered"), - lambda x: getattr(x, "codes"), + lambda x: x.codes, operator.methodcaller("remove_categories", "a"), operator.methodcaller("remove_unused_categories"), operator.methodcaller("rename_categories", {"a": "A", "b": "B"}), diff --git a/pandas/tests/groupby/aggregate/test_numba.py b/pandas/tests/groupby/aggregate/test_numba.py index ca265a1d1108b..afddc90fdd055 100644 --- a/pandas/tests/groupby/aggregate/test_numba.py +++ b/pandas/tests/groupby/aggregate/test_numba.py @@ -1,6 +1,7 @@ import numpy as np import pytest +from pandas.compat import is_platform_arm from pandas.errors import NumbaUtilError from pandas import ( @@ -11,8 +12,17 @@ option_context, ) import pandas._testing as tm +from pandas.util.version import Version -pytestmark = pytest.mark.single_cpu +pytestmark = [pytest.mark.single_cpu] + +numba = pytest.importorskip("numba") +pytestmark.append( + pytest.mark.skipif( + Version(numba.__version__) == Version("0.61") and is_platform_arm(), + reason=f"Segfaults on ARM platforms with numba {numba.__version__}", + ) +) def test_correct_function_signature(): @@ -186,7 +196,7 @@ def test_multifunc_numba_vs_cython_frame(agg_kwargs): tm.assert_frame_equal(result, expected) -@pytest.mark.parametrize("func", ["sum", "mean"]) +@pytest.mark.parametrize("func", ["sum", "mean", "var", "std", "min", "max"]) def test_multifunc_numba_vs_cython_frame_noskipna(func): pytest.importorskip("numba") data = DataFrame( diff --git a/pandas/tests/groupby/test_api.py b/pandas/tests/groupby/test_api.py index cc69de2581a79..215e627abb018 100644 --- a/pandas/tests/groupby/test_api.py +++ b/pandas/tests/groupby/test_api.py @@ -174,16 +174,13 @@ def test_frame_consistency(groupby_func): elif groupby_func in ("nunique",): exclude_expected = {"axis"} elif groupby_func in ("max", "min"): - exclude_expected = {"axis", "kwargs", "skipna"} + exclude_expected = {"axis", "kwargs"} exclude_result = {"min_count", "engine", "engine_kwargs"} - elif groupby_func in ("sum", "mean"): + elif groupby_func in ("sum", "mean", "std", "var"): exclude_expected = {"axis", "kwargs"} exclude_result = {"engine", "engine_kwargs"} - elif groupby_func in ("std", "var"): - exclude_expected = {"axis", "kwargs", "skipna"} - exclude_result = {"engine", "engine_kwargs"} elif groupby_func in ("median", "prod", "sem"): - exclude_expected = {"axis", "kwargs", "skipna"} + exclude_expected = {"axis", "kwargs"} elif groupby_func in ("bfill", "ffill"): exclude_expected = {"inplace", "axis", "limit_area"} elif groupby_func in ("cummax", "cummin"): @@ -235,16 +232,13 @@ def test_series_consistency(request, groupby_func): if groupby_func in ("any", "all"): exclude_expected = {"kwargs", "bool_only", "axis"} elif groupby_func in ("max", "min"): - exclude_expected = {"axis", "kwargs", "skipna"} + exclude_expected = {"axis", "kwargs"} exclude_result = {"min_count", "engine", "engine_kwargs"} - elif groupby_func in ("sum", "mean"): + elif groupby_func in ("sum", "mean", "std", "var"): exclude_expected = {"axis", "kwargs"} exclude_result = {"engine", "engine_kwargs"} - elif groupby_func in ("std", "var"): - exclude_expected = {"axis", "kwargs", "skipna"} - exclude_result = {"engine", "engine_kwargs"} elif groupby_func in ("median", "prod", "sem"): - exclude_expected = {"axis", "kwargs", "skipna"} + exclude_expected = {"axis", "kwargs"} elif groupby_func in ("bfill", "ffill"): exclude_expected = {"inplace", "axis", "limit_area"} elif groupby_func in ("cummax", "cummin"): diff --git a/pandas/tests/groupby/test_apply.py b/pandas/tests/groupby/test_apply.py index 294ab14c96de8..5bf16ee9ad0b8 100644 --- a/pandas/tests/groupby/test_apply.py +++ b/pandas/tests/groupby/test_apply.py @@ -1390,7 +1390,7 @@ def test_empty_df(method, op): # GH 47985 empty_df = DataFrame({"a": [], "b": []}) gb = empty_df.groupby("a", group_keys=True) - group = getattr(gb, "b") + group = gb.b result = getattr(group, method)(op) expected = Series( diff --git a/pandas/tests/groupby/test_categorical.py b/pandas/tests/groupby/test_categorical.py index 20309e852a556..e49be8c00b426 100644 --- a/pandas/tests/groupby/test_categorical.py +++ b/pandas/tests/groupby/test_categorical.py @@ -990,7 +990,7 @@ def test_sort(): # self.cat.groupby(['value_group'])['value_group'].count().plot(kind='bar') df = DataFrame({"value": np.random.default_rng(2).integers(0, 10000, 10)}) - labels = [f"{i} - {i+499}" for i in range(0, 10000, 500)] + labels = [f"{i} - {i + 499}" for i in range(0, 10000, 500)] cat_labels = Categorical(labels, labels) df = df.sort_values(by=["value"], ascending=True) diff --git a/pandas/tests/groupby/test_groupby.py b/pandas/tests/groupby/test_groupby.py index 5bae9b1fd9882..d0ce27b4a22f8 100644 --- a/pandas/tests/groupby/test_groupby.py +++ b/pandas/tests/groupby/test_groupby.py @@ -264,7 +264,7 @@ def test_attr_wrapper(ts): # make sure raises error msg = "'SeriesGroupBy' object has no attribute 'foo'" with pytest.raises(AttributeError, match=msg): - getattr(grouped, "foo") + grouped.foo def test_frame_groupby(tsframe): diff --git a/pandas/tests/groupby/test_numba.py b/pandas/tests/groupby/test_numba.py index 3e32031e51138..082319d8479f0 100644 --- a/pandas/tests/groupby/test_numba.py +++ b/pandas/tests/groupby/test_numba.py @@ -1,15 +1,24 @@ import pytest +from pandas.compat import is_platform_arm + from pandas import ( DataFrame, Series, option_context, ) import pandas._testing as tm +from pandas.util.version import Version -pytestmark = pytest.mark.single_cpu +pytestmark = [pytest.mark.single_cpu] -pytest.importorskip("numba") +numba = pytest.importorskip("numba") +pytestmark.append( + pytest.mark.skipif( + Version(numba.__version__) == Version("0.61") and is_platform_arm(), + reason=f"Segfaults on ARM platforms with numba {numba.__version__}", + ) +) @pytest.mark.filterwarnings("ignore") diff --git a/pandas/tests/groupby/test_raises.py b/pandas/tests/groupby/test_raises.py index ba13d3bd7278f..864b9e5d55991 100644 --- a/pandas/tests/groupby/test_raises.py +++ b/pandas/tests/groupby/test_raises.py @@ -263,10 +263,7 @@ def test_groupby_raises_string_np( if using_infer_string: if groupby_func_np is np.mean: klass = TypeError - msg = ( - f"Cannot perform reduction '{groupby_func_np.__name__}' " - "with string dtype" - ) + msg = f"Cannot perform reduction '{groupby_func_np.__name__}' with string dtype" _call_and_check(klass, msg, how, gb, groupby_func_np, ()) diff --git a/pandas/tests/groupby/test_reductions.py b/pandas/tests/groupby/test_reductions.py index 1db12f05e821f..ea876cfdf4933 100644 --- a/pandas/tests/groupby/test_reductions.py +++ b/pandas/tests/groupby/test_reductions.py @@ -514,6 +514,147 @@ def test_sum_skipna_object(skipna): tm.assert_series_equal(result, expected) +@pytest.mark.parametrize( + "func, values, dtype, result_dtype", + [ + ("prod", [0, 1, 3, np.nan, 4, 5, 6, 7, -8, 9], "float64", "float64"), + ("prod", [0, -1, 3, 4, 5, np.nan, 6, 7, 8, 9], "Float64", "Float64"), + ("prod", [0, 1, 3, -4, 5, 6, 7, -8, np.nan, 9], "Int64", "Int64"), + ("prod", [np.nan] * 10, "float64", "float64"), + ("prod", [np.nan] * 10, "Float64", "Float64"), + ("prod", [np.nan] * 10, "Int64", "Int64"), + ("var", [0, -1, 3, 4, np.nan, 5, 6, 7, 8, 9], "float64", "float64"), + ("var", [0, 1, 3, -4, 5, 6, 7, -8, 9, np.nan], "Float64", "Float64"), + ("var", [0, -1, 3, 4, 5, -6, 7, np.nan, 8, 9], "Int64", "Float64"), + ("var", [np.nan] * 10, "float64", "float64"), + ("var", [np.nan] * 10, "Float64", "Float64"), + ("var", [np.nan] * 10, "Int64", "Float64"), + ("std", [0, 1, 3, -4, 5, 6, 7, -8, np.nan, 9], "float64", "float64"), + ("std", [0, -1, 3, 4, 5, -6, 7, np.nan, 8, 9], "Float64", "Float64"), + ("std", [0, 1, 3, -4, 5, 6, 7, -8, 9, np.nan], "Int64", "Float64"), + ("std", [np.nan] * 10, "float64", "float64"), + ("std", [np.nan] * 10, "Float64", "Float64"), + ("std", [np.nan] * 10, "Int64", "Float64"), + ("sem", [0, -1, 3, 4, 5, -6, 7, np.nan, 8, 9], "float64", "float64"), + ("sem", [0, 1, 3, -4, 5, 6, 7, -8, np.nan, 9], "Float64", "Float64"), + ("sem", [0, -1, 3, 4, 5, -6, 7, 8, 9, np.nan], "Int64", "Float64"), + ("sem", [np.nan] * 10, "float64", "float64"), + ("sem", [np.nan] * 10, "Float64", "Float64"), + ("sem", [np.nan] * 10, "Int64", "Float64"), + ("min", [0, -1, 3, 4, 5, -6, 7, np.nan, 8, 9], "float64", "float64"), + ("min", [0, 1, 3, -4, 5, 6, 7, -8, np.nan, 9], "Float64", "Float64"), + ("min", [0, -1, 3, 4, 5, -6, 7, 8, 9, np.nan], "Int64", "Int64"), + ( + "min", + [0, 1, np.nan, 3, 4, 5, 6, 7, 8, 9], + "timedelta64[ns]", + "timedelta64[ns]", + ), + ( + "min", + pd.to_datetime( + [ + "2019-05-09", + pd.NaT, + "2019-05-11", + "2019-05-12", + "2019-05-13", + "2019-05-14", + "2019-05-15", + "2019-05-16", + "2019-05-17", + "2019-05-18", + ] + ), + "datetime64[ns]", + "datetime64[ns]", + ), + ("min", [np.nan] * 10, "float64", "float64"), + ("min", [np.nan] * 10, "Float64", "Float64"), + ("min", [np.nan] * 10, "Int64", "Int64"), + ("max", [0, -1, 3, 4, 5, -6, 7, np.nan, 8, 9], "float64", "float64"), + ("max", [0, 1, 3, -4, 5, 6, 7, -8, np.nan, 9], "Float64", "Float64"), + ("max", [0, -1, 3, 4, 5, -6, 7, 8, 9, np.nan], "Int64", "Int64"), + ( + "max", + [0, 1, np.nan, 3, 4, 5, 6, 7, 8, 9], + "timedelta64[ns]", + "timedelta64[ns]", + ), + ( + "max", + pd.to_datetime( + [ + "2019-05-09", + pd.NaT, + "2019-05-11", + "2019-05-12", + "2019-05-13", + "2019-05-14", + "2019-05-15", + "2019-05-16", + "2019-05-17", + "2019-05-18", + ] + ), + "datetime64[ns]", + "datetime64[ns]", + ), + ("max", [np.nan] * 10, "float64", "float64"), + ("max", [np.nan] * 10, "Float64", "Float64"), + ("max", [np.nan] * 10, "Int64", "Int64"), + ("median", [0, -1, 3, 4, 5, -6, 7, np.nan, 8, 9], "float64", "float64"), + ("median", [0, 1, 3, -4, 5, 6, 7, -8, np.nan, 9], "Float64", "Float64"), + ("median", [0, -1, 3, 4, 5, -6, 7, 8, 9, np.nan], "Int64", "Float64"), + ( + "median", + [0, 1, np.nan, 3, 4, 5, 6, 7, 8, 9], + "timedelta64[ns]", + "timedelta64[ns]", + ), + ( + "median", + pd.to_datetime( + [ + "2019-05-09", + pd.NaT, + "2019-05-11", + "2019-05-12", + "2019-05-13", + "2019-05-14", + "2019-05-15", + "2019-05-16", + "2019-05-17", + "2019-05-18", + ] + ), + "datetime64[ns]", + "datetime64[ns]", + ), + ("median", [np.nan] * 10, "float64", "float64"), + ("median", [np.nan] * 10, "Float64", "Float64"), + ("median", [np.nan] * 10, "Int64", "Float64"), + ], +) +def test_multifunc_skipna(func, values, dtype, result_dtype, skipna): + # GH#15675 + df = DataFrame( + { + "val": values, + "cat": ["A", "B"] * 5, + } + ).astype({"val": dtype}) + # We need to recast the expected values to the result_dtype as some operations + # change the dtype + expected = ( + df.groupby("cat")["val"] + .apply(lambda x: getattr(x, func)(skipna=skipna)) + .astype(result_dtype) + ) + result = getattr(df.groupby("cat")["val"], func)(skipna=skipna) + tm.assert_series_equal(result, expected) + + def test_cython_median(): arr = np.random.default_rng(2).standard_normal(1000) arr[::2] = np.nan diff --git a/pandas/tests/groupby/transform/test_numba.py b/pandas/tests/groupby/transform/test_numba.py index 969df8ef4c52b..e19b7592f75b3 100644 --- a/pandas/tests/groupby/transform/test_numba.py +++ b/pandas/tests/groupby/transform/test_numba.py @@ -1,6 +1,7 @@ import numpy as np import pytest +from pandas.compat import is_platform_arm from pandas.errors import NumbaUtilError from pandas import ( @@ -9,8 +10,17 @@ option_context, ) import pandas._testing as tm +from pandas.util.version import Version -pytestmark = pytest.mark.single_cpu +pytestmark = [pytest.mark.single_cpu] + +numba = pytest.importorskip("numba") +pytestmark.append( + pytest.mark.skipif( + Version(numba.__version__) == Version("0.61") and is_platform_arm(), + reason=f"Segfaults on ARM platforms with numba {numba.__version__}", + ) +) def test_correct_function_signature(): diff --git a/pandas/tests/indexes/categorical/test_indexing.py b/pandas/tests/indexes/categorical/test_indexing.py index 49eb79da616e7..25232075a07d9 100644 --- a/pandas/tests/indexes/categorical/test_indexing.py +++ b/pandas/tests/indexes/categorical/test_indexing.py @@ -64,8 +64,7 @@ def test_take_fill_value(self): tm.assert_categorical_equal(result.values, expected.values) msg = ( - "When allow_fill=True and fill_value is not None, " - "all indices must be >= -1" + "When allow_fill=True and fill_value is not None, all indices must be >= -1" ) with pytest.raises(ValueError, match=msg): idx.take(np.array([1, 0, -2]), fill_value=True) @@ -103,8 +102,7 @@ def test_take_fill_value_datetime(self): tm.assert_index_equal(result, expected) msg = ( - "When allow_fill=True and fill_value is not None, " - "all indices must be >= -1" + "When allow_fill=True and fill_value is not None, all indices must be >= -1" ) with pytest.raises(ValueError, match=msg): idx.take(np.array([1, 0, -2]), fill_value=True) diff --git a/pandas/tests/indexes/datetimes/methods/test_round.py b/pandas/tests/indexes/datetimes/methods/test_round.py index cde4a3a65804d..b023542ba0a4c 100644 --- a/pandas/tests/indexes/datetimes/methods/test_round.py +++ b/pandas/tests/indexes/datetimes/methods/test_round.py @@ -216,6 +216,6 @@ def test_round_int64(self, start, index_freq, periods, round_freq): assert (mod == 0).all(), f"round not a {round_freq} multiple" assert (diff <= unit // 2).all(), "round error" if unit % 2 == 0: - assert ( - result.asi8[diff == unit // 2] % 2 == 0 - ).all(), "round half to even error" + assert (result.asi8[diff == unit // 2] % 2 == 0).all(), ( + "round half to even error" + ) diff --git a/pandas/tests/indexes/datetimes/test_formats.py b/pandas/tests/indexes/datetimes/test_formats.py index 4551fdf073193..f4e0a63043335 100644 --- a/pandas/tests/indexes/datetimes/test_formats.py +++ b/pandas/tests/indexes/datetimes/test_formats.py @@ -205,12 +205,7 @@ def test_dti_representation_to_series(self, unit): exp3 = "0 2011-01-01\n1 2011-01-02\ndtype: datetime64[ns]" - exp4 = ( - "0 2011-01-01\n" - "1 2011-01-02\n" - "2 2011-01-03\n" - "dtype: datetime64[ns]" - ) + exp4 = "0 2011-01-01\n1 2011-01-02\n2 2011-01-03\ndtype: datetime64[ns]" exp5 = ( "0 2011-01-01 09:00:00+09:00\n" @@ -226,11 +221,7 @@ def test_dti_representation_to_series(self, unit): "dtype: datetime64[ns, US/Eastern]" ) - exp7 = ( - "0 2011-01-01 09:00:00\n" - "1 2011-01-02 10:15:00\n" - "dtype: datetime64[ns]" - ) + exp7 = "0 2011-01-01 09:00:00\n1 2011-01-02 10:15:00\ndtype: datetime64[ns]" with pd.option_context("display.width", 300): for idx, expected in zip( diff --git a/pandas/tests/indexes/datetimes/test_indexing.py b/pandas/tests/indexes/datetimes/test_indexing.py index bfbcdcff51ee6..c44345273466c 100644 --- a/pandas/tests/indexes/datetimes/test_indexing.py +++ b/pandas/tests/indexes/datetimes/test_indexing.py @@ -338,8 +338,7 @@ def test_take_fill_value(self): tm.assert_index_equal(result, expected) msg = ( - "When allow_fill=True and fill_value is not None, " - "all indices must be >= -1" + "When allow_fill=True and fill_value is not None, all indices must be >= -1" ) with pytest.raises(ValueError, match=msg): idx.take(np.array([1, 0, -2]), fill_value=True) @@ -375,8 +374,7 @@ def test_take_fill_value_with_timezone(self): tm.assert_index_equal(result, expected) msg = ( - "When allow_fill=True and fill_value is not None, " - "all indices must be >= -1" + "When allow_fill=True and fill_value is not None, all indices must be >= -1" ) with pytest.raises(ValueError, match=msg): idx.take(np.array([1, 0, -2]), fill_value=True) diff --git a/pandas/tests/indexes/interval/test_constructors.py b/pandas/tests/indexes/interval/test_constructors.py index 8db483751438c..90423149658ab 100644 --- a/pandas/tests/indexes/interval/test_constructors.py +++ b/pandas/tests/indexes/interval/test_constructors.py @@ -154,8 +154,7 @@ def test_constructor_empty(self, constructor, breaks, closed): def test_constructor_string(self, constructor, breaks): # GH 19016 msg = ( - "category, object, and string subtypes are not supported " - "for IntervalIndex" + "category, object, and string subtypes are not supported for IntervalIndex" ) with pytest.raises(TypeError, match=msg): constructor(**self.get_kwargs_from_breaks(breaks)) @@ -224,8 +223,7 @@ def test_constructor_errors(self): # GH 19016: categorical data data = Categorical(list("01234abcde"), ordered=True) msg = ( - "category, object, and string subtypes are not supported " - "for IntervalIndex" + "category, object, and string subtypes are not supported for IntervalIndex" ) with pytest.raises(TypeError, match=msg): IntervalIndex.from_arrays(data[:-1], data[1:]) @@ -297,8 +295,7 @@ def test_constructor_errors(self): # GH 19016: categorical data data = Categorical(list("01234abcde"), ordered=True) msg = ( - "category, object, and string subtypes are not supported " - "for IntervalIndex" + "category, object, and string subtypes are not supported for IntervalIndex" ) with pytest.raises(TypeError, match=msg): IntervalIndex.from_breaks(data) diff --git a/pandas/tests/indexes/interval/test_formats.py b/pandas/tests/indexes/interval/test_formats.py index 73bbfc91028b3..d45d894c485c9 100644 --- a/pandas/tests/indexes/interval/test_formats.py +++ b/pandas/tests/indexes/interval/test_formats.py @@ -21,12 +21,7 @@ class TestIntervalIndexRendering: [ ( Series, - ( - "(0.0, 1.0] a\n" - "NaN b\n" - "(2.0, 3.0] c\n" - "dtype: object" - ), + ("(0.0, 1.0] a\nNaN b\n(2.0, 3.0] c\ndtype: object"), ), (DataFrame, (" 0\n(0.0, 1.0] a\nNaN b\n(2.0, 3.0] c")), ], diff --git a/pandas/tests/indexes/multi/test_indexing.py b/pandas/tests/indexes/multi/test_indexing.py index d82203a53a60f..f098690be2afa 100644 --- a/pandas/tests/indexes/multi/test_indexing.py +++ b/pandas/tests/indexes/multi/test_indexing.py @@ -259,8 +259,7 @@ def test_get_indexer(self): def test_get_indexer_nearest(self): midx = MultiIndex.from_tuples([("a", 1), ("b", 2)]) msg = ( - "method='nearest' not implemented yet for MultiIndex; " - "see GitHub issue 9365" + "method='nearest' not implemented yet for MultiIndex; see GitHub issue 9365" ) with pytest.raises(NotImplementedError, match=msg): midx.get_indexer(["a"], method="nearest") diff --git a/pandas/tests/indexes/numeric/test_indexing.py b/pandas/tests/indexes/numeric/test_indexing.py index 43adc09774914..3c1b98d57b2a0 100644 --- a/pandas/tests/indexes/numeric/test_indexing.py +++ b/pandas/tests/indexes/numeric/test_indexing.py @@ -479,8 +479,7 @@ def test_take_fill_value_float64(self): tm.assert_index_equal(result, expected) msg = ( - "When allow_fill=True and fill_value is not None, " - "all indices must be >= -1" + "When allow_fill=True and fill_value is not None, all indices must be >= -1" ) with pytest.raises(ValueError, match=msg): idx.take(np.array([1, 0, -2]), fill_value=True) diff --git a/pandas/tests/indexes/period/test_formats.py b/pandas/tests/indexes/period/test_formats.py index 9f36eb1e7a1d1..dc95e19523842 100644 --- a/pandas/tests/indexes/period/test_formats.py +++ b/pandas/tests/indexes/period/test_formats.py @@ -63,8 +63,7 @@ def test_representation(self, method): exp3 = "PeriodIndex(['2011-01-01', '2011-01-02'], dtype='period[D]')" exp4 = ( - "PeriodIndex(['2011-01-01', '2011-01-02', '2011-01-03'], " - "dtype='period[D]')" + "PeriodIndex(['2011-01-01', '2011-01-02', '2011-01-03'], dtype='period[D]')" ) exp5 = "PeriodIndex(['2011', '2012', '2013'], dtype='period[Y-DEC]')" diff --git a/pandas/tests/indexes/period/test_indexing.py b/pandas/tests/indexes/period/test_indexing.py index 2683e25eda618..00e8262ddfa4c 100644 --- a/pandas/tests/indexes/period/test_indexing.py +++ b/pandas/tests/indexes/period/test_indexing.py @@ -700,8 +700,7 @@ def test_take_fill_value(self): tm.assert_index_equal(result, expected) msg = ( - "When allow_fill=True and fill_value is not None, " - "all indices must be >= -1" + "When allow_fill=True and fill_value is not None, all indices must be >= -1" ) with pytest.raises(ValueError, match=msg): idx.take(np.array([1, 0, -2]), fill_value=True) diff --git a/pandas/tests/indexes/test_base.py b/pandas/tests/indexes/test_base.py index 608158d40cf23..5b75bd9afd6df 100644 --- a/pandas/tests/indexes/test_base.py +++ b/pandas/tests/indexes/test_base.py @@ -1112,8 +1112,7 @@ def test_take_fill_value(self): def test_take_fill_value_none_raises(self): index = Index(list("ABC"), name="xxx") msg = ( - "When allow_fill=True and fill_value is not None, " - "all indices must be >= -1" + "When allow_fill=True and fill_value is not None, all indices must be >= -1" ) with pytest.raises(ValueError, match=msg): diff --git a/pandas/tests/indexes/test_index_new.py b/pandas/tests/indexes/test_index_new.py index 4a31ae88a757a..dd228e6b713b5 100644 --- a/pandas/tests/indexes/test_index_new.py +++ b/pandas/tests/indexes/test_index_new.py @@ -419,8 +419,7 @@ class TestIndexConstructionErrors: def test_constructor_overflow_int64(self): # see GH#15832 msg = ( - "The elements provided in the data cannot " - "all be casted to the dtype int64" + "The elements provided in the data cannot all be casted to the dtype int64" ) with pytest.raises(OverflowError, match=msg): Index([np.iinfo(np.uint64).max - 1], dtype="int64") diff --git a/pandas/tests/indexes/timedeltas/test_indexing.py b/pandas/tests/indexes/timedeltas/test_indexing.py index e411555c65bea..426083cb6b67c 100644 --- a/pandas/tests/indexes/timedeltas/test_indexing.py +++ b/pandas/tests/indexes/timedeltas/test_indexing.py @@ -262,8 +262,7 @@ def test_take_fill_value(self): tm.assert_index_equal(result, expected) msg = ( - "When allow_fill=True and fill_value is not None, " - "all indices must be >= -1" + "When allow_fill=True and fill_value is not None, all indices must be >= -1" ) with pytest.raises(ValueError, match=msg): idx.take(np.array([1, 0, -2]), fill_value=True) diff --git a/pandas/tests/indexing/test_iloc.py b/pandas/tests/indexing/test_iloc.py index dc95e1bb1b8a0..2f6998a85c80b 100644 --- a/pandas/tests/indexing/test_iloc.py +++ b/pandas/tests/indexing/test_iloc.py @@ -763,8 +763,7 @@ def test_iloc_mask(self): "(index of the boolean Series and of the " "indexed object do not match).", ("locs", ".iloc"): ( - "iLocation based boolean indexing on an " - "integer type is not available" + "iLocation based boolean indexing on an integer type is not available" ), } diff --git a/pandas/tests/io/excel/test_readers.py b/pandas/tests/io/excel/test_readers.py index 34824f0a67985..140cf39b26556 100644 --- a/pandas/tests/io/excel/test_readers.py +++ b/pandas/tests/io/excel/test_readers.py @@ -910,8 +910,7 @@ def test_corrupt_bytes_raises(self, engine): error = XLRDError msg = ( - "Unsupported format, or corrupt file: Expected BOF " - "record; found b'foo'" + "Unsupported format, or corrupt file: Expected BOF record; found b'foo'" ) elif engine == "calamine": from python_calamine import CalamineError diff --git a/pandas/tests/io/excel/test_style.py b/pandas/tests/io/excel/test_style.py index 71ef1201e523f..0e13b2f94ed58 100644 --- a/pandas/tests/io/excel/test_style.py +++ b/pandas/tests/io/excel/test_style.py @@ -356,6 +356,6 @@ def test_format_hierarchical_rows_periodindex(merge_cells): for cell in formatted_cells: if cell.row != 0 and cell.col == 0: - assert isinstance( - cell.val, Timestamp - ), "Period should be converted to Timestamp" + assert isinstance(cell.val, Timestamp), ( + "Period should be converted to Timestamp" + ) diff --git a/pandas/tests/io/formats/style/test_style.py b/pandas/tests/io/formats/style/test_style.py index ff8a1b9f570ab..b7dcfde327b83 100644 --- a/pandas/tests/io/formats/style/test_style.py +++ b/pandas/tests/io/formats/style/test_style.py @@ -933,7 +933,7 @@ def test_trim(self, df): def test_export(self, df, styler): f = lambda x: "color: red" if x > 0 else "color: blue" - g = lambda x, z: f"color: {z}" if x > 0 else f"color: {z}" + g = lambda x, z: f"color: {z}" style1 = styler style1.map(f).map(g, z="b").highlight_max()._compute() # = render result = style1.export() diff --git a/pandas/tests/io/formats/test_css.py b/pandas/tests/io/formats/test_css.py index c4ecb48006cb1..642a562704344 100644 --- a/pandas/tests/io/formats/test_css.py +++ b/pandas/tests/io/formats/test_css.py @@ -193,8 +193,7 @@ def test_css_border_shorthands(prop, expected): ( "margin: 1px; margin-top: 2px", "", - "margin-left: 1px; margin-right: 1px; " - "margin-bottom: 1px; margin-top: 2px", + "margin-left: 1px; margin-right: 1px; margin-bottom: 1px; margin-top: 2px", ), ("margin-top: 2px", "margin: 1px", "margin: 1px; margin-top: 2px"), ("margin: 1px", "margin-top: 2px", "margin: 1px"), diff --git a/pandas/tests/io/formats/test_printing.py b/pandas/tests/io/formats/test_printing.py index 3b63011bf862e..f86b4af2647f8 100644 --- a/pandas/tests/io/formats/test_printing.py +++ b/pandas/tests/io/formats/test_printing.py @@ -82,6 +82,9 @@ def test_repr_dict(self): def test_repr_mapping(self): assert printing.pprint_thing(MyMapping()) == "{'a': 4, 'b': 4}" + def test_repr_frozenset(self): + assert printing.pprint_thing(frozenset([1, 2])) == "frozenset(1, 2)" + class TestFormatBase: def test_adjoin(self): diff --git a/pandas/tests/io/formats/test_to_csv.py b/pandas/tests/io/formats/test_to_csv.py index 7bf041a50b745..6d762fdeb8d79 100644 --- a/pandas/tests/io/formats/test_to_csv.py +++ b/pandas/tests/io/formats/test_to_csv.py @@ -482,10 +482,7 @@ def test_to_csv_string_with_crlf(self): # case 3: CRLF as line terminator # 'lineterminator' should not change inner element expected_crlf = ( - b"int,str_crlf\r\n" - b"1,abc\r\n" - b'2,"d\r\nef"\r\n' - b'3,"g\r\nh\r\n\r\ni"\r\n' + b'int,str_crlf\r\n1,abc\r\n2,"d\r\nef"\r\n3,"g\r\nh\r\n\r\ni"\r\n' ) df.to_csv(path, lineterminator="\r\n", index=False) with open(path, "rb") as f: diff --git a/pandas/tests/io/formats/test_to_html.py b/pandas/tests/io/formats/test_to_html.py index b1a437bfdbd8a..9c75314b66fa2 100644 --- a/pandas/tests/io/formats/test_to_html.py +++ b/pandas/tests/io/formats/test_to_html.py @@ -94,8 +94,7 @@ def test_to_html_with_column_specific_col_space_raises(): ) msg = ( - "Col_space length\\(\\d+\\) should match " - "DataFrame number of columns\\(\\d+\\)" + "Col_space length\\(\\d+\\) should match DataFrame number of columns\\(\\d+\\)" ) with pytest.raises(ValueError, match=msg): df.to_html(col_space=[30, 40]) diff --git a/pandas/tests/io/formats/test_to_markdown.py b/pandas/tests/io/formats/test_to_markdown.py index 7aa7cebb5120f..f3d9b88cc91e2 100644 --- a/pandas/tests/io/formats/test_to_markdown.py +++ b/pandas/tests/io/formats/test_to_markdown.py @@ -35,8 +35,7 @@ def test_empty_frame(): df.to_markdown(buf=buf) result = buf.getvalue() assert result == ( - "| id | first_name | last_name |\n" - "|------|--------------|-------------|" + "| id | first_name | last_name |\n|------|--------------|-------------|" ) @@ -65,8 +64,7 @@ def test_series(): s.to_markdown(buf=buf) result = buf.getvalue() assert result == ( - "| | foo |\n|---:|------:|\n| 0 | 1 " - "|\n| 1 | 2 |\n| 2 | 3 |" + "| | foo |\n|---:|------:|\n| 0 | 1 |\n| 1 | 2 |\n| 2 | 3 |" ) diff --git a/pandas/tests/io/formats/test_to_string.py b/pandas/tests/io/formats/test_to_string.py index af3cdf2d44af3..63c975fd831e7 100644 --- a/pandas/tests/io/formats/test_to_string.py +++ b/pandas/tests/io/formats/test_to_string.py @@ -132,20 +132,17 @@ def test_to_string_with_formatters_unicode(self): ) assert result == expected - def test_to_string_index_formatter(self): - df = DataFrame([range(5), range(5, 10), range(10, 15)]) - - rs = df.to_string(formatters={"__index__": lambda x: "abc"[x]}) - - xp = dedent( - """\ - 0 1 2 3 4 - a 0 1 2 3 4 - b 5 6 7 8 9 - c 10 11 12 13 14\ - """ - ) - assert rs == xp + def test_to_string_index_formatter(self): + df = DataFrame([range(5), range(5, 10), range(10, 15)]) + rs = df.to_string(formatters={"__index__": lambda x: "abc"[x]}) + xp = dedent( + """\ + 0 1 2 3 4 + a 0 1 2 3 4 + b 5 6 7 8 9 + c 10 11 12 13 14""" + ) + assert rs == xp def test_no_extra_space(self): # GH#52690: Check that no extra space is given @@ -380,17 +377,11 @@ def test_to_string_small_float_values(self): # sadness per above if _three_digit_exp(): expected = ( - " a\n" - "0 1.500000e+000\n" - "1 1.000000e-017\n" - "2 -5.500000e-007" + " a\n0 1.500000e+000\n1 1.000000e-017\n2 -5.500000e-007" ) else: expected = ( - " a\n" - "0 1.500000e+00\n" - "1 1.000000e-17\n" - "2 -5.500000e-07" + " a\n0 1.500000e+00\n1 1.000000e-17\n2 -5.500000e-07" ) assert result == expected @@ -1213,13 +1204,7 @@ def test_to_string_float_na_spacing(self): ser[::2] = np.nan result = ser.to_string() - expected = ( - "0 NaN\n" - "1 1.5678\n" - "2 NaN\n" - "3 -3.0000\n" - "4 NaN" - ) + expected = "0 NaN\n1 1.5678\n2 NaN\n3 -3.0000\n4 NaN" assert result == expected def test_to_string_with_datetimeindex(self): diff --git a/pandas/tests/io/json/test_pandas.py b/pandas/tests/io/json/test_pandas.py index 5dc1272880c9b..144b36166261b 100644 --- a/pandas/tests/io/json/test_pandas.py +++ b/pandas/tests/io/json/test_pandas.py @@ -1267,9 +1267,7 @@ def test_default_handler_numpy_unsupported_dtype(self): columns=["a", "b"], ) expected = ( - '[["(1+0j)","(nan+0j)"],' - '["(2.3+0j)","(nan+0j)"],' - '["(4-5j)","(1.2+0j)"]]' + '[["(1+0j)","(nan+0j)"],["(2.3+0j)","(nan+0j)"],["(4-5j)","(1.2+0j)"]]' ) assert df.to_json(default_handler=str, orient="values") == expected @@ -1372,11 +1370,7 @@ def test_tz_is_naive(self): ) def test_tz_range_is_utc(self, tz_range): exp = '["2013-01-01T05:00:00.000Z","2013-01-02T05:00:00.000Z"]' - dfexp = ( - '{"DT":{' - '"0":"2013-01-01T05:00:00.000Z",' - '"1":"2013-01-02T05:00:00.000Z"}}' - ) + dfexp = '{"DT":{"0":"2013-01-01T05:00:00.000Z","1":"2013-01-02T05:00:00.000Z"}}' assert ujson_dumps(tz_range, iso_dates=True) == exp dti = DatetimeIndex(tz_range) @@ -1775,7 +1769,7 @@ def test_read_json_with_url_value(self, url): ) def test_read_json_with_very_long_file_path(self, compression): # GH 46718 - long_json_path = f'{"a" * 1000}.json{compression}' + long_json_path = f"{'a' * 1000}.json{compression}" with pytest.raises( FileNotFoundError, match=f"File {long_json_path} does not exist" ): diff --git a/pandas/tests/io/json/test_readlines.py b/pandas/tests/io/json/test_readlines.py index 3c843479b446a..d482eb5fa1a06 100644 --- a/pandas/tests/io/json/test_readlines.py +++ b/pandas/tests/io/json/test_readlines.py @@ -236,9 +236,9 @@ def test_readjson_chunks_closes(chunksize): ) with reader: reader.read() - assert ( - reader.handles.handle.closed - ), f"didn't close stream with chunksize = {chunksize}" + assert reader.handles.handle.closed, ( + f"didn't close stream with chunksize = {chunksize}" + ) @pytest.mark.parametrize("chunksize", [0, -1, 2.2, "foo"]) @@ -435,8 +435,7 @@ def test_to_json_append_mode(mode_): # Test ValueError when mode is not supported option df = DataFrame({"col1": [1, 2], "col2": ["a", "b"]}) msg = ( - f"mode={mode_} is not a valid option." - "Only 'w' and 'a' are currently supported." + f"mode={mode_} is not a valid option.Only 'w' and 'a' are currently supported." ) with pytest.raises(ValueError, match=msg): df.to_json(mode=mode_, lines=False, orient="records") diff --git a/pandas/tests/io/json/test_ujson.py b/pandas/tests/io/json/test_ujson.py index c5ccc3b3f7184..d2bf9bdb139bd 100644 --- a/pandas/tests/io/json/test_ujson.py +++ b/pandas/tests/io/json/test_ujson.py @@ -53,60 +53,24 @@ def orient(request): class TestUltraJSONTests: @pytest.mark.skipif(not IS64, reason="not compliant on 32-bit, xref #15865") - def test_encode_decimal(self): - sut = decimal.Decimal("1337.1337") - encoded = ujson.ujson_dumps(sut, double_precision=15) - decoded = ujson.ujson_loads(encoded) - assert decoded == "1337.1337" - - sut = decimal.Decimal("0.95") - encoded = ujson.ujson_dumps(sut, double_precision=1) - assert encoded == '"0.95"' - - decoded = ujson.ujson_loads(encoded) - assert decoded == "0.95" - - sut = decimal.Decimal("0.94") - encoded = ujson.ujson_dumps(sut, double_precision=1) - assert encoded == '"0.94"' - - decoded = ujson.ujson_loads(encoded) - assert decoded == "0.94" - - sut = decimal.Decimal("1.95") - encoded = ujson.ujson_dumps(sut, double_precision=1) - assert encoded == '"1.95"' - - decoded = ujson.ujson_loads(encoded) - assert decoded == "1.95" - - sut = decimal.Decimal("-1.95") - encoded = ujson.ujson_dumps(sut, double_precision=1) - assert encoded == '"-1.95"' - - decoded = ujson.ujson_loads(encoded) - assert decoded == "-1.95" - - sut = decimal.Decimal("0.995") - encoded = ujson.ujson_dumps(sut, double_precision=2) - assert encoded == '"0.995"' - - decoded = ujson.ujson_loads(encoded) - assert decoded == "0.995" - - sut = decimal.Decimal("0.9995") - encoded = ujson.ujson_dumps(sut, double_precision=3) - assert encoded == '"0.9995"' - - decoded = ujson.ujson_loads(encoded) - assert decoded == "0.9995" - - sut = decimal.Decimal("0.99999999999999944") - encoded = ujson.ujson_dumps(sut, double_precision=15) - assert encoded == '"0.99999999999999944"' - + @pytest.mark.parametrize( + "value, double_precision", + [ + ("1337.1337", 15), + ("0.95", 1), + ("0.94", 1), + ("1.95", 1), + ("-1.95", 1), + ("0.995", 2), + ("0.9995", 3), + ("0.99999999999999944", 15), + ], + ) + def test_encode_decimal(self, value, double_precision): + sut = decimal.Decimal(value) + encoded = ujson.ujson_dumps(sut, double_precision=double_precision) decoded = ujson.ujson_loads(encoded) - assert decoded == "0.99999999999999944" + assert decoded == value @pytest.mark.parametrize("ensure_ascii", [True, False]) def test_encode_string_conversion(self, ensure_ascii): @@ -991,7 +955,7 @@ def test_decode_array(self, arr): def test_decode_extreme_numbers(self, extreme_num): assert extreme_num == ujson.ujson_loads(str(extreme_num)) - @pytest.mark.parametrize("too_extreme_num", [f"{2**64}", f"{-2**63-1}"]) + @pytest.mark.parametrize("too_extreme_num", [f"{2**64}", f"{-(2**63) - 1}"]) def test_decode_too_extreme_numbers(self, too_extreme_num): with pytest.raises( ValueError, @@ -1006,7 +970,7 @@ def test_decode_with_trailing_non_whitespaces(self): with pytest.raises(ValueError, match="Trailing data"): ujson.ujson_loads("{}\n\t a") - @pytest.mark.parametrize("value", [f"{2**64}", f"{-2**63-1}"]) + @pytest.mark.parametrize("value", [f"{2**64}", f"{-(2**63) - 1}"]) def test_decode_array_with_big_int(self, value): with pytest.raises( ValueError, diff --git a/pandas/tests/io/parser/common/test_read_errors.py b/pandas/tests/io/parser/common/test_read_errors.py index ed2e729430b01..a73327beea8bb 100644 --- a/pandas/tests/io/parser/common/test_read_errors.py +++ b/pandas/tests/io/parser/common/test_read_errors.py @@ -131,8 +131,7 @@ def test_catch_too_many_names(all_parsers): msg = ( "Too many columns specified: expected 4 and found 3" if parser.engine == "c" - else "Number of passed names did not match " - "number of header fields in the file" + else "Number of passed names did not match number of header fields in the file" ) with pytest.raises(ValueError, match=msg): diff --git a/pandas/tests/io/parser/test_mangle_dupes.py b/pandas/tests/io/parser/test_mangle_dupes.py index d3789cd387c05..55c8bbc4bb9e1 100644 --- a/pandas/tests/io/parser/test_mangle_dupes.py +++ b/pandas/tests/io/parser/test_mangle_dupes.py @@ -136,7 +136,7 @@ def test_mangled_unnamed_placeholders(all_parsers): expected = DataFrame(columns=Index([], dtype="str")) for j in range(i + 1): - col_name = "Unnamed: 0" + f".{1*j}" * min(j, 1) + col_name = "Unnamed: 0" + f".{1 * j}" * min(j, 1) expected.insert(loc=0, column=col_name, value=[0, 1, 2]) expected[orig_key] = orig_value diff --git a/pandas/tests/io/parser/test_parse_dates.py b/pandas/tests/io/parser/test_parse_dates.py index 1411ed5019766..9a15d9bc84a2e 100644 --- a/pandas/tests/io/parser/test_parse_dates.py +++ b/pandas/tests/io/parser/test_parse_dates.py @@ -228,7 +228,7 @@ def test_parse_tz_aware(all_parsers): def test_read_with_parse_dates_scalar_non_bool(all_parsers, kwargs): # see gh-5636 parser = all_parsers - msg = "Only booleans and lists " "are accepted for the 'parse_dates' parameter" + msg = "Only booleans and lists are accepted for the 'parse_dates' parameter" data = """A,B,C 1,2,2003-11-1""" @@ -239,7 +239,7 @@ def test_read_with_parse_dates_scalar_non_bool(all_parsers, kwargs): @pytest.mark.parametrize("parse_dates", [(1,), np.array([4, 5]), {1, 3}]) def test_read_with_parse_dates_invalid_type(all_parsers, parse_dates): parser = all_parsers - msg = "Only booleans and lists " "are accepted for the 'parse_dates' parameter" + msg = "Only booleans and lists are accepted for the 'parse_dates' parameter" data = """A,B,C 1,2,2003-11-1""" diff --git a/pandas/tests/io/pytables/test_append.py b/pandas/tests/io/pytables/test_append.py index 47658c0eb9012..04241a78bff5f 100644 --- a/pandas/tests/io/pytables/test_append.py +++ b/pandas/tests/io/pytables/test_append.py @@ -823,12 +823,9 @@ def test_append_raise(setup_path): store.append("df", df) df["foo"] = "bar" msg = re.escape( - "invalid combination of [values_axes] on appending data " - "[name->values_block_1,cname->values_block_1," - "dtype->bytes24,kind->string,shape->(1, 30)] " - "vs current table " - "[name->values_block_1,cname->values_block_1," - "dtype->datetime64[s],kind->datetime64[s],shape->None]" + "Cannot serialize the column [foo] " + "because its data contents are not [string] " + "but [datetime64[s]] object dtype" ) with pytest.raises(ValueError, match=msg): store.append("df", df) @@ -997,3 +994,29 @@ def test_append_to_multiple_min_itemsize(setup_path): ) result = store.select_as_multiple(["index", "nums", "strs"]) tm.assert_frame_equal(result, expected, check_index_type=True) + + +def test_append_string_nan_rep(setup_path): + # GH 16300 + df = DataFrame({"A": "a", "B": "foo"}, index=np.arange(10)) + df_nan = df.copy() + df_nan.loc[0:4, :] = np.nan + msg = "NaN representation is too large for existing column size" + + with ensure_clean_store(setup_path) as store: + # string column too small + store.append("sa", df["A"]) + with pytest.raises(ValueError, match=msg): + store.append("sa", df_nan["A"]) + + # nan_rep too big + store.append("sb", df["B"], nan_rep="bars") + with pytest.raises(ValueError, match=msg): + store.append("sb", df_nan["B"]) + + # smaller modified nan_rep + store.append("sc", df["A"], nan_rep="n") + store.append("sc", df_nan["A"]) + result = store["sc"] + expected = concat([df["A"], df_nan["A"]]) + tm.assert_series_equal(result, expected) diff --git a/pandas/tests/io/pytables/test_round_trip.py b/pandas/tests/io/pytables/test_round_trip.py index 6b98a720e4299..875a792467828 100644 --- a/pandas/tests/io/pytables/test_round_trip.py +++ b/pandas/tests/io/pytables/test_round_trip.py @@ -213,12 +213,9 @@ def test_table_values_dtypes_roundtrip(setup_path): # incompatible dtype msg = re.escape( - "invalid combination of [values_axes] on appending data " - "[name->values_block_0,cname->values_block_0," - "dtype->float64,kind->float,shape->(1, 3)] vs " - "current table [name->values_block_0," - "cname->values_block_0,dtype->int64,kind->integer," - "shape->None]" + "Cannot serialize the column [a] " + "because its data contents are not [float] " + "but [integer] object dtype" ) with pytest.raises(ValueError, match=msg): store.append("df_i8", df1) diff --git a/pandas/tests/io/pytables/test_store.py b/pandas/tests/io/pytables/test_store.py index a6fe9529c594a..2bfe9e33a6235 100644 --- a/pandas/tests/io/pytables/test_store.py +++ b/pandas/tests/io/pytables/test_store.py @@ -311,7 +311,7 @@ def test_getattr(setup_path): # test attribute access result = store.a tm.assert_series_equal(result, s) - result = getattr(store, "a") + result = store.a tm.assert_series_equal(result, s) df = DataFrame( diff --git a/pandas/tests/io/test_pickle.py b/pandas/tests/io/test_pickle.py index 5fe0f1265edff..bab2c1561eb99 100644 --- a/pandas/tests/io/test_pickle.py +++ b/pandas/tests/io/test_pickle.py @@ -383,55 +383,6 @@ def test_pickle_buffer_roundtrip(): tm.assert_frame_equal(df, result) -# --------------------- -# tests for URL I/O -# --------------------- - - -@pytest.mark.parametrize( - "mockurl", ["http://url.com", "ftp://test.com", "http://gzip.com"] -) -def test_pickle_generalurl_read(monkeypatch, mockurl): - def python_pickler(obj, path): - with open(path, "wb") as fh: - pickle.dump(obj, fh, protocol=-1) - - class MockReadResponse: - def __init__(self, path) -> None: - self.file = open(path, "rb") - if "gzip" in path: - self.headers = {"Content-Encoding": "gzip"} - else: - self.headers = {"Content-Encoding": ""} - - def __enter__(self): - return self - - def __exit__(self, *args): - self.close() - - def read(self): - return self.file.read() - - def close(self): - return self.file.close() - - with tm.ensure_clean() as path: - - def mock_urlopen_read(*args, **kwargs): - return MockReadResponse(path) - - df = DataFrame( - 1.1 * np.arange(120).reshape((30, 4)), - columns=Index(list("ABCD"), dtype=object), - index=Index([f"i-{i}" for i in range(30)], dtype=object), - ) - python_pickler(df, path) - monkeypatch.setattr("urllib.request.urlopen", mock_urlopen_read) - result = pd.read_pickle(mockurl) - tm.assert_frame_equal(df, result) - - def test_pickle_fsspec_roundtrip(): pytest.importorskip("fsspec") with tm.ensure_clean(): diff --git a/pandas/tests/io/xml/test_xml.py b/pandas/tests/io/xml/test_xml.py index 5c07a56c9fb3f..d897d251909fe 100644 --- a/pandas/tests/io/xml/test_xml.py +++ b/pandas/tests/io/xml/test_xml.py @@ -1503,8 +1503,7 @@ def test_bad_xml(parser): with pytest.raises( SyntaxError, match=( - "Extra content at the end of the document|" - "junk after document element" + "Extra content at the end of the document|junk after document element" ), ): read_xml( diff --git a/pandas/tests/plotting/test_series.py b/pandas/tests/plotting/test_series.py index 9675b936c171e..c3b0219971446 100644 --- a/pandas/tests/plotting/test_series.py +++ b/pandas/tests/plotting/test_series.py @@ -427,7 +427,7 @@ def test_pie_series_autopct_and_fontsize(self): ax = _check_plot_works( series.plot.pie, colors=color_args, autopct="%.2f", fontsize=7 ) - pcts = [f"{s*100:.2f}" for s in series.values / series.sum()] + pcts = [f"{s * 100:.2f}" for s in series.values / series.sum()] expected_texts = list(chain.from_iterable(zip(series.index, pcts))) _check_text_labels(ax.texts, expected_texts) for t in ax.texts: diff --git a/pandas/tests/resample/test_time_grouper.py b/pandas/tests/resample/test_time_grouper.py index 30e2c9dfe3d30..3cc95922e7f2f 100644 --- a/pandas/tests/resample/test_time_grouper.py +++ b/pandas/tests/resample/test_time_grouper.py @@ -353,7 +353,7 @@ def test_groupby_resample_interpolate_raises(groupy_test_df): for df in dfs: with pytest.raises( NotImplementedError, - match="Direct interpolation of MultiIndex data frames is " "not supported", + match="Direct interpolation of MultiIndex data frames is not supported", ): df.groupby("volume").resample("1D").interpolate(method="linear") diff --git a/pandas/tests/reshape/merge/test_merge.py b/pandas/tests/reshape/merge/test_merge.py index f0abc1afc6ab0..f0f67aebd85ec 100644 --- a/pandas/tests/reshape/merge/test_merge.py +++ b/pandas/tests/reshape/merge/test_merge.py @@ -1464,7 +1464,10 @@ def test_merge_how_validation(self): data2 = DataFrame( np.arange(20).reshape((5, 4)) + 1, columns=["a", "b", "x", "y"] ) - msg = "'full' is not a valid Merge type: left, right, inner, outer, cross, asof" + msg = ( + "'full' is not a valid Merge type: left, right, inner, outer, " + "left_anti, right_anti, cross, asof" + ) with pytest.raises(ValueError, match=re.escape(msg)): data1.merge(data2, how="full") diff --git a/pandas/tests/reshape/merge/test_merge_antijoin.py b/pandas/tests/reshape/merge/test_merge_antijoin.py new file mode 100644 index 0000000000000..006622c6e5e94 --- /dev/null +++ b/pandas/tests/reshape/merge/test_merge_antijoin.py @@ -0,0 +1,280 @@ +import numpy as np +import pytest + +import pandas.util._test_decorators as td + +import pandas as pd +from pandas import ( + DataFrame, + MultiIndex, +) +import pandas._testing as tm +from pandas.core.reshape.merge import merge + + +def test_merge_antijoin(): + # GH#42916 + left = DataFrame({"A": [1, 2, 3]}, index=["a", "b", "c"]) + right = DataFrame({"B": [1, 2, 4]}, index=["a", "b", "d"]) + + result = merge(left, right, how="left_anti", left_index=True, right_index=True) + expected = DataFrame({"A": [3], "B": [np.nan]}, index=["c"]) + tm.assert_frame_equal(result, expected) + + result = merge(left, right, how="right_anti", left_index=True, right_index=True) + expected = DataFrame({"A": [np.nan], "B": [4]}, index=["d"]) + tm.assert_frame_equal(result, expected) + + +def test_merge_antijoin_on_different_columns(): + left = DataFrame({"A": [1.0, 2.0, 3.0], "B": ["a", "b", "c"]}).astype({"B": object}) + right = DataFrame({"C": [1.0, 2.0, 4.0], "D": ["a", "d", "b"]}).astype( + {"D": object} + ) + + result = merge(left, right, how="left_anti", left_on="B", right_on="D") + expected = DataFrame( + { + "A": [3.0], + "B": ["c"], + "C": [np.nan], + "D": [np.nan], + }, + index=[2], + ).astype({"B": object, "D": object}) + tm.assert_frame_equal(result, expected) + + result = merge(left, right, how="right_anti", left_on="B", right_on="D") + expected = DataFrame( + { + "A": [np.nan], + "B": [np.nan], + "C": [2.0], + "D": ["d"], + }, + index=[1], + ).astype({"B": object, "D": object}) + tm.assert_frame_equal(result, expected) + + +def test_merge_antijoin_nonunique_keys(): + left = DataFrame({"A": [1.0, 2.0, 3.0], "B": ["a", "b", "b"]}).astype({"B": object}) + right = DataFrame({"C": [1.0, 2.0, 4.0], "D": ["b", "d", "d"]}).astype( + {"D": object} + ) + + result = merge(left, right, how="left_anti", left_on="B", right_on="D") + expected = DataFrame( + { + "A": [1.0], + "B": ["a"], + "C": [np.nan], + "D": [np.nan], + }, + index=[0], + ).astype({"B": object, "D": object}) + tm.assert_frame_equal(result, expected) + + result = merge(left, right, how="right_anti", left_on="B", right_on="D") + expected = DataFrame( + { + "A": [np.nan, np.nan], + "B": [np.nan, np.nan], + "C": [2.0, 4.0], + "D": ["d", "d"], + }, + index=[2, 3], + ).astype({"B": object, "D": object}) + tm.assert_frame_equal(result, expected) + + +def test_merge_antijoin_same_df(): + left = DataFrame({"A": [1, 2, 3]}, index=["a", "b", "c"], dtype=np.int64) + result = merge(left, left, how="left_anti", left_index=True, right_index=True) + expected = DataFrame([], columns=["A_x", "A_y"], dtype=np.int64) + tm.assert_frame_equal(result, expected, check_index_type=False) + + +def test_merge_antijoin_nans(): + left = DataFrame({"A": [1.0, 2.0, np.nan], "C": ["a", "b", "c"]}).astype( + {"C": object} + ) + right = DataFrame({"A": [3.0, 2.0, np.nan], "D": ["d", "e", "f"]}).astype( + {"D": object} + ) + result = merge(left, right, how="left_anti", on="A") + expected = DataFrame({"A": [1.0], "C": ["a"], "D": [np.nan]}).astype( + {"C": object, "D": object} + ) + tm.assert_frame_equal(result, expected) + + +def test_merge_antijoin_on_datetime64tz(): + # GH11405 + left = DataFrame( + { + "key": pd.date_range("20151010", periods=2, tz="US/Eastern"), + "value": [1.0, 2.0], + } + ) + right = DataFrame( + { + "key": pd.date_range("20151011", periods=3, tz="US/Eastern"), + "value": [1.0, 2.0, 3.0], + } + ) + + expected = DataFrame( + { + "key": pd.date_range("20151010", periods=1, tz="US/Eastern"), + "value_x": [1.0], + "value_y": [np.nan], + }, + index=[0], + ) + result = merge(left, right, on="key", how="left_anti") + tm.assert_frame_equal(result, expected) + + expected = DataFrame( + { + "key": pd.date_range("20151012", periods=2, tz="US/Eastern"), + "value_x": [np.nan, np.nan], + "value_y": [2.0, 3.0], + }, + index=[1, 2], + ) + result = merge(left, right, on="key", how="right_anti") + tm.assert_frame_equal(result, expected) + + +def test_merge_antijoin_multiindex(): + left = DataFrame( + { + "A": [1, 2, 3], + "B": [4, 5, 6], + }, + index=MultiIndex.from_tuples( + [("a", "x"), ("b", "y"), ("c", "z")], names=["first", "second"] + ), + ) + right = DataFrame( + { + "C": [7, 8, 9], + "D": [10, 11, 12], + }, + index=MultiIndex.from_tuples( + [("a", "x"), ("b", "y"), ("c", "w")], names=["first", "second"] + ), + ) + + result = merge(left, right, how="left_anti", left_index=True, right_index=True) + expected = DataFrame( + { + "A": [3], + "B": [6], + "C": [np.nan], + "D": [np.nan], + }, + index=MultiIndex.from_tuples([("c", "z")], names=["first", "second"]), + ) + tm.assert_frame_equal(result, expected) + + result = merge(left, right, how="right_anti", left_index=True, right_index=True) + expected = DataFrame( + { + "A": [np.nan], + "B": [np.nan], + "C": [9], + "D": [12], + }, + index=MultiIndex.from_tuples([("c", "w")], names=["first", "second"]), + ) + tm.assert_frame_equal(result, expected) + + +@pytest.mark.parametrize( + "dtype", + [ + "Int64", + pytest.param("int64[pyarrow]", marks=td.skip_if_no("pyarrow")), + pytest.param("timestamp[s][pyarrow]", marks=td.skip_if_no("pyarrow")), + pytest.param("string[pyarrow]", marks=td.skip_if_no("pyarrow")), + ], +) +def test_merge_antijoin_extension_dtype(dtype): + left = DataFrame( + { + "join_col": [1, 3, 5], + "left_val": [1, 2, 3], + } + ) + right = DataFrame( + { + "join_col": [2, 3, 4], + "right_val": [1, 2, 3], + } + ) + left = left.astype({"join_col": dtype}) + right = right.astype({"join_col": dtype}) + result = merge(left, right, how="left_anti", on="join_col") + expected = DataFrame( + { + "join_col": [1, 5], + "left_val": [1, 3], + "right_val": [np.nan, np.nan], + }, + index=[0, 2], + ) + expected = expected.astype({"join_col": dtype}) + tm.assert_frame_equal(result, expected) + + +def test_merge_antijoin_empty_dataframe(): + left = DataFrame({"A": [], "B": []}) + right = DataFrame({"C": [], "D": []}) + + result = merge(left, right, how="left_anti", left_on="A", right_on="C") + expected = DataFrame({"A": [], "B": [], "C": [], "D": []}) + tm.assert_frame_equal(result, expected) + + result = merge(left, right, how="right_anti", left_on="A", right_on="C") + tm.assert_frame_equal(result, expected) + + +def test_merge_antijoin_no_common_elements(): + left = DataFrame({"A": [1, 2, 3]}) + right = DataFrame({"B": [4, 5, 6]}) + + result = merge(left, right, how="left_anti", left_on="A", right_on="B") + expected = DataFrame({"A": [1, 2, 3], "B": [np.nan, np.nan, np.nan]}) + tm.assert_frame_equal(result, expected) + + result = merge(left, right, how="right_anti", left_on="A", right_on="B") + expected = DataFrame({"A": [np.nan, np.nan, np.nan], "B": [4, 5, 6]}) + tm.assert_frame_equal(result, expected) + + +def test_merge_antijoin_with_null_values(): + left = DataFrame({"A": [1.0, 2.0, None, 4.0]}) + right = DataFrame({"B": [2.0, None, 5.0]}) + + result = merge(left, right, how="left_anti", left_on="A", right_on="B") + expected = DataFrame({"A": [1.0, 4.0], "B": [np.nan, np.nan]}, index=[0, 3]) + tm.assert_frame_equal(result, expected) + + result = merge(left, right, how="right_anti", left_on="A", right_on="B") + expected = DataFrame({"A": [np.nan], "B": [5.0]}, index=[2]) + tm.assert_frame_equal(result, expected) + + +def test_merge_antijoin_with_mixed_dtypes(): + left = DataFrame({"A": [1, "2", 3.0]}) + right = DataFrame({"B": ["2", 3.0, 4]}) + + result = merge(left, right, how="left_anti", left_on="A", right_on="B") + expected = DataFrame({"A": [1], "B": [np.nan]}, dtype=object) + tm.assert_frame_equal(result, expected) + + result = merge(left, right, how="right_anti", left_on="A", right_on="B") + expected = DataFrame({"A": [np.nan], "B": [4]}, dtype=object, index=[2]) + tm.assert_frame_equal(result, expected) diff --git a/pandas/tests/reshape/merge/test_merge_cross.py b/pandas/tests/reshape/merge/test_merge_cross.py index 14f9036e43fce..6ab80cf0e0823 100644 --- a/pandas/tests/reshape/merge/test_merge_cross.py +++ b/pandas/tests/reshape/merge/test_merge_cross.py @@ -42,8 +42,7 @@ def test_merge_cross_error_reporting(kwargs): left = DataFrame({"a": [1, 3]}) right = DataFrame({"b": [3, 4]}) msg = ( - "Can not pass on, right_on, left_on or set right_index=True or " - "left_index=True" + "Can not pass on, right_on, left_on or set right_index=True or left_index=True" ) with pytest.raises(MergeError, match=msg): merge(left, right, how="cross", **kwargs) @@ -94,8 +93,7 @@ def test_join_cross_error_reporting(): left = DataFrame({"a": [1, 3]}) right = DataFrame({"a": [3, 4]}) msg = ( - "Can not pass on, right_on, left_on or set right_index=True or " - "left_index=True" + "Can not pass on, right_on, left_on or set right_index=True or left_index=True" ) with pytest.raises(MergeError, match=msg): left.join(right, how="cross", on="a") diff --git a/pandas/tests/scalar/period/test_period.py b/pandas/tests/scalar/period/test_period.py index fe51817a78be8..baaedaa853565 100644 --- a/pandas/tests/scalar/period/test_period.py +++ b/pandas/tests/scalar/period/test_period.py @@ -991,7 +991,6 @@ def test_properties_quarterly(self): qedec_date = Period(freq="Q-DEC", year=2007, quarter=1) qejan_date = Period(freq="Q-JAN", year=2007, quarter=1) qejun_date = Period(freq="Q-JUN", year=2007, quarter=1) - # for x in range(3): for qd in (qedec_date, qejan_date, qejun_date): assert (qd + x).qyear == 2007 @@ -1016,7 +1015,6 @@ def test_properties_monthly(self): def test_properties_weekly(self): # Test properties on Periods with daily frequency. w_date = Period(freq="W", year=2007, month=1, day=7) - # assert w_date.year == 2007 assert w_date.quarter == 1 assert w_date.month == 1 @@ -1046,7 +1044,6 @@ def test_properties_daily(self): # Test properties on Periods with daily frequency. with tm.assert_produces_warning(FutureWarning, match=bday_msg): b_date = Period(freq="B", year=2007, month=1, day=1) - # assert b_date.year == 2007 assert b_date.quarter == 1 assert b_date.month == 1 @@ -1089,7 +1086,6 @@ def test_properties_hourly(self): def test_properties_minutely(self): # Test properties on Periods with minutely frequency. t_date = Period(freq="Min", year=2007, month=1, day=1, hour=0, minute=0) - # assert t_date.quarter == 1 assert t_date.month == 1 assert t_date.day == 1 @@ -1108,7 +1104,6 @@ def test_properties_secondly(self): s_date = Period( freq="Min", year=2007, month=1, day=1, hour=0, minute=0, second=0 ) - # assert s_date.year == 2007 assert s_date.quarter == 1 assert s_date.month == 1 diff --git a/pandas/tests/scalar/timedelta/test_constructors.py b/pandas/tests/scalar/timedelta/test_constructors.py index e029dfc3b2703..45caeb1733590 100644 --- a/pandas/tests/scalar/timedelta/test_constructors.py +++ b/pandas/tests/scalar/timedelta/test_constructors.py @@ -353,8 +353,7 @@ def test_construction(): Timedelta("foo") msg = ( - "cannot construct a Timedelta from " - "the passed arguments, allowed keywords are " + "cannot construct a Timedelta from the passed arguments, allowed keywords are " ) with pytest.raises(ValueError, match=msg): Timedelta(day=10) diff --git a/pandas/tests/scalar/timestamp/methods/test_round.py b/pandas/tests/scalar/timestamp/methods/test_round.py index 944aa55727217..6b27e5e6c5554 100644 --- a/pandas/tests/scalar/timestamp/methods/test_round.py +++ b/pandas/tests/scalar/timestamp/methods/test_round.py @@ -165,7 +165,6 @@ def test_round_dst_border_ambiguous(self, method, unit): # GH 18946 round near "fall back" DST ts = Timestamp("2017-10-29 00:00:00", tz="UTC").tz_convert("Europe/Madrid") ts = ts.as_unit(unit) - # result = getattr(ts, method)("h", ambiguous=True) assert result == ts assert result._creso == getattr(NpyDatetimeUnit, f"NPY_FR_{unit}").value diff --git a/pandas/tests/series/methods/test_between.py b/pandas/tests/series/methods/test_between.py index e67eafbd118ce..f035767e2ce0e 100644 --- a/pandas/tests/series/methods/test_between.py +++ b/pandas/tests/series/methods/test_between.py @@ -66,8 +66,7 @@ def test_between_error_args(self, inclusive): left, right = series[[2, 7]] value_error_msg = ( - "Inclusive has to be either string of 'both'," - "'left', 'right', or 'neither'." + "Inclusive has to be either string of 'both','left', 'right', or 'neither'." ) series = Series(date_range("1/1/2000", periods=10)) diff --git a/pandas/tests/series/test_constructors.py b/pandas/tests/series/test_constructors.py index 69f42b5e42878..a2be698c0ec28 100644 --- a/pandas/tests/series/test_constructors.py +++ b/pandas/tests/series/test_constructors.py @@ -90,6 +90,13 @@ def test_unparsable_strings_with_dt64_dtype(self): with pytest.raises(ValueError, match=msg): Series(np.array(vals, dtype=object), dtype="datetime64[ns]") + def test_invalid_dtype_conversion_datetime_to_timedelta(self): + # GH#60728 + vals = Series([NaT, Timestamp(2025, 1, 1)], dtype="datetime64[ns]") + msg = r"^Cannot cast DatetimeArray to dtype timedelta64\[ns\]$" + with pytest.raises(TypeError, match=msg): + Series(vals, dtype="timedelta64[ns]") + @pytest.mark.parametrize( "constructor", [ diff --git a/pandas/tests/test_downstream.py b/pandas/tests/test_downstream.py index 18df76ddd8ed8..76fad35304fe6 100644 --- a/pandas/tests/test_downstream.py +++ b/pandas/tests/test_downstream.py @@ -20,6 +20,7 @@ TimedeltaIndex, ) import pandas._testing as tm +from pandas.util.version import Version @pytest.fixture @@ -222,7 +223,7 @@ def test_missing_required_dependency(): assert name in output -def test_frame_setitem_dask_array_into_new_col(): +def test_frame_setitem_dask_array_into_new_col(request): # GH#47128 # dask sets "compute.use_numexpr" to False, so catch the current value @@ -230,7 +231,14 @@ def test_frame_setitem_dask_array_into_new_col(): olduse = pd.get_option("compute.use_numexpr") try: + dask = pytest.importorskip("dask") da = pytest.importorskip("dask.array") + if Version(dask.__version__) <= Version("2025.1.0") and Version( + np.__version__ + ) >= Version("2.1"): + request.applymarker( + pytest.mark.xfail(reason="loc.__setitem__ incorrectly mutated column c") + ) dda = da.array([1, 2]) df = DataFrame({"a": ["a", "b"]}) diff --git a/pandas/tests/tools/test_to_datetime.py b/pandas/tests/tools/test_to_datetime.py index 74b051aec71a4..566fd8d901569 100644 --- a/pandas/tests/tools/test_to_datetime.py +++ b/pandas/tests/tools/test_to_datetime.py @@ -1935,7 +1935,7 @@ def test_to_datetime_unit_na_values(self): @pytest.mark.parametrize("bad_val", ["foo", 111111111]) def test_to_datetime_unit_invalid(self, bad_val): if bad_val == "foo": - msg = "Unknown datetime string format, unable to parse: " f"{bad_val}" + msg = f"Unknown datetime string format, unable to parse: {bad_val}" else: msg = "cannot convert input 111111111 with the unit 'D'" with pytest.raises(ValueError, match=msg): @@ -2258,7 +2258,7 @@ def test_to_datetime_iso8601_exact_fails(self, input, format): [ '^unconverted data remains when parsing with format ".*": ".*". ' f"{PARSING_ERR_MSG}$", - f'^time data ".*" doesn\'t match format ".*". ' f"{PARSING_ERR_MSG}$", + f'^time data ".*" doesn\'t match format ".*". {PARSING_ERR_MSG}$', ] ) with pytest.raises( diff --git a/pandas/tests/tools/test_to_numeric.py b/pandas/tests/tools/test_to_numeric.py index f3645bf0649bd..893f526fb3eb0 100644 --- a/pandas/tests/tools/test_to_numeric.py +++ b/pandas/tests/tools/test_to_numeric.py @@ -192,7 +192,7 @@ def test_numeric_df_columns(columns): # see gh-14827 df = DataFrame( { - "a": [1.2, decimal.Decimal(3.14), decimal.Decimal("infinity"), "0.1"], + "a": [1.2, decimal.Decimal("3.14"), decimal.Decimal("infinity"), "0.1"], "b": [1.0, 2.0, 3.0, 4.0], } ) @@ -207,10 +207,10 @@ def test_numeric_df_columns(columns): "data,exp_data", [ ( - [[decimal.Decimal(3.14), 1.0], decimal.Decimal(1.6), 0.1], + [[decimal.Decimal("3.14"), 1.0], decimal.Decimal("1.6"), 0.1], [[3.14, 1.0], 1.6, 0.1], ), - ([np.array([decimal.Decimal(3.14), 1.0]), 0.1], [[3.14, 1.0], 0.1]), + ([np.array([decimal.Decimal("3.14"), 1.0]), 0.1], [[3.14, 1.0], 0.1]), ], ) def test_numeric_embedded_arr_likes(data, exp_data): diff --git a/pandas/tests/tseries/offsets/test_offsets.py b/pandas/tests/tseries/offsets/test_offsets.py index d0192c12f9518..7480b99595066 100644 --- a/pandas/tests/tseries/offsets/test_offsets.py +++ b/pandas/tests/tseries/offsets/test_offsets.py @@ -798,9 +798,9 @@ def test_get_offset(): for name, expected in pairs: offset = _get_offset(name) - assert ( - offset == expected - ), f"Expected {name!r} to yield {expected!r} (actual: {offset!r})" + assert offset == expected, ( + f"Expected {name!r} to yield {expected!r} (actual: {offset!r})" + ) def test_get_offset_legacy(): diff --git a/pandas/tests/tseries/offsets/test_ticks.py b/pandas/tests/tseries/offsets/test_ticks.py index f91230e1460c4..46b6846ad1ec2 100644 --- a/pandas/tests/tseries/offsets/test_ticks.py +++ b/pandas/tests/tseries/offsets/test_ticks.py @@ -289,8 +289,7 @@ def test_tick_rdiv(cls): td64 = delta.to_timedelta64() instance__type = ".".join([cls.__module__, cls.__name__]) msg = ( - "unsupported operand type\\(s\\) for \\/: 'int'|'float' and " - f"'{instance__type}'" + f"unsupported operand type\\(s\\) for \\/: 'int'|'float' and '{instance__type}'" ) with pytest.raises(TypeError, match=msg): diff --git a/pandas/tests/tslibs/test_parsing.py b/pandas/tests/tslibs/test_parsing.py index 07425af8ed37a..bc5cd5fcccbf8 100644 --- a/pandas/tests/tslibs/test_parsing.py +++ b/pandas/tests/tslibs/test_parsing.py @@ -134,10 +134,7 @@ def test_does_not_convert_mixed_integer(date_string, expected): ( "2013Q1", {"freq": "INVLD-L-DEC-SAT"}, - ( - "Unable to retrieve month information " - "from given freq: INVLD-L-DEC-SAT" - ), + ("Unable to retrieve month information from given freq: INVLD-L-DEC-SAT"), ), ], ) diff --git a/pandas/tests/window/test_numba.py b/pandas/tests/window/test_numba.py index 120dbe788a23f..887aeca6590dc 100644 --- a/pandas/tests/window/test_numba.py +++ b/pandas/tests/window/test_numba.py @@ -1,6 +1,7 @@ import numpy as np import pytest +from pandas.compat import is_platform_arm from pandas.errors import NumbaUtilError import pandas.util._test_decorators as td @@ -11,8 +12,17 @@ to_datetime, ) import pandas._testing as tm +from pandas.util.version import Version -pytestmark = pytest.mark.single_cpu +pytestmark = [pytest.mark.single_cpu] + +numba = pytest.importorskip("numba") +pytestmark.append( + pytest.mark.skipif( + Version(numba.__version__) == Version("0.61") and is_platform_arm(), + reason=f"Segfaults on ARM platforms with numba {numba.__version__}", + ) +) @pytest.fixture(params=["single", "table"]) diff --git a/pandas/tests/window/test_online.py b/pandas/tests/window/test_online.py index 14d3a39107bc4..43d55a7992b3c 100644 --- a/pandas/tests/window/test_online.py +++ b/pandas/tests/window/test_online.py @@ -1,15 +1,24 @@ import numpy as np import pytest +from pandas.compat import is_platform_arm + from pandas import ( DataFrame, Series, ) import pandas._testing as tm +from pandas.util.version import Version -pytestmark = pytest.mark.single_cpu +pytestmark = [pytest.mark.single_cpu] -pytest.importorskip("numba") +numba = pytest.importorskip("numba") +pytestmark.append( + pytest.mark.skipif( + Version(numba.__version__) == Version("0.61") and is_platform_arm(), + reason=f"Segfaults on ARM platforms with numba {numba.__version__}", + ) +) @pytest.mark.filterwarnings("ignore") diff --git a/pandas/tseries/frequencies.py b/pandas/tseries/frequencies.py index 9a01568971af8..88ea1bfa3c6ed 100644 --- a/pandas/tseries/frequencies.py +++ b/pandas/tseries/frequencies.py @@ -145,8 +145,7 @@ def infer_freq( pass elif isinstance(index.dtype, PeriodDtype): raise TypeError( - "PeriodIndex given. Check the `freq` attribute " - "instead of using infer_freq." + "PeriodIndex given. Check the `freq` attribute instead of using infer_freq." ) elif lib.is_np_dtype(index.dtype, "m"): # Allow TimedeltaIndex and TimedeltaArray diff --git a/pyproject.toml b/pyproject.toml index 7ab9cd2c17669..b7d53b0d8934a 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -272,8 +272,6 @@ ignore = [ "B007", # controversial "B008", - # setattr is used to side-step mypy - "B009", # getattr is used to side-step mypy "B010", # tests use comparisons but not their returned value @@ -362,8 +360,6 @@ ignore = [ "PLR1733", # 5 errors, it seems like we wannt to ignore these # Unnecessary lookup of list item by index "PLR1736", # 4 errors, we're currently having inline pylint ignore - # empty-comment - "PLR2044", # autofixable # Unpacking a dictionary in iteration without calling `.items()` "PLE1141", # autofixable # import-outside-toplevel @@ -746,5 +742,5 @@ exclude_lines = [ directory = "coverage_html_report" [tool.codespell] -ignore-words-list = "blocs, coo, hist, nd, sav, ser, recuse, nin, timere, expec, expecs, indext, SME, NotIn, tructures, tru" +ignore-words-list = "blocs, coo, hist, nd, sav, ser, recuse, nin, timere, expec, expecs, indext, SME, NotIn, tructures, tru, indx, abd, ABD" ignore-regex = 'https://([\w/\.])+' diff --git a/web/pandas/community/ecosystem.md b/web/pandas/community/ecosystem.md index dc7b9bc947214..876e6e5b298ea 100644 --- a/web/pandas/community/ecosystem.md +++ b/web/pandas/community/ecosystem.md @@ -8,7 +8,7 @@ developers to build powerful and more focused data tools. The creation of libraries that complement pandas' functionality also allows pandas development to remain focused around its original requirements. -This is an community-maintained list of projects that build on pandas in order +This is a community-maintained list of projects that build on pandas in order to provide tools in the PyData space. The pandas core development team does not necessarily endorse any particular project on this list or have any knowledge of the maintenance status of any particular library. For a more complete list of projects that depend on pandas, see the [libraries.io usage page for @@ -496,17 +496,29 @@ You can find more information about the Hugging Face Dataset Hub in the [documen ## Out-of-core -### [Bodo](https://bodo.ai/) +### [Bodo](https://github.com/bodo-ai/Bodo) -Bodo is a high-performance Python computing engine that automatically parallelizes and -optimizes your code through compilation using HPC (high-performance computing) techniques. -Designed to operate with native pandas dataframes, Bodo compiles your pandas code to execute -across multiple cores on a single machine or distributed clusters of multiple compute nodes efficiently. -Bodo also makes distributed pandas dataframes queryable with SQL. -The community edition of Bodo is free to use on up to 8 cores. Beyond that, Bodo offers a paid -enterprise edition. Free licenses of Bodo (for more than 8 cores) are available -[upon request](https://www.bodo.ai/contact) for academic and non-profit use. +Bodo is a high-performance compute engine for Python data processing. +Using an auto-parallelizing just-in-time (JIT) compiler, Bodo simplifies scaling Pandas +workloads from laptops to clusters without major code changes. +Under the hood, Bodo relies on MPI-based high-performance computing (HPC) technology—making it +both easier to use and often much faster than alternatives. +Bodo also provides a SQL engine that can query distributed pandas dataframes efficiently. + +```python +import pandas as pd +import bodo + +@bodo.jit +def process_data(): + df = pd.read_parquet("my_data.pq") + df2 = pd.DataFrame({"A": df.apply(lambda r: 0 if r.A == 0 else (r.B // r.A), axis=1)}) + df2.to_parquet("out.pq") + +process_data() +``` + ### [Cylon](https://cylondata.org/) From 39589fcd8ef40f69e4170ea4395fe36fed8b7384 Mon Sep 17 00:00:00 2001 From: RedGuy12 Date: Mon, 10 Feb 2025 05:40:59 +0000 Subject: [PATCH 3/4] Backport typo fix --- doc/source/user_guide/io/csv.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/user_guide/io/csv.rst b/doc/source/user_guide/io/csv.rst index 829457f45880c..108299511f346 100644 --- a/doc/source/user_guide/io/csv.rst +++ b/doc/source/user_guide/io/csv.rst @@ -1001,7 +1001,7 @@ Thousand separators For large numbers that have been written with a thousands separator, you can set the ``thousands`` keyword to a string of length 1 so that integers will be parsed -correctly: +correctly. By default, numbers with a thousands separator will be parsed as strings: From a8bb717bf651783d72768389cac04fae60a85ad6 Mon Sep 17 00:00:00 2001 From: RedGuy12 Date: Sat, 22 Feb 2025 23:08:21 +0000 Subject: [PATCH 4/4] add missing `import os`s --- doc/source/user_guide/io/excel.rst | 1 + doc/source/user_guide/io/feather.rst | 1 + doc/source/user_guide/io/hdf5.rst | 1 + doc/source/user_guide/io/html.rst | 1 + doc/source/user_guide/io/json.rst | 1 + doc/source/user_guide/io/orc.rst | 1 + doc/source/user_guide/io/parquet.rst | 5 +++-- doc/source/user_guide/io/pickling.rst | 1 + doc/source/user_guide/io/stata.rst | 1 + doc/source/user_guide/io/xml.rst | 1 + 10 files changed, 12 insertions(+), 2 deletions(-) diff --git a/doc/source/user_guide/io/excel.rst b/doc/source/user_guide/io/excel.rst index 41ff0e7477235..26e9c16ce27fe 100644 --- a/doc/source/user_guide/io/excel.rst +++ b/doc/source/user_guide/io/excel.rst @@ -221,6 +221,7 @@ should be passed to ``index_col`` and ``header``: .. ipython:: python :suppress: + import os os.remove("path_to_file.xlsx") Missing values in columns specified in ``index_col`` will be forward filled to diff --git a/doc/source/user_guide/io/feather.rst b/doc/source/user_guide/io/feather.rst index 713660e7e0260..3a3e34eabe4f9 100644 --- a/doc/source/user_guide/io/feather.rst +++ b/doc/source/user_guide/io/feather.rst @@ -64,4 +64,5 @@ Read from a feather file. .. ipython:: python :suppress: + import os os.remove("example.feather") diff --git a/doc/source/user_guide/io/hdf5.rst b/doc/source/user_guide/io/hdf5.rst index 55457339f0179..79aec64026c0f 100644 --- a/doc/source/user_guide/io/hdf5.rst +++ b/doc/source/user_guide/io/hdf5.rst @@ -21,6 +21,7 @@ for some advanced strategies :suppress: :okexcept: + import os os.remove("store.h5") .. ipython:: python diff --git a/doc/source/user_guide/io/html.rst b/doc/source/user_guide/io/html.rst index 879c2da281c92..efc3958b00e94 100644 --- a/doc/source/user_guide/io/html.rst +++ b/doc/source/user_guide/io/html.rst @@ -120,6 +120,7 @@ as a string: .. ipython:: python :suppress: + import os os.remove("tmp.html") You can even pass in an instance of ``StringIO`` if you so desire: diff --git a/doc/source/user_guide/io/json.rst b/doc/source/user_guide/io/json.rst index 2861176cd80fb..c745c59f684da 100644 --- a/doc/source/user_guide/io/json.rst +++ b/doc/source/user_guide/io/json.rst @@ -604,6 +604,7 @@ indicate missing values and the subsequent read cannot distinguish the intent. .. ipython:: python :suppress: + import os os.remove("test.json") When using ``orient='table'`` along with user-defined ``ExtensionArray``, diff --git a/doc/source/user_guide/io/orc.rst b/doc/source/user_guide/io/orc.rst index fc1f2e671b011..d49c55efa347a 100644 --- a/doc/source/user_guide/io/orc.rst +++ b/doc/source/user_guide/io/orc.rst @@ -59,4 +59,5 @@ Read only certain columns of an orc file. .. ipython:: python :suppress: + import os os.remove("example_pa.orc") diff --git a/doc/source/user_guide/io/parquet.rst b/doc/source/user_guide/io/parquet.rst index ada07471c9aac..efe10bfa18bf3 100644 --- a/doc/source/user_guide/io/parquet.rst +++ b/doc/source/user_guide/io/parquet.rst @@ -105,8 +105,9 @@ Read only certain columns of a parquet file. .. ipython:: python - :suppress: + :okexcept: + import os os.remove("example_pa.parquet") os.remove("example_fp.parquet") @@ -145,7 +146,7 @@ Passing ``index=True`` will *always* write the index, even if that's not the underlying engine's default behavior. .. ipython:: python - :suppress: + :okexcept: os.remove("test.parquet") diff --git a/doc/source/user_guide/io/pickling.rst b/doc/source/user_guide/io/pickling.rst index 8da5e1f96a184..38c6ec1f60e89 100644 --- a/doc/source/user_guide/io/pickling.rst +++ b/doc/source/user_guide/io/pickling.rst @@ -23,6 +23,7 @@ any pickled pandas object (or any other pickled object) from file: .. ipython:: python :suppress: + import os os.remove("foo.pkl") .. warning:: diff --git a/doc/source/user_guide/io/stata.rst b/doc/source/user_guide/io/stata.rst index 89f930525d3a8..d76800eb1ee8c 100644 --- a/doc/source/user_guide/io/stata.rst +++ b/doc/source/user_guide/io/stata.rst @@ -124,6 +124,7 @@ values will have ``object`` data type. .. ipython:: python :suppress: + import os os.remove("stata.dta") .. _io.stata-categorical: diff --git a/doc/source/user_guide/io/xml.rst b/doc/source/user_guide/io/xml.rst index aa619eeefe149..67b423a0ec831 100644 --- a/doc/source/user_guide/io/xml.rst +++ b/doc/source/user_guide/io/xml.rst @@ -133,6 +133,7 @@ Specify only elements or only attributes to parse: .. ipython:: python :suppress: + import os os.remove("books.xml") XML documents can have namespaces with prefixes and default namespaces without