Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrieve a query as a NumPy structured array #1156

Open
wants to merge 39 commits into
base: master
Choose a base branch
from

Conversation

ilanschnell
Copy link

In this PR, we a .fetchdictarray() method to the pyodbc.Cursor object. This adds numpy as an optional build and runtime dependency. Only when numpy is available at build time, is the extension src/npcontainer.cpp compiled. In addition WITH_NUMPY will be defined such that src/cursor.cpp can add the method, and src/pyodbcmodule.cpp can initialize numpy on import.

Here is the docstring of the .fetchdictarray() method:

fetchdictarray(size=-1, return_nulls=False, null_suffix='_isnull')
                               --> a dictionary of column arrays.

Fetch as many rows as specified by size into a dictionary of NumPy
ndarrays (dictarray). The dictionary will contain a key for each column,
with its value being a NumPy ndarray holding its value for the fetched
rows. Optionally, extra columns will be added to signal nulls on
nullable columns.

Parameters
----------
size : int, optional
    The number of rows to fetch. Use -1 (the default) to fetch all
    remaining rows.
return_nulls : boolean, optional
    If True, information about null values will be included adding a
    boolean array using as key a string  built by concatenating the
    column name and null_suffix.
null_suffix : string, optional
    A string used as a suffix when building the key for null values.
    Only used if return_nulls is True.

Returns
-------
out: dict
    A dictionary mapping column names to an ndarray holding its values
    for the fetched rows. The dictionary will use the column name as
    key for the ndarray containing values associated to that column.
    Optionally, null information for nullable columns will be provided
    by adding additional boolean columns named after the nullable column
    concatenated to null_suffix

Remarks
-------
Similar to fetchmany(size), but returning a dictionary of NumPy ndarrays
for the results instead of a Python list of tuples of objects, reducing
memory footprint as well as improving performance.
fetchdictarray is overall more efficient that fetchsarray.

Note: The code is based on a https://github.com/ContinuumIO/TextAdapter (which was released in 2017 by Anaconda, Inc. under the BSD license). The original authors of the numpy container are Francesc Alted and Oscar Villellas.

@ndmlny-qs
Copy link
Contributor

pinging the PR to see if there is anything I can help with to merge this feature

mkleehammer and others added 28 commits May 9, 2023 00:31
I ran the unit tests with a debug version of Python and it complained that the Unicode string I
was building was invalid.  The original code was a modified version of an older Python tuple
repr implementation, so I looked at doing that again.  However, cpython now uses an internal
_PyUnicode_Writer class we don't have access to, so I'm cheating by creating a tuple.

Since repr should not be in the critical path of most performance sensitive DB jobs, this will
do for now.
It is easier to build debug versions of Python now.
The "raw" encoding was Python 2.7 only.

I originally created ODBCCHAR to replace SQLWCHAR because unixODBC would define it as wchar_t
even when that was 4 bytes.  The data in the buffer of 4-byte wchar_t's was still 2-byte data.
Now I've just simplified to uint16_t.  I added this to HACKING.md.

Deleted Tuple wrapper.  Use the Object wrapper and the PyTuple_ functions.  This is to prepare
for possibly using the ABI which would not allow me access the internal item pointers directly,
so I could not use operator[] to set items.  (Python has __getitem__ and __setitem__, but to
overload __setitem__ in C++ you can only return a reference to the internal data.)
Used subprocess in setup.py to eliminate warnings about the process still running.

Removed connect() ansi parameter.

Updated SQLWChar to allow it to be declared on the stack and initialized later.  Turned into a
class with an operator to convert to SQLWCHAR*.
Somehow I lost some changes.
This is a fix for GitHub security advisory GHSA-pm6v-h62r-rwx8.  The old code had a hardcoded
buffer of 100 bytes (and a comment asking why it was hardcoded!) and fetching a decimal greater
than 100 digits would cause a buffer overflow.

Author arturxedex128 supplied a very simple code to reproduce the error which was put into the
3 PostgreSQL unit tests as test_large_decimal.  (Thank you arturxedex128!)

Unfortunately the strategy is still that we have to parse decimals, but now Python strings /
Unicode objects are used so there is no arbitrary limit.
I have not ported this code path and I'm not as familiar with it as I need to be.  To allow me
to complete porting and testing the rest, I've temporarily commented it out.

I will look into consolidating the binding for the two code paths.  Also, I'd like to consider
renaming it to "array binding" or "row wise binding" instead of "fast executemany".  While the
latter does tell us the goal of it, it is too generic.  For one thing, what if we wanted to
supply both row- and column-wise binding -- they are both "fast".
I'm not sure where the minor fixes came from, like PyEvel_ -> PyObject.  I'll need to test with
older 3.x versions.

I am going to use the test file naming convention of xxx_tests.py to make it easier to use tab
completion in shells and editors.
I accidentally deleted it.  It is required for simple local pytest.  (See comment in the file.)
I'm porting the tests one at a time and want to ensure the ones ported are successful.
I've only tested on Linux so far.  Next step is to get the Windows tests working on a local
machine and/or AppVeyor.
I uncommented some sections and the indentation was off.  Perhaps it had tabs.
I missed a version.  While in there, I simplified this code and used the year as an int and
consolidated the "not SQL Server" code into the _get_sqlserver_year function.
I also added flake8 and pylint to the dev requirements.
ilanschnell and others added 10 commits August 25, 2023 12:37
This commit modifies the `params.cpp` file to check if the given
iterable has items in it when using an empty custom iterable object.
This way when executing the below code

```python
import collections
import pyodbc

class MySequence(collections.abc.Sequence):
    def __getitem__(self, index):
        raise Exception

    def __len__(self):
        return 1

connection.execute("SELECT ?, ?", 123, MySequence()).fetchone()
```

a Python exception is returned, instead of a segfault.
...which have never worked.  Maybe this will work with CIBUILDWHEEL eventually but until it does, drop them.
@ndmlny-qs
Copy link
Contributor

@mkleehammer this has been rebased against the py3 branch

Merge conflict flags where kept in `.github/workflows/ubuntu_build.yml`
for some reason. This commit removes them.
@ndmlny-qs
Copy link
Contributor

@mkleehammer you will need to close out this PR so I can make a different one against the py3 branch. I'll resolve any tests that fail in a new PR agains the that branch.

@ndmlny-qs
Copy link
Contributor

@mkleehammer this PR can be closed in favor of #1270 where I am still working on updating tests

@ndmlny-qs
Copy link
Contributor

@ilanschnell You can close this PR, see #1270 for a comment about why this can be closed.

@mkleehammer if Ilan does not close out this PR, you can close it out as you clean out stale PRs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants