Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optional filtering to LLM Engines #1550

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
13bc97d
use cudf deep copy for get_data_frame
efajardo-nv Feb 28, 2024
dc6a570
revert test updates
efajardo-nv Feb 28, 2024
5538556
remove comment
efajardo-nv Feb 28, 2024
de8348f
pr feedback updates and clang-format fixes
efajardo-nv Feb 29, 2024
551b893
Merge branch 'branch-24.03' of https://github.com/nv-morpheus/Morpheu…
efajardo-nv Feb 29, 2024
e54b89e
fix expected results in test_llm_retriever_node
efajardo-nv Feb 29, 2024
1ce60f3
pylint fix
efajardo-nv Feb 29, 2024
86d5222
Merge branch 'branch-24.03' into messagemeta-get-dataframe-update
efajardo-nv Mar 1, 2024
d92a966
Adding retry logic and proxy support to the NeMo LLM Service
mdemoret-nv Mar 4, 2024
ee9f569
Merge branch 'branch-24.03' of https://github.com/nv-morpheus/Morpheu…
efajardo-nv Mar 6, 2024
1470198
revert previous commits and add struct support to cudf helpers
efajardo-nv Mar 6, 2024
b26731c
Merge branch 'messagemeta-get-dataframe-update' of github.com:efajard…
dagardner-nv Mar 6, 2024
f5a1af5
Support filtering for llm
dagardner-nv Mar 6, 2024
cf8c45d
update make_table_from_table_with_metadata for structs
efajardo-nv Mar 6, 2024
70b5593
Merge branch 'messagemeta-get-dataframe-update' of github.com:efajard…
dagardner-nv Mar 6, 2024
95fa664
Uncomment per Eli
dagardner-nv Mar 6, 2024
8bbe27d
Remove old debug code
dagardner-nv Mar 6, 2024
1214419
Pull the 'as_column() call for the row selector out of the loop
dagardner-nv Mar 6, 2024
974e903
use imported update_struct_field_names
efajardo-nv Mar 6, 2024
d6b179d
First pass at setting the row mask to the context
dagardner-nv Mar 7, 2024
1ba2615
Move the row mask to the state
dagardner-nv Mar 7, 2024
423b248
Test filter_fn arg to the extractor node
dagardner-nv Mar 8, 2024
1307622
Remove duplicate tests for the other constructors as these were moved…
dagardner-nv Mar 8, 2024
71b4000
Add test for row mask functionality
dagardner-nv Mar 8, 2024
22c9d67
Fix mis-named test, fix accidental usage of the source df which is be…
dagardner-nv Mar 8, 2024
5796300
Add row_mask to task handler test
dagardner-nv Mar 8, 2024
c211db1
update_column_struct_field_names ignoring str columns for now
efajardo-nv Mar 8, 2024
6819969
remove pandas import
efajardo-nv Mar 8, 2024
ebc332c
Add end-to-end test with a filter function
dagardner-nv Mar 8, 2024
ebe25ce
Update IWYU mappings to not recomend the attr header
dagardner-nv Mar 8, 2024
e74bcdb
IWYU fixes
dagardner-nv Mar 8, 2024
7f14ed1
Pylint fixes
dagardner-nv Mar 8, 2024
b81f625
fix expected results in test_llm_retriever_node
efajardo-nv Mar 9, 2024
72434de
pylint fix
efajardo-nv Mar 9, 2024
9c0df8d
remove unused import
efajardo-nv Mar 9, 2024
077347e
remove unused import
efajardo-nv Mar 9, 2024
943d9b1
Merge branch 'branch-24.03' of https://github.com/nv-morpheus/Morpheu…
efajardo-nv Mar 11, 2024
072f53e
Merge branch 'branch-24.03' of https://github.com/nv-morpheus/Morpheu…
efajardo-nv Mar 12, 2024
f3d0e46
update assert_df_equals to support list columns
efajardo-nv Mar 12, 2024
8951296
add struct and list column data to tests
efajardo-nv Mar 12, 2024
be515f1
change variable name
efajardo-nv Mar 12, 2024
91c1a15
revert python_file_regex change
efajardo-nv Mar 13, 2024
71856f2
Merge branch 'branch-24.03' of https://github.com/nv-morpheus/Morpheu…
efajardo-nv Mar 13, 2024
230326d
fix get_meta slice issue by slicing dataframe directly instead of via…
efajardo-nv Mar 17, 2024
c546157
Merge branch 'branch-24.03' of https://github.com/nv-morpheus/Morpheu…
efajardo-nv Mar 17, 2024
c6f93e2
clang-format fix
efajardo-nv Mar 17, 2024
04e18d0
revert last commit to cudf_helpers
efajardo-nv Mar 18, 2024
b4b1071
move get_py_object to messagemeta
efajardo-nv Mar 18, 2024
f5b5c07
use test_dataframe.jsonlines in all messagemeta tests
efajardo-nv Mar 20, 2024
0968cb8
revert change to make_table_from_table_with_metadata
efajardo-nv Mar 20, 2024
2665d32
Merge branch 'messagemeta-get-dataframe-update' of github.com:efajard…
dagardner-nv Apr 10, 2024
2e6af62
Merge branch 'branch-24.06' of github.com:nv-morpheus/Morpheus into d…
dagardner-nv Apr 10, 2024
665b56a
Update stub after recompile
dagardner-nv Apr 10, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions ci/iwyu/mappings.imp
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@
# pybind11
{ "include": [ "<pybind11/detail/common.h>", "private", "<pybind11/pytypes.h>", "public" ] },
{ "include": [ "<pybind11/cast.h>", "private", "<pybind11/pybind11.h>", "public" ] },
{ "include": [ "<pybind11/attr.h>", "private", "<pybind11/pybind11.h>", "public" ] },

# rxcpp
# Hide includes that are exported by <rxcpp/rx.hpp>
Expand Down
77 changes: 76 additions & 1 deletion morpheus/_lib/cudf_helpers.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,15 @@
# limitations under the License.

import cudf
from cudf.core.dtypes import StructDtype

from libcpp.string cimport string
from libcpp.utility cimport move
from libcpp.vector cimport vector

from cudf._lib.column cimport Column
from cudf._lib.cpp.io.types cimport column_name_info
from cudf._lib.cpp.io.types cimport table_metadata
from cudf._lib.cpp.io.types cimport table_with_metadata
from cudf._lib.cpp.table.table_view cimport table_view
from cudf._lib.cpp.types cimport size_type
Expand Down Expand Up @@ -56,6 +59,18 @@ cdef public api:

object make_table_from_table_info_data(TableInfoData table_info, object owner):

cdef table_metadata tbl_meta

num_index_cols_meta = 0
cdef column_name_info child_info
for i, name in enumerate(owner._column_names, num_index_cols_meta):
child_info.name = name.encode()
tbl_meta.schema_info.push_back(child_info)
_set_col_children_metadata(
owner[name]._column,
tbl_meta.schema_info[i]
)

index_names = None

if (table_info.index_names.size() > 0):
Expand Down Expand Up @@ -89,7 +104,11 @@ cdef public api:
import traceback
print("error while converting libcudf table to cudf dataframe:", traceback.format_exc())

return cudf.DataFrame._from_data(data, index)
df = cudf.DataFrame._from_data(data, index)

update_struct_field_names(df, tbl_meta.schema_info)

return df


TableInfoData make_table_info_data_from_table(object table):
Expand Down Expand Up @@ -173,3 +192,59 @@ cdef public api:
source_column_idx += 1

return dict(zip(column_names, data_columns)), index

cdef _set_col_children_metadata(Column col,
column_name_info& col_meta):
cdef column_name_info child_info
if isinstance(col.dtype, cudf.StructDtype):
for i, (child_col, name) in enumerate(
zip(col.children, list(col.dtype.fields))
):
child_info.name = name.encode()
col_meta.children.push_back(child_info)
_set_col_children_metadata(
child_col, col_meta.children[i]
)
elif isinstance(col.dtype, cudf.ListDtype):
for i, child_col in enumerate(col.children):
col_meta.children.push_back(child_info)
_set_col_children_metadata(
child_col, col_meta.children[i]
)
else:
return

cdef update_struct_field_names(
table,
vector[column_name_info]& schema_info
):
for i, (name, col) in enumerate(table._data.items()):
table._data[name] = update_column_struct_field_names(
col, schema_info[i]
)


cdef Column update_column_struct_field_names(
Column col,
column_name_info& info
):
cdef vector[string] field_names

if col.dtype != "object" and col.children:
children = list(col.children)
for i, child in enumerate(children):
children[i] = update_column_struct_field_names(
child,
info.children[i]
)
col.set_base_children(tuple(children))

if isinstance(col.dtype, StructDtype):
field_names.reserve(len(col.base_children))
for i in range(info.children.size()):
field_names.push_back(info.children[i].name)
col = col._rename_fields(
field_names
)

return col
2 changes: 2 additions & 0 deletions morpheus/_lib/cudf_helpers/__init__.pyi
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
from __future__ import annotations
import morpheus._lib.cudf_helpers
import typing
from cudf.core.dtypes import StructDtype
import cudf

__all__ = [
"StructDtype",
"cudf"
]

Expand Down
33 changes: 33 additions & 0 deletions morpheus/_lib/include/morpheus/llm/llm_context.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,20 @@
#include <string>
#include <vector>

// IWYU mistakenly believes that we could use the forward declares of LLMTask and ControlMessage in fwd.hpp, however an
// an incomplete type decl cannot be used in a shared_ptr, and in the case of LLMTask in a struct that is used in a
// shared_ptr.
// IWYU pragma: no_include "morpheus/llm/fwd.hpp"

namespace morpheus::llm {

struct LLMContextState
{
LLMTask task;
std::shared_ptr<ControlMessage> message;

// Optional row mask to be applied to the Dataframe by the extractor and task handler to filter rows
std::vector<bool> row_mask;
};

/**
Expand Down Expand Up @@ -64,6 +72,7 @@ class MORPHEUS_EXPORT LLMContext : public std::enable_shared_from_this<LLMContex
* @param parent parent context
* @param name new context name
* @param inputs input mappings for new context
* @param row_mask row mask for new context
*/
LLMContext(std::shared_ptr<LLMContext> parent, std::string name, input_mappings_t inputs);

Expand Down Expand Up @@ -190,6 +199,30 @@ class MORPHEUS_EXPORT LLMContext : public std::enable_shared_from_this<LLMContex
*/
const mrc::pymrc::JSONValues& view_outputs() const;

/**
* @brief Set the row mask indicating which rows of the dataframe are being used to populate the inputs.
* This should only be called by the first node in an LLM Engine, typically the Extractor node.
*
* @param row_mask vector of bools
*/
void set_row_mask(std::vector<bool>&& row_mask);

/**
* @brief Check if the row mask has been set.
*
* @return true if row mask has been set
* @return false if row mask has not been set
*/
bool has_row_mask() const;

/**
* @brief Get the row mask indicating which rows of the dataframe the outputs should be written to.
* This should only be called by the task handler.
*
* @return vector of bools
*/
const std::vector<bool>& get_row_mask() const;

private:
input_mappings_t::const_iterator find_input(const std::string& node_name, bool throw_if_not_found = true) const;

Expand Down
2 changes: 2 additions & 0 deletions morpheus/_lib/include/morpheus/messages/meta.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,8 @@ class MessageMeta
*/
virtual std::optional<std::string> ensure_sliceable_index();

pybind11::object get_py_object() const;

/**
* @brief Create MessageMeta cpp object from a python object
*
Expand Down
3 changes: 3 additions & 0 deletions morpheus/_lib/llm/__init__.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -68,12 +68,15 @@ class LLMContext():
@typing.overload
def get_input(self, node_name: str) -> object: ...
def get_inputs(self) -> object: ...
def get_row_mask(self) -> typing.List[bool]: ...
def has_row_mask(self) -> bool: ...
def message(self) -> morpheus._lib.messages.ControlMessage: ...
def push(self, name: str, inputs: typing.List[InputMap]) -> LLMContext: ...
@typing.overload
def set_output(self, output_name: str, output: object) -> None: ...
@typing.overload
def set_output(self, outputs: object) -> None: ...
def set_row_mask(self, row_mask: typing.List[bool]) -> None: ...
def task(self) -> LLMTask: ...
@property
def full_name(self) -> str:
Expand Down
5 changes: 4 additions & 1 deletion morpheus/_lib/llm/module.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -203,7 +203,10 @@ PYBIND11_MODULE(llm, _module)
py::overload_cast<const std::string&, mrc::pymrc::JSONValues&&>(&LLMContext::set_output),
py::arg("output_name"),
py::arg("output"))
.def("push", &LLMContext::push, py::arg("name"), py::arg("inputs"));
.def("push", &LLMContext::push, py::arg("name"), py::arg("inputs"))
.def("set_row_mask", &LLMContext::set_row_mask, py::arg("row_mask"))
.def("has_row_mask", &LLMContext::has_row_mask)
.def("get_row_mask", &LLMContext::get_row_mask);

py::class_<LLMNodeBase, PyLLMNodeBase<>, std::shared_ptr<LLMNodeBase>>(_module, "LLMNodeBase")
.def(py::init_alias<>())
Expand Down
29 changes: 29 additions & 0 deletions morpheus/_lib/src/llm/llm_context.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -224,4 +224,33 @@ const mrc::pymrc::JSONValues& LLMContext::view_outputs() const
return m_outputs;
}

void LLMContext::set_row_mask(std::vector<bool>&& row_mask)
{
if (m_parent)
{
return m_parent->set_row_mask(std::move(row_mask));
}

m_state->row_mask = std::move(row_mask);
}

bool LLMContext::has_row_mask() const
{
if (m_parent)
{
return m_parent->has_row_mask();
}

return !m_state->row_mask.empty();
}

const std::vector<bool>& LLMContext::get_row_mask() const
{
if (m_parent)
{
return m_parent->get_row_mask();
}
return m_state->row_mask;
}

} // namespace morpheus::llm
5 changes: 5 additions & 0 deletions morpheus/_lib/src/messages/meta.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,11 @@ TableInfo MessageMeta::get_info() const
return this->m_data->get_info();
}

py::object MessageMeta::get_py_object() const
{
return this->m_data->get_py_object();
}

TableInfo MessageMeta::get_info(const std::string& col_name) const
{
auto full_info = this->m_data->get_info();
Expand Down
41 changes: 12 additions & 29 deletions morpheus/_lib/src/messages/multi.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -264,50 +264,33 @@ std::vector<std::string> MultiMessageInterfaceProxy::get_meta_column_names(const

pybind11::object MultiMessageInterfaceProxy::get_meta(MultiMessage& self)
{
// Need to release the GIL before calling `get_meta()`
pybind11::gil_scoped_release no_gil;

// Get the column and convert to cudf
auto info = self.get_meta();

// Convert to a python datatable. Automatically gets the GIL
return CudfHelper::table_from_table_info(info);
return MultiMessageInterfaceProxy::get_meta(self, std::vector<std::string>{});
}

pybind11::object MultiMessageInterfaceProxy::get_meta(MultiMessage& self, std::string col_name)
{
TableInfo info;

{
// Need to release the GIL before calling `get_meta()`
pybind11::gil_scoped_release no_gil;

// Get the column and convert to cudf
info = self.get_meta();
}

auto py_table = CudfHelper::table_from_table_info(info);

// Now convert it to a series by selecting only the column
return py_table[col_name.c_str()];
return MultiMessageInterfaceProxy::get_meta(self)[col_name.c_str()];
}

pybind11::object MultiMessageInterfaceProxy::get_meta(MultiMessage& self, std::vector<std::string> columns)
{
// Need to release the GIL before calling `get_meta()`
pybind11::gil_scoped_release no_gil;
pybind11::object df = self.meta->get_py_object();

auto row_indexer = pybind11::slice(
pybind11::int_(self.mess_offset), pybind11::int_(self.mess_offset + self.mess_count), pybind11::none());

// Get the column and convert to cudf
auto info = self.get_meta(columns);
if (columns.empty())
{
return df.attr("iloc")[row_indexer];
}

// Convert to a python datatable. Automatically gets the GIL
return CudfHelper::table_from_table_info(info);
return df.attr("iloc")[row_indexer][py::cast(columns)];
}

pybind11::object MultiMessageInterfaceProxy::get_meta(MultiMessage& self, pybind11::none none_obj)
{
// Just offload to the overload without columns. This overload is needed to match the python interface
return MultiMessageInterfaceProxy::get_meta(self);
return MultiMessageInterfaceProxy::get_meta(self, std::vector<std::string>{});
}

pybind11::object MultiMessageInterfaceProxy::get_meta_list(MultiMessage& self, pybind11::object col_name)
Expand Down
Loading
Loading