Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

main cuml umap #397

Closed
wants to merge 9 commits into from
Closed

main cuml umap #397

wants to merge 9 commits into from

Conversation

dcolinmorgan
Copy link
Contributor

  • a few stray prints (couldnt get logger to work) but good for testing
  • now main branch has cuml umap with g.umap(engine='cuml'), otherwise defaults to umap-learn (also takes that as engine arg)

Copy link
Contributor

@lmeyerov lmeyerov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • engine : UMAPEngine (public interfaces) + engine_resolved : UMAPEngineCOncrete (internal calls) so users can pick which engine, default to "auto". Match how featurize() does it.
  • print -> logger
  • un-delete feature threading through params (needed for memoization)

In a follow-on PR, let's add an ipynb demo!

graphistry/umap_utils.py Outdated Show resolved Hide resolved
graphistry/umap_utils.py Outdated Show resolved Hide resolved
@@ -119,10 +132,12 @@ def umap_lazy_init(
negative_sample_rate=5,
n_components: int = 2,
metric: str = "euclidean",
# engine: str = "auto",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need this b/c dicates which gets init'd below

@@ -134,6 +149,7 @@ def umap_lazy_init(
local_connectivity=local_connectivity,
repulsion_strength=repulsion_strength,
negative_sample_rate=negative_sample_rate,
# engine=engine,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bring back all these engine calls, we need


# self._umap = umap.UMAP(**umap_kwargs)
# if self.engine == 'cuml':
if 'cuml' in sys.modules:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def concrete_umap_engine(engine):
   if engine == "auto":
      if 'cuml' in ...

and then here:

def lazy_init(engine : UMAPEngine):

    engine_concrete : UMAPEngineConcrete = concretize_umap_engine(engine)

   if engine_concrete == 'cuml':
   ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i assume 'conrete_umap_engine' here == 'resolve_umap_engine' you described above? the example feature_util has no 'conrete_feature_engine' analog

@@ -181,6 +210,8 @@ def umap_fit(self, X: pd.DataFrame, y: Union[pd.DataFrame, None] = None):
mins = (time() - t) / 60
logger.info(f"-UMAP-ing took {mins:.2f} minutes total")
logger.info(f" - or {X.shape[0]/mins:.2f} rows per minute")
print(f"-UMAP-ing took {mins:.2f} minutes total")
print(f" - or {X.shape[0]/mins:.2f} rows per minute")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert all printf before final landing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


# res._umap = umap.UMAP(**umap_kwargs)

if 'cuml' in sys.modules:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass in engine selection; should be concretized by this point

graphistry/umap_utils.py Outdated Show resolved Hide resolved
# engine='umap-learn'
print(engine)
has_umap=assert_imported(engine)
self.umap_lazy_init(has_umap=has_umap)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

engine_resolved = resolve_umap)engine(engine)

"`pip install cuml`")
raise import_cuml_exn

UMAPEngineConcrete = Literal["none", "cuml", "umap_learn"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"none" seems like an illegal value in this case -- how do we do a umap w/ no engine?

(it's valid for feature engineering b/c that means a no-op passthrough of features)

self.engine = engine

if engine_resolved == "umap_learn":
del umap_kwargs['engine']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confusing for dels to be here, can be part of umap_kwargs construction

self.umap_initialized = True
self.engine = engine
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

save resolved engine here

@@ -361,7 +434,10 @@ def umap(
:return: self, with attributes set with new data
"""
assert_imported()
assert_imported_cuml()
self.umap_lazy_init()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass engine

self.umap_lazy_init()
engine_resolved = resolve_umap_engine(engine)


self.suffix = suffix
Copy link
Contributor

@lmeyerov lmeyerov Sep 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all assignments should be on self, and after if inplace: res = self else res = self.bind()... includes lazy init which seems to mutate a bunch too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i missed that one point and have changed it in the wimpiest commit ever

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah guess I was unclear

engine and suffix should be set on res , currently defined like 430, not mutate the original self

For end-user simplicity and internal optimizations (memoization, ...), an important guarantee is the API is functional/pure

Copy link
Contributor

@lmeyerov lmeyerov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixes:

  • ensure def umap(self, ...) immediately switches to res = .., so subsequent calls do not mutate original self (lazy_init...) in typical case of in_place=False. RN, I think it'll self._umap = ... .
  • memoization seems to be dropping impactful params like metric & engine

Recommended:

  • surprisingly typed (and unnecessary?) has_umap : str = "True"
  • notes on ipynb
  • add to CHANGELOG and to README

Follow-on PR?

  • if input is a pandas df yet engine is cuml, can we auto-coerce pdf to cudf (and keep cudf after)?

@@ -167,28 +160,30 @@ def umap_lazy_init(
negative_sample_rate=5,
n_components: int = 2,
metric: str = "euclidean",
engine: str = "auto", #this seems to override user input
# has_umap: str = "True",
has_umap: str = "True",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get rid of has_umap? Or at least make a bool?

I think it is there just for throwing exns when no engine, so we can do those early vs propagating

engine_resolved = resolve_umap_engine(engine)

self.engine=engine_resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe more on a future PR, we can probably automate pandas->cudf (and back) when the engine is cuml

self.umap_lazy_init()
engine_resolved = resolve_umap_engine(engine)


self.suffix = suffix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah guess I was unclear

engine and suffix should be set on res , currently defined like 430, not mutate the original self

For end-user simplicity and internal optimizations (memoization, ...), an important guarantee is the API is functional/pure

inplace: bool = False,
feature_engine: str = "auto",
memoize: bool = True,
**featurize_kwargs,
):
"""
UMAP the featurized node or edges data,
or pass in your own X, y (optional).
or pass in your own fX, y (optional).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo?

umap_kwargs = dict(
n_components=n_components,
metric=metric,
# metric=metric,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove metric?

if a problem for cuml, we can throw a targeted exn if engine is cuml

Copy link
Contributor Author

@dcolinmorgan dcolinmorgan Sep 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as far as I can tell this flag is still an open issue, with some hacks to get it working rapidsai/cuml#1653 (comment). From what I can tell there is no such flag listed in the cuml UMAP function https://docs.rapids.ai/api/cuml/stable/api.html?highlight=umap#cuml.UMAP and since metric='euclidean' by default in sklearn anyway I thought we could bypass the need for it. although i can see why the user could want to keep the functionality

Copy link
Contributor

@lmeyerov lmeyerov Sep 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that was my guess why you did it. it ends up big for umap users, so to support mult engines, a pattern can be like:

opts = {
  'a': 'b',
  'c': 'd',
  **({'x': 'y'} if engine='umap' else {})
}

graphistry/umap_utils.py Show resolved Hide resolved
README.md Outdated

```python
g = graphistry.nodes(G)
g.umap(engine='cuml',metric='euclidean')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

g.umap(engine='cum').plot() :)

README.md Show resolved Hide resolved
README.md Outdated
@@ -178,6 +178,13 @@ It is easy to turn arbitrary data into insightful graphs. PyGraphistry comes wit
graphistry.edges(edges, 'src', 'dst').plot()
```

* GPU [RAPIDS.ai](https://www.rapids.ai) cuml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cuML

README.md Outdated
@@ -384,6 +391,12 @@ See `help(g.featurize)` for more options
g.umap(kind='nodes', X=['col_1', ..., 'col_n'], y=['label', ..., 'other_targets'], ...)
```

* UMAP supports different backend `engine`, which by default exploit an availabile GPU, resulting in marked speed increases, and falling back to CPU processing:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umap(engine="...") supports multiple implementations. It defaults to using the GPU-accelerated engine="cuml" when a GPU is available, resulting in orders-of-magnitude speedups, and falls back to CPU processing via engine="umap".

"!pip install -U pygraphistry/ --quiet\n",
"\n",
"import graphistry \n",
"graphistry.register(api=3,protocol=\"https\", server=\"hub.graphistry.com\", username='dcolinmorgan', password='f5UwthGEF@F@xnP')\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dcolinmorgan can you scrub your pwd and reset your hub acct?

@lmeyerov
Copy link
Contributor

FYI, a bunch of fixes on main branch in case your CI has a lot of false positives. Still a false positive from py3.10 around mypi, but that should be it.

@dcolinmorgan dcolinmorgan closed this by deleting the head repository Dec 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants