Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Frontend][Tensorflow] Add unique operator #7441

Merged
merged 16 commits into from
Feb 26, 2021
Merged

Conversation

ymwangg
Copy link
Contributor

@ymwangg ymwangg commented Feb 11, 2021

This PR adds the tensorflow unique operator as described in https://www.tensorflow.org/api_docs/python/tf/unique.

I'm not sure I follow the best practices. Comments and suggestions are welcome. @yongwww @kevinthesun @codeislife99

@ymwangg ymwangg changed the title [Relay][Tensorflow] Add unique operator [Frontend][Tensorflow] Add unique operator Feb 11, 2021
@masahi masahi self-assigned this Feb 11, 2021
@masahi
Copy link
Member

masahi commented Feb 11, 2021

Thanks, I was planning to work on unique next week, happy to collaborate.

I can add TIR unqiue impl both cpu and gpu later. We can add relay boilarplate, temp impl in cpp, and tests in this PR.

@ymwangg
Copy link
Contributor Author

ymwangg commented Feb 11, 2021

Thanks, I was planning to work on unique next week, happy to collaborate.

I can add TIR unqiue impl both cpu and gpu later. We can add relay boilarplate, temp impl in cpp, and tests in this PR.

That would be great!

@masahi
Copy link
Member

masahi commented Feb 11, 2021

@ymwangg For a general op like unique, we should follow numpy API, rather than being too specific to TF. PyTorch unique should be supported by the same API. Framework specific details should go into the frontend.

Numpy and PyTorch supports dim argument to do unique on multidimensional input, but I don't think it's a good idea. So restricting to 1D, at least for the first implementation, sounds good to me.

We can implement unique via sorting and cumsum (without hash table). If implemented this way, the same code works on both CPU and GPU. That's I'm planning to do, but if you feel brave, you can try that in this PR 🙂 But it is likely not going to be faster than the hash table based implementation, since it requires multiple passes over input. This could be useful if the hash based impl cannot be used for some reason.

@ymwangg
Copy link
Contributor Author

ymwangg commented Feb 17, 2021

@masahi Thanks for your comment.
Here's the algorithm that I came up with based on your suggestions.

# topi
def unique(data, data_sorted, data_argsorted):
    output = [0] * len(data)
    count = [0] * len(data)
    first_occurrence = [len(data)] * len(data)
    inverse_indices = [0] * len(data)
    num_unique = 0
    # ir_builder
    for i in range(len(data)):
        if i == 0 or data_sorted[i] != data_sorted[i-1]:
            num_unique += 1
            output[num_unique-1] = data_sorted[i]
            first_occurrence[num_unique-1] = min(first_occurrence[num_unique-1], data_argsorted[i])
        count[num_unique-1] += 1
        inverse_indices[data_argsorted[i]] = num_unique - 1
    return output, count, first_occurrence, inverse_indices, num_unique

# tf front end
def tf_unique(data):
    output, count, first_occurrence, inverse_indices, num_unique = unique(data, np.sort(data), np.argsort(data))
    sorted_occurence_indices = np.argsort(first_occurrence) # relay.argsort
    new_output = [output[sorted_occurence_indices[i]] for i in range(num_unique)] # relay.take
    index_converter = np.argsort(sorted_occurence_indices) # relay.argsort
    new_inverse_indices = [index_converter[i] for i in inverse_indices] # relay.take
    return new_output, new_inverse_indices

It defines a topi function that is similar to np.unique but requires the sorted data and the argsort of the data. In the frontend, it needs to do argsort twice if we want to keep the unique elements in the order of their first occurrence.

Does this look good to you?

@masahi
Copy link
Member

masahi commented Feb 17, 2021

It can be a lot simpler than that. Unique is basically sort + adjacent difference + exclusive scan. If you don't understand that statement, the following example should help. We have exclusive scan for CPU (cumsum op with exclusive=True), and GPU (see #7303).

If we implement unique this way, the same code runs on both CPU and GPU.

import numpy as np


def exclusive_scan(arr):
    return np.cumsum(arr) - arr


inp = np.random.randint(0, 10, size=(15,))
argsort_indices = np.argsort(inp)
sorted_inp = np.array([inp[i] for i in argsort_indices])
print("sorted input:", sorted_inp)

adj_diff = np.concatenate([[1],  np.diff(sorted_inp)])
print("adjacent difference:", adj_diff)

non_zero = adj_diff != 0
ex_scan = exclusive_scan(non_zero)
print("exclusive scan:", ex_scan)

unique = np.zeros(inp.shape[0], dtype=np.int)

for i in range(inp.shape[0]):
    if non_zero[i] != 0:
        unique[ex_scan[i]] = inp[argsort_indices[i]]

print("num unique element:", ex_scan[-1] + 1)
print("unique:", unique)

Output:

sorted input: [0 0 0 4 5 5 6 6 6 6 6 7 8 8 9]
adjacent difference: [0 0 0 4 1 0 1 0 0 0 0 1 1 0 1]
exclusive scan: [0 1 1 1 2 3 3 4 4 4 4 4 5 6 6]
num unique element: 7
unique: [0 4 5 6 7 8 9 0 0 0 0 0 0 0 0]

@codeislife99
Copy link
Contributor

Hey @masahi , can your example be extended to provide counts as well ? https://www.tensorflow.org/api_docs/python/tf/unique_with_counts

@masahi
Copy link
Member

masahi commented Feb 17, 2021

Yes, it's possible but a bit complicated. PyTorch also has return_counts option https://pytorch.org/docs/stable/generated/torch.unique.html

I think for the first PR, not all options need to be implemented. We can follow up later.

I'm using PyTorch GPU impl as reference, see for example below on how they support count
https://github.com/pytorch/pytorch/blob/22a34bcf4e5eaa348f0117c414c3dd760ec64b13/aten/src/ATen/native/cuda/Unique.cu#L60-L68

@codeislife99
Copy link
Contributor

I see, I was interested in counts option and probably you might be as well , because SparseSegmentSqrtN or other variants SparseLengthSum / EmbeddingBag can be written as a combination of take, expand, repeat,scatter_add and unique_count. So I was interested if we could do this either in this PR or in a parallel PR(and later merge them)

@ymwangg
Copy link
Contributor Author

ymwangg commented Feb 17, 2021

@masahi Thanks for the explanation and it is very helpful!
It looks like the main thing we need to do is to implement a topi.adjacent_difference op similar to thrust::adjacent_difference. And in the frontend, we do something like:

sorted_data = relay.sort(data)
argsort_indices = relay.argsort(data)
adj_diff = relay.adjacent_difference(sorted_data, first_value=0, "not_equal")
ex_scan = relay.cumsum(adj_diff, exclusive=True)
inverse_indices = relay.scatter(data, argsort_indices, ex_scan)
unique = relay.scatter(data, ex_scan, sorted_data)
unique_sliced = relay.strided_slice(unique, [0], relay.take(ex_scan,[-1]), slice_mode="size")
return unique_sliced, inverse_indices

I saw PyTorch uses thrust::unique to get the unique array. I think we can use relay.scatter to do the same thing.

To support counting, it looks like we need to implement a topi.unique_by_key op similar to thrust::unique_by_key. I think maybe we can do it in a different PR and focus on adjacent_difference in this PR.

@masahi
Copy link
Member

masahi commented Feb 17, 2021

For your first implementation, combination-based approach is ok. But unique is important enough that I think it deserves its own operator. Also implementation directly in ir builder will likely be faster. Supporting other options will also be easier if we write in ir builder.

So use ir builder if you are comfortable with it, otherwise combination of relay ops is fine. Performance + support for options can be done later (by me).

Don't worry about unique_by_key. Last time I checked the pytorch implementation, I concluded that we can do eveything pytorch does via ir builder.

@masahi
Copy link
Member

masahi commented Feb 20, 2021

Looks good 👍 GPU is not supported right?

python/tvm/topi/unique.py Outdated Show resolved Hide resolved
python/tvm/topi/unique.py Outdated Show resolved Hide resolved
@masahi
Copy link
Member

masahi commented Feb 20, 2021

Can you also add pytorch frontend? Not all option need to be supported. Likely the same as tf conversion

@ymwangg
Copy link
Contributor Author

ymwangg commented Feb 20, 2021

@masahi Yeah, I only added CPU version in this PR. I'm not very familiar with GPU IR now but I can do it later. If the overall structure looks good, I can add unique_with_counts since their implementations are very similar.

I'll add the pytorch frontend in this PR.

@masahi
Copy link
Member

masahi commented Feb 20, 2021

I can do the GPU version. It will likely require ir builder. But let me know if you want to do GPU as well, you can certainly do it. The idea is identical with CPU version, just using different parallelization.

If unique_with_counts can be supported by adding another option to unique, that sounds good. We shouldn't add relay.unique_with_counts or topi.unique_with_counts.

@ymwangg
Copy link
Contributor Author

ymwangg commented Feb 20, 2021

@masahi I added the return_counts option for the topi.unique operator. I also added pytorch frontend. Interestingly, it looks like pytorch returns unique elements in random order when sorted=False.

I'll work on the GPU version of unique next week.

@ymwangg
Copy link
Contributor Author

ymwangg commented Feb 23, 2021

@masahi I added the GPU version and it's ready for review.

python/tvm/topi/unique.py Outdated Show resolved Hide resolved
@masahi
Copy link
Member

masahi commented Feb 23, 2021

@ymwangg @codeislife99 I found a neat trick PyTorch uses for count. https://github.com/pytorch/pytorch/blob/22a34bcf4e5eaa348f0117c414c3dd760ec64b13/aten/src/ATen/native/cuda/Unique.cu#L60-L68

Basically, after you get ex scan, instead of copying from the original input, you copy from an array [0, 1, 2, ....]. This will give you something like [0, 2, 5], and doing adjacent element on it directly gives the count. Does this make sense? It should be much faster than atomic.

@ymwangg
Copy link
Contributor Author

ymwangg commented Feb 24, 2021

@masahi thanks. I'll try using arange and adjacent_difference to compute the counts rather than counting by adding.

Copy link
Member

@masahi masahi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@masahi masahi merged commit 09b0c8e into apache:main Feb 26, 2021
@masahi
Copy link
Member

masahi commented Feb 26, 2021

Thanks @ymwangg @codeislife99, this is really a great work!

@ymwangg
Copy link
Contributor Author

ymwangg commented Feb 26, 2021

@masahi thanks for making this such an interesting project!

Lokiiiiii pushed a commit to Lokiiiiii/tvm that referenced this pull request Mar 2, 2021
* Initial commit of the unique operator

Add unit tests for unique operator

* Add tensorflow unique op

* Refactor unique to use sort-based algorithm

* Change relay.unique test to run only on cpu

* Change topi.unique test to run only on cpu

* Change range to parallel for parallelizable loops

* Add return_counts option for relay.unique and topi.unique, add pytorch frontend

* Fix pylint

* Patch pytorch frontend

* Initial support of topi.cuda.unique

* Refactor to use ir_builder directly

* Modularize adjacent difference

* Refactor to simplify

* Fix typo

* Combine _unique and _unique_with_counts

* Reuse indices_ptr to remove arange_ptr

Co-authored-by: Yanming Wang <yanmwang@amazon.com>
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Mar 2, 2021
* Initial commit of the unique operator

Add unit tests for unique operator

* Add tensorflow unique op

* Refactor unique to use sort-based algorithm

* Change relay.unique test to run only on cpu

* Change topi.unique test to run only on cpu

* Change range to parallel for parallelizable loops

* Add return_counts option for relay.unique and topi.unique, add pytorch frontend

* Fix pylint

* Patch pytorch frontend

* Initial support of topi.cuda.unique

* Refactor to use ir_builder directly

* Modularize adjacent difference

* Refactor to simplify

* Fix typo

* Combine _unique and _unique_with_counts

* Reuse indices_ptr to remove arange_ptr

Co-authored-by: Yanming Wang <yanmwang@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants