Add option to use gather to select indices in EC #3479

elokrainz · 2025-10-23T15:11:39Z

Summary:
Due to atomic add in torch.index_select, the backward performance sometimes is bad comparing with gather. In this diff, it provides users with control over the indexing process and select the suitable operator based on specific cases.

Perf comparison on pure operators(forward+backward)

2D Embedding, No Repetition
Config: shape=(1000000, 256), dim=0, indices=100000, unique=95300 (95.3%)
Method Time (s) Speedup Status
torch.gather 0.9439 1.00 x 🏆
torch.index_select 1.0509 0.90 x

2D Embedding, Low Repetition
Config: shape=(1000000, 256), dim=0, indices=100000, unique=48732 (48.7%)
Method Time (s) Speedup Status
torch.gather 0.9076 1.00 x 🏆
torch.index_select 1.0415 0.87 x

2D Embedding, High Repetition
Config: shape=(1000000, 256), dim=0, indices=250000, unique=9957 (4.0%)
Method Time (s) Speedup Status
torch.gather 1.2385 1.00 x 🏆
torch.index_select 1.6225 0.76 x

Small Vocab, Low Repetition
Config: shape=(1000, 256), dim=0, indices=2000, unique=635 (31.8%)
Method Time (s) Speedup Status
torch.gather 0.1502 1.00 x 🏆
torch.index_select 0.1763 0.85 x

Small Vocab, Very High Repetition
Config: shape=(1000, 256), dim=0, indices=100000, unique=625 (0.6%)
Method Time (s) Speedup Status
torch.gather 0.2626 1.00 x 🏆
torch.index_select 0.4126 0.64 x

Large Vocab, No Repetition
Config: shape=(10000000, 256), dim=0, indices=10000, unique=9996 (100.0%)
Method Time (s) Speedup Status
torch.gather 5.8014 1.00 x 🏆
torch.index_select 5.8184 1.00 x

Large Vocab, Low Repetition
Config: shape=(10000000, 256), dim=0, indices=10000, unique=5000 (50.0%)
Method Time (s) Speedup Status
torch.gather 5.7912 1.00 x 🏆
torch.index_select 5.8137 1.00 x

Large Vocab, High Repetition
Config: shape=(10000000, 256), dim=0, indices=10000, unique=400 (4.0%)
Method Time (s) Speedup Status
torch.gather 5.7784 1.00 x 🏆
torch.index_select 5.8100 0.99 x

Differential Revision: D85309309

meta-codesync · 2025-10-23T15:11:49Z

@elokrainz has exported this pull request. If you are a Meta employee, you can view the originating Diff in D85309309.

Summary: Due to atomic add in torch.index_select, the backward performance sometimes is bad comparing with gather. In this diff, it provides users with control over the indexing process and select the suitable operator based on specific cases. Perf comparison on pure operators(forward+backward) 2D Embedding, No Repetition Config: shape=(1000000, 256), dim=0, indices=100000, unique=95300 (95.3%) Method Time (s) Speedup Status torch.gather 0.9439 1.00 x 🏆 torch.index_select 1.0509 0.90 x 2D Embedding, Low Repetition Config: shape=(1000000, 256), dim=0, indices=100000, unique=48732 (48.7%) Method Time (s) Speedup Status torch.gather 0.9076 1.00 x 🏆 torch.index_select 1.0415 0.87 x 2D Embedding, High Repetition Config: shape=(1000000, 256), dim=0, indices=250000, unique=9957 (4.0%) Method Time (s) Speedup Status torch.gather 1.2385 1.00 x 🏆 torch.index_select 1.6225 0.76 x Small Vocab, Low Repetition Config: shape=(1000, 256), dim=0, indices=2000, unique=635 (31.8%) Method Time (s) Speedup Status torch.gather 0.1502 1.00 x 🏆 torch.index_select 0.1763 0.85 x Small Vocab, Very High Repetition Config: shape=(1000, 256), dim=0, indices=100000, unique=625 (0.6%) Method Time (s) Speedup Status torch.gather 0.2626 1.00 x 🏆 torch.index_select 0.4126 0.64 x Large Vocab, No Repetition Config: shape=(10000000, 256), dim=0, indices=10000, unique=9996 (100.0%) Method Time (s) Speedup Status torch.gather 5.8014 1.00 x 🏆 torch.index_select 5.8184 1.00 x Large Vocab, Low Repetition Config: shape=(10000000, 256), dim=0, indices=10000, unique=5000 (50.0%) Method Time (s) Speedup Status torch.gather 5.7912 1.00 x 🏆 torch.index_select 5.8137 1.00 x Large Vocab, High Repetition Config: shape=(10000000, 256), dim=0, indices=10000, unique=400 (4.0%) Method Time (s) Speedup Status torch.gather 5.7784 1.00 x 🏆 torch.index_select 5.8100 0.99 x Mast Job Test: baseline: fire-jingchang-f816557933 torch.index_select backward takes ~37ms {F1982939713} exp: fire-jingchang-f816355728 torch.gather backward takes ~10ms {F1982939742} Reviewed By: TroyGarden Differential Revision: D85309309

Summary: Pull Request resolved: meta-pytorch#3479 Due to atomic add in torch.index_select, the backward performance sometimes is bad comparing with gather. In this diff, it provides users with control over the indexing process and select the suitable operator based on specific cases. Perf comparison on pure operators(forward+backward) 2D Embedding, No Repetition Config: shape=(1000000, 256), dim=0, indices=100000, unique=95300 (95.3%) Method Time (s) Speedup Status torch.gather 0.9439 1.00 x 🏆 torch.index_select 1.0509 0.90 x 2D Embedding, Low Repetition Config: shape=(1000000, 256), dim=0, indices=100000, unique=48732 (48.7%) Method Time (s) Speedup Status torch.gather 0.9076 1.00 x 🏆 torch.index_select 1.0415 0.87 x 2D Embedding, High Repetition Config: shape=(1000000, 256), dim=0, indices=250000, unique=9957 (4.0%) Method Time (s) Speedup Status torch.gather 1.2385 1.00 x 🏆 torch.index_select 1.6225 0.76 x Small Vocab, Low Repetition Config: shape=(1000, 256), dim=0, indices=2000, unique=635 (31.8%) Method Time (s) Speedup Status torch.gather 0.1502 1.00 x 🏆 torch.index_select 0.1763 0.85 x Small Vocab, Very High Repetition Config: shape=(1000, 256), dim=0, indices=100000, unique=625 (0.6%) Method Time (s) Speedup Status torch.gather 0.2626 1.00 x 🏆 torch.index_select 0.4126 0.64 x Large Vocab, No Repetition Config: shape=(10000000, 256), dim=0, indices=10000, unique=9996 (100.0%) Method Time (s) Speedup Status torch.gather 5.8014 1.00 x 🏆 torch.index_select 5.8184 1.00 x Large Vocab, Low Repetition Config: shape=(10000000, 256), dim=0, indices=10000, unique=5000 (50.0%) Method Time (s) Speedup Status torch.gather 5.7912 1.00 x 🏆 torch.index_select 5.8137 1.00 x Large Vocab, High Repetition Config: shape=(10000000, 256), dim=0, indices=10000, unique=400 (4.0%) Method Time (s) Speedup Status torch.gather 5.7784 1.00 x 🏆 torch.index_select 5.8100 0.99 x Mast Job Test: baseline: fire-jingchang-f816557933 torch.index_select backward takes ~37ms {F1982939713} exp: fire-jingchang-f816355728 torch.gather backward takes ~10ms {F1982939742} Reviewed By: TroyGarden Differential Revision: D85309309

Summary: Due to atomic add in torch.index_select, the backward performance sometimes is bad comparing with gather. In this diff, it provides users with control over the indexing process and select the suitable operator based on specific cases. Perf comparison on pure operators(forward+backward) 2D Embedding, No Repetition Config: shape=(1000000, 256), dim=0, indices=100000, unique=95300 (95.3%) Method Time (s) Speedup Status torch.gather 0.9439 1.00 x 🏆 torch.index_select 1.0509 0.90 x 2D Embedding, Low Repetition Config: shape=(1000000, 256), dim=0, indices=100000, unique=48732 (48.7%) Method Time (s) Speedup Status torch.gather 0.9076 1.00 x 🏆 torch.index_select 1.0415 0.87 x 2D Embedding, High Repetition Config: shape=(1000000, 256), dim=0, indices=250000, unique=9957 (4.0%) Method Time (s) Speedup Status torch.gather 1.2385 1.00 x 🏆 torch.index_select 1.6225 0.76 x Small Vocab, Low Repetition Config: shape=(1000, 256), dim=0, indices=2000, unique=635 (31.8%) Method Time (s) Speedup Status torch.gather 0.1502 1.00 x 🏆 torch.index_select 0.1763 0.85 x Small Vocab, Very High Repetition Config: shape=(1000, 256), dim=0, indices=100000, unique=625 (0.6%) Method Time (s) Speedup Status torch.gather 0.2626 1.00 x 🏆 torch.index_select 0.4126 0.64 x Large Vocab, No Repetition Config: shape=(10000000, 256), dim=0, indices=10000, unique=9996 (100.0%) Method Time (s) Speedup Status torch.gather 5.8014 1.00 x 🏆 torch.index_select 5.8184 1.00 x Large Vocab, Low Repetition Config: shape=(10000000, 256), dim=0, indices=10000, unique=5000 (50.0%) Method Time (s) Speedup Status torch.gather 5.7912 1.00 x 🏆 torch.index_select 5.8137 1.00 x Large Vocab, High Repetition Config: shape=(10000000, 256), dim=0, indices=10000, unique=400 (4.0%) Method Time (s) Speedup Status torch.gather 5.7784 1.00 x 🏆 torch.index_select 5.8100 0.99 x Mast Job Test: baseline: fire-jingchang-f816557933 torch.index_select backward takes ~37ms {F1982939713} exp: fire-jingchang-f816355728 torch.gather backward takes ~10ms {F1982939742} Reviewed By: TroyGarden Differential Revision: D85309309

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 23, 2025

meta-codesync bot added fb-exported meta-exported labels Oct 23, 2025

elokrainz force-pushed the export-D85309309 branch from f8a0b12 to 987f44c Compare October 24, 2025 03:33

elokrainz force-pushed the export-D85309309 branch from 987f44c to 5458b03 Compare October 24, 2025 03:35

elokrainz force-pushed the export-D85309309 branch from 5458b03 to 885c63d Compare October 24, 2025 08:41

meta-codesync bot closed this in 196b9a9 Oct 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add option to use gather to select indices in EC #3479

Add option to use gather to select indices in EC #3479

elokrainz commented Oct 23, 2025 •

edited

Loading

Uh oh!

meta-codesync bot commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Add option to use gather to select indices in EC #3479

Add option to use gather to select indices in EC #3479

Conversation

elokrainz commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync bot commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

elokrainz commented Oct 23, 2025 •

edited

Loading