Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dot script changes #159

Merged
merged 32 commits into from
Aug 16, 2017

Conversation

anirudh2290
Copy link

@anirudh2290 anirudh2290 commented Aug 10, 2017

I have refactored the dot benchmarking script. Ran the benchmarking with kdda, avazu and criteo real datasets and also ran the benchmarking with synthetic datasets with uniform distribution and power law distribution. Posted the benchmarking results here.

@eric-haibin-lin @reminisce @stefanhenneking @cjolivier01 @madjam

Summary

For kdda dataset, The speedup is highest for dot(csr, default), followed by dot(csr, rsp) followed by dot(csr^T, default).
The same pattern can be observed for avazu and criteo datasets. The dot(csr^T, default) is much slower than the other two
operations with about orders of magnitude or two difference in speedup.
For the synthetic as well as real datasets, as we keep increasing density the speedup keeps decreasing for dot(csr, default). The same pattern can be observed for
the other two operations dot(csr^T, default) and dot(csr, rsp).

Uniform Distribution

For uniform distribution , for dot(csr, default) speedup is less than 1 for densities >= 20% for
smaller feature dimensions and densities >= 5% for larger feature dimensions. For dot(csr^T, default) the speedup is less than 1 for densities greater than 0.5%.
For dot(csr, rsp) the speedup is less than 1 for densities >= 5%.

Powerlaw Distribution

For powerlaw distribution for dot(csr, default) speedup is less than 1 for densities >= 2%.
For dot(csr^T, default) speedup is less than 1 for densities >= 0.5% for smaller feature dimensions and densities >= 2% for larger feature dimensions.
For dot(csr, rsp) we get speedup less than 1 for densities >= 10% for smaller feature dimensions, and we get speedup less than 1 for densities >= 5% for larger
feature dimensions.

Comparison between Uniform and Powerlaw distributions

The best speedup which we get for both the distributions (dot(csr, default)) at 0.1% density, which is 34 for the Uniform distribution and around 14 for powerlaw distribution.
The uniform distribution turns out to be much faster for this case because of unequal workloads in each row for the powerlaw distribution.
For different combinations of feature_dim, output_dim and batch_size you can see that uniform distribution has much better speedup compared to powerlaw distribution for dot(csr, default).
For dot(csr^T, default) you cannot really observe any significant difference in the speedups of uniform and powerlaw distributions.
For dot(csr, rsp), the speedup is much better for the uniform distribution compared to powerlaw distribution.

Speedup Comparison between Scipy and MXNet for Uniform distribution

For dot(csr, default), the speedup obtained by mxnet dot operator for sparse ndarrays with uniform distribution is much better compared to the speedup
obtained by the scipy dot operator keeping the feature_dim, output_dim and batch_size same. For example, for a csr density of 0.1 % with batch_size as 128, 1M features and output size as 1000 mxnet achieves a speedup of 33.55 while scipy manages to achieve 11.26 over dense matrices dot operation.
You can see the similar pattern for other values of batch_size, output_dim, feature_dim as well as for increasing values of density. For example, as the density reaches 20% for 8M features, batch_size of 128 and output_dim of 32, the speedup goes below 0.22 while for mxnet the speedup is still at 0.28 even for the highest density tested(65%, batch_size:128, output_dim:32).
For dot(csr^T, default), the scipy dot operator does much better. The mxnet dot operator achieves speedup of greater than 1 only in 2 cases while the scipy dot operator achieves speedup greater than 1 in many more cases.

Speedup Comparison between Scipy and MXNet for Powerlaw Distribution

Overall the speedup values for dot(csr, default) for scipy are better compared to mxnet. Although the individual runtimes for the sparse dot operation are much better(not an apples to apples comparison since the mxnet dot implementation uses all cores) for most cases scipy has better speedups compared to mxnet. We see a similar patter for dot(csr^T, default) where speedups are overall better for scipy dot operation for sparse matrices, but the individual sparse dot operations are faster in mxnet.

Runtime Comparsion between MXNet and Scipy

MXNet runtime after leveraging multiple cores for parallelism is much better than scipy.
Below are the average speedups for different batch_sizes, feature_dims and output_dims for Uniform and Powerlaw distribution. You can refer to the table below for detailed results.

Uniform Distribution

Average runtime speedup dot(csr, default): 15.8
Average runtime speedup dot(csr^T, default): 2.4

Powerlaw Distribution

Average runtime speedup dot(csr, default): 3.4
Average runtime speedup dot(csr^T, default): 1.6

Benchmarking Result (Real Dataset)

Hardware: r4.8xlarge

kdda

density(%) batch_size output_dim feature_dim t_dense/t_sparse t_dense(ms) t_sparse(ms) is_transpose rhs_rsp
0.0002 64 1 20216830 573.04 121.38 0.21 0 0
0.0002 64 1 20216830 3.13 141.74 45.23 1 0
0.0002 64 1 20216830 199.53 111.94 0.56 0 1
0.0002 64 8 20216830 1954.38 206.46 0.11 0 0
0.0002 64 8 20216830 3.19 216.52 67.77 1 0
0.0002 64 8 20216830 434.89 249.45 0.57 0 1
0.0002 64 32 20216830 1571.66 171.96 0.11 0 0
0.0002 64 32 20216830 2.14 315.84 147.86 1 0
0.0002 64 32 20216830 588.90 341.58 0.58 0 1
0.0002 64 32 20216830 1601.46 176.80 0.11 0 0
0.0002 64 32 20216830 1.82 268.67 148.02 1 0
0.0002 64 32 20216830 660.38 401.98 0.61 0 1

avazu

density(%) batch_size output_dim feature_dim t_dense/t_sparse t_dense(ms) t_sparse(ms) is_transpose rhs_rsp
0.0015 128 1 1000000 108.95 12.94 0.12 0 0
0.0015 128 1 1000000 4.25 12.39 2.92 1 0
0.0015 128 1 1000000 18.45 11.52 0.62 0 1
0.0015 128 1000 1000000 2528.95 437.05 0.17 0 0
0.0015 128 1000 1000000 1.82 305.99 168.42 1 0
0.0015 128 1000 1000000 974.26 590.94 0.61 0 1
0.0015 128 2000 1000000 4261.91 941.41 0.22 0 0
0.0015 128 2000 1000000 1.87 625.34 334.30 1 0
0.0015 128 2000 1000000 1927.53 1168.51 0.61 0 1
0.0015 128 1000 1000000 2456.45 382.65 0.16 0 0
0.0015 128 1000 1000000 1.83 309.10 168.90 1 0
0.0015 128 2000 1000000 1931.53 1172.55 0.61 0 1
0.0015 256 1000 1000000 2623.19 515.91 0.20 0 0
0.0015 256 1000 1000000 3.65 611.20 167.38 1 0
0.0015 128 2000 1000000 1953.81 1164.68 0.60 0 1

criteo

density(%) batch_size output_dim feature_dim t_dense/t_sparse t_dense(ms) t_sparse(ms) is_transpose rhs_rsp
0.0004 128 1 8388621 841.17 89.54 0.11 0 0
0.0004 128 1 8388621 4.88 109.99 22.55 1 0
0.0004 128 1 8388621 26.91 104.99 3.90 0 1
0.0004 128 8 8388621 1268.97 143.42 0.11 0 0
0.0004 128 8 8388621 4.25 130.48 30.72 1 0
0.0004 128 8 8388621 39.90 146.97 3.68 0 1
0.0004 128 16 8388621 1733.79 191.58 0.11 0 0
0.0004 128 16 8388621 4.60 191.21 41.58 1 0
0.0004 128 16 8388621 65.68 250.90 3.82 0 1
0.0004 128 32 8388621 1931.13 226.05 0.12 0 0
0.0004 128 32 8388621 2.87 183.80 64.12 1 0
0.0004 128 32 8388621 60.88 223.33 3.67 0 1
0.0004 128 64 8388621 1862.07 208.08 0.11 0 0
0.0004 128 64 8388621 2.53 272.53 107.89 1 0
0.0004 128 64 8388621 87.61 340.99 3.89 0 1
0.0004 64 32 8388621 781.79 90.40 0.12 0 0
0.0004 64 32 8388621 2.08 132.15 63.44 1 0
0.0004 128 64 8388621 89.49 339.97 3.80 0 1
0.0004 128 32 8388621 1343.73 152.76 0.11 0 0
0.0004 128 32 8388621 3.39 213.96 63.10 1 0
0.0004 128 64 8388621 93.88 357.50 3.81 0 1

Benchmarking Result (Synthetic Dataset)

Uniform Distribution

mxnet sparse dot benchmark: dot(csr, default)

lhs_density(%) rhs_density(%) context batch_size feature_dim output_dim t_sparse(ms) t_dense(ms) speedup
1.0 100.0 cpu(0) 128 1000000 256 34.36 127.76 3.72
1.0 100.0 cpu(0) 128 1000000 1000 90.97 456.53 5.02
1.0 100.0 cpu(0) 128 1000000 1000 90.40 612.76 6.78
1.0 100.0 cpu(0) 64 1000000 1000 46.79 233.11 4.98
1.0 100.0 cpu(0) 128 1000000 1000 91.23 651.58 7.14
0.1 100.0 cpu(0) 128 1000000 1000 13.63 457.28 33.55
0.5 100.0 cpu(0) 128 1000000 1000 50.97 490.24 9.62
1.0 100.0 cpu(0) 128 1000000 1000 92.12 405.94 4.41
2.0 100.0 cpu(0) 128 1000000 1000 153.89 415.90 2.70
5.0 100.0 cpu(0) 128 1000000 1000 237.39 490.02 2.06
10.0 100.0 cpu(0) 128 1000000 1000 337.41 360.70 1.07
20.0 100.0 cpu(0) 128 1000000 1000 542.96 360.44 0.66
50.0 100.0 cpu(0) 128 1000000 1000 1201.81 361.72 0.30
65.0 100.0 cpu(0) 128 1000000 1000 1692.96 384.63 0.23
1.0 100.0 cpu(0) 128 8000000 1 10.33 91.08 8.82
1.0 100.0 cpu(0) 128 8000000 32 40.71 215.09 5.28
1.0 100.0 cpu(0) 128 8000000 32 40.42 214.28 5.30
1.0 100.0 cpu(0) 128 16000000 32 83.71 392.13 4.68
1.0 100.0 cpu(0) 64 8000000 32 20.31 113.74 5.60
1.0 100.0 cpu(0) 128 8000000 32 42.15 216.12 5.13
0.1 100.0 cpu(0) 128 8000000 32 4.68 215.89 46.10
0.5 100.0 cpu(0) 128 8000000 32 21.60 150.56 6.97
1.0 100.0 cpu(0) 128 8000000 32 40.39 199.57 4.94
2.0 100.0 cpu(0) 128 8000000 32 64.22 121.49 1.89
5.0 100.0 cpu(0) 128 8000000 32 122.33 121.23 0.99
10.0 100.0 cpu(0) 128 8000000 32 199.28 124.17 0.62
20.0 100.0 cpu(0) 128 8000000 32 371.97 207.33 0.56
50.0 100.0 cpu(0) 128 8000000 32 659.60 207.04 0.31
65.0 100.0 cpu(0) 128 8000000 32 752.78 207.24 0.28

mxnet sparse dot benchmark: dot(csr^T, default)

lhs_density(%) rhs_density(%) context batch_size feature_dim output_dim t_sparse(ms) t_dense(ms) speedup
1.0 100.0 cpu(0) 128 1000000 256 323.08 147.16 0.46
1.0 100.0 cpu(0) 128 1000000 1000 1143.93 395.79 0.35
1.0 100.0 cpu(0) 128 1000000 1000 1124.62 433.82 0.39
1.0 100.0 cpu(0) 64 1000000 1000 775.82 211.87 0.27
1.0 100.0 cpu(0) 128 1000000 1000 1122.62 472.84 0.42
0.1 100.0 cpu(0) 128 1000000 1000 326.17 452.38 1.39
0.5 100.0 cpu(0) 128 1000000 1000 757.96 368.65 0.49
1.0 100.0 cpu(0) 128 1000000 1000 1103.66 427.88 0.39
2.0 100.0 cpu(0) 128 1000000 1000 1493.98 312.12 0.21
5.0 100.0 cpu(0) 128 1000000 1000 1834.69 300.24 0.16
10.0 100.0 cpu(0) 128 1000000 1000 2841.92 295.42 0.10
20.0 100.0 cpu(0) 128 1000000 1000 4963.78 348.55 0.07
50.0 100.0 cpu(0) 128 1000000 1000 11090.89 339.55 0.03
65.0 100.0 cpu(0) 128 1000000 1000 14015.80 530.20 0.04
1.0 100.0 cpu(0) 128 8000000 1 207.22 119.92 0.58
1.0 100.0 cpu(0) 128 8000000 32 422.96 173.00 0.41
1.0 100.0 cpu(0) 128 8000000 32 420.41 263.78 0.63
1.0 100.0 cpu(0) 128 16000000 32 812.35 365.46 0.45
1.0 100.0 cpu(0) 64 8000000 32 324.94 148.61 0.46
1.0 100.0 cpu(0) 128 8000000 32 421.15 261.64 0.62
0.1 100.0 cpu(0) 128 8000000 32 172.95 257.02 1.49
0.5 100.0 cpu(0) 128 8000000 32 325.56 244.22 0.75
1.0 100.0 cpu(0) 128 8000000 32 420.59 258.96 0.62
2.0 100.0 cpu(0) 128 8000000 32 561.90 262.29 0.47
5.0 100.0 cpu(0) 128 8000000 32 872.31 166.86 0.19
10.0 100.0 cpu(0) 128 8000000 32 1394.09 188.70 0.14
20.0 100.0 cpu(0) 128 8000000 32 2737.59 167.18 0.06
50.0 100.0 cpu(0) 128 8000000 32 4648.24 158.13 0.03
65.0 100.0 cpu(0) 128 8000000 32 5251.28 174.13 0.03

mxnet sparse dot benchmark: dot(csr, rsp)

lhs_density(%) rhs_density(%) context batch_size feature_dim output_dim t_sparse(ms) t_dense(ms) speedup
1.0 5.0 cpu(0) 128 1000000 256 1.50 130.26 86.80
1.0 5.0 cpu(0) 128 1000000 1000 7.90 408.37 51.67
1.0 5.0 cpu(0) 128 1000000 1000 8.02 575.67 71.74
1.0 5.0 cpu(0) 64 1000000 1000 3.69 404.90 109.70
1.0 5.0 cpu(0) 128 1000000 1000 7.61 738.18 96.98
0.1 5.0 cpu(0) 128 1000000 1000 1.90 435.13 228.76
0.5 5.0 cpu(0) 128 1000000 1000 2.78 454.27 163.62
1.0 5.0 cpu(0) 128 1000000 1000 7.09 386.84 54.57
2.0 5.0 cpu(0) 128 1000000 1000 11.65 417.21 35.81
5.0 5.0 cpu(0) 128 1000000 1000 18.32 748.16 40.84
10.0 5.0 cpu(0) 128 1000000 1000 22.26 386.74 17.37
20.0 5.0 cpu(0) 128 1000000 1000 32.97 435.76 13.22
50.0 5.0 cpu(0) 128 1000000 1000 57.46 707.50 12.31
65.0 5.0 cpu(0) 128 1000000 1000 71.22 753.46 10.58
1.0 5.0 cpu(0) 128 8000000 1 9.12 83.51 9.16
1.0 5.0 cpu(0) 128 8000000 32 9.68 121.18 12.52
1.0 5.0 cpu(0) 128 8000000 32 10.23 208.76 20.41
1.0 5.0 cpu(0) 128 16000000 32 21.38 382.26 17.88
1.0 5.0 cpu(0) 64 8000000 32 4.90 74.07 15.12
1.0 5.0 cpu(0) 128 8000000 32 9.71 121.24 12.49
0.1 5.0 cpu(0) 128 8000000 32 4.22 121.23 28.76
0.5 5.0 cpu(0) 128 8000000 32 6.46 121.20 18.75
1.0 5.0 cpu(0) 128 8000000 32 9.79 205.61 21.01
2.0 5.0 cpu(0) 128 8000000 32 16.05 202.98 12.64
5.0 5.0 cpu(0) 128 8000000 32 28.63 137.61 4.81
10.0 5.0 cpu(0) 128 8000000 32 39.29 121.18 3.08
20.0 5.0 cpu(0) 128 8000000 32 55.07 121.09 2.20
50.0 5.0 cpu(0) 128 8000000 32 105.46 121.18 1.15
65.0 5.0 cpu(0) 128 8000000 32 137.26 151.88 1.11

scipy sparse dot benchmark: dot(csr, default)

lhs_density(%) rhs_density(%) context batch_size feature_dim output_dim t_sparse(ms) t_dense(ms) speedup
1.0 100.0 cpu(0) 128 1000000 256 294.42 306.39 1.04
1.0 100.0 cpu(0) 128 1000000 1000 757.86 866.75 1.14
1.0 100.0 cpu(0) 128 1000000 1000 752.21 856.69 1.14
1.0 100.0 cpu(0) 64 1000000 1000 378.16 3540.17 9.36
1.0 100.0 cpu(0) 128 1000000 1000 747.98 857.05 1.15
0.1 100.0 cpu(0) 128 1000000 1000 76.09 856.90 11.26
0.5 100.0 cpu(0) 128 1000000 1000 377.69 856.95 2.27
1.0 100.0 cpu(0) 128 1000000 1000 750.94 857.45 1.14
2.0 100.0 cpu(0) 128 1000000 1000 1492.61 856.06 0.57
5.0 100.0 cpu(0) 128 1000000 1000 3691.37 857.25 0.23
10.0 100.0 cpu(0) 128 1000000 1000 7305.37 857.43 0.12
20.0 100.0 cpu(0) 128 1000000 1000 14391.96 860.41 0.06
50.0 100.0 cpu(0) 128 1000000 1000 33326.49 859.13 0.03
65.0 100.0 cpu(0) 128 1000000 1000 43346.13 857.97 0.02
1.0 100.0 cpu(0) 128 8000000 1 38.82 67.82 1.75
1.0 100.0 cpu(0) 128 8000000 32 690.54 1931.28 2.80
1.0 100.0 cpu(0) 128 8000000 32 695.46 1931.95 2.78
1.0 100.0 cpu(0) 128 16000000 32 1403.05 3870.88 2.76
1.0 100.0 cpu(0) 64 8000000 32 345.57 1069.43 3.09
1.0 100.0 cpu(0) 128 8000000 32 693.35 1931.48 2.79
0.1 100.0 cpu(0) 128 8000000 32 74.12 1913.50 25.81
0.5 100.0 cpu(0) 128 8000000 32 358.13 1916.25 5.35
1.0 100.0 cpu(0) 128 8000000 32 678.78 1917.35 2.82
2.0 100.0 cpu(0) 128 8000000 32 1211.03 1908.81 1.58
5.0 100.0 cpu(0) 128 8000000 32 2982.13 1911.39 0.64
10.0 100.0 cpu(0) 128 8000000 32 5408.13 1910.07 0.35
20.0 100.0 cpu(0) 128 8000000 32 8713.70 1905.89 0.22
50.0 100.0 cpu(0) 128 8000000 32 13011.79 1899.54 0.15
65.0 100.0 cpu(0) 128 8000000 32 14268.57 1904.14 0.13

scipy sparse dot benchmark: dot(csr^T, default)

lhs_density(%) rhs_density(%) context batch_size feature_dim output_dim t_sparse(ms) t_dense(ms) speedup
1.0 100.0 cpu(0) 128 1000000 256 562.64 672.50 1.20
1.0 100.0 cpu(0) 128 1000000 1000 1710.93 1930.76 1.13
1.0 100.0 cpu(0) 128 1000000 1000 1722.20 1932.09 1.12
1.0 100.0 cpu(0) 64 1000000 1000 1307.40 1662.83 1.27
1.0 100.0 cpu(0) 128 1000000 1000 1780.13 1978.76 1.11
0.1 100.0 cpu(0) 128 1000000 1000 973.27 1933.86 1.99
0.5 100.0 cpu(0) 128 1000000 1000 1334.74 1961.99 1.47
1.0 100.0 cpu(0) 128 1000000 1000 1717.18 1930.40 1.12
2.0 100.0 cpu(0) 128 1000000 1000 2554.72 1936.31 0.76
5.0 100.0 cpu(0) 128 1000000 1000 4989.81 1931.02 0.39
10.0 100.0 cpu(0) 128 1000000 1000 9039.25 1929.08 0.21
20.0 100.0 cpu(0) 128 1000000 1000 16908.19 1932.57 0.11
50.0 100.0 cpu(0) 128 1000000 1000 38582.01 1932.06 0.05
65.0 100.0 cpu(0) 128 1000000 1000 48508.11 1933.40 0.04
1.0 100.0 cpu(0) 128 8000000 1 49.72 78.53 1.58
1.0 100.0 cpu(0) 128 8000000 32 867.99 3041.72 3.50
1.0 100.0 cpu(0) 128 8000000 32 863.88 3033.73 3.51
1.0 100.0 cpu(0) 128 16000000 32 1738.52 6232.28 3.58
1.0 100.0 cpu(0) 64 8000000 32 539.29 1817.33 3.37
1.0 100.0 cpu(0) 128 8000000 32 865.95 3040.42 3.51
0.1 100.0 cpu(0) 128 8000000 32 296.71 3038.15 10.24
0.5 100.0 cpu(0) 128 8000000 32 562.37 3042.12 5.41
1.0 100.0 cpu(0) 128 8000000 32 861.22 3041.54 3.53
2.0 100.0 cpu(0) 128 8000000 32 1396.91 3038.00 2.17
5.0 100.0 cpu(0) 128 8000000 32 3287.59 3082.61 0.94
10.0 100.0 cpu(0) 128 8000000 32 6024.43 3081.17 0.51
20.0 100.0 cpu(0) 128 8000000 32 10416.74 3077.87 0.30
50.0 100.0 cpu(0) 128 8000000 32 16211.84 3078.74 0.19
65.0 100.0 cpu(0) 128 8000000 32 16259.50 3029.88 0.19

scipy mxnet comparison: dot(csr, default)

lhs_density rhs_density context batch_size feature_dim output_dim t_sparse_scipy(ms) t_dense_scipy(ms) t_sparse_mxnet(ms) t_dense_mxnet(ms) Speedup = t_sparse_scipy/t_sparse_mxnet
1 100 cpu(0) 128 1000000 256 294.42 306.39 34.36 127.76 8.568684517
1 100 cpu(0) 128 1000000 1000 757.86 866.75 90.97 456.53 8.330878312
1 100 cpu(0) 128 1000000 1000 752.21 856.69 90.4 612.76 8.32090708
1 100 cpu(0) 64 1000000 1000 378.16 3540.17 46.79 233.11 8.082068818
1 100 cpu(0) 128 1000000 1000 747.98 857.05 91.23 651.58 8.198838102
0.1 100 cpu(0) 128 1000000 1000 76.09 856.9 13.63 457.28 5.582538518
0.5 100 cpu(0) 128 1000000 1000 377.69 856.95 50.97 490.24 7.410045125
1 100 cpu(0) 128 1000000 1000 750.94 857.45 92.12 405.94 8.151758576
2 100 cpu(0) 128 1000000 1000 1492.61 856.06 153.89 415.9 9.699200728
5 100 cpu(0) 128 1000000 1000 3691.37 857.25 237.39 490.02 15.54981254
10 100 cpu(0) 128 1000000 1000 7305.37 857.43 337.41 360.7 21.65131442
20 100 cpu(0) 128 1000000 1000 14391.96 860.41 542.96 360.44 26.50648298
50 100 cpu(0) 128 1000000 1000 33326.49 859.13 1201.81 361.72 27.73024854
65 100 cpu(0) 128 1000000 1000 43346.13 857.97 1692.96 384.63 25.60375319
1 100 cpu(0) 128 8000000 1 38.82 67.82 10.33 91.08 3.757986447
1 100 cpu(0) 128 8000000 32 690.54 1931.28 40.71 215.09 16.9624171
1 100 cpu(0) 128 8000000 32 695.46 1931.95 40.42 214.28 17.20583869
1 100 cpu(0) 128 16000000 32 1403.05 3870.88 83.71 392.13 16.760841
1 100 cpu(0) 64 8000000 32 345.57 1069.43 20.31 113.74 17.01477105
1 100 cpu(0) 128 8000000 32 693.35 1931.48 42.15 216.12 16.44958482
0.1 100 cpu(0) 128 8000000 32 74.12 1913.5 4.68 215.89 15.83760684
0.5 100 cpu(0) 128 8000000 32 358.13 1916.25 21.6 150.56 16.58009259
1 100 cpu(0) 128 8000000 32 678.78 1917.35 40.39 199.57 16.80564496
2 100 cpu(0) 128 8000000 32 1211.03 1908.81 64.22 121.49 18.85752102
5 100 cpu(0) 128 8000000 32 2982.13 1911.39 122.33 121.23 24.37774871
10 100 cpu(0) 128 8000000 32 5408.13 1910.07 199.28 124.17 27.13834805
20 100 cpu(0) 128 8000000 32 8713.7 1905.89 371.97 207.33 23.42581391
50 100 cpu(0) 128 8000000 32 13011.79 1899.54 659.6 207.04 19.72678896
65 100 cpu(0) 128 8000000 32 14268.57 1904.14 752.78 207.24 18.95450198

scipy mxnet comparison: dot(csr^T, default)

lhs_density rhs_density context batch_size feature_dim output_dim t_sparse_scipy(ms) t_dense_scipy(ms) t_sparse_mxnet(ms) t_dense_mxnet(ms) Speedup = t_sparse_scipy/t_sparse_mxnet
1 100 cpu(0) 128 1000000 256 562.64 672.5 323.08 147.16 1.741488176
1 100 cpu(0) 128 1000000 1000 1710.93 1930.76 1143.93 395.79 1.495659699
1 100 cpu(0) 128 1000000 1000 1722.2 1932.09 1124.62 433.82 1.531361704
1 100 cpu(0) 64 1000000 1000 1307.4 1662.83 775.82 211.87 1.685184708
1 100 cpu(0) 128 1000000 1000 1780.13 1978.76 1122.62 472.84 1.585692398
0.1 100 cpu(0) 128 1000000 1000 973.27 1933.86 326.17 452.38 2.983934758
0.5 100 cpu(0) 128 1000000 1000 1334.74 1961.99 757.96 368.65 1.760963639
1 100 cpu(0) 128 1000000 1000 1717.18 1930.4 1103.66 427.88 1.555895837
2 100 cpu(0) 128 1000000 1000 2554.72 1936.31 1493.98 312.12 1.710009505
5 100 cpu(0) 128 1000000 1000 4989.81 1931.02 1834.69 300.24 2.719701966
10 100 cpu(0) 128 1000000 1000 9039.25 1929.08 2841.92 295.42 3.180684185
20 100 cpu(0) 128 1000000 1000 16908.19 1932.57 4963.78 348.55 3.406313334
50 100 cpu(0) 128 1000000 1000 38582.01 1932.06 11090.89 339.55 3.478711808
65 100 cpu(0) 128 1000000 1000 48508.11 1933.4 14015.8 530.2 3.46095906
1 100 cpu(0) 128 8000000 1 49.72 78.53 207.22 119.92 0.23993823
1 100 cpu(0) 128 8000000 32 867.99 3041.72 422.96 173 2.052179875
1 100 cpu(0) 128 8000000 32 863.88 3033.73 420.41 263.78 2.054851217
1 100 cpu(0) 128 16000000 32 1738.52 6232.28 812.35 365.46 2.140112021
1 100 cpu(0) 64 8000000 32 539.29 1817.33 324.94 148.61 1.659660245
1 100 cpu(0) 128 8000000 32 865.95 3040.42 421.15 261.64 2.056155764
0.1 100 cpu(0) 128 8000000 32 296.71 3038.15 172.95 257.02 1.715582538
0.5 100 cpu(0) 128 8000000 32 562.37 3042.12 325.56 244.22 1.7273928
1 100 cpu(0) 128 8000000 32 861.22 3041.54 420.59 258.96 2.047647353
2 100 cpu(0) 128 8000000 32 1396.91 3038 561.9 262.29 2.486047339
5 100 cpu(0) 128 8000000 32 3287.59 3082.61 872.31 166.86 3.768832181
10 100 cpu(0) 128 8000000 32 6024.43 3081.17 1394.09 188.7 4.321406796
20 100 cpu(0) 128 8000000 32 10416.74 3077.87 2737.59 167.18 3.805076728
50 100 cpu(0) 128 8000000 32 16211.84 3078.74 4648.24 158.13 3.487737294
65 100 cpu(0) 128 8000000 32 16259.5 3029.88 5251.28 174.13 3.096292713

Powerlaw Distribution

mxnet sparse dot benchmark: dot(csr, default)

lhs_density(%) rhs_density(%) context batch_size feature_dim output_dim t_sparse(ms) t_dense(ms) speedup
1.0 100.0 cpu(0) 128 1000000 256 146.62 129.15 0.88
1.0 100.0 cpu(0) 128 1000000 1000 424.80 740.36 1.74
1.0 100.0 cpu(0) 128 1000000 1000 426.33 544.31 1.28
1.0 100.0 cpu(0) 64 1000000 1000 165.02 408.33 2.47
1.0 100.0 cpu(0) 128 1000000 1000 427.06 714.32 1.67
0.1 100.0 cpu(0) 128 1000000 1000 51.54 730.98 14.18
0.5 100.0 cpu(0) 128 1000000 1000 211.82 692.08 3.27
1.0 100.0 cpu(0) 128 1000000 1000 428.49 454.91 1.06
2.0 100.0 cpu(0) 128 1000000 1000 696.45 536.87 0.77
5.0 100.0 cpu(0) 128 1000000 1000 1449.20 476.58 0.33
10.0 100.0 cpu(0) 128 1000000 1000 1631.12 572.32 0.35
20.0 100.0 cpu(0) 128 1000000 1000 1916.38 363.78 0.19
50.0 100.0 cpu(0) 128 1000000 1000 2325.93 414.93 0.18
65.0 100.0 cpu(0) 128 1000000 1000 2384.55 744.94 0.31
1.0 100.0 cpu(0) 128 8000000 1 63.38 90.12 1.42
1.0 100.0 cpu(0) 128 8000000 32 166.35 213.85 1.29
1.0 100.0 cpu(0) 128 8000000 32 175.26 142.55 0.81
1.0 100.0 cpu(0) 128 16000000 32 337.00 392.15 1.16
1.0 100.0 cpu(0) 64 8000000 32 101.33 112.96 1.11
1.0 100.0 cpu(0) 128 8000000 32 165.73 215.64 1.30
0.1 100.0 cpu(0) 128 8000000 32 16.92 156.07 9.23
0.5 100.0 cpu(0) 128 8000000 32 82.57 171.59 2.08
1.0 100.0 cpu(0) 128 8000000 32 178.22 181.41 1.02
2.0 100.0 cpu(0) 128 8000000 32 319.20 163.89 0.51
5.0 100.0 cpu(0) 128 8000000 32 448.13 133.87 0.30
10.0 100.0 cpu(0) 128 8000000 32 538.82 219.42 0.41
20.0 100.0 cpu(0) 128 8000000 32 634.89 198.67 0.31
50.0 100.0 cpu(0) 128 8000000 32 778.20 138.63 0.18
65.0 100.0 cpu(0) 128 8000000 32 787.64 216.83 0.28

mxnet sparse dot benchmark: dot(csr^T, default)

lhs_density(%) rhs_density(%) context batch_size feature_dim output_dim t_sparse(ms) t_dense(ms) speedup
1.0 100.0 cpu(0) 128 1000000 256 211.46 141.43 0.67
1.0 100.0 cpu(0) 128 1000000 1000 675.08 527.31 0.78
1.0 100.0 cpu(0) 128 1000000 1000 678.83 348.99 0.51
1.0 100.0 cpu(0) 64 1000000 1000 444.87 215.36 0.48
1.0 100.0 cpu(0) 128 1000000 1000 663.76 344.94 0.52
0.1 100.0 cpu(0) 128 1000000 1000 247.62 434.33 1.75
0.5 100.0 cpu(0) 128 1000000 1000 435.31 373.59 0.86
1.0 100.0 cpu(0) 128 1000000 1000 678.43 335.81 0.49
2.0 100.0 cpu(0) 128 1000000 1000 1105.32 529.71 0.48
5.0 100.0 cpu(0) 128 1000000 1000 1710.71 527.61 0.31
10.0 100.0 cpu(0) 128 1000000 1000 2691.18 466.08 0.17
20.0 100.0 cpu(0) 128 1000000 1000 4676.14 396.06 0.08
50.0 100.0 cpu(0) 128 1000000 1000 10621.35 437.99 0.04
65.0 100.0 cpu(0) 128 1000000 1000 13605.84 448.83 0.03
1.0 100.0 cpu(0) 128 8000000 1 127.24 111.08 0.87
1.0 100.0 cpu(0) 128 8000000 32 258.88 261.50 1.01
1.0 100.0 cpu(0) 128 8000000 32 257.73 261.19 1.01
1.0 100.0 cpu(0) 128 16000000 32 512.35 403.70 0.79
1.0 100.0 cpu(0) 64 8000000 32 166.22 117.38 0.71
1.0 100.0 cpu(0) 128 8000000 32 259.65 235.60 0.91
0.1 100.0 cpu(0) 128 8000000 32 94.06 256.92 2.73
0.5 100.0 cpu(0) 128 8000000 32 170.52 257.13 1.51
1.0 100.0 cpu(0) 128 8000000 32 258.10 262.60 1.02
2.0 100.0 cpu(0) 128 8000000 32 429.89 264.77 0.62
5.0 100.0 cpu(0) 128 8000000 32 638.32 199.07 0.31
10.0 100.0 cpu(0) 128 8000000 32 965.87 264.99 0.27
20.0 100.0 cpu(0) 128 8000000 32 1639.92 264.22 0.16
50.0 100.0 cpu(0) 128 8000000 32 3753.93 264.82 0.07
65.0 100.0 cpu(0) 128 8000000 32 4828.76 261.33 0.05

mxnet sparse dot benchmark: dot(csr, rsp)

lhs_density(%) rhs_density(%) context batch_size feature_dim output_dim t_sparse(ms) t_dense(ms) speedup
1.0 5.0 cpu(0) 128 8000000 1 18.79 97.75 5.20
1.0 5.0 cpu(0) 128 8000000 32 25.64 215.40 8.40
1.0 5.0 cpu(0) 128 8000000 32 24.35 167.95 6.90
1.0 5.0 cpu(0) 128 16000000 32 58.14 237.90 4.09
1.0 5.0 cpu(0) 64 8000000 32 10.71 109.15 10.19
1.0 5.0 cpu(0) 128 8000000 32 25.64 188.89 7.37
0.1 5.0 cpu(0) 128 8000000 32 2.62 210.09 80.18
0.5 5.0 cpu(0) 128 8000000 32 10.36 209.87 20.27
1.0 5.0 cpu(0) 128 8000000 32 24.53 208.54 8.50
2.0 5.0 cpu(0) 128 8000000 32 58.01 210.49 3.63
5.0 5.0 cpu(0) 128 8000000 32 81.26 203.26 2.50
10.0 5.0 cpu(0) 128 8000000 32 96.81 212.56 2.20
20.0 5.0 cpu(0) 128 8000000 32 99.95 212.09 2.12
50.0 5.0 cpu(0) 128 8000000 32 130.57 152.60 1.17
65.0 5.0 cpu(0) 128 8000000 32 145.60 198.69 1.36
1.0 5.0 cpu(0) 128 1000000 256 7.38 121.50 16.46
1.0 5.0 cpu(0) 128 1000000 1000 26.43 363.80 13.76
1.0 5.0 cpu(0) 128 1000000 1000 25.86 729.43 28.21
1.0 5.0 cpu(0) 64 1000000 1000 7.65 243.18 31.79
1.0 5.0 cpu(0) 128 1000000 1000 25.80 446.95 17.33
0.1 5.0 cpu(0) 128 1000000 1000 2.62 504.00 192.22
0.5 5.0 cpu(0) 128 1000000 1000 13.73 478.17 34.81
1.0 5.0 cpu(0) 128 1000000 1000 25.81 381.48 14.78
2.0 5.0 cpu(0) 128 1000000 1000 46.92 715.47 15.25
5.0 5.0 cpu(0) 128 1000000 1000 98.15 409.76 4.17
10.0 5.0 cpu(0) 128 1000000 1000 100.63 430.34 4.28
20.0 5.0 cpu(0) 128 1000000 1000 112.29 417.95 3.72
50.0 5.0 cpu(0) 128 1000000 1000 140.14 567.76 4.05
65.0 5.0 cpu(0) 128 1000000 1000 141.44 369.54 2.61

scipy sparse dot benchmark: dot(csr, default)

lhs_density(%) rhs_density(%) context batch_size feature_dim output_dim t_sparse(ms) t_dense(ms) speedup
1.0 100.0 cpu(0) 128 1000000 256 161.17 304.40 1.89
1.0 100.0 cpu(0) 128 1000000 1000 627.08 863.86 1.38
1.0 100.0 cpu(0) 128 1000000 1000 626.75 864.25 1.38
1.0 100.0 cpu(0) 64 1000000 1000 311.86 3681.30 11.80
1.0 100.0 cpu(0) 128 1000000 1000 625.43 865.74 1.38
0.1 100.0 cpu(0) 128 1000000 1000 60.72 863.56 14.22
0.5 100.0 cpu(0) 128 1000000 1000 311.92 854.28 2.74
1.0 100.0 cpu(0) 128 1000000 1000 626.41 857.31 1.37
2.0 100.0 cpu(0) 128 1000000 1000 1247.63 857.63 0.69
5.0 100.0 cpu(0) 128 1000000 1000 3140.50 859.42 0.27
10.0 100.0 cpu(0) 128 1000000 1000 6278.92 856.56 0.14
20.0 100.0 cpu(0) 128 1000000 1000 12515.63 857.40 0.07
50.0 100.0 cpu(0) 128 1000000 1000 31241.86 856.87 0.03
65.0 100.0 cpu(0) 128 1000000 1000 40585.85 866.96 0.02
1.0 100.0 cpu(0) 128 8000000 1 12.81 69.52 5.43
1.0 100.0 cpu(0) 128 8000000 32 163.09 1916.94 11.75
1.0 100.0 cpu(0) 128 8000000 32 161.78 1914.93 11.84
1.0 100.0 cpu(0) 128 16000000 32 324.77 3834.26 11.81
1.0 100.0 cpu(0) 64 8000000 32 80.03 1060.65 13.25
1.0 100.0 cpu(0) 128 8000000 32 165.13 1927.85 11.67
0.1 100.0 cpu(0) 128 8000000 32 15.62 1932.43 123.73
0.5 100.0 cpu(0) 128 8000000 32 80.60 1927.99 23.92
1.0 100.0 cpu(0) 128 8000000 32 161.30 1920.71 11.91
2.0 100.0 cpu(0) 128 8000000 32 324.41 1916.63 5.91
5.0 100.0 cpu(0) 128 8000000 32 806.08 1914.76 2.38
10.0 100.0 cpu(0) 128 8000000 32 1621.83 1911.87 1.18
20.0 100.0 cpu(0) 128 8000000 32 3246.37 1909.35 0.59
50.0 100.0 cpu(0) 128 8000000 32 8089.40 1910.03 0.24
65.0 100.0 cpu(0) 128 8000000 32 10491.50 1908.70 0.18

scipy sparse dot benchmark: dot(csr^T, default)

lhs_density(%) rhs_density(%) context batch_size feature_dim output_dim t_sparse(ms) t_dense(ms) speedup
1.0 100.0 cpu(0) 128 1000000 256 275.71 668.89 2.43
1.0 100.0 cpu(0) 128 1000000 1000 1057.33 1934.36 1.83
1.0 100.0 cpu(0) 128 1000000 1000 1083.78 1973.98 1.82
1.0 100.0 cpu(0) 64 1000000 1000 544.27 1715.48 3.15
1.0 100.0 cpu(0) 128 1000000 1000 1058.37 1926.86 1.82
0.1 100.0 cpu(0) 128 1000000 1000 110.96 1983.23 17.87
0.5 100.0 cpu(0) 128 1000000 1000 525.80 1928.72 3.67
1.0 100.0 cpu(0) 128 1000000 1000 1052.81 1936.02 1.84
2.0 100.0 cpu(0) 128 1000000 1000 2078.79 1935.97 0.93
5.0 100.0 cpu(0) 128 1000000 1000 4003.07 1924.15 0.48
10.0 100.0 cpu(0) 128 1000000 1000 7250.62 1929.69 0.27
20.0 100.0 cpu(0) 128 1000000 1000 13820.41 1933.50 0.14
50.0 100.0 cpu(0) 128 1000000 1000 34872.17 1978.84 0.06
65.0 100.0 cpu(0) 128 1000000 1000 44829.00 1975.22 0.04
1.0 100.0 cpu(0) 128 8000000 1 15.43 78.47 5.09
1.0 100.0 cpu(0) 128 8000000 32 285.26 3040.07 10.66
1.0 100.0 cpu(0) 128 8000000 32 285.11 3035.40 10.65
1.0 100.0 cpu(0) 128 16000000 32 588.41 6380.72 10.84
1.0 100.0 cpu(0) 64 8000000 32 145.21 1861.71 12.82
1.0 100.0 cpu(0) 128 8000000 32 285.98 3045.31 10.65
0.1 100.0 cpu(0) 128 8000000 32 24.42 3042.09 124.57
0.5 100.0 cpu(0) 128 8000000 32 142.37 3047.59 21.41
1.0 100.0 cpu(0) 128 8000000 32 286.20 3043.24 10.63
2.0 100.0 cpu(0) 128 8000000 32 563.47 3038.89 5.39
5.0 100.0 cpu(0) 128 8000000 32 1106.09 3037.70 2.75
10.0 100.0 cpu(0) 128 8000000 32 2022.37 3042.22 1.50
20.0 100.0 cpu(0) 128 8000000 32 3859.23 3044.10 0.79
50.0 100.0 cpu(0) 128 8000000 32 9309.22 3036.03 0.33
65.0 100.0 cpu(0) 128 8000000 32 12391.34 3083.39 0.25

scipy mxnet comparison: dot(csr, default)

lhs_density rhs_density context batch_size feature_dim output_dim t_sparse_scipy t_dense_scipy t_sparse_mxnet t_dense_mxnet Speedup = t_sparse_scipy/t_sparse_mxnet
1 100 cpu(0) 128 1000000 256 161.17 304.4 146.62 129.15 1.099236121
1 100 cpu(0) 128 1000000 1000 627.08 863.86 424.8 740.36 1.476177024
1 100 cpu(0) 128 1000000 1000 626.75 864.25 426.33 544.31 1.470105317
1 100 cpu(0) 64 1000000 1000 311.86 3681.3 165.02 408.33 1.889831536
1 100 cpu(0) 128 1000000 1000 625.43 865.74 427.06 714.32 1.464501475
0.1 100 cpu(0) 128 1000000 1000 60.72 863.56 51.54 730.98 1.178114086
0.5 100 cpu(0) 128 1000000 1000 311.92 854.28 211.82 692.08 1.472571051
1 100 cpu(0) 128 1000000 1000 626.41 857.31 428.49 454.91 1.461901095
2 100 cpu(0) 128 1000000 1000 1247.63 857.63 696.45 536.87 1.791413598
5 100 cpu(0) 128 1000000 1000 3140.5 859.42 1449.2 476.58 2.167057687
10 100 cpu(0) 128 1000000 1000 6278.92 856.56 1631.12 572.32 3.849453136
20 100 cpu(0) 128 1000000 1000 12515.63 857.4 1916.38 363.78 6.530870704
50 100 cpu(0) 128 1000000 1000 31241.86 856.87 2325.93 414.93 13.43198635
65 100 cpu(0) 128 1000000 1000 40585.85 866.96 2384.55 744.94 17.02033927
1 100 cpu(0) 128 8000000 1 12.81 69.52 63.38 90.12 0.202114232
1 100 cpu(0) 128 8000000 32 163.09 1916.94 166.35 213.85 0.980402765
1 100 cpu(0) 128 8000000 32 161.78 1914.93 175.26 142.55 0.923085701
1 100 cpu(0) 128 16000000 32 324.77 3834.26 337 392.15 0.963709199
1 100 cpu(0) 64 8000000 32 80.03 1060.65 101.33 112.96 0.789795717
1 100 cpu(0) 128 8000000 32 165.13 1927.85 165.73 215.64 0.996379654
0.1 100 cpu(0) 128 8000000 32 15.62 1932.43 16.92 156.07 0.923167849
0.5 100 cpu(0) 128 8000000 32 80.6 1927.99 82.57 171.59 0.976141456
1 100 cpu(0) 128 8000000 32 161.3 1920.71 178.22 181.41 0.90506116
2 100 cpu(0) 128 8000000 32 324.41 1916.63 319.2 163.89 1.016322055
5 100 cpu(0) 128 8000000 32 806.08 1914.76 448.13 133.87 1.798763752
10 100 cpu(0) 128 8000000 32 1621.83 1911.87 538.82 219.42 3.009966222
20 100 cpu(0) 128 8000000 32 3246.37 1909.35 634.89 198.67 5.113279466
50 100 cpu(0) 128 8000000 32 8089.4 1910.03 778.2 138.63 10.39501414
65 100 cpu(0) 128 8000000 32 10491.5 1908.7 787.64 216.83 13.32017165

scipy mxnet comparison: dot(csr^T, default)

lhs_density rhs_density context batch_size feature_dim output_dim t_sparse_scipy t_dense_scipy t_sparse_mxnet t_dense_mxnet Speedup = t_sparse_scipy/t_sparse_mxnet
1 100 cpu(0) 128 1000000 256 275.71 668.89 211.46 141.43 1.30383997
1 100 cpu(0) 128 1000000 1000 1057.33 1934.36 675.08 527.31 1.566229188
1 100 cpu(0) 128 1000000 1000 1083.78 1973.98 678.83 348.99 1.596541107
1 100 cpu(0) 64 1000000 1000 544.27 1715.48 444.87 215.36 1.22343606
1 100 cpu(0) 128 1000000 1000 1058.37 1926.86 663.76 344.94 1.594507051
0.1 100 cpu(0) 128 1000000 1000 110.96 1983.23 247.62 434.33 0.448105969
0.5 100 cpu(0) 128 1000000 1000 525.8 1928.72 435.31 373.59 1.207874848
1 100 cpu(0) 128 1000000 1000 1052.81 1936.02 678.43 335.81 1.551832908
2 100 cpu(0) 128 1000000 1000 2078.79 1935.97 1105.32 529.71 1.880713278
5 100 cpu(0) 128 1000000 1000 4003.07 1924.15 1710.71 527.61 2.340005027
10 100 cpu(0) 128 1000000 1000 7250.62 1929.69 2691.18 466.08 2.69421592
20 100 cpu(0) 128 1000000 1000 13820.41 1933.5 4676.14 396.06 2.95551673
50 100 cpu(0) 128 1000000 1000 34872.17 1978.84 10621.35 437.99 3.283214469
65 100 cpu(0) 128 1000000 1000 44829 1975.22 13605.84 448.83 3.294835159
1 100 cpu(0) 128 8000000 1 15.43 78.47 127.24 111.08 0.121266897
1 100 cpu(0) 128 8000000 32 285.26 3040.07 258.88 261.5 1.101900494
1 100 cpu(0) 128 8000000 32 285.11 3035.4 257.73 261.19 1.106235207
1 100 cpu(0) 128 16000000 32 588.41 6380.72 512.35 403.7 1.148453206
1 100 cpu(0) 64 8000000 32 145.21 1861.71 166.22 117.38 0.873601251
1 100 cpu(0) 128 8000000 32 285.98 3045.31 259.65 235.6 1.101405738
0.1 100 cpu(0) 128 8000000 32 24.42 3042.09 94.06 256.92 0.259621518
0.5 100 cpu(0) 128 8000000 32 142.37 3047.59 170.52 257.13 0.834916725
1 100 cpu(0) 128 8000000 32 286.2 3043.24 258.1 262.6 1.10887253
2 100 cpu(0) 128 8000000 32 563.47 3038.89 429.89 264.77 1.310730652
5 100 cpu(0) 128 8000000 32 1106.09 3037.7 638.32 199.07 1.732814262
10 100 cpu(0) 128 8000000 32 2022.37 3042.22 965.87 264.99 2.093832503
20 100 cpu(0) 128 8000000 32 3859.23 3044.1 1639.92 264.22 2.35330382
50 100 cpu(0) 128 8000000 32 9309.22 3036.03 3753.93 264.82 2.479859774
65 100 cpu(0) 128 8000000 32 12391.34 3083.39 4828.76 261.33 2.56615363

Graphs

Uniform Distribution

MXNet dot(csr, default) with Feature_Dim: 8M, Output_dim: 32, Batch_size: 128

uniform_feature_1m

MXNet dot(csr, default) with Feature_Dim: 1M, Output_dim: 1000, Batch_size: 128

uniform_feature_dim8m

Powerlaw distribution

MXNet dot(csr, default) with Feature_Dim: 8M, Output_dim: 32, Batch_size: 128

powerlaw_featuredim1m

MXNet dot(csr, default) with Feature_Dim: 1M, Output_dim: 1000, Batch_size: 128

powerlawfeaturedim8m

Copy link
Owner

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see lots of code commented out. Is this PR ready? Usually we put [WIP] in the title of PR so that it's not reviewed too early / when it's not ready.

'data_name': 'criteo.t',
'data_origin_name': 'criteo.t.bz2',
'url' : "https://s3-us-west-2.amazonaws.com/sparse-dataset/criteo.t.bz2",
'feature_dim': 16000000,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you double check the dimension? It seems to be "2^23 + 13" now, according to Madhav

def run_benchmark(mini_path):
"""Run benchmarks
"""
#print("Running Benchmarking on %r data") % mini_file_name
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a log_verbose option and print these when it's turned on?

# One warm up run, verify correctness
out = mx.nd.dot(lhs_nd, rhs_dns, trans_lhs)
out_expected = mx.nd.dot(lhs_dns, rhs_dns, trans_lhs)
# ONe warm up run, verify correctness
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ONe -> One

col_max = col_max * 2

if unused_nnz > 0:
#return mx.nd.array(sp.random(num_rows, num_cols, density).toarray()).tostype("csr")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove unused code?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is still WIP. I need to cleanup code, add criteo and remove some comments. Will add WIP.

@anirudh2290 anirudh2290 changed the title Dot script changes [WIP] Dot script changes Aug 10, 2017
@anirudh2290 anirudh2290 changed the title [WIP] Dot script changes Dot script changes Aug 11, 2017
set_default_context(ctx)
assert fw == "mxnet" or fw == "scipy"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does fw stand for?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

framework.

rsp=False):

def create_mini_path(mini_path, path, num_batches):
"""Create mini path for sparse"""
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment really doesn't say anything.. Maybe a sentence explaining why we're creating the mini file will be good to first time users (to sample some batches from a large dataset)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will change

# Start benchmarking
lhs_nd = rand_ndarray(lhs_shape, lhs_stype, density=lhs_den, distribution=distribution)
# only uniform distribution supported for rhs
rhs_nd = rand_ndarray(rhs_shape, rhs_stype, density=rhs_den, distribution="uniform")
lhs_nd.wait_to_read()
rhs_nd.wait_to_read()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you have mx.nd.waitall() in measure_cost, these wait_to_read() can be removed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will change

else:
lhs_dns = np.transpose(lhs_nd.asnumpy()) if trans_lhs else lhs_nd.asnumpy()
rhs_dns = rhs_nd.asnumpy()
lhs_nd = sp.spmatrix.transpose(sp.csr_matrix(lhs_nd.asnumpy())) if trans_lhs else sp.csr_matrix(lhs_nd.asnumpy())
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you check if sp.spmatrix.transpose only returns a new view of the layout, or actually shuffles data to transpose the entire matrix? Does it affect the overall time spent on scipy?
When we run dot(csr.T, dns), the inputs are actually just csr(without the transpose) and dns, and the cost should include the time to do the transpose and the dot. So this will be a fair comparison with mxnet.dot since the input to mx.nd.dot is also csr(without the transpose).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sp.spmatrix.transpose just changes the view of the layout by reversing the dimensions. https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.sparse.spmatrix.transpose.html
Yes I agree that in one case we are doing transpose early on but in mx.nd.dot we are doing transpose when we pass it to measure_cost. I can do the transpose before I call measure_cost for mxnet. Do you think this cost will be significant to affect benchmarking results ?

assert_almost_equal(out.asnumpy(), out_expected.asnumpy(), rtol=1e-1, atol=1e-1)
sparse_cost = measure_cost(num_repeat, dot_func_sparse, lhs_nd, rhs_nd, trans_lhs)
dense_cost = measure_cost(num_repeat, dot_func_dense, lhs_dns, rhs_dns, trans_lhs)
sparse_cost = measure_cost(num_repeat, dot_func_sparse, lhs_nd, rhs_nd)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this is the opposite of what I meant.. We should move the transpose inside the measure_cost for both scipy and mxnet..

@eric-haibin-lin eric-haibin-lin merged commit f2a0852 into eric-haibin-lin:sparse Aug 16, 2017
eric-haibin-lin added a commit that referenced this pull request Aug 23, 2017
* [WIP] Sparse Tensor  (apache#5800)

* squash

merge with 38f7c55

compiles on GPU

update check alloc:

Checkpoint. Pass elem-sum gpu test

bug fix for copyfromto. sparse sgd test pass on gpu

inefficient implementation for csr copy

update submodule

fix lint

Simple bind with infer storage type (#32)

* Symbol binding for sparse tensor development. (#31)

* Initial checkin

* Add init functions for simple bind in graph_executor

* Add simple_bind c_api

* Add simple bind c-api

* Assign zeros to in_args, arg_grads, and aux_states

* Add simple_bind2 python interface

* Fix python interface bugs

* Interface changes

* Fix

* Fix core dump

* Add bind_ith_exec c_api

* Change simple_bind2

* Fix seg fault

* Finish simple_bind

* Change _bind_ith_exec

* Refactor simple_bind initialization flow for bind

* Consolidate bind and simple_bind graph init flow

* Fix bug

* Clean up

* Add comments

* Clean up

* Clean up

* Minor correction

* Rename APIs in graph executor

* Refactor

* Rebase

* Delete deprecated functions

* Move more front-end work to backend

* Bug fix

* Fix failed tests

* Minor fix

* Fix lint

* Fix lint

* Revert unnecessary changes

* Revert

* Revert

* Clean up

* Fix lint

Conflicts:
	python/mxnet/symbol.py
	src/executor/graph_executor.cc

* Add inferstorage to graph executor

* re-enable tests for sparse embedding with simple_bind

* type switch fix in sparse embedding"
;

change `default` to `default_storage` for cast storage op (#33)

* change default to default_storage

* disable cpp test build temporarily

attempt to fix windows build error, and fix lint (#34)

update nnvm submodule (#37)

Scipy build (#38)

* update nnvm submodule

* add scipy pip install for dockerfile

Python3 unit tests (#39)

* change xrange to range for python3 compatiblity"

* remove more xrange from tests

replace long with int for python3 (#40)

fix the rest of TShape constructor errors (#41)

fix lint (#42)

fix wrong usage of mshadow::Shape1" (#43)

implementation for Csr slice on cpu (#36)

* CPU implementation for CSR

remove seg_len from csr slice

add some docs for slice csr

change indptr, values, etc to be private member

bug fix in sparse embedding

update nnvm submoduel

fix lint

update unit test for sparse nd"

* add const for SliceCsrIndPtr kernel

Fix sparse dot according to the new RSP definition (#35)

* Fix csr dot dns

* Fix sparse dot

* Add fallback and test cases for dot(csr, dns)=dns

* Add int type switch

* Fix

* Fix

* Fix

update mshadow submodule (#44)

Fix dns to rsp (#46)

fix lint (#47)

add runtime storage fallback detection" (#48)

* add runtime storage fallback detection"

* replace cast storage ex with cast storage impl

Fm example (#45)

* update csr slice logic to avoid confusion. add more exmaples.

* add hint to module.update

* more testcases(fallback) for sparse_nd

* add to_csr() and to_rsp() method. More unit test (fallback now)

* add fm test. fix lint

* register sparse sgd under Optim.SGD

* update dmlc-core submoduel

* change indptr to _indptr temporarily. add const ref to fname

fix lint

fix lint; (#51)

Guard gpu cast storage (#50)

* Clean up

* Fix typo

Rearrange unit test files (#52)

fix lint. add scipy for python_test. fix scipy.sparse import error. fix truediv for python3

fix travis test (#54)

* remove pyc files

* add verbose for travis nosetests

cleanup some testing code and enums (#57)

* update Makefile

* refactor test_sparse_operator

* change `default_storage` back to `default`

* remove unused cpp tests

port libsvm parser to mxnet as libsvm iter (#55)

* copied csv iter to libsvm iter

test

libsvm iter draft

handle round batch == false for csr batch loader

code refactoring

add get stype, shape interface to iiter

separate class for sparse iter

add missing file

fix mem corruption'

rename variables

add comments

also read label from libsvm

add test. update docs. update submodule

Conflicts:
	python/mxnet/sparse_ndarray.py

* update submodule

* fix lint

* update test

* revert naming change

add benchmark scritp for dot (#59)

* add benchmark scritp for dot

add gpu option for bench

add get_data funciton for benchmark

print t_sparse, too;

add comment

change nnz to dnesity

add backward

* add comment

update fm test (#62)

introduce CSRNDarray and rowsparseNDarray to python frontend api (#58)

* introduce CSRNDarray and rowsparseNDarray to python frontend api

* temporarily disable fm_module test

fix lint (#64)

fix typo. disable libsvm io test (#65)

Improve dot (#61)

* Init checkin

* Fix

* Adjust dot parallelization methods

* Set num_omp_threads for benchmark from command line

* Fix omp thread number

* Clean up

* Add scipy as dot baseline

* Fix format

sparse_retain op (#66)

* Initial checkin

* Fix bugs

* Add unit test for sparse_retain

* Add example and modify test

add storage cast for outputs that have non-default storage (#67)

fix gpu build (#69)

Fix test_sparse_retain python3 issue (#68)

revert nnvm version

* draft for sgd rsp rsp (#75)

support sgd(rsp, rsp)

support dot(csr, rsp) when rsp is full

add ref to const ndarray params

support sparse embedding with rsp weight'

fix lint

modify embedding backward to produce dense grad

remove invalid_rid for rsp->dns

remove previous embedding op changes

pass sparse embedding test

add STORAGE_TYPE_ASSIGN_CHECK

remove backward storage infer

* fix lint (#78)

* fix lint (#79)

* serial elemwise sum impl (#80)

update module kvstore interface

add other missing params and functions

revert some interface changes

revert some more changes

reomve explicit casting for gradients on kvstore

update Comm interface

update fm example

Conflicts:
	python/mxnet/model.py
	python/mxnet/ndarray.py

* bug fix for initializing module with row_sparse weight (#81)

* bug fix for initializing module with row_sparse weight

* update log message

* Sparse ndarray serialization and deserialization (#77)

* Initial checkin

* Add unit tests

* Fix lint

* Fix lint (#84)

* Sgd with row_sparse weight, dns gradient (#83)

* sgd rsp dns draft

* support sgd_mom(rsp, dns, rsp)

* update doc

* remove cast storage for kv updater

* code refactoring

* update mshadow version (#88)

* csr slice bug fix (#90)

* benchmark dot code refactor (#87)

* q^x6x add some code in benchmark

* refactor

* minor fixes

* fix

* lint fix

* Add unit test (#91)

* add unittest

* minor fix

* remove commented lines

* change test func name

* add test rsp

* kvstore push row sparse (#93)

* Add multi-thread cpu elemwise sum for rsps

* Minor fix

* Add flag to switch between serial and multi-thread kvstore push

* Fix lint in sparse_ndarray.py

* Revert "Fix lint in sparse_ndarray.py"

This reverts commit d7225ec.

* Fix ndarray init in copy(ctx)

* Add env var to control the flow of serial/parallel reduce

* Refactor

* Fix copy ndarray bug

* Fix lint

* Refactor

* Fix windows openmp build failure (#94)

* update mshadow submoduel (#95)

* Revert "update mshadow submoduel (#95)" (#96)

This reverts commit 1a129e4.

* Refactor sparse tensor code (#99)

* Initial checkin test_sparse_ndarray passes

* Fix test failure

* Clean up

* Clean up

* Move init backend op to ndarray_utils

* Fix lint

* Eliminate circular dependency on headers

* More refactor

* Fix gpu build and consolidate Slice for dense and sparse

* Clean up

* More refactor

* Clean up

* Fix gpu build

* Fix comment

* fix pylint (#100)

* Fix refactor sparse gpu test (#104)

* Fix gpu build

* Fix

* Fix gpu test failure

* change idx types from int32 to int64 (#101)

Conflicts:
	python/mxnet/test_utils.py
	tests/python/unittest/test_sparse_operator.py

update mshadow submodule

fix extra quotes in test script

change indptr type to int64

better err message for rsp"

* revert LOG(DEBUG) change (#105)

* fix undefined zeros in optimizer.py (#106)

* move init dns zeros to init_op.h for kvstore to use (#107)

* Refactor cast storage (#109)

* Refactor cast_storage

* Add cast_storage cc and cu files

* Remove redundant comments

* Replace std::accumulate with ParallelAccumulate

* Clean up

* Fix windows build

* Rowsparse kv (#111)

* update kvstore unit test

Conflicts:
	tests/python/unittest/test_kvstore.py

update model/module.py

Conflicts:
	python/mxnet/model.py
	python/mxnet/module/module.py

fix lint

resolve conflict

remove int keys in kvstore

update cast to str function

* fix failed dist_sync_kv test

* bug fix in comm to ensure merged gradient is of the right type

bug fix in comm

* row sparse dist kvstore draft (push only)

row_sparse pull

* add ndarray row sparse shared mem constructor

* code refactoring

* add test for row_sparse weight

bug fix for kv server slicing

add async support

rsolve race condition in kvstore

* resolve error after reb ase

* fix lint (#113)

* rename some python funciton (#114)

* _to_rsp

* _to_csr. raise NotImplementedError

* todense

* fix lint (#115)

enable libsvm uniit test (apache#6839)

remove shared mem slice for csr

add csr ndarray iter test

make osx nose test verbose

disable libsvm iter test

Move InferAttr to mxnet from nnvm (apache#6830)

* Move InferAttr to mxnet from nnvm

Replace nnvm infer attr functions in c_api

Initial checkin

Clean up

Remove nnvm namespace for FInferShape, FInferType, and FInferStorageType

Add new interface for InferStorageType

Revert "Remove nnvm namespace for FInferShape, FInferType, and FInferStorageType"

This reverts commit 8aedf05.

Fix and clean up

Fix lint

Add nnvm changes

Change infer function interface to accept only rvalue reference of graph

Clean up

Flush commits to show up in PR

Add error handling for storage type inference failure

Update nnvm

* Fix pylint

Change idx type switch for aux data (apache#6860)

* Change idx type switch for aux data

* Add mshadow commit

Sparse dot enhancement (apache#6842)

* Initial checkin

Initial checkin

Fix sparse dot test

Fix unitest and add fallback for sparse dot

* Add benchmark code

* Revert "Add benchmark code"

This reverts commit be009fe.

* Fix bug

* Fix storage shape

* Remove unnecessary test code

* Use idx type switch

Implement dot(csr, rsp)=dns and dot(csr.T, rsp)=rsp and refactor (apache#6902)

* Initial checkin

Add dot(csr.T, rsp)=rsp2

Add infer storage for dot(csr, rsp)=dns and dot(csr.T, rsp)=rsp2

* Fix comments

* Replace std::lower_bound with own impl for gpu use too

* Add time profiling

* Revert "Add time profiling"

This reverts commit 8f5bb98.

* Move dot and batch_dot to a single file

* Move dot gpu impl to a .cuh file

* More refactor

* Fix include error

LibsvmIter fix (apache#6898)

* fix bug in libsvm iter which causes mem corruption

* add test for news dataset

* fix wrong path in test

* fix import error for urllib

* update url

* replace bz command with bz module

Optimized gpu dot kernels (apache#6937)

* pulled update to mshadow

* mshadow update

* added optimized gpu kernels for dot(csr,dns)=dns and dot(csr.T,dns)=dns, and unit test

* added __syncwarp to vector kernel and reduced number of writes to shared memory

Refactor sparse tensor code (apache#6955)

* Save stype in frontend to avoid c-api call for stype

* Change storage_type to stype

* Revert "Change storage_type to stype"

This reverts commit 90db7d1.

* Revert "Revert "Change storage_type to stype""

This reverts commit 0932838.

Move ndarray.py, sparse_ndarray.py, ndarray_utils.py, and _ndarray_internal to ndarrary folder

More refactor

Move elementwise sum for rsp to ndarray_function.cc

Remove unnecessary import in ndarray module

Fix pylint

Remove redundant code

Remove _stype from slots

Fix cpp-package build error caused by the change to imperative invoke interface

Use relative import

Remove print line

Rename _ndarray_internal.py to _internal.py

* Relaunch test...

minor bug fix in warp synchronous code (apache#7029)

* move storage type vector from nnvm to mxnet (apache#7054)

* move storage type vector from nnvm to mxnet

* update nnvm

* update nnvm

* Improve copy sparse tensors (apache#7003)

* Use cast_storage when copying ndarrays of different stypes on same context

* Relaunch test

* fix failed tests. add back 64bit support for dot

fix lint

* bug fix for IdentityComputeRsp

* fix lint

fix lint

fix lint

* add data partition for libsvm iter (apache#7027)

* remove sparse embedding (apache#7165)

* fix ndarray namespace

* remove untested gpu operators (apache#7172)

* skip sparse dot gpu tset. add sparse_nd_zeros gpu test

* remove sparse_retain gpu

Conflicts:
	tests/python/gpu/test_operator_gpu.py

* Fix ndarray aux data issue (apache#7098)

* Fix getting sparse ndarray data/aux_data issues

* Add tests for func csr and row_sparse

* Make get/set data/aux_data thread safe

* Fix a bug

* Fix typo and comment

* More comments

* Correct comment

Conflicts:
	tests/python/gpu/test_operator_gpu.py

* Support K-dimensional row-sparse tensor (apache#7179)

* remove check for k dimensional rowsparse tensor

* change var name for rsp sgd operator

* add checks for sparse dot

* bug fix for kdim rowsparse cast storage cpu

* update IdentityLikeRhsComputeEx interface

* remove set_storage_shape from ndarray. support elemwise_add with kdim row_sparse tensor

* use get_with_shape instead of reshape

* update according to comments

Conflicts:
	src/operator/tensor/elemwise_unary_op.h

* Improve sparse ndarray error message (apache#7181)

* add test for broadcast_to

* add comments

Conflicts:
	python/mxnet/base.py

* construct row_sparse ndarray for dist-async

fix bug in rsp add

rsp sync push

race condition for push

fix bug in rsp pull. refactor test

cleanup comments

refactor dist server

fix lint

fix storage shape issue with the new ndarray constructor

data sharding draft;

fix lint. add comment

add support for zeros gradients

use std::upper_bound/lower_bound

remove special init function for rowsparse dist kvstore

temporary support for inplace operators for sparse

add test. fix return type

store kRowSparseNDArray in kv server

remove fcomp_ex sgd with dns weight and rsp gradient

bug fix in sparse retain

sparse pull c_api

revise rowsparse pull api

use engine to compute unique to ensure thread safety

add rowsparse pull to dist-kv

fix lint

add example for rsp_pull

remove name2idx;

add sparse_pull_dict param to module

fix unit test and  c rowid conversion

support str key type in kvstore (apache#6765)

* update kvstore unit test

* update model/module.py

* fix lint

* remove int keys in kvstore

* update cast to str function

* remove _cast_to_str_keys

* fix lint

* always cast to str

Conflicts:
	include/mxnet/c_api.h
	include/mxnet/kvstore.h
	python/mxnet/kvstore.py
	python/mxnet/model.py
	python/mxnet/module/module.py
	src/c_api/c_api.cc
	src/kvstore/kvstore_local.h
	tests/python/unittest/test_kvstore.py

update module API for other submodules

update stypes in kvstore after refactoring

change type of size from size_t to int64_t

add sparse linear regression example

remove sparse_pull_dict from module

fix init_optim for seq_module. update sparse example

resolve conflict for binary add rsp rsp

Conflicts:
	python/mxnet/kvstore.py
	tests/python/unittest/test_kvstore.py

* fix DotCsrRspRspImpl error message (apache#7191)

* GPU implementation of cast_storage (dense to csr) (apache#7081)

* Added gpu implementation for cast_storage dense to csr, unit tests, and benchmark. Additionally, cast_storage interface change to accommodate the need of temporary storage in cuda kernels.

* fixed whitespace

* minor unittest update

* removed whitespace

* add cast storage benchmark params info

Conflicts:
	tests/python/gpu/test_operator_gpu.py

* Sparse square sum (apache#7206)

* Add square_sum op

* Add unit test and fix check_numeric_gradient

* Add .cu file and example

* Fix lint

* Remove gpu registration

* Use square_sum in test_module_fm

* Modify and Add documentation for mx.nd.zeros (apache#7197)

* Modify and Add documentation for mx.nd.zeros

* Change context to cpu

* Change stype to optional

* Change ordering and remove optional for _zeros_sparse_ndarray

* Expose kWriteInplace for imperative execution (fcompute_ex and fstatefulcompute_ex) (#133)

* expose kWriteInplace to FComputeEx and FStatefulComputeEx

* refactor ccode

* remove duplicated test

* Operator add_n for row sparse ndarrays (apache#7244)

* Add add_n op for row-sparse ndarrays and identity FComputeEx

* Fix bug in square_sum

* Remove test_cast_storage_ex from gpu test since it's not implemented yet

* Fix according to the cr

Conflicts:
	src/operator/tensor/elemwise_sum.cc
	src/operator/tensor/elemwise_unary_op.cc
	tests/python/gpu/test_operator_gpu.py

resolve conflict

* GPU implementation of cast_storage (dense to rsp) (apache#7223)

* CastStorageDnsRsp GPU Implementation

* updating function doc and some variable types and names

* adding cuda_get_device_prop() util function

* added rand_shape function for n-dimensional tensors

* updated cast storage unit test

* added dns_to_rsp to cast storage benchmark script

* removing redundant unit test

* fix lint

* minor change in benchmark script

* fix lint

* correct function description

* change storage_type to stype

* changed scope of using namespaces

* changed variable types from index_t to dim_t

* resolve merge conflict in ndarray.load

* Improve StatefulOp/FCompute storage fallback (#134)

* test for fcomp fallback

add storage fallback test and optimize fallback logic

rename function, add comments

use std size()

* add autograd test with sparse inputs

* update sparse ndarray api (#139)

* support mx.nd.empty for sparse ndarray

Change SparseNDArray to BaseSparseNDArray

support mx.nd.array with BaseSparseNDArray inputs. Update documentation with explicit subclasses of NDArrays

Conflicts:
	python/mxnet/ndarray/__init__.py
	python/mxnet/ndarray/ndarray.py
	python/mxnet/ndarray/sparse_ndarray.py
	tests/python/unittest/test_sparse_ndarray.py

* fix print msg in test

* Handle ograd_stype='row_sparse' for square_sum backward (#143)

* Add one kernel for square_sum backward pass to take rsp ograd

* Add kNullOp and change to use type_assign in infer stype fallback

* Sparse retain improvement (#138)

* Add one more kernel for sparse retain

* Fix compile

* Change STORAGE_TYPE_ASSIGN_CHECK to type_assign for fallback

* Fix

* Add gpu compile

* ignoring variables in SimpleBind that is used on python's sparse branch for now. (#135)

* add bias term to fm test (#145)

* update ndarray.nd, remove `invoke` from excluded members (#137)

remove __weakref__ from SparseNDArray

add data indice to doc

revert dlpack update

revert mxdoc changes

move methods from BaseSparseNDarray to csrndarray and rwosparse ndarray

* support storage fallback with mutable inputs (#147)

* include mutatable inputs in storage fallback. refactor executor

add fallback test for rms prop and adam

fix lint

fix lint

fix test in optimizer

*  update according to comments

* fix unit tests

* fix gpu compilation err

* Code changes based on reviews (#144)

* code changes according to review comments

remove executor debug. add doc to optimizer

update sparse sgd test

add dtype option to rand_sparse_ndarray

* overhauled reqs for sparse operators

* patch FCompExFallback with mutable inputs. update test_optimizer with more fallback cases

* change executor debug macro to env var

* add comment

* update doc

* change ndarray.aux_shape() to return const reference

* remove todense to_rsp to_csr. replace with tostype

* replace manual calls to cast_storage with tostype

* disable gpu fallback test for optimizer

* fix lint

* add backward pass for cast_storage. refactor cast_storage test

* rand_sparse_ndarray bug fix

* fix cast_storage for gpu

* disable csr test for fp16

* update row sparse ndarray doc

* update doc

* small edits according to reviews (#151)

* fix lint (#152)

* add license to all new files in sparse brnach (#154)

* Allocate temp data on the fly for some casting operations (#149)

* fix utf8 encoding in sparse ndarray

* Extending the GPU dot operator (apache#7226)

* Added GPU DotCsrRspDnsImpl declaration and TODOs

* cleaning up function doc, variable types, and code-style

* minor bug fixes

* enable GPU dot(csr,rsp)=dns unit test

* extend sparse dot unit test

* adding GPU impl of DotCsrRspDns and its kernels

* add TODO

* changed variable types from index_t to dim_t

* fix function description

* added DotCsrRspRspImpl and its kernels (baseline, functionality)

* added DotCsrDnsRspImpl and its kernels (baseline, functionality); plus code documentation

* refactored dot benchmark

* optimized DotCsrTransDnsRsp GPU kernel

* change of dot impl interface to include OpContext, for temp storage

* removing __device__ flag from CPU kernels

* minor fixes and changing variable data types

* minor fixes based on code reviews

Conflicts:
	benchmark/python/sparse_op.py
	tests/python/gpu/test_operator_gpu.py
	tests/python/unittest/test_sparse_operator.py

* Add get_synthetic_dataset function to util (#146)

* Add get_synthetic_datasets

* Move to test_utils

* Remove _get_uniform_dataset

* Move validation to its own function

* Refactor the validation code for csr generation

* Make test_powerlaw a nested function

* Change SparseNDArray to CSRNDArray

* Merge with dtype specific changes in test_utils

* temporary fix for batch norm storage fallback (#156)

* support random_uniform/normal/gamma with row_sparse output (#155)

* add support for initilazer with rowsparse output

* add scalar assignment to row_sparse

* add setitem test to gpu

* Revert "add scalar assignment to row_sparse"

This reverts commit 8aef7a5.

* Revert "add setitem test to gpu"

This reverts commit 3b969ac.

* Square sum backward support one more case (#161)

* Add documentation for sparse ops (#148)

*  draft doc for sparse op

* add more stype doc for operators

* add doc for cast_storage

* see also cast_storage. remove base sparse ndarray. fix aux_types comemtn

* grammar / spelling fix

* A few fixes (#163)

* fix batch norm gpu kernel. register random operators on gpu

* register sparse random op on gpu, too

* Minor fixes sparse ops (#160)

* change CPU kernel inline directives, data types, and function doc

* update dot dtype switch to use 32 and 64bit floating point only

* use type_assign instead of STORAGE_TYPE_ASSIGN_CHECK

* added tensor_util-inl.cuh file for common tensor operator GPU kernels

* sparse Adam optimizer (#164)

*  add sparse adam

* register gpu op

* add comments

* cr comments

* kvstore.row_sparse_pull for GPU and end-to-end benchmark: CPU vs. multi-GPUs (#150)

* Add gpu support for BroadcastRowSparse

* Fix bugs

* Add benchmark script

* Increase output dim size

* Update weight on CPU using single GPU for sparse tensors

* More fix

* Optimize sparse_retain for special case

* Change row sparse pull locations

* Avoid sparse retain on cpu if possible

* Use acc for metric

* Fix misc

* fix bug in adam update (#167)

fix a bug in adam update

* change sparse example from regression to classification (#165)

* fix python import (#166)

* Add waitall to sparse_end2end.py (#169)

* Add waitall()

* Add dummy metric option

* Add header license

* Dot script changes (#159)

* Add get_synthetic_datasets

* Move to test_utils

* Remove _get_uniform_dataset

* Move validation to its own function

* Refactor the validation code for csr generation

* Make test_powerlaw a nested function

* Change SparseNDArray to CSRNDArray

* Refactoring changes to dot.py

* Fix mxnet test_utils changes

* Remove pdb statement

* Add distribution parameter

* Refactor benchmarking script

* Remove unused code

* Make style changes and remove unused code

* Change typo in comment

* Add transpose support

* Change typo

* 4 decimal points needed for density

* Add rsp support for real datasets

* Correct variable name mini_file_name

* Move wait_to_read outside if

* Seperate out scipy and mxnet logic in bench_dot

* Fix lhs_trans issue

* Move transpose outside measure_cost

* Compute transpose inside measure_cost

* Remove unused variables

* Transpose only if trans_lhs (#171)

* fix default val for distribution (#172)

* fix lint (#175)

* avoid cast_storage in dist-kvstore-server (#174)

* avoid cast_storage in dist-kvstore-server

* add stream arg to mshadow;;copy

* fix copy order

* Add sparse namespace to ndarray and symbol (#177)

* Register dot, cast_storage, and sparse_retain under mxnet.ndarray.sparse

* Add sparse to symbol namespace

* Delete commented code

* mv sparse_ndarray.py sparse.py

* Clean up

* Change docstring

* changes based on code reviews (#176)

* remove scipy dependency

* move kvstore checks to backned

* add const to lambda

* temp fix to ndarray.md (#178)

* Fix sparse namespace pylint (#179)

* add comments and error msg (#181)

* add clarification for csr (#182)

* add clarification for csr

* cr comments

* revert change in test util (#183)

* fix amalgamation (#184)

* fix lint
eric-haibin-lin pushed a commit that referenced this pull request Oct 18, 2018
* Added implementation of RMSProp, AdaGrad, AdaDelta

* Added AdaMax and Nadam
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants