AI News, Benchmark TensorFlow #66

Benchmark TensorFlow #66

I'm kind of curious if there's any support for automatically compounding operations together or of leveraging kernels that have some compounding built in (like the alpha/beta params of gemm).

You can find new benchmarks of my latest winograd kernels in the updated paper here: What I'll be working on next is basically going to be taking a lot of what I learned implementing winograd and refreshing all of my conv/pooling/gemm kernels to support very small minibatches at near full utilization.

Nervana's Neon and Winograd #93

I already have a fully fused version of that kernel that I should finish debugging this weekend.

On the weight update side, fusion probably isn't possible due to the extremely strided memory access pattern required and no shared memory left for mitigating that.

guess there's a chance in NCHW that the overlaps in the super-tiling might make full fusion possible in update, but on the downside you're slower on fprop/bprop for smallish HW because your effective tile size needs to be much bigger and you end up with a lot of zero overlap.

There's plenty of shared memory around to efficiently transpose in place at no cost and having C as the inner dimension means that you minimize the slicing logic for all values of N and not just larger ones.