Release Notes#
v0.2.0#
New Features#
Python version support: Added Python 3.14 support.
FFT Conv1D: New non-causal 1D convolution using real FFT.
Forward and backward pass with full autograd support.
Supports FFT sizes from 32 to 8192.
Automatic input/filter padding and FFT size validation.
Supports float16, bfloat16, float32, and float64 data types.
Causal Conv1D NHC layout: Added support for channel-last memory format in causal convolution.
Causal Conv1D NCH layout: Improved performance for workloads with 128-byte aligned sequences and 16-byte data types.
FFT Conv2D optimizations: Performance improvements to 2D FFT convolution kernels. Supports 2D depthwise convolution with ‘same’ padding using FFT.
Bug Fixes#
Fixed
int32index overflow for large tensors whenbatch_size * hidden_dim * seq_dimexceeds 2^31 elements. All kernel layout computations now use 64-bit indexing.Fixed causal conv1d backward kernel producing incorrect weight gradients for certain sequence lengths due to uninitialized values.
Fixed padding calculation in
fft_conv2d.pyfor correct same-padding semantics.
v0.1.1#
First official release of subquadratic_ops_torch, providing GPU-accelerated
CUDA kernels for subquadratic operations with PyTorch bindings.
New Features#
B2B Causal Conv1D: Back-to-back depthwise causal 1D convolution fused kernel for the Striped Hyena 2 architecture used in the Evo2 model.
Fuses projection convolution, element-wise gating, mixer convolution, and skip connection into a single kernel launch.
Forward and backward pass with full autograd support.
Supports float16, bfloat16, float32, and float64 data types.
Causal Conv1D: Depthwise causal 1D convolution with optional bias and activation.
Forward and backward pass with full autograd support.
Built-in SiLU activation support.
Supports float16, bfloat16, float32, and float64 data types.
NCH layout support.
FFT Causal Conv1D: Causal 1D convolution using real FFT and IFFT for long filter sizes.
Short-FFT path for filters up to 16384 (SM 9.0+) or 8192 (older architectures).
Long-FFT path for arbitrary filter lengths.
Forward and backward pass with full autograd support.
Supports float16, bfloat16, float32, and float64 data types.
FFT Conv2D: FFT-based 2D depthwise separable convolution with same-padding semantics.
Forward and backward pass with full autograd support.
Supports float32, and float64 data types.
Implicit Filter: Implicit modal filter generation for the Hyena architecture.
Memory-efficient implementation of implicit modal filter.
Forward and backward pass with full autograd support.
Built on NVIDIA Warp for tiled kernel launches.
Rearrange: CUDA-accelerated tensor layout transposition between
(B, H, L)and(L, B, H)formats with autograd support.
Platform Support#
CUDA-compatible NVIDIA GPUs (Ampere and newer).
CUDA Toolkit 12.0+.
Python 3.11–3.13.