Release Notes#

v0.2.0#

New Features#

Python version support: Added Python 3.14 support.
FFT Conv1D: New non-causal 1D convolution using real FFT.
- Forward and backward pass with full autograd support.
- Supports FFT sizes from 32 to 8192.
- Automatic input/filter padding and FFT size validation.
- Supports float16, bfloat16, float32, and float64 data types.
Causal Conv1D NHC layout: Added support for channel-last memory format in causal convolution.
Causal Conv1D NCH layout: Improved performance for workloads with 128-byte aligned sequences and 16-byte data types.
FFT Conv2D optimizations: Performance improvements to 2D FFT convolution kernels. Supports 2D depthwise convolution with ‘same’ padding using FFT.

Bug Fixes#

Fixed int32 index overflow for large tensors when batch_size * hidden_dim * seq_dim exceeds 2^31 elements. All kernel layout computations now use 64-bit indexing.
Fixed causal conv1d backward kernel producing incorrect weight gradients for certain sequence lengths due to uninitialized values.
Fixed padding calculation in fft_conv2d.py for correct same-padding semantics.

v0.1.1#

First official release of subquadratic_ops_torch, providing GPU-accelerated CUDA kernels for subquadratic operations with PyTorch bindings.

New Features#

B2B Causal Conv1D: Back-to-back depthwise causal 1D convolution fused kernel for the Striped Hyena 2 architecture used in the Evo2 model.
- Fuses projection convolution, element-wise gating, mixer convolution, and skip connection into a single kernel launch.
- Forward and backward pass with full autograd support.
- Supports float16, bfloat16, float32, and float64 data types.
Causal Conv1D: Depthwise causal 1D convolution with optional bias and activation.
- Forward and backward pass with full autograd support.
- Built-in SiLU activation support.
- Supports float16, bfloat16, float32, and float64 data types.
- NCH layout support.
FFT Causal Conv1D: Causal 1D convolution using real FFT and IFFT for long filter sizes.
- Short-FFT path for filters up to 16384 (SM 9.0+) or 8192 (older architectures).
- Long-FFT path for arbitrary filter lengths.
- Forward and backward pass with full autograd support.
- Supports float16, bfloat16, float32, and float64 data types.
FFT Conv2D: FFT-based 2D depthwise separable convolution with same-padding semantics.
- Forward and backward pass with full autograd support.
- Supports float32, and float64 data types.
Implicit Filter: Implicit modal filter generation for the Hyena architecture.
- Memory-efficient implementation of implicit modal filter.
- Forward and backward pass with full autograd support.
- Built on NVIDIA Warp for tiled kernel launches.
Rearrange: CUDA-accelerated tensor layout transposition between (B, H, L) and (L, B, H) formats with autograd support.

Platform Support#

CUDA-compatible NVIDIA GPUs (Ampere and newer).
CUDA Toolkit 12.0+.
Python 3.11–3.13.