Release Notes#

v0.2.0#

New Features#

  • Python version support: Added Python 3.14 support.

  • FFT Conv1D: New non-causal 1D convolution using real FFT.

    • Forward and backward pass with full autograd support.

    • Supports FFT sizes from 32 to 8192.

    • Automatic input/filter padding and FFT size validation.

    • Supports float16, bfloat16, float32, and float64 data types.

  • Causal Conv1D NHC layout: Added support for channel-last memory format in causal convolution.

  • Causal Conv1D NCH layout: Improved performance for workloads with 128-byte aligned sequences and 16-byte data types.

  • FFT Conv2D optimizations: Performance improvements to 2D FFT convolution kernels. Supports 2D depthwise convolution with ‘same’ padding using FFT.

Bug Fixes#

  • Fixed int32 index overflow for large tensors when batch_size * hidden_dim * seq_dim exceeds 2^31 elements. All kernel layout computations now use 64-bit indexing.

  • Fixed causal conv1d backward kernel producing incorrect weight gradients for certain sequence lengths due to uninitialized values.

  • Fixed padding calculation in fft_conv2d.py for correct same-padding semantics.

v0.1.1#

First official release of subquadratic_ops_torch, providing GPU-accelerated CUDA kernels for subquadratic operations with PyTorch bindings.

New Features#

  • B2B Causal Conv1D: Back-to-back depthwise causal 1D convolution fused kernel for the Striped Hyena 2 architecture used in the Evo2 model.

    • Fuses projection convolution, element-wise gating, mixer convolution, and skip connection into a single kernel launch.

    • Forward and backward pass with full autograd support.

    • Supports float16, bfloat16, float32, and float64 data types.

  • Causal Conv1D: Depthwise causal 1D convolution with optional bias and activation.

    • Forward and backward pass with full autograd support.

    • Built-in SiLU activation support.

    • Supports float16, bfloat16, float32, and float64 data types.

    • NCH layout support.

  • FFT Causal Conv1D: Causal 1D convolution using real FFT and IFFT for long filter sizes.

    • Short-FFT path for filters up to 16384 (SM 9.0+) or 8192 (older architectures).

    • Long-FFT path for arbitrary filter lengths.

    • Forward and backward pass with full autograd support.

    • Supports float16, bfloat16, float32, and float64 data types.

  • FFT Conv2D: FFT-based 2D depthwise separable convolution with same-padding semantics.

    • Forward and backward pass with full autograd support.

    • Supports float32, and float64 data types.

  • Implicit Filter: Implicit modal filter generation for the Hyena architecture.

    • Memory-efficient implementation of implicit modal filter.

    • Forward and backward pass with full autograd support.

    • Built on NVIDIA Warp for tiled kernel launches.

  • Rearrange: CUDA-accelerated tensor layout transposition between (B, H, L) and (L, B, H) formats with autograd support.

Platform Support#

  • CUDA-compatible NVIDIA GPUs (Ampere and newer).

  • CUDA Toolkit 12.0+.

  • Python 3.11–3.13.