SEC Filings to Alpha

Reasoning-Augmented Factor Extraction from SEC Filings

QLoRA · GRPO · vLLM · DeepSeek-R1-Distill-Qwen-14B · SLURM · ACCRE DGX A100

DSI Capstone with AllianceBernstein | Vanderbilt University | Spring 2026

An end-to-end system that converts 10 years of SEC 10-K and 10-Q filings across 80 industrial tickers into a filing-level sentiment signal. The signal powers a sector-neutral long-short equity strategy delivering Information Ratio 2.02 over 318 active trading days. A four-stage pipeline combines iXBRL parsing, chunked question-answering over a 60-question taxonomy, QLoRA fine-tuning of a 14B reasoning model, and GRPO reinforcement-learning alignment.

Headline Result

2.02

Information Ratio, sector-neutral long-short, 318 active days

+5.31pp

Ladder lift from base model to final cohort

11/12

Sector-neutral cells where OOS IR met or beat IS

The Challenge

Institutional investors pay for analyst summaries of SEC filings because the underlying documents are long, inconsistently structured, and written in deliberately hedged language. A decade of 10-Ks and 10-Qs across 80 tickers is roughly 2,500 filings and tens of millions of tokens, far beyond manual review. The question is whether a fine-tuned reasoning model can extract filing-level sentiment that survives honest backtesting as a tradeable factor, not just look good in cross-sectional correlations.

Can a reasoning LLM, fine-tuned and RL-aligned on a 60-question factor taxonomy, produce sentiment signals that deliver a monotone cohort-to-return ladder under sector-neutral long-short construction?

Four-Stage Pipeline

Each filing flows through four stages, chained as resume-safe jobs on an HPC cluster:

  1. Parse iXBRL and legacy HTML 10-K/10-Q filings into MD&A and Risk Factors sections. A BeautifulSoup parser resolves TOC anchors with three fallback boundary strategies (next-anchor regex, body-text regex, 500 KB slice) to handle a decade of inconsistent SEC toolchains.
  2. Apply a 60-question factor taxonomy across 14 categories (11 universal plus 3 sub-sector-specific for airlines, defense, and industrial). A keyword router narrows which questions fire per subsection, cutting LLM calls while preserving coverage.
  3. Fine-tune DeepSeek-R1-Distill-Qwen-14B with QLoRA on 5,217 Claude-Opus-relabeled annotations (80/20 stratified split). Merge the adapter and deploy via vLLM for 50 to 100 times faster inference than PEFT streaming.
  4. Train a GRPO-aligned sentiment policy with a composite ordinal-plus-anti-neutral reward (to fight 30 percent neutral-class dominance), then layer Best-of-N self-consistency voting at T=0.8, N=3 as a test-time-compute booster.

Results

Monotone Cohort Ladder

Top-quintile minus bottom-quintile return, sector-neutral construction. Every training stage strictly improves the prior cohort.

Base model

2.78%

Off-the-shelf reasoning LLM, zero fine-tuning

SFT (QLoRA)

+2.10pp4.88%

Supervised fine-tune on relabeled annotations

SFT + GRPO + Best-of-N

+3.21pp8.09%

Final model, monotone lift over every prior cohort

Net lift of +5.31pp from untouched reasoning model to fully aligned policy. Newey-West HAC standard errors applied, CAPM-decomposed, transaction-cost-sensitivity swept.

48-Cell Backtest Grid

4 strategies by 4 model variants by 3 horizons, 605 held-out filings. Each square is one backtest cell.

Passed (44)
Muted cell (4)
11 of 12 sector-neutral cells: OOS IR greater than or equal to IS IR

Candor

Defense tickers showed a wrong-sign signal. Spearman correlation of minus 0.146 between predicted sentiment and forward returns (p = 0.007, n = 337).

Rather than sweep it under the rug, I surfaced it as a production rule: invert or exclude defense from long-short books, and document the pattern with IC-ranked factor attribution. An honest backtest beats a flattering one.

Hardest Problems Solved

  1. Failed learned ordinal verifier. The original Best-of-N scorer used a CORN ordinal-regression head. Validation revealed silent-failure modes. Pivoted to zero-parameter self-consistency voting (Wang et al. 2022), preserving the +3.2 percentage-point lift over SFT-only while cutting inference compute and removing the silent-failure risk.
  2. OOM incident on vLLM cluster. Under higher concurrency on DGX A100 40GB, the Task 2 pipeline OOMed at max_tokens=1024. Rewrote it end-to-end as resume-safe: per-filing checkpointing, bounded concurrency semaphores, 5-attempt exponential backoff (2 to 32 seconds), 120-second per-request timeouts, and live progress bars.
  3. Inconsistent SEC document structure. A decade of filings mixed iXBRL, legacy HTML, and-a-name anchors with div-id anchors. Built a three-tier boundary-resolution fallback so section extraction survived the worst-formatted filings without dropping coverage.

Scale and Metrics

2,441

SEC Filings

67,741

Factor Observations

80

Industrial Tickers

10 yrs

Coverage (2015 to 2025)

60

Taxonomy Questions

14B

Model Parameters

48

Backtest Cells

5,217

Labeled Annotations

Infrastructure

Inference ran on a vLLM-on-ACCRE stack (DGX A100 40GB and RTX A6000) with OpenAI-compatible clients, bounded concurrency, 5-attempt exponential backoff, and per-request 120s timeouts. Authored SLURM sbatch launch scripts and a Singularity-containerized vLLM startup, tuned separately for A100 40GB and 80GB nodes, with ACCRE-specific path resolution and HuggingFace cache placement on fast scratch storage.

Tech Stack

PythonPyTorchTransformersPEFT (QLoRA)TRL (GRPO)vLLMDeepSeek-R1Claude OpusBeautifulSoupPydanticpandasNumPySciPystatsmodelsJupyterCUDASingularitySLURMGit

Team and Role

Three-person capstone team with per-person ticker partitioning. I led the project as sole author of 58 of 73 repository commits, drove the pipeline architecture, fine-tuning and RL alignment, and the backtest framework, and authored the handoff documentation and final technical report.

Partner: AllianceBernstein · Academic Sponsor: Vanderbilt Data Science Institute