08/16/25
Columbia Data Management at VLDB 2025 in London!

This August, the Columbia University’s Data Management Group will be participating at VLDB 2025 in London! We are presenting three research papers, and involved in two panels (as organizer and participant). Highlights below.

Main Conference Papers

Suna: Scalable Causal Confounder Discovery over Relational Data. Liu et al.

Jerry developed a new system called Suna for scalable causal confounder discovery at VLDB 2025. Causal analysis is notoriously fragile when key confounders are missing, which lead to wrong conclusions and bad decisions. While vast relational data repositories could supply these missing variables, existing methods either require a complete causal diagram (unrealistic), don’t scale, or fail when a causal effect actually exists. Suna changes the game: it directly identifies an admissible set of confounders for a causal query without needing the full causal diagram, and it scales to millions of attributes.

Jerry’s key insight is a new connection between confounder existence and unconfounded ancestors of the treatment variable, which lets us iteratively uncover just the minimal confounders needed. Combining this theorem with factorized learning and GPU acceleration lets Suna find high-quality, interpretable confounders >100× faster than any alternative system. For the first time, it’s possible to have accurate causal answers interactively, even over massive repositories.

FaDE: More Than a Million What-ifs Per Second. Mohammed et al.

Haneen’s FaDE brings what-if analytics into the realm of true interactivity. Asking “what if we removed this region’s sales?” or “what if we doubled ad spend?” is central to decision-making, root-cause analysis, and explanations. However, it’s traditionally unimaginably slow because each hypothetical change (intervention) often requires re-running complex queries, and even small scenarios can involve huge search spaces, making interactive exploration nearly impossible. Prior work need to make careful algorithmic trade-offs or rely on prior knowledge, which limits their applicability.

FaDE changes that! Building on our recent SmokedDuck extension to DuckDB which provides nearly-free provenance capture, FaDE introduces a provenance-based intervention engine that can evaluate up to 1 million what-if questions per second. Instead of approximations or pruning, it simply brute-forces through massive intervention spaces: fast enough to support sensitivity analysis, rich explanations, and exploratory what-if modeling on real SQL queries, all at interactive speed!

HoliPaxos: Towards More Predictable Performance in State Machine Replication Liang et al.

Vahab contributed to HoliPaxos before starting his PhD at Columbia. HoliPaxos revisits classical MultiPaxos with the goal of improving performance reliability in state machine replication. Rather than redesigning consensus protocols from scratch, HoliPaxos introduces lightweight mechanisms that make Paxos-based systems more robust under non-ideal conditions, such as node slowdowns, partial network partitions, and log management overheads. The work proposes a self-monitoring leader mechanism that detects overload or degraded performance and initiates graceful leader handoff, providing 1-slowdown tolerance without the complexity of dual-leader protocols. It further introduces new failure-detection rules that address leader churn in partial partitions without imposing additional runtime costs, ensuring stable leadership in challenging connectivity scenarios. Finally, it advocates an on-demand recovery approach to log management that reduces performance variability and resource waste compared to traditional periodic snapshotting. By focusing on practical improvements that preserve the simplicity and stability of MultiPaxos, HoliPaxos offers a more predictable and resilient foundation for modern replicated systems.

DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing. Shankar et al.

LLMs have been a game changer for structured extraction from PDFs and there has been a huge number of academic projects and startups tackling this problem. A lot of the focus has been on beating benchmarks, and while the progress has been tremendous, LLMs are so new that the actual ecosystem document tasks and applications is still evolving rapidly. And so, it’s not even clear what benchmark designs and metrics are actually correlated with real needs.

Now, I could talk up DocETL’s system abstractions, workflow patterns, and rewrite directives, IMO the coolest part of Shreya’s DocETL work is that it embodies a litany of subtle engineering decisions to make the framework useful. Further, she’s created a flywheel of a community that brings their documents and tasks to the platform, which motivates and benefits the research work, which then folds back to the user community. I’m grateful to have joined in a bit while visiting Berkeley during my sabbatical.

Panels

VLDB Panel: “Where Does Academic Database Research Go From Here? 2nd Edition”

We already did the work for SIGMOD, so when the VLDB panel organizers asked us to organize a panel, Raul and I readily agreed. The panel’s goal is again to explore the comparative advantage of academic database research in today’s budgetary, industry, and technology climate. It picks up where the SIGMOD panel left off. While the SIGMOD panel consisted of more senior members of the community, the VLDB panel is considerably more varied in seniority, ranging from PhDs to ACM Fellows.

PhD Workshop Panel: “Adventure and Beauty in Data Management Research”.

Eugene is excited to participate in a Panel at the VLDB PhD Workshop, largely to learn from the other panelists and attend some of the student talks!