Genomics

From CSVs to Iceberg: Scaling a Genomics ETL Pipeline for ML Training on a budget

14 December 2025·7 mins

Data-Engineering Apache-Iceberg Aws-Athena Parquet Genomics

How replacing a CSV-join pipeline with Apache Iceberg and a long-format data model cut an ETL pipeline from ~15 minutes to a minute

Using Learning Curves to Know Whether More Data Will Help

10 March 2025·4 mins

Machine-Learning Sample-Size Genomics Methodology

Learning curves won’t tell you exactly how many samples to collect, but they will tell you whether collecting more is worth it at all. In domains where each sample costs real money, that’s the question that actually matters.

When Random Features Work Just as Well

20 January 2025·3 mins

Machine-Learning Feature-Selection Genomics Dimensionality

On the counterintuitive finding that randomly selecting features from high-dimensional genomic data often matches the performance of careful feature engineering and why that makes mathematical sense.

↑