[{"content":"","date":"14 December 2025","externalUrl":null,"permalink":"/tags/apache-iceberg/","section":"Tags","summary":"","title":"Apache-Iceberg","type":"tags"},{"content":"","date":"14 December 2025","externalUrl":null,"permalink":"/tags/aws-athena/","section":"Tags","summary":"","title":"Aws-Athena","type":"tags"},{"content":"","date":"14 December 2025","externalUrl":null,"permalink":"/tags/data-engineering/","section":"Tags","summary":"","title":"Data-Engineering","type":"tags"},{"content":"We had a data ingestion pipeline to create wide-format matrix that worked perfectly well for hundreds of samples. But scaling this to thousands of samples would have broken it entirely. As a start-up, services like Snowflake and Databricks were out of our budget. So, I had to make the most of comparatively cheaper, native solutions in AWS. This post covers how I used long-formats and Apache Iceberg to build a solution that could scale.\nThe Data # The domain is DNA methylation: measuring chemical modifications at specific positions (CpG sites) across the genome. Each sample produces a file with roughly 4 million rows: one row per genomic coordinate, with a beta value (0–100) representing the methylation level at that position. Each file is about 100 MB as a CSV.\nThe typical output would produce a perculiar wide-format matrix: CpG coordinates as rows, samples as columns, beta values as cells.\nThe Legacy Approach # The existing pipeline worked like this:\nRead individual per-sample CSV files from S3 via temporary Athena external tables Run a CTAS (Create Table As Select) query that pivots the data into wide format: CpG loci as rows, samples as columns Write the result as a static Parquet file This produced a correct output, but had fundamental scaling problems:\nSchema explosion on new samples. Every new sample added a new column to the wide-format output. Adding one sample meant re-running the entire CTAS, rewriting the full file. The SQL queries themselves grew so long they hit API limits, requiring intermediate Parquet files to be created before a final join.\nNo incremental updates. The output was a static snapshot. One file, fixed schema, fixed set of samples. To add a single sample, you had to recreate it again from scratch.\nAthena performance degrades with wide format. Like relational databases, Athena is optimised for tall, narrow tables, not matrices with hundreds or thousands of columns. As sample count grew, the wide-format queries became increasingly slow. At scale, the query engine was fighting the shape of the data.\nFor defined client project work where we were analysing a few hundred samples at most, and we could do a \u0026ldquo;run once, analyse, archive\u0026rdquo;, this was adequate. For continuous sample ingestion to train ML pipelines, it was not.\nThe Architecture # This replaced the wide-format-first approach with a three-stage pipeline:\n1. Ingest CSVs into Parquet # Rather than querying raw CSVs at analysis time, the first stage converts each sample\u0026rsquo;s CSV into Parquet as soon as it arrives.\n2. Store in Long Format with Iceberg # The core table uses long format rather than wide format:\ncpg_coordinate sample_id beta_value chr1:100-102 sample1 73.33 chr1:102-104 sample1 100.0 chr1:104-106 sample1 68.91 One row per locus per sample. The schema is fixed and it never changes regardless of how many samples exist. Adding a new sample means inserting new rows, not adding new columns.\nThis table is managed by Apache Iceberg, which provides the metadata and transaction layer on top of Parquet files in S3.\n3. Generate Wide Format On Demand # When a wide-format matrix is needed for ML training, a single GROUP BY aggregation query pivots from long to wide. Athena\u0026rsquo;s distributed engine touches each row exactly once regardless of sample count: no multi-stage joins, no intermediate files.\nWhy Iceberg # Iceberg isn\u0026rsquo;t just a storage format. It\u0026rsquo;s a metadata and transaction layer that sits between the query engine (Athena) and the data files (Parquet in S3). The architecture has three layers:\nCatalog (AWS Glue in our case): holds a single mutable pointer per table: the path to the current metadata file. This is the only thing that changes during a write.\nMetadata Layer: consists of three levels of files, all written to S3 alongside the data:\nMetadata file (.metadata.json): the table\u0026rsquo;s complete state at a point in time. Contains the schema, partition spec, and a reference to a manifest list. Every write (INSERT, DELETE, OPTIMIZE) creates a new metadata file. The catalog atomically swaps its pointer to the new file - this is what makes writes atomic. Manifest list (one per snapshot S0, S1, \u0026hellip;): a list of all the manifest files that make up this snapshot. Crucially, snapshots share manifest files. When you insert a new sample, only a new manifest file is added for those new data files; all existing manifest files from the previous snapshot are reused. This is how time-travel works without duplicating data. Manifest files: each describes a subset of Parquet data files: which partition they belong to, their row counts, and min/max statistics for every column. Athena uses these statistics to skip files that can\u0026rsquo;t satisfy a query\u0026rsquo;s WHERE clause without opening them (partition pruning). Data layer — the actual Parquet files, never modified in place. Deletes write a separate delete file; the original data file remains.\nThis architecture gives us several things that plain Athena-over-Parquet cannot provide:\nACID Transactions # Concurrent reads and writes are safe. No risk of reading a half-written table or corrupting data with overlapping queries.\nRow-Level Deletes # A customer right-to-erasure request becomes a SQL statement:\nDELETE FROM methylation WHERE sample_id = \u0026#39;sample1\u0026#39; No ETL pipeline to find the original Parquet file, no manual manifest management, no risk of deleting the wrong data. Iceberg writes a delete file; the original data is preserved until explicitly vacuumed.\nHidden Partitioning # Iceberg partitions data physically (grouping rows with the same partition key into the same files) but manages the mapping through metadata. Unlike Hive-style partitioning with Athena, I don\u0026rsquo;t need to include partition columns in their WHERE clauses for pruning to work. Athena consults the manifest statistics and skips irrelevant files automatically - the metadata layer handles everything.\nThis means we can change the partitioning strategy (e.g., switching from batch-based to chromosome-based) without rewriting queries or telling users anything changed.\nTime Travel # Every write creates a new snapshot. Previous snapshots remain accessible, making it possible to audit what the table looked like before a deletion. It\u0026rsquo;s useful for regulatory compliance, though not sufficient on its own for data compliance (you\u0026rsquo;d still need CloudTrail-level audit logging for who, when, and why).\nPerformance Results # The benchmark compared the two approaches on the same data: merging 148 samples across into a single wide-format Parquet file for analysis.\nApproach Time Legacy 857 seconds (14 min) Iceberg approach 62 seconds A 14x speedup — and the gap widens with sample count because the Iceberg approach doesn\u0026rsquo;t re-read or re-join existing data when new samples are added.\nThe performance difference comes from three compounding factors:\nParquet vs CSV at read time. Columnar format with compression and predicate pushdown vs. line-by-line text parsing. Single aggregation vs. cascading joins. The wide-format pivot is a single GROUP BY over the long-format table, not a chain of pairwise sample joins that grow quadratically. Metadata-driven file skipping. Athena uses Iceberg manifest statistics to skip files that can\u0026rsquo;t match the query, scanning only the relevant partition. Trade-offs and Limitations # Iceberg adds operational complexity. The metadata layer needs to be understood by the team. Concepts like snapshots, manifest files, and vacuum are new to anyone used to \u0026ldquo;just put Parquet files in S3.\u0026rdquo;\nVACUUM is required for true GDPR compliance. Row-level deletes mark data as deleted in the metadata, but the underlying Parquet bytes remain until a VACUUM operation physically removes expired snapshots. For a right-to-erasure request, you need both the DELETE and a subsequent VACUUM.\nWide-format generation is now a query, not a file. The legacy approach produced a static file that analysts could download and work with offline. The Iceberg approach generates wide format on demand via a query. For ML training pipelines that expect a file as input, this means adding a materialisation step.\nLong-format tables are larger. Storing sample_id on every row is more verbose than a single column header in wide format. Parquet compression mitigates this significantly, but the raw row count scales as samples × loci (in our case, 4 million × sample count).\nKey Takeaways # Data model matters Switching from wide-format-first to long-format-first was the biggest win. Long format eliminates schema changes, enables row-level operations, and turns a quadratic join problem into a linear scan. Invest in format conversion early. Converting CSVs to Parquet on arrival is a small upfront cost that pays dividends on every downstream query. Don\u0026rsquo;t let raw text formats persist in analytical paths. Iceberg is worth the complexity for mutable data. If your data only ever grows and is never deleted or updated, plain Parquet with Hive-style partitioning may be sufficient. The moment you need deletes, updates, or schema evolution, Iceberg earns its keep. ","date":"14 December 2025","externalUrl":null,"permalink":"/projects/csv-to-iceberg-methylation-analytics/","section":"Projects","summary":"How replacing a CSV-join pipeline with Apache Iceberg and a long-format data model cut an ETL pipeline from ~15 minutes to a minute","title":"From CSVs to Iceberg: Scaling a Genomics ETL Pipeline for ML Training on a budget","type":"projects"},{"content":"","date":"14 December 2025","externalUrl":null,"permalink":"/tags/genomics/","section":"Tags","summary":"","title":"Genomics","type":"tags"},{"content":"I am a computational biologist with a PhD in metagenomics and antimicrobial resistance, with experience building production ML systems and data infrastructure for biotech and genomics. I have worked at the forefront of pathogen genomic and microbiome research at institutions including the Wellcome Sanger Institute and The Alan Turing Institute, before joining a Khosla Ventures-backed epigenetics startup. My current interest focuses on applying transformer-based deep learning to metagenomic sequence data for mobile antimicrobial resistance detection.\n","date":"14 December 2025","externalUrl":null,"permalink":"/","section":"Home","summary":"","title":"Home","type":"page"},{"content":"","date":"14 December 2025","externalUrl":null,"permalink":"/tags/parquet/","section":"Tags","summary":"","title":"Parquet","type":"tags"},{"content":"Selected projects covering ML engineering, deep learning, and applied bioinformatics.\n","date":"14 December 2025","externalUrl":null,"permalink":"/projects/","section":"Projects","summary":"","title":"Projects","type":"projects"},{"content":"","date":"14 December 2025","externalUrl":null,"permalink":"/tags/","section":"Tags","summary":"","title":"Tags","type":"tags"},{"content":"","date":"1 September 2025","externalUrl":null,"permalink":"/tags/aws/","section":"Tags","summary":"","title":"Aws","type":"tags"},{"content":"When our data science team outgrew ad hoc model training on shared EC2 instances, I built an internal MLOps platform from scratch. A central piece was deploying MLflow as a self-hosted, authenticated experiment tracking service with Terraform and a Python toolkit wrapping it for daily use.\nWhy Self-Hosted MLflow # At the beginning, the data team had no experiment tracking at all. Training results lived in notebooks, Google docs, Confluence, or Slack messages. There was no way to reproduce a run, compare model versions, or trace a deployed model back to the data and parameters that produced it.\nMLflow was the natural choice: open-source, framework-agnostic, and well-integrated with the Python ML ecosystem the team already used. But a managed MLflow service wasn\u0026rsquo;t an option: the data includes patient-derived biological samples. Self-hosting was the only option that met both the security requirements and our budget.\nInfrastructure # I packaged the entire deployment as a reusable Terraform module: ECS Fargate running the MLflow server, RDS PostgreSQL for experiment metadata, S3 for model artifacts, and Secrets Manager for credentials.\nThe Tracking Library # Rather than having scientists write raw mlflow.log_param() and mlflow.log_metric() calls, I included a tracking library in the ML Toolkit package I built that wraps MLflow with decorators and helpers designed for our specific workflow.\nInitialisation # A single mlflow_init function handles all setup: tracking URI, experiment selection, and autologging configuration:\nfrom mitrabio.ml_tools.tracking.mlflow_tools import mlflow_init mlflow_init( tracking_server_host=\u0026#34;\u0026lt;analytics url\u0026gt;\u0026#34;, experiment_name=\u0026#34;clock_v3_training\u0026#34;, autolog=True, big_schema=True, ) The big_schema flag is critical for our use case. With 500,000 methylation features per sample, MLflow\u0026rsquo;s default autologging tries to capture model signatures and input examples, which fails or produces enormous payloads. When big_schema=True, the library disables signature inference, input example logging, and dataset metadata logging. Instead, it logs a compressed JSON schema artifact containing column names and dtypes. The model itself is logged with a dummy placeholder signature.\nTracking Decorators # The library provides decorators that wrap training and evaluation functions with MLflow tracking:\nfrom mitrabio.ml_tools.tracking.mlflow_tools import mlflow_track_regressor @mlflow_track_regressor(run_name=\u0026#34;elasticnet_cv\u0026#34;, big_schema=True, evaluate=True) def train_model(X_train, y_train, X_test, y_test): model = ElasticNetCV(l1_ratio=0.5, n_alphas=100, cv=5) model.fit(X_train, y_train) return model The decorator handles:\nRun lifecycle: creates a nested MLflow run, captures the return value Model detection: inspects whether the returned object is a fitted estimator or a GridSearchCV result (extracting best_estimator_ if so) Model logging: logs the model with a placeholder signature when big_schema=True, and saves the input schema as a compressed artifact Evaluation: after the run closes, runs mlflow.models.evaluate against the test set, producing regression or classification metrics and diagnostic plots Model Retrieval # The library includes helpers for locating logged models by run ID or model ID, returning the S3 artifact path needed for downstream scoring pipelines:\nfrom mitrabio.ml_tools.tracking.mlflow_tools import get_model_s3_path artifact_path, run_id = get_model_s3_path(run_id=\u0026#34;abc123\u0026#34;) This is used by the production scoring pipeline to load approved model versions without needing to know the underlying S3 structure.\n","date":"1 September 2025","externalUrl":null,"permalink":"/projects/private-mlops-platform-aws/","section":"Projects","summary":"How I deployed MLflow as a authenticated experiment tracking server on AWS and integrated it into a reusable ML toolkit.","title":"Building a Private MLOps Platform on AWS","type":"projects"},{"content":"","date":"1 September 2025","externalUrl":null,"permalink":"/tags/machine-learning/","section":"Tags","summary":"","title":"Machine-Learning","type":"tags"},{"content":"","date":"1 September 2025","externalUrl":null,"permalink":"/tags/mlflow/","section":"Tags","summary":"","title":"Mlflow","type":"tags"},{"content":"","date":"1 September 2025","externalUrl":null,"permalink":"/tags/mlops/","section":"Tags","summary":"","title":"Mlops","type":"tags"},{"content":"","date":"1 September 2025","externalUrl":null,"permalink":"/tags/terraform/","section":"Tags","summary":"","title":"Terraform","type":"tags"},{"content":"","date":"23 June 2025","externalUrl":null,"permalink":"/tags/ci-cd/","section":"Tags","summary":"","title":"Ci-Cd","type":"tags"},{"content":"I work with a methylation sequencing pipeline for data processing pipeline at my current company. It takes raw sequencing reads and produces methylation calls: the quantitative measurements that feed every downstream model and product. If this pipeline breaks, everything downstream breaks. If it produces subtly wrong results, every model trained on that data is compromised.\nWhen I joined, the pipeline existed but the CI/CD around it didn\u0026rsquo;t. Deployments were manual, testing was ad hoc, and there was no separation between development and production environments. This post describes the CI/CD system I built to change that.\nThe Pipeline # The pipeline is written in Nextflow (DSL2), a workflow manager designed for computational genomics. It orchestrates roughly a dozen bioinformatics tools, each running in its own Docker container. The controller (the Nextflow process itself) runs as an ECS task on AWS, triggered by a Lambda function. Each bioinformatics step runs in a separate container pulled from ECR.\nThis architecture means a deployment involves multiple container images: one controller image and ten or more module images, each independently versioned. The CI/CD system has to handle all of them.\nThe Testing Strategy # The first thing I established was a layered testing strategy. Every pull request to main triggers four levels of tests, running in parallel.\nCompliance Checks # Pre-commit hooks run across the entire codebase on every PR. (Useful when a developer may forget to setup/install pre-commits locally.) These cover:\nPython linting and formatting (Ruff) — catches style issues and common errors Type checking (mypy) — enforces type annotations on the Python modules in the pipeline Security scanning (Bandit) — static analysis for common Python security issues like hardcoded credentials or unsafe deserialization Nextflow linting — validates DSL2 syntax across all pipeline modules and config files General hygiene — trailing whitespace, YAML/TOML validation, large file detection This catches the majority of trivial issues before any of the following compute-intensive tests run.\nUnit Tests # Python unit tests run inside the pipeline\u0026rsquo;s own Docker container, pulled from ECR. This ensures tests execute in the same environment as production. Coverage reports are uploaded to Codecov for tracking.\nRunning tests inside the production container is a deliberate choice. It catches dependency mismatches that would slip through if tests ran in a clean CI environment with separately installed packages.\nIntegration Tests # Each pipeline module has its own nf-test suite: a testing framework purpose-built for Nextflow. Integration tests run the actual bioinformatics tools against small test datasets stored in S3, verifying that each module produces the expected outputs.\nThese tests run as a matrix build: one parallel job per module, each pulling its container from ECR and executing against the development AWS account. This parallelisation keeps the total test time manageable despite the number of modules.\nSmoke Tests # A final smoke test validates that the full pipeline can be parsed and initialised with manifest inputs. This catches configuration errors, missing parameters, and broken module imports that wouldn\u0026rsquo;t surface in isolated unit or integration tests.\nContainer Security Scanning # In parallel with the functional tests, Trivy scans the controller Docker image for HIGH and CRITICAL vulnerabilities. This runs on every PR, blocking merge if fixable vulnerabilities are found. The scan uses a centralised Trivy wrapper action that pins the scanner version across all repositories, so vulnerability detection is consistent and version upgrades happen in one place.\nThe Deployment Model # The pipeline deploys across three environments of three AWS accounts: development, staging, and production, with different triggers and gates at each stage.\nDevelopment: Automatic on Merge # Every merge to main triggers an automatic deployment to development. The workflow builds the controller image, tags it with git-{short_sha} (an immutable tag tied to the exact commit), pushes it to the development ECR registry, and updates the ECS task definition. The Lambda function that triggers pipeline runs automatically picks up the new task definition revision.\nThere\u0026rsquo;s a deliberate design choice here: the development deployment only updates the ECS task definition - it doesn\u0026rsquo;t touch infrastructure. Infrastructure changes go through a separate Terraform workflow. This decoupling means application deployments are fast (under 2 minutes) and can\u0026rsquo;t accidentally break networking, IAM, or storage configuration.\nStaging: Automatic on Release Tag # Creating a semantic version tag (e.g., v2.1.0) triggers the release workflow. This:\nValidates the tag — confirms it\u0026rsquo;s a valid semver tag pointing to a commit that\u0026rsquo;s reachable from main (preventing releases from feature branches) Validates the release manifest — checks that image_versions.json (which pins every module image version) is present and valid Ensures the image exists — looks for the git-{sha} image in ECR, building it if somehow missing Retags the image — adds the semantic version tag to the existing image without rebuilding (the same image digest, just a new tag) Promotes to staging — promote the controller image and all module images from the development ECR registry to the staging ECR registry using Skopeo* Runs end-to-end tests — executes the full pipeline against staging infrastructure with real test data Deploys to staging — updates the staging ECS task definition *The image promotion step deserves detail. Each AWS environment has its own ECR registry in a separate AWS account. Promoting an image means copying it between registries without rebuilding — preserving the exact image digest. The promotion action uses Skopeo for digest-safe, multi-architecture copies and includes safety checks: it refuses to overwrite an existing tag that points to a different digest, preventing accidental image replacement.\nProduction: Manual with Gates # Production deployment is triggered manually via workflow_dispatch, selecting the target environment and running from a semantic version tag. The workflow:\nValidates the tag Promotes all images (controller + modules) from the source registry to the production registry Updates the production ECS task definition The manual trigger is intentional. Production deployments should be a conscious decision, not an automatic side effect of tagging. GitHub environment protection rules provide an additional approval gate.\nReusable Actions # A key architectural decision was extracting common CI/CD logic into a shared .github repository of reusable composite actions and reusable workflows. This means every pipeline and service in the organisation uses the same building blocks:\nsetup-aws-ecr: configures OIDC-based AWS credentials and logs into ECR build-and-push: builds a Docker image with layer caching, optional Trivy scanning, and image pushes (skips if the tag already exists) promote-ecr-image-skopeo: copies images between ECR registries across AWS accounts retag-image: adds semantic version tags to existing images without rebuilding run-trivy: centralised Trivy wrapper with pinned version parse-semver-tag: validates semantic version tags ecs-task-update: updates ECS task definitions without touching infrastructure This shared library means bug fixes and security patches to CI/CD logic propagate automatically to every repository. It also enforces consistency: every team\u0026rsquo;s deployment uses the same image promotion logic, the same security scanning, the same tag validation.\nWhat This Solved # Before this system:\nDeployments were manual operations There was no way to know if a change broke a pipeline module without running the full pipeline on real data Container images were rebuilt in each environment, introducing the possibility of non-reproducible builds There was no security scanning Rolling back meant trying to remember what was deployed before After:\nEvery PR gets automated testing (compliance, smoke, unit and integration) before it can be deployed to development and each release to staging gets an E2E test Every container image is scanned for known vulnerabilities Deployments are \u0026ldquo;one-click\u0026rdquo; (staging/production) or automatic (development) The same image digest flows from development through staging to production Rolling back means redeploying a previous semantic version tag All CI/CD logic is shared and centrally maintained ","date":"23 June 2025","externalUrl":null,"permalink":"/projects/ci-cd-genomic-data-pipeline/","section":"Projects","summary":"How I built a CI/CD system for a Nextflow methylation sequencing pipeline: from pre-commit linting through four layers of testing to automated promotion across development, staging, and production, all backed by reusable GitHub Actions and container image promotion via ECR.","title":"CI/CD for a Genomic Data Pipeline: Testing, Security, and Multi-Environment Deployment","type":"projects"},{"content":"","date":"23 June 2025","externalUrl":null,"permalink":"/tags/devops/","section":"Tags","summary":"","title":"Devops","type":"tags"},{"content":"","date":"23 June 2025","externalUrl":null,"permalink":"/tags/docker/","section":"Tags","summary":"","title":"Docker","type":"tags"},{"content":"","date":"23 June 2025","externalUrl":null,"permalink":"/tags/github-actions/","section":"Tags","summary":"","title":"Github-Actions","type":"tags"},{"content":"","date":"23 June 2025","externalUrl":null,"permalink":"/tags/nextflow/","section":"Tags","summary":"","title":"Nextflow","type":"tags"},{"content":"Thoughts, learnings, and notes on ML engineering, data infrastructure, and computational biology.\n","date":"10 March 2025","externalUrl":null,"permalink":"/blog/","section":"Blog","summary":"","title":"Blog","type":"blog"},{"content":"","date":"10 March 2025","externalUrl":null,"permalink":"/tags/methodology/","section":"Tags","summary":"","title":"Methodology","type":"tags"},{"content":"","date":"10 March 2025","externalUrl":null,"permalink":"/tags/sample-size/","section":"Tags","summary":"","title":"Sample-Size","type":"tags"},{"content":"In applied machine learning, especially in fields like genomics where data collection is expensive, one of the most consequential decisions you make is whether to invest in more data or a better model.\nLearning curves are a diagnostic that helps you answer this empirically. They won\u0026rsquo;t predict exactly what happens at a sample size you haven\u0026rsquo;t reached, but they will tell you whether you\u0026rsquo;re in a regime where more data is likely to help, and that\u0026rsquo;s usually the decision that matters.\nWhat a Learning Curve Is # A learning curve plots model performance as a function of training set size. You train the same model on progressively larger subsets of your data, recording both the training score and the validation score at each size.\nThe relationship between these two lines is the diagnostic. Plotting only the validation score tells you how well the model performs. Plotting both tells you why — and more importantly, what to do about it.\nReading the Bias-Variance Gap # High Variance: The Gap is Wide # If the training error is low but the validation score is substantially higher, the model is overfitting, memorising training data rather than learning generalisable patterns. In high-dimensional settings like methylation data (hundreds of thousands of features, hundreds of samples), this is the default starting point.\nThe key question is whether the gap narrows as you add data. If it does, more samples will help. The model has the capacity to learn the signal - it just needs more examples to separate signal from noise.\nIf the gap stays wide regardless of sample size, more data alone won\u0026rsquo;t help. You need stronger regularisation or dimensionality reduction first.\nHigh Bias: Both Errors are high # If both training and validation errors are high and close together, the model can\u0026rsquo;t capture the signal. It\u0026rsquo;s underfitting. More data won\u0026rsquo;t fix this. You need a more expressive model, better features, or to revisit whether the signal exists in this feature space at all.\nConvergence: Where They Meet # The ideal pattern: the training error gradually increases (overfitting becomes harder) and the validation score gradually decreases (the model learns more). The two curves converge.\nWhere they converge is the performance ceiling for your current model and features. If they\u0026rsquo;ve nearly met at 300 samples, collecting another 500 is unlikely to close the remaining gap. If there\u0026rsquo;s still a visible gap at your current sample size, there\u0026rsquo;s room to improve by collecting more.\nIn summary:\nWide gap, narrowing — collect more data. Wide gap, static — fix the model first. Narrow gap, high errors — the model is underfitting. Change approach. Narrow gap, acceptable errors — you have enough data. What Learning Curves Can and Can\u0026rsquo;t Tell You # Learning curves are a diagnostic, not a forecast. They show you the trend within the data you already have, whether performance is still climbing, whether the model is overfitting or underfitting, and whether the gap between training and validation is closing. What they can\u0026rsquo;t do is reliably extrapolate. If you have 300 samples and the curve is still climbing, you know more data would help, but you can\u0026rsquo;t precisely predict what performance looks like at 1,000. The curve could plateau at 400 or keep climbing to 800.\nThat said, the diagnostic is still enormously useful for decision-making. Define what \u0026ldquo;good enough\u0026rdquo; means before generating the curves, for example, R^2=0.4 for the model to be commercially viable. Then ask: is the curve still climbing at my current sample size, or has it flattened? Is the training-validation gap wide or narrow?\nThis converts a data science analysis into a directional business case: \u0026ldquo;Performance is still improving and the model is clearly overfitting. More data is the right investment\u0026rdquo; or \u0026ldquo;The curve has flattened and both errors are high. We should invest in better features or a different model class before collecting more samples.\u0026rdquo;\n","date":"10 March 2025","externalUrl":null,"permalink":"/blog/learning-curves-sample-size/","section":"Blog","summary":"Learning curves won’t tell you exactly how many samples to collect, but they will tell you whether collecting more is worth it at all. In domains where each sample costs real money, that’s the question that actually matters.","title":"Using Learning Curves to Know Whether More Data Will Help","type":"blog"},{"content":"As the number of ML models at my company grew beyond a single epigenetic age clock, the team kept solving the same problems: handling grouped cross-validation, wiring up evaluation metrics, and getting models from training into production scoring. Each new model started with copy-pasted code from the last one.\nI built an internal Python package, the ML Toolkit, to consolidate these into a library of tested, reusable components. The design goal was a toolbox that data scientists configure rather than code: define a pipeline in YAML, point it at data, and get a trained, tracked, evaluated model.\nYAML-Configured Pipelines # The core of the toolkit is a dynamic pipeline builder that constructs scikit-learn Pipeline objects from YAML configuration files. Each step is specified as a fully-qualified Python class path with its initialisation arguments:\npipeline: name: clock_v3 model_version: \u0026#34;3.0.1\u0026#34; steps: - name: drop_sex_chromosomes class: mitrabio.ml_tools.preprocessing.DropCpgsFromChrom init_args: chromosomes_to_drop: [\u0026#34;chrX\u0026#34;, \u0026#34;chrY\u0026#34;] - name: qc_filter class: mitrabio.ml_tools.preprocessing.QcFilterCpGs init_args: min_coverage: 10 - name: normalise class: mitrabio.ml_tools.preprocessing.PeakAlignedNormalizer init_args: strategy: split_shift - name: feature_selection class: mitrabio.ml_tools.preprocessing.SelectFromModelPercentile init_args: estimator: class: sklearn.linear_model.ElasticNet init_args: l1_ratio: 0.5 percentile: 95 - name: model class: mitrabio.ml_tools.training.ElasticNetCVWithGroups init_args: n_alphas: 100 cv: 5 A DynamicClassLoader handles importing classes from string paths, resolving nested class definitions (like the estimator inside feature selection), and importing callable functions (like custom scoring functions). The PipelineBuilder validates the config, instantiates each step, and assembles them into a scikit-learn Pipeline.\nThis means experimenting with a different normaliser, feature selection threshold, or model type is a YAML edit rather than a code change. The same pipeline builder is used for the age clock, melanoma prediction, and every other endpoint the team trains. The builder doesn\u0026rsquo;t care where the class comes from as long as it follows the transformer or estimator interface.\nPrediction CLI # The toolkit ships a CLI entry point (mitrabio-predict) for scoring new samples against a trained model. It loads the model from MLflow by run ID, downloads the input schema artifact, validates the input data against the schema, runs the pipeline, and writes predictions.\n","date":"1 March 2025","externalUrl":null,"permalink":"/projects/ml-toolkit/","section":"Projects","summary":"A Python package with a YAML-driven pipeline builder and a prediction CLI.","title":"Building a Reusable ML Toolkit for Genomic Models","type":"projects"},{"content":"","date":"1 March 2025","externalUrl":null,"permalink":"/tags/python/","section":"Tags","summary":"","title":"Python","type":"tags"},{"content":"","date":"1 March 2025","externalUrl":null,"permalink":"/tags/scikit-learn/","section":"Tags","summary":"","title":"Scikit-Learn","type":"tags"},{"content":"","date":"20 January 2025","externalUrl":null,"permalink":"/tags/dimensionality/","section":"Tags","summary":"","title":"Dimensionality","type":"tags"},{"content":"","date":"20 January 2025","externalUrl":null,"permalink":"/tags/feature-selection/","section":"Tags","summary":"","title":"Feature-Selection","type":"tags"},{"content":" Early in my work building an epigenetic age prediction model, I needed to reduce the input space from roughly 500,000 CpG sites down to something tractable. A colleague spent considerable time on feature engineering, ranking sites by variance, filtering by biological relevance, running correlation analyses. I tried something simpler: randomly sampling a few thousand features.\nThe results were essentially the same.\nWhy It Happens # In DNA methylation data you\u0026rsquo;re firmly in the p \u0026gt; n regime: hundreds of thousands of features, hundreds to thousands of samples. The signal isn\u0026rsquo;t concentrated in a small number of sites. Age-related methylation changes occur across thousands of positions throughout the genome, each contributing weakly.\nIn this setting, a random subset of 10,000 features drawn from 500,000 will, with high probability, contain enough weakly predictive sites to reconstruct the signal almost as well as the full set. The model doesn\u0026rsquo;t need the best features, it needs enough, and random sampling delivers that reliably.\nTrain the same Elastic Net on (1) all ~500,000 features, (2) the top 10,000 by correlation with the target, or (3) a random 10,000. The difference between options 2 and 3 is often negligible. Sometimes the random subset edges ahead, likely by avoiding overfitting to the noisiest univariate correlations.\nThe Mathematics # Spurious correlations are inevitable. With 500,000 features and a few hundred samples, testing at p \u0026lt; 0.05 yields ~25,000 false positives. Any method that ranks by univariate association is fishing in a pool where signal and noise are thoroughly mixed. Your carefully engineered feature set may just be selecting the loudest noise.\nThe signal is diffuse. If 5% of features carry meaningful age signal, a random draw of 10,000 will contain roughly 500 genuinely informative ones, more than enough for penalised regression.\nThe Hughes phenomenon. With fixed training samples, predictive power first increases with features then deteriorates. There\u0026rsquo;s an optimal feature set size, and both random and engineered subsets of similar size land in the same performance neighbourhood. The binding constraint isn\u0026rsquo;t which features, it\u0026rsquo;s how many relative to sample size.\nThe Practical Lesson # I watched a colleague invest weeks into a multi-stage pipeline: variance filtering, biological annotation filtering, recursive feature elimination. The final model performed within a fraction of a percent of one trained on a random subset that took minutes to generate.\nFeature engineering still matters when signals are genuinely sparse, when you have strong prior knowledge, or when the goal is interpretability. But in high-dimensional omics data where the signal is diffuse, random selection is a competitive baseline. If your engineered features don\u0026rsquo;t substantially outperform a random draw, that tells you the signal is spread too broadly for targeted selection to help.\nMy rule of thumb: always benchmark against random feature selection. The bottleneck in high-dimensional biological data is rarely which features you pick - it\u0026rsquo;s sample size, label quality, and validation strategy.\n","date":"20 January 2025","externalUrl":null,"permalink":"/blog/when-random-features-work-just-as-well/","section":"Blog","summary":"On the counterintuitive finding that randomly selecting features from high-dimensional genomic data often matches the performance of careful feature engineering and why that makes mathematical sense.","title":"When Random Features Work Just as Well","type":"blog"},{"content":"","date":"15 November 2024","externalUrl":null,"permalink":"/tags/sagemaker/","section":"Tags","summary":"","title":"Sagemaker","type":"tags"},{"content":"My company published a machine learning model in Nature Aging that predicts age from DNA methylation data collected from facial skin. This post covers the engineering side: scaling the training pipeline to handle hundreds of thousands of genomic features and building the ML toolkit that supports it.\nThe Data # The input data comes from enzymatic methyl sequencing of DNA extracted from non-invasive skin samples (forehead tape strips). Each sample produces quality-filtered methylation measurements at roughly 500,000 CpG sites, positions in the genome where a cytosine is followed by a guanine, with beta values ranging from 0 to 100 representing the proportion of methylated DNA at each position.\nThe training dataset comprised thousands of samples, each with a known chronological age label. The target is a regression problem: predict a person\u0026rsquo;s age from their methylation profile.\nHalf a million features and thousands of samples is a moderately large tabular dataset. It fits in memory on a single machine, but the computational cost of hyperparameter tuning, fitting dozens or hundreds of model configurations with cross-validation, scales with both dimensions.\nThe Training Problem # The model uses Elastic Net regression — a linear model that combines L1 (Lasso) and L2 (Ridge) regularisation. This is a well-suited choice for high-dimensional epigenomic data:\nL1 regularisation drives coefficients to exactly zero, performing implicit feature selection. With 500,000 features, most of which carry no age-related signal, sparsity is essential. L2 regularisation handles multicollinearity. Neighbouring CpG sites are often highly correlated, and Ridge-style shrinkage stabilises the coefficient estimates. The l1_ratio parameter controls the balance between the two penalties, and the alpha parameter controls overall regularisation strength. Both need to be tuned. The initial approach ran GridSearchCV on a single EC2 instance: iterating over a grid of l1_ratio values and 100 alpha values, with 5-fold cross-validation using GroupKFold to prevent data leakage from repeated measurements of the same individual.\nThis worked, but was slow and expensive. A single grid search across the full hyperparameter space took hours on a large instance, and any change to the feature set, preprocessing, or cross-validation strategy meant re-running the entire search.\nParallelising on SageMaker # The fix was to decompose the grid search into independent jobs and run them in parallel on SageMaker.\nEach combination of hyperparameters is an independent fitting problem. There are no dependencies between grid points. This makes hyperparameter tuning embarrassingly parallel. Instead of running a single GridSearchCV on one large instance, the training scripts were modified to:\nPartition the hyperparameter grid into individual configurations Submit each configuration as a separate SageMaker training job on a smaller, cheaper instance Collect results across all jobs and select the best-performing configuration SageMaker handles the instance provisioning, container management, and job scheduling. The training code itself didn\u0026rsquo;t change. It\u0026rsquo;s still scikit-learn\u0026rsquo;s ElasticNet and GridSearchCV under the hood, but the outer loop that iterates over hyperparameter configurations was lifted from a single-machine for loop to a distributed submission layer.\nThe result was a 100x speedup in hyperparameter tuning time. What previously took hours on a single large instance now completed in minutes across many smaller ones, at lower total cost because the instances run only for the duration of each individual fit.\nKey Takeaways # Parallel problems should be treated as such. Hyperparameter tuning has no inter-job dependencies. Lifting the outer loop from a single machine to a managed compute service (SageMaker) gave a 100x speedup with minimal code changes. ","date":"15 November 2024","externalUrl":null,"permalink":"/projects/epigenetic-age-prediction/","section":"Projects","summary":"How parallelising hyperparameter tuning on SageMaker turned a single-instance grid search into a 100x faster training workflow.","title":"Scaling ML Training for Epigenetic Age Prediction","type":"projects"},{"content":"Data scientists need powerful compute. This post covers how I built a CLI tool that gives scientists self-service access to EC2 instances while quietly solving the operational problems they don\u0026rsquo;t think about: data persistence, cost control, and environment reproducibility.\nThe Problem # Our data science team needed on-demand cloud compute for analysis, sometimes genomics workloads that could run for hours on large-memory instances. The initial approach was manual: someone with AWS access would spin up an EC2 instance, configure it, and hand over SSH credentials.\nThis had the usual problems:\nEnvironment drift: each instance was configured slightly differently depending on who set it up and when Data loss risk: work stored on EBS volumes would be lost if an instance was terminated without backup Cost leaks: instances left running overnight or over weekends, sometimes for days Bottleneck: scientists couldn\u0026rsquo;t spin up their own environments without asking an engineer The goal was to productise this workflow into a tool that scientists could use independently.\nThe Design # I built a Python CLI using Typer and Rich that wraps the full lifecycle of a research EC2 instance into five commands:\ncreate → provision a new instance with persistent storage start → resume a stopped instance, restore data, configure tools stop → sync data to S3, then stop the instance destroy → terminate with confirmation guards list → show all instances with state and ownership Architecture # Each instance has three layers of state:\nEphemeral compute — the EC2 instance itself, which can be stopped and started Persistent volume — an EBS volume mounted at /data that survives stop/start cycles Durable backup — automatic S3 sync so data survives even instance termination Data Persistence # The trickiest engineering problem wasn\u0026rsquo;t provisioning, it was making sure scientists never lost work.\nThe S3 Sync Model # Every instance is tagged with an S3 bucket and prefix at creation time. The stop command syncs the /data volume to S3 before shutting down, and the start command restores it. This means scientists can:\nStop an instance at the end of the day (saving money) Start it the next morning and pick up where they left off Even destroy and recreate an instance and get their data back The sync uses s5cmd for high-throughput parallel transfers: significantly faster than the AWS CLI\u0026rsquo;s s3 sync for the large genomics datasets we work with.\nGuard Rails # Data sync has several safety checks:\nVolume size validation at creation: the CLI checks the size of existing data in S3 and ensures the requested EBS volume is large enough, with headroom Cloud-init completion check before stop: ensures the instance has finished its bootstrap before attempting any sync Empty sync protection: if the local volume is empty (which would be a bug, not a real state), the sync is blocked to prevent overwriting good S3 data with nothing SSM readiness gate: sync runs via SSM commands on the instance; the CLI waits for the SSM agent to be healthy before attempting any remote operation If sync fails, the CLI warns but lets the user decide whether to proceed with the stop; the data is still on the EBS volume either way.\nInstance Bootstrap # When a new instance is created, a user_data shell script runs on first boot. This script provisions the entire research environment automatically:\nSystem packages: Python, Docker, AWS CLI, GitHub CLI, s5cmd, CloudWatch agent Storage setup: partitions and mounts the EBS data volume, configures Docker to use a separate volume for image storage Data restoration: syncs the user\u0026rsquo;s project data from S3 to /data Monitoring: configures the CloudWatch agent for operational logs Shell defaults: sets up the user\u0026rsquo;s environment for internal package workflows The start command (for resuming stopped instances) does a lighter version of this, remounting volumes, restarting Docker, and reapplying Git configuration, rather than re-running the full bootstrap.\nSSH Configuration # The CLI auto-generates SSH config entries using SSM Session Manager as a proxy:\nHost my-analysis-node HostName i-0abc123... User ubuntu ProxyCommand sh -c \u0026#34;aws ssm start-session --target %h ...\u0026#34; ForwardAgent yes This means scientists can connect via ssh my-analysis-node or use VS Code\u0026rsquo;s Remote-SSH extension with a friendly hostname. SSH agent forwarding is enabled by default, so their local GitHub keys work on the instance without copying credentials.\nThe CLI offers to append this entry automatically, and the destroy command offers to remove it, keeping ~/.ssh/config clean.\nKey Pair Management # Instance creation walks users through key pair selection interactively, listing existing AWS key pairs, offering to create new ones, saving the private key with correct permissions, and prompting for the local path. This eliminates the \u0026ldquo;I lost my key\u0026rdquo; support request.\nGit Identity Propagation # The CLI has a configure command for persisting settings like GitHub username and email. On create and start, these are pushed to the instance via SSM so git commit works immediately without manual setup. The fallback chain is: explicit CLI flags → persisted config → local git config on the user\u0026rsquo;s machine.\nInstance Resize # Scientists often discover mid-project that they need more memory. The change-type command lets them resize a stopped instance without reprovisioning: just stop, change type, start.\nCost Control # The biggest operational win wasn\u0026rsquo;t the CLI itself - it was the cost automation built around it.\nCron-Based Auto-Shutdown # Sometimes the data scientists would forget to stop their instances, leading to idle instances costing us significantly. However, the data scientists had variable schedules and long-running programmes. So I documented a pattern using local cron jobs that they could use:\n4pm — discover all running instances owned by the user, broadcast a warning 5pm — stop all running instances that haven\u0026rsquo;t been snoozed The snooze mechanism is deliberately simple: touch /tmp/snooze_\u0026lt;instance-name\u0026gt;. If the file exists, the 5pm job skips that instance. The 4pm job clears all snooze files daily, so the default is always \u0026ldquo;shut down unless you actively say otherwise.\u0026rdquo;\n","date":"15 March 2024","externalUrl":null,"permalink":"/projects/self-service-ec2-platform/","section":"Projects","summary":"Designing and building a Python CLI that lets data scientists create, manage, and safely shut down cloud research environments without needing to know Terraform or the AWS console.","title":"Building a Self-Service Analysis Environment for Data Scientists","type":"projects"},{"content":"","date":"15 March 2024","externalUrl":null,"permalink":"/tags/cli/","section":"Tags","summary":"","title":"Cli","type":"tags"},{"content":"","date":"15 March 2024","externalUrl":null,"permalink":"/tags/platform-engineering/","section":"Tags","summary":"","title":"Platform-Engineering","type":"tags"},{"content":"","date":"10 October 2023","externalUrl":null,"permalink":"/tags/algorithms/","section":"Tags","summary":"","title":"Algorithms","type":"tags"},{"content":"","date":"10 October 2023","externalUrl":null,"permalink":"/tags/bioinformatics/","section":"Tags","summary":"","title":"Bioinformatics","type":"tags"},{"content":"","date":"10 October 2023","externalUrl":null,"permalink":"/tags/c++/","section":"Tags","summary":"","title":"C++","type":"tags"},{"content":"","date":"10 October 2023","externalUrl":null,"permalink":"/tags/open-source/","section":"Tags","summary":"","title":"Open-Source","type":"tags"},{"content":" Insertion sequences are among the simplest and most abundant mobile genetic elements in bacterial genomes. They\u0026rsquo;re short, typically 700 to 2,500 base pairs, and contain little more than the genes needed for their own transposition. But their simplicity belies their importance: insertion sequences are a primary vehicle for the horizontal transfer of antimicrobial resistance genes between bacteria, and they\u0026rsquo;re dramatically underrepresented in existing reference databases.\nDuring my PhD I built Palidis (Palindromic Detection of Insertion Sequences) — a bioinformatics pipeline that discovers novel insertion sequences directly from metagenomic sequencing data without relying on reference databases. At its core is a C++ algorithm called pal-MEM that I designed for efficiently finding inverted terminal repeats in large genomic datasets. The work was published in Microbial Genomics and later found an unexpected second life in gene therapy manufacturing.\nThe Problem # Insertion sequences are flanked by inverted terminal repeats (ITRs), short DNA sequences of 10–50 base pairs at each end that are reverse complements of each other. These ITRs are the binding sites for transposases, the enzymes that mediate the element\u0026rsquo;s movement between genomic locations. Detecting ITRs is therefore the key to finding insertion sequences.\nBut there are several reasons why existing approaches struggle:\nReference databases are incomplete. The main database, ISfinder, catalogues known insertion sequences, but it\u0026rsquo;s small and biased toward well-studied organisms. You can\u0026rsquo;t find what isn\u0026rsquo;t in the database.\nAssembly algorithms break on repeats. Short-read assemblers struggle to resolve repeated elements. They may collapse them, include only one copy, or omit them entirely. An insertion sequence that\u0026rsquo;s misassembled or incomplete won\u0026rsquo;t be found by tools that work on assembled genomes.\nExact palindrome detection is too strict. ITR pairs are not always perfect reverse complements. Tools like EMBOSS that search for exact palindromes miss many real insertion sequences that have accumulated mutations since their insertion.\nBrute-force search is too slow. Metagenomic datasets contain millions of reads, each ~100 base pairs. Checking every pair of reads for inverted repeat relationships is computationally prohibitive.\nThe Algorithm: pal-MEM # The core challenge is finding pairs of reads that contain reverse-complementary subsequences, potential ITRs, across a dataset of millions of sequences. I needed an algorithm that was both fast enough for metagenomic-scale data and sensitive enough to find the biologically relevant matches.\npal-MEM (palindromic Maximal Exact Matching) is based on the E-MEM algorithm for computing maximal exact matches, but substantially modified for the specific problem of finding inverted repeats in short-read metagenomic data.\nTwo-Bit Encoding # The first optimisation is at the representation level. DNA has a four-letter alphabet (A, C, G, T), which maps naturally to two bits per nucleotide:\nNucleotide Encoding A 00 C 01 G 10 T 11 This means a 15-mer (the default k-mer length) occupies just 30 bits, half a 64-bit integer. An entire 100 bp read fits in four 64-bit integers. The encoding reduces memory by 4x compared to character-based storage. More importantly, it makes k-mer comparison a single bitwise operation rather than a character-by-character string comparison.\nThe algorithm stores all reads as a continuous array of unsigned 64-bit integers, with each integer holding 32 nucleotides. Random 20-bit separator sequences mark read boundaries. A secondary data structure tracks the start and end positions of each read within this continuous array.\nHash Table for k-mer Lookup # The algorithm builds a hash table from the reference sequences (which, for inverted repeat detection, are the reverse complements of the input reads). k-mers are the keys; their positions in the reference are the values.\nA key insight from E-MEM is that not every k-mer needs to be stored. For a minimum match length L and k-mer size k, a k-mer only needs to be indexed if its position p satisfies:\n$$p \\equiv 0 \\pmod{(L - k) + 1}$$This guarantees that any MEM of length L will contain at least one indexed k-mer, while reducing the number of stored k-mers, and therefore memory usage, substantially.\nThe hash table uses open addressing with double hashing for collision resolution. Hash table sizes are pre-computed prime numbers chosen to maintain load factors that keep collision chains short.\nExtension with Interval Halving # When a k-mer match is found, the algorithm extends it in both directions to find the maximal exact match. Rather than extending one nucleotide at a time, pal-MEM uses an interval halving approach:\nExtend by the maximum possible distance (to the boundary of the shortest sequence) Compare the two extended regions using bitwise operations on the 64-bit integer representation If they don\u0026rsquo;t match, halve the extension distance and try again Once a match is found, extend one nucleotide at a time until a mismatch is reached This approach converges in O(log n) comparison steps rather than O(n), with each comparison itself being a constant-time bitwise operation on 64-bit integers.\nOptimisations for Metagenomic Data # I made three modifications specific to the metagenomic use case:\nReverse complement by default. E-MEM searches for direct matches between sequences, with reverse complement matching as an option. Since we\u0026rsquo;re specifically looking for inverted repeats (which are reverse complements), pal-MEM transforms the reference into reverse complements at the outset, converting the problem to direct matching.\nEarly termination per read. A short read (~100 bp) from Illumina sequencing will contain at most one ITR. Once a MEM is found within a read, pal-MEM skips to the next read rather than continuing to scan the remainder. This dramatically reduces the search space for large metagenomic libraries.\nTechnical repeat filtering. Sequencing libraries are dominated by technical reverse complements — overlapping fragments from both strands of a double-stranded DNA molecule. These produce MEMs at the prefix of one read and the suffix of another, mimicking biological inverted repeats. pal-MEM excludes MEMs whose start or end positions are within a buffer distance of the read boundaries, filtering out the most common class of false positives without an expensive separate deduplication step.\nThe Palidis Pipeline # pal-MEM finds reads containing potential inverted repeats. Palidis wraps it in a Nextflow pipeline that goes from raw sequencing data to validated insertion sequences in five steps:\nPre-process and find repeat sequences: Convert FASTQ to FASTA, run pal-MEM to identify reads containing inverted terminal repeats.\nMap and filter by proximity: Map the repeat-containing reads against assembled contigs using Bowtie2. A Python script identifies candidate ITRs where the mapped positions of the repeats fall within the expected distance range of an insertion sequence (500–3,000 bp by default).\nCluster and validate: Cluster candidate ITRs using CD-HIT-EST (sequence identity threshold, alignment coverage). Putative insertion sequences must have ITRs from the same cluster that are reverse complements of each other — confirmed by BLASTn alignment showing Strand=Plus/Minus with identity above the minimum ITR length.\nAnnotate transposases: Run InterProScan on predicted protein sequences from Prodigal to confirm the presence of transposase, integrase-like, or RNase H domains — the enzymatic machinery required for transposition.\nGenerate outputs: Produce a FASTA file of insertion sequences and a tab-delimited information file with coordinates, sample IDs, contig mappings, and protein family annotations.\nThe entire pipeline runs in a single Docker/Singularity container, with Nextflow handling parallelisation across samples and HPC job scheduling via nf-core institutional configs.\nResults # Applied to 264 human oral and gut metagenomes from the Human Microbiome Project, Palidis identified 2,517 insertion sequences from 1,837 contigs across 218 samples. After clustering to remove redundancy, this produced an Insertion Sequence Catalogue (ISC) of 879 unique insertion sequences.\nOf these:\n519 (59%) were novel — not found in ISfinder, the main reference database 360 (41%) matched ISfinder entries, with 60 having strong homology (e-value \u0026lt; 1e-50) and 300 having loose homology The catalogue contained 87 unique transposases across elements ranging from 524 to 2,999 bp Querying the ISC against a database of 661,405 bacterial genomes revealed evidence of horizontal gene transfer across bacterial classes: the same insertion sequences appearing in genomes from different genera, sometimes spanning 21 genera and 46 species. Several genera (Bacteroides, Corynebacterium, Prevotella) had significant numbers of insertion sequences not represented in ISfinder at all, highlighting the gap in existing databases.\nImpact Beyond AMR # The original motivation was antimicrobial resistance surveillance: understanding how resistance genes spread between bacteria via mobile genetic elements. My PhD work showed that AMR genes are commonly linked to insertion sequences, and that country-specific resistance profiles can be traced partly through the mobile element landscape.\nBut the tool found an unexpected second application.\nGene therapy manufacturing relies on adeno-associated virus (AAV) vectors, which use ITRs, the same structures Palidis detects, for viral DNA replication and packaging. These 145 bp ITR sequences form T-shaped hairpin structures that are essential for the vector to function. During plasmid production in bacteria, these complex ITR structures are unstable and prone to spontaneous deletions and mutations. Damaged ITRs severely reduce or eliminate vector effectiveness.\nPalidis can verify ITR integrity in both plasmid DNA and finished AAV vectors by checking for a single dominant cluster of ITRs, indicating uniformity. This quality control application is documented in a WIPO patent application, a use case I never anticipated when building a tool to find bacterial transposable elements.\n","date":"10 October 2023","externalUrl":null,"permalink":"/projects/palidis-insertion-sequence-discovery/","section":"Projects","summary":"How I built a maximal exact matching algorithm in C++ with two-bit encoding to discover novel mobile genetic elements from metagenomic sequencing data and how it found applications from antimicrobial resistance surveillance to gene therapy manufacturing.","title":"Palidis: A C++ Algorithm for Discovering Insertion Sequences in Metagenomic Data","type":"projects"},{"content":"","date":"1 September 2023","externalUrl":null,"permalink":"/tags/biotech/","section":"Tags","summary":"","title":"Biotech","type":"tags"},{"content":"","date":"1 September 2023","externalUrl":null,"permalink":"/tags/dashboards/","section":"Tags","summary":"","title":"Dashboards","type":"tags"},{"content":"The most common feature request I get from scientists and lab teams is some variation of: \u0026ldquo;Can we add a table to the dashboard where I can filter by X, sort by Y, and search for Z?\u0026rdquo;\nEvery time, I have the same internal reaction: you\u0026rsquo;re asking me to build a worse version of Excel.\nThe Pattern # A pipeline produces data. Someone asks for a dashboard. The first iteration has some charts. Then the requests start: add a column, filter by date, sort by this metric, export the filtered view, conditional formatting, pivot by batch.\nEach request is reasonable in isolation. Taken together, they describe a spreadsheet. And there is nothing I can build — no React table component, no ag-Grid config — that will beat what Excel or Google Sheets already do. These are tools with decades of development behind them, used daily by the people requesting the feature.\nWhat I Do Instead # When someone asks for a filterable table, I ask what they\u0026rsquo;re actually trying to do once they\u0026rsquo;ve found the rows they care about:\n\u0026ldquo;I want to check specific samples\u0026rdquo; — a search bar is faster to build and use than a table. \u0026ldquo;I want to explore the data\u0026rdquo; — a \u0026ldquo;Download CSV\u0026rdquo; button and five minutes in their spreadsheet of choice gets them further than anything I could build in a sprint. \u0026ldquo;I want to spot anomalies\u0026rdquo; — they need charts with thresholds or alerting, not a sortable column. The CSV download is the one I reach for most. It\u0026rsquo;s trivial to implement, it respects that scientists already have workflows in Excel or R or Python, and it avoids the maintenance burden of a custom table UI that will never be as good as the tools it\u0026rsquo;s imitating.\nA filterable, sortable, paginated table with search sounds simple. In practice it means pagination, server-side filtering, state management, shareable URLs, accessibility, performance testing, and an endless stream of requests to tweak columns and filters. That\u0026rsquo;s not a feature — that\u0026rsquo;s a product. And it\u0026rsquo;s a product that already exists.\nWhen a Table Is the Right Call # Tables make sense when the data is live and changes too frequently for a static export, when clicking a row navigates somewhere, when it combines data from multiple sources, or when access control matters. But when the request is \u0026ldquo;I want to look at my data and filter it,\u0026rdquo; the honest answer is usually: here\u0026rsquo;s a download button.\n","date":"1 September 2023","externalUrl":null,"permalink":"/blog/glorified-excel/","section":"Blog","summary":"In biotech, the most common dashboard feature request is a filterable, sortable table. The fastest solution is usually a CSV download and the spreadsheet software people already know.","title":"Glorified Excel: The Dashboard Feature Request You Should Push Back On","type":"blog"},{"content":"","date":"1 September 2023","externalUrl":null,"permalink":"/tags/product/","section":"Tags","summary":"","title":"Product","type":"tags"},{"content":"","date":"1 September 2023","externalUrl":null,"permalink":"/tags/software-engineering/","section":"Tags","summary":"","title":"Software-Engineering","type":"tags"},{"content":"This was a weekend hackathon project built with three teammates. The idea: given a prompt like \u0026ldquo;a moisturiser with hyaluronic acid,\u0026rdquo; generate a complete formulation, ingredient list, assembly protocol, and allergen warnings, by chaining together an LLM, a product ingredient database, and Wikipedia.\nThe result was FormulAI, a Streamlit app built on LangChain and GPT-3.5.\nHow It Works # The app takes a text prompt (e.g. \u0026ldquo;a cream with retinol\u0026rdquo;) and a formulation type (cream, spray, or any), then runs four chained steps:\nCSV Agent: A LangChain CSV agent queries a dataset of ~200 real skincare products with their full ingredient lists, scraped and cleaned from product labels. The agent finds products with similar ingredients to the request, giving the LLM real-world formulation context rather than generating from pure parametric knowledge.\nIngredients Chain: An LLM chain takes the user\u0026rsquo;s requested chemical, the CSV agent\u0026rsquo;s findings, and the selected formulation type, then generates a full ingredient list with each ingredient\u0026rsquo;s function (humectant, emulsifier, preservative, etc.). The prompt tells the model to leverage the real formulations from the dataset and suggest similar chemicals if the exact one isn\u0026rsquo;t present.\nProtocol Chain: A second LLM chain takes the generated ingredients plus Wikipedia research on the requested chemical and writes an assembly protocol — the order of operations for combining the ingredients.\nAllergen Chain: A final chain takes the ingredient list and identifies potential allergens.\nEach chain has its own ConversationBufferMemory, so the conversation history is preserved and visible in the sidebar for debugging. The Wikipedia lookup runs via LangChain\u0026rsquo;s WikipediaAPIWrapper to pull in reference material on the target chemical.\nThe Dataset # The ingredient database (skincare_products_clean.csv) contains real product formulations across moisturisers, serums, oils, mists, balms, masks, peels, eye care products, cleansers, and exfoliators. Each row has a product type and its full ingredient list. This grounds the LLM\u0026rsquo;s suggestions in actual commercial formulations rather than hallucinated ingredient combinations.\n","date":"11 June 2023","externalUrl":null,"permalink":"/blog/formulai-langchain-formulations/","section":"Blog","summary":"A hackathon project that chains LLM calls with a product ingredient database and Wikipedia to generate skincare formulations — ingredients, assembly protocols, and allergen warnings.","title":"FormulAI: Using LangChain to Generate Skincare Formulations","type":"blog"},{"content":"","date":"11 June 2023","externalUrl":null,"permalink":"/tags/hackathon/","section":"Tags","summary":"","title":"Hackathon","type":"tags"},{"content":"","date":"11 June 2023","externalUrl":null,"permalink":"/tags/langchain/","section":"Tags","summary":"","title":"Langchain","type":"tags"},{"content":"","date":"11 June 2023","externalUrl":null,"permalink":"/tags/llm/","section":"Tags","summary":"","title":"Llm","type":"tags"},{"content":"","date":"11 June 2023","externalUrl":null,"permalink":"/tags/streamlit/","section":"Tags","summary":"","title":"Streamlit","type":"tags"},{"content":"Being able to build pipelines is now a requirement of many bioinformaticians. A pipeline can be as small as putting two commands in a bash script to as large as a complex suite of steps.\nBut what makes a good one?\nThis post is short, non-technical and to the point, and is aimed at experienced bioinformaticians, those starting out and those who are just curious.\nThe Pipeline Must Be Useable and User-Friendly # Your pipeline has to work and work well for your user. Ultimately, to the user, it doesn\u0026rsquo;t matter whether it\u0026rsquo;s written in bash or a workflow manager, or whether it\u0026rsquo;s modular or a monolith. There are tools that will help you get to the finish line, but they help more with longer-term strategy.\nBefore diving in, you must do the research:\nWhat does the user want to do? What output do they want to see? What inputs do they want to use? If it goes wrong, what messages do they need to see? Can the user use the command line or do they need a GUI? Identify the Potential Pitfalls Early # Once you\u0026rsquo;ve gathered the requirements, you need to know where the pipeline will be run and explore potential resource limitations. Specifically:\nHow large are the dependency files (e.g. databases) and do these have an impact on how quickly the pipeline can be shared (i.e. internet speeds and storage)? How long does the pipeline run for? Do the compute resources allow for this and will it impact delivery? How many files and intermediate files will be produced in a given time? Is there enough storage for this and should there be a maximum limit at a given time? You should communicate clearly with the user about these challenges and work with them to find a solution if necessary (rather than hacking your way through it yourself). Sometimes the user may have alternative suggestions that would be sufficient for their needs, e.g. using a smaller database.\nFlexibility and Robustness # Unless the methods embedded in the pipeline are standard or published, it is common for the user to request changes in the future. If you are building a pipeline with more than a couple of steps, decide what tools you will use to isolate each step (with obvious inputs and outputs) that will make it easy for somebody else to swap in another step without breaking the whole pipeline. Workflow managers are a good option here.\nReproducibility # A pipeline can be considered useless if the data it generates is not reproducible. Therefore, you must test whether the outputs are as expected. Building a suite of tests (unit tests, regression tests, system tests and continuous integration tests) that can also be automated as part of version control will save you a lot of heartache down the line, such as when you need to make changes to a step but need to keep part of the final output the same. The extent of your testing will depend on the requirements of the pipeline.\nCan the Pipeline Afford to Fail? # Like any software project, the effort to meet the requirements above depends on how reliable a pipeline has to be. For example, if it\u0026rsquo;s going to be used a few times by one collaborator that you have quite close contact with, then you may be able to afford to patch fixes, iterate and release often. If you\u0026rsquo;re working on a pipeline that\u0026rsquo;s going to be run thousands of times in one go, creating crucial reports to public health agencies, then you\u0026rsquo;ve got to make sure your first release is the best it can be.\nMy Take on Where We Are # It\u0026rsquo;s fair to say that research institutions are behind in software development practices to benefit pipeline development (and even laboratory information management systems) that have created a reproducibility crisis. To help tackle this, I also hope it will become common practice to review and test scientific software before its release and publication. Until funders and institutions fund training and retention of skills to nurture best software practices, we\u0026rsquo;ve got a lot of pipelines to fix.\n","date":"2 April 2023","externalUrl":null,"permalink":"/blog/how-to-impress-someone-with-a-bioinformatics-pipeline/","section":"Blog","summary":"What makes a good bioinformatics pipeline? A short, non-technical take on the things that matter — from user requirements to reproducibility to knowing when good enough is good enough.","title":"How to Impress Someone with a Bioinformatics Pipeline","type":"blog"},{"content":"","date":"2 April 2023","externalUrl":null,"permalink":"/tags/pipelines/","section":"Tags","summary":"","title":"Pipelines","type":"tags"},{"content":"","date":"2 April 2023","externalUrl":null,"permalink":"/tags/reproducibility/","section":"Tags","summary":"","title":"Reproducibility","type":"tags"},{"content":"","externalUrl":null,"permalink":"/categories/","section":"Categories","summary":"","title":"Categories","type":"categories"}]