Skip to content

Illumina NovaSeq 6000 server log parsing and analysis for BaseSpace and bcl2fastq pipelines.

License

Notifications You must be signed in to change notification settings

VerisimilitudeX/IlluminaLogVision

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IlluminaLogVision: Extended Epigenomic Analytics for NovaSeq 6000

IlluminaLogVision is a Java-based toolkit dedicated to parsing and interpreting Illumina NovaSeq 6000 log files. It incorporates detailed fields like HPC node usage, system serials, Q30 statistics, pass-filter counts, and more. Originally designed for epigenomic research pipelines, IlluminaLogVision uses advanced metrics to aid in optimizing library preparation and HPC scheduling while ensuring reliable data quality.

Background

  • Comprehensive Error Analytics: By capturing and standardizing error-rate measurements, we offer in-depth insight into basecalling quality, which is crucial for sensitive epigenetic assays like WGBS (whole-genome bisulfite sequencing) or histone ChIP-seq.
  • HPC Load Balancing: Multi-node HPC infrastructures often run demultiplexing or alignment tasks in parallel. Tracking HPC node usage and run distribution helps researchers identify bottlenecks and optimize resource allocation across large-scale epigenomic projects.
  • Q30 and Pass-Filter Metrics: The proportion of bases above Q30 and clusters passing filter are established indicators for run success. By aggregating these measures per lane, researchers can more quickly refine library prep conditions or revisit experimental design.
  • Yield and Cluster Density: Understanding how yield in gigabases correlates with cluster density is essential for tuning loading concentrations, which is especially beneficial for epigenetic workflows that rely on high coverage.

Features

  • Extended Parsing Logic: Reads HPC node details, run folder paths, Q30 figures, indexing barcodes, pipeline versions, and additional fields beyond basic logs.
  • Rich Analytics: Computes average error rate, error-rate standard deviation, lane-specific HPC usage frequency, total yield, and more.
  • Multi-File Compatibility: Accepts varying log formats, from minimal 7-field lines to larger lines featuring HPC node and pass-filter references.
  • Epigenetic Application: Integrates seamlessly into bioinformatics pipelines for genomic and methylation assays, focusing on HPC usage and read quality in detail.

Usage

Clone or download this repository, then build and run either via Gradle or directly using the Java command line. Ensure your logs are stored in assets/.

gradle build
gradle run --args="real_runA.txt"
    

Alternatively, compile and run manually:

javac *.java
java Main real_runA.txt
    

If no command-line argument is provided, the default file is real_runA.txt.

Extended Log Example

RUN-20250903 Lane1 HPC-Node4 SN3000123456 /seqdata/210801_M04281_0123_000000000-A1B2C 2025-09-03T09:10:22Z 42000000 0.0030 315 38.2 91.5 Q30=88.9 Index=ACTG NGS-v2.2.1 bcl2fastq2.20
    

Above, fields include Run ID, Lane, HPC Node, Machine Serial, Run Folder Path, Timestamp, Read Count, Error Rate, Cluster Density, Yield (Gb), Pass-Filter Count, Q30, Index, Pipeline Version, and Analysis Software.


Planned Research-Focused Updates

  • Dynamic Epigenomic Reporting: Automated generation of QC charts for methylation coverage vs. read error distribution, enabling real-time assessment of CpG-specific data quality.
  • Integrative HPC Metrics: Collect HPC node performance stats (CPU load, memory usage) to refine scheduling across batch-based or containerized workflows.
  • Hybrid Cloud Support: Real-time synchronization with off-site analysis clusters for massive epigenome projects.

Stars

Star History Chart

License

MIT License