Curriculum
Module 01 · 45 min

The Human Genome at Scale

3.1 Gb, 20k coding genes, and ~98% of the action outside them.

FoundationsClinicalResearch
Core topics

What's covered

  • T01Coding vs noncoding DNA: exons, introns, UTRs, regulatory elements
  • T02Repeat landscape: SINEs, LINEs, LTRs, satellite DNA, segmental duplications
  • T03Centromeres, telomeres, and the T2T-CHM13 completion (2022)
  • T04Pangenome reference (HPRC, 2023) — moving beyond GRCh38
  • T05Mitochondrial genome and heteroplasmy
  • T06Chromatin organization: nucleosomes, TADs, A/B compartments
Learning objectives

By the end of this module you will be able to

  • L01State the size and gene count of the human nuclear genome and explain why ENCODE's 80% 'functional' figure is contested.
  • L02Distinguish GRCh38 from T2T-CHM13 and the HPRC pangenome and explain when each matters clinically.
  • L03Describe heteroplasmy and its implications for mitochondrial disease severity.
Key takeaways

What you should walk away believing

  • Only ~1.5% of the genome encodes protein; most clinically meaningful noncoding variants act through cis-regulatory elements.
  • T2T-CHM13 closed centromeric and acrocentric gaps — variant calls in those regions were systematically wrong on GRCh38.
  • The pangenome reduces reference bias for non-European ancestries; expect it to migrate into clinical pipelines through 2026.
  • Mitochondrial disease severity tracks heteroplasmy fraction with tissue-specific thresholds — a single 'pathogenic' label can mean very different things.
Lesson · Foundations emphasis

What this means at your level

Foundations

The human genome is about 3.1 billion base pairs across 22 autosomes, the X and Y, and a small circular mitochondrial chromosome. Roughly 20,000 genes code for proteins, but most of the genome is regulatory, repetitive, or of unknown function. The first truly complete human genome (T2T-CHM13) was finished in 2022, and a multi-individual pangenome reference followed in 2023.

Clinician deep-dive

When you order or interpret a genomic test, you're almost always working against GRCh38, which still has reference gaps and ancestry bias. Variants in segmental duplications, centromeres, and HLA can be miscalled. Mitochondrial reports must include heteroplasmy percentage; without it, severity prediction is impossible. The HPRC pangenome will make underrepresented-ancestry variant calling materially more accurate as labs adopt it.

Research note

T2T-CHM13 added ~200 Mb of previously unresolved sequence, including all centromeric satellite arrays, and corrected ~12,000 variants in GRCh38. Long-read assemblies (PacBio HiFi, ONT ultra-long) underpin the HPRC graph genome. Watch for graph-aware variant callers (vg, GraphAligner) and minigraph-cactus pipelines becoming clinical-grade.

Myth-buster

We have 'finished' sequencing the human genome.

Reality

The 2003 HGP draft and 2022 T2T-CHM13 are different milestones. T2T finished one haploid genome; the pangenome era is finishing the species. Population-scale haplotype resolution and structural-variant catalogs are still active work.

Evidence-graded claims

What the data say

Quick check

Test yourself

Q1Approximately how many protein-coding genes does the human genome contain?
Q2What did T2T-CHM13 add that GRCh38 was missing?
Q3Heteroplasmy refers to:
Glossary

Key terms & abbreviations

Pangenome
A graph-based reference representing genomic variation across many individuals rather than a single linear sequence.
T2T-CHM13
Telomere-to-telomere assembly of a haploid (hydatidiform mole) cell line, completed by the T2T consortium in 2022.
Heteroplasmy
Coexistence of wild-type and variant mitochondrial DNA within a single cell or tissue, expressed as a percentage.
TADTopologically Associating Domain
Self-interacting genomic region (~1 Mb) within which regulatory contacts preferentially occur. TAD boundary disruption can cause disease via enhancer hijacking.
Further reading

Anchor references