Assembly with De Bruijn graphs in practice | bioinformatics


When you learn about genome assembly in classes or lectures, they usually show you a tidy illustration of how a sequence can be converted into a De Bruijn graph. I had never really thought twice about this until I sat down and started developing an assembler myself, at which point I realized my original understanding of De Bruijn-based assembly graphs was massively oversimplified. One of my first questions: how do assemblers deal with the two strands of DNA?

If an assembler ignores the double-strandedness of DNA, it would essentially produce two assembly graphs (which may or may not intersect) that are roughly mirror images of each other, each corresponding to one strand. Though you could theoretically still do graph cleaning, phasing, etc. with the knowledge that the graph is doubled, this is impractical and would consume an unnecessary amount of memory and runtime, as well as add complexity to the later stages of assembly.

So what De Bruijn graph-based assemblers do instead is have each node in the graph correspond to both the forward sequence and the reverse complement of a given k-mer. This means that the edges of the graph are now bidirected instead of directed—traversing a node from one direction gives you the forward sequence, and reading it in the opposite direction gives you the reverse complement.

As an example, suppose you are representing the sequence CAGCAT (and its reverse complement ATGCTG) in a De Bruijn graph with k = 4. You would have the following nodes:

  Node 1 Node 2 Node 3
Forward (+) CAGC AGCA GCAT
Reverse (-) GCTG TGCT ATGC

So reading the path +1 +2 +3 (node 1 forward, then node 2 forward, then node 3 forward) would give you the original forward sequence CAGCAT. Note that if we want the reverse complement, we can traverse the path in the opposite direction: -3 -2 -1 yields ATGCTG (i.e., flip both the direction and order of the nodes). This gives us a neat way of representing both strands as a single set of nodes.

Of course, when you are actually processing the reads, you don’t know whether the sequence came from the forward or reverse strand. So you instead note one of the sequences as the “canonical” sequence and track the node orientations when adding edges between nodes.

For instance, one way to take the canonical k-mer is to take the lexicographically smaller k-mer. So our same example as before might now look like this (note the node 3 sequences are swapped):

  Node 1 Node 2 Node 3
Canonical (+) CAGC AGCA ATGC
Non-canonical (-) GCTG TGCT GCAT

So now, the path that gives us the forward sequence CAGCAT is +1 +2 -3, and the path that gives us the reverse sequence ATGCTG is +3 -2 -1.

This leads to another detail that you don’t usually learn about in lecture: the canonical k-mer is also another abstraction. Many assemblers (and read aligners) don’t manipulate the actual strings of the sequences. Instead, they hash each k-mer into an integer and store and manipulate that value.

This useful for a couple of reasons: first, storing an integer instead of an actual string substantially reduces the memory usage (e.g., a 32-mer would be represented in >= 32 bytes as a string, but only ((32 * 2) = 64 bits) = 8 bytes if you encode it as an integer). Second, this makes repeated string comparisons faster—you can find the lexicographical ordering between two strings with a single integer comparison instead of an O(k) comparison.

But storing both k-mer hashes is redundant, since having the knowledge of one strand allows you to recover the other. So you can represent each node as just the canonical hash by setting your hash function to hash both the forward and reverse strands to the same value. You can then use the information of whether you used the original or “reversed” hash to track the orientations of both nodes when adding an edge.

One last note: there has been some work on minimizer-space De Bruijn graphs (Ekim et al. 2021, Benoit et al. 2024). In these graphs, each node is not a single k-mer, but rather an ordered list of k’ consecutive minimizers. And so each edge between nodes represents an overlap of k’-1 k-mers (rather than overlap between two (k-1)-mers, as in a conventional De Bruijn graph). The way of representing the reverse complement sequence is similar, but with another layer of “negation”—when traversing in the reverse direction along a path, you must also reverse the order and orientations of the constituent k-mers in each node.

But crucially, it is not enough to make sure that an arbitrary reverse complement sequence hashes to the same value as its forward sequence—you have to make sure that the same corresponding minimizers are selected from the full forward and reverse sequences as well. Otherwise, you will still end up with two separate graphs, one per strand, because the k-mers being hashed are not the same. I add this note because I was reminded of it the hard way :)


March 1, 2025
← Back to posts