Repeated DNA Sequences In Java

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: “ACGAATTCCG”. When studying DNA, it is sometimes useful to identify repeated sequences within the DNA. Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example, given String s = “AAAAAACCCCCAAAAAACCCCCCAAAAAAGGGGTTTT”, return: [“AAAAACCCCC”, “CCCCCAAAAA”,”CCCCAAAAAA”].

Solution:
The key to solving the problem is that each of the 4 nucleotides can be stored in 2 bits. So the 10 letter long sequence can be converted to 20 bits long integer. The following is a Java solution. You may use an example to manually execute the program and see how it works.

package com.algorithm.dna;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;

public class RepeatedDNA {

    public static void main(String[] args) {
        String s = "AAAAAACCCCCAAAAAACCCCCCAAAAAAGGGGTTTT";
        for (String output : findRepeatedDnaSequences(s)) {
            System.out.println(output);
        }
    }

    public static List<String> findRepeatedDnaSequences(String s) {
        List<String> result = new ArrayList<String>();
        int len = s.length();
        if (len < 10) {
            return result;
        }

        Map<Character, Integer> map = new HashMap<Character, Integer>();
        map.put('A', 0);
        map.put('C', 1);
        map.put('G', 2);
        map.put('T', 3);

        Set<Integer> temp = new HashSet<Integer>();
        Set<Integer> added = new HashSet<Integer>();

        int hash = 0;
        for (int i = 0; i < len; i++) {
            if (i < 9) {
                hash = (hash << 2) + map.get(s.charAt(i));
            } else {
                hash = (hash << 2) + map.get(s.charAt(i));
                hash = hash & (1 << 20) - 1;
                if (temp.contains(hash) && !added.contains(hash)) {
                    result.add(s.substring(i - 9, i + 1));
                    added.add(hash);
                } else {
                    temp.add(hash);
                }
            }
        }
        return result;
    }
}

Advertisement