All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: “ACGAATTCCG”. When studying DNA, it is sometimes useful to identify repeated sequences within the DNA. Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example, given String s = “AAAAAACCCCCAAAAAACCCCCCAAAAAAGGGGTTTT”, return: [“AAAAACCCCC”, “CCCCCAAAAA”,”CCCCAAAAAA”].
Solution:
The key to solving the problem is that each of the 4 nucleotides can be stored in 2 bits. So the 10 letter long sequence can be converted to 20 bits long integer. The following is a Java solution. You may use an example to manually execute the program and see how it works.
package com.algorithm.dna;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;
public class RepeatedDNA {
public static void main(String[] args) {
String s = "AAAAAACCCCCAAAAAACCCCCCAAAAAAGGGGTTTT";
for (String output : findRepeatedDnaSequences(s)) {
System.out.println(output);
}
}
public static List<String> findRepeatedDnaSequences(String s) {
List<String> result = new ArrayList<String>();
int len = s.length();
if (len < 10) {
return result;
}
Map<Character, Integer> map = new HashMap<Character, Integer>();
map.put('A', 0);
map.put('C', 1);
map.put('G', 2);
map.put('T', 3);
Set<Integer> temp = new HashSet<Integer>();
Set<Integer> added = new HashSet<Integer>();
int hash = 0;
for (int i = 0; i < len; i++) {
if (i < 9) {
hash = (hash << 2) + map.get(s.charAt(i));
} else {
hash = (hash << 2) + map.get(s.charAt(i));
hash = hash & (1 << 20) - 1;
if (temp.contains(hash) && !added.contains(hash)) {
result.add(s.substring(i - 9, i + 1));
added.add(hash);
} else {
temp.add(hash);
}
}
}
return result;
}
}