Problem
All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: “ACGAATTCCG”. When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example, given s = “AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT”, return: [“AAAAACCCCC”, “CCCCCAAAAA”].
Java Solution
The key to solve this problem is that each of the 4 nucleotides can be stored in 2 bits. So the 10-letter-long sequence can be converted to 20-bits-long integer. The following is a Java solution. You may use an example to manually execute the program and see how it works.
public List<String> findRepeatedDnaSequences(String s) { List<String> result = new ArrayList<>(); if(s==null||s.length()<10){ return result; } HashMap<Character, Integer> dict = new HashMap<>(); dict.put('A', 0); dict.put('C', 1); dict.put('G', 2); dict.put('T', 3); int hash=0; int mask = (1<<20) -1; HashSet<Integer> added = new HashSet<>(); HashSet<Integer> temp = new HashSet<>(); for(int i=0; i<s.length(); i++){ hash = (hash<<2) + dict.get(s.charAt(i)); if(i>=9){ hash&=mask; if(temp.contains(hash) && !added.contains(hash)){ result.add(s.substring(i-9, i+1)); added.add(hash); } temp.add(hash); } } return result; } |
A JavaScript Solution:
Chinese: https://www.youtube.com/watch?v=z9F6N8Hh8dI
English: https://www.youtube.com/watch?v=ETUhHnvW2iU
Facebook: https://www.facebook.com/groups/2094071194216385/
why do you have a temp and an added hashset. If your calculating the hashcode, shouldn’t you just have one set that contains all the seen hashcode to find duplicates?
The hashcode sometimes will give you the same value for sequences that are not valid. Example
ACCCCTGAGG
CTGTTCGTTG
Both return hashCode: 1406448045
In java at least. So you can’t rely on the under the hood java implementation of hashcode unless you implement your own version. In the code, they implement it’s own hashcode for 20 bits.
You may need more memory for 10 letter string.
Why do you need to generate your own hashcode? The String class has it’s own hashCode method that returns a unique hash for each unique string. So if you just take each 10 letter string and check if it exists in the Set and if so then add it to the list, wouldn’t that work?