LeetCode – Repeated DNA Sequences (Java)

Problem

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: “ACGAATTCCG”. When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example, given s = “AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT”, return: [“AAAAACCCCC”, “CCCCCAAAAA”].

Java Solution

The key to solve this problem is that each of the 4 nucleotides can be stored in 2 bits. So the 10-letter-long sequence can be converted to 20-bits-long integer. The following is a Java solution. You may use an example to manually execute the program and see how it works.

public List<String> findRepeatedDnaSequences(String s) {
    List<String> result = new ArrayList<>();
    if(s==null||s.length()<10){
        return result;
    }
 
    HashMap<Character, Integer> dict = new HashMap<>();
    dict.put('A', 0);
    dict.put('C', 1);
    dict.put('G', 2);
    dict.put('T', 3);
 
    int hash=0;      
    int mask = (1<<20) -1;
 
    HashSet<Integer> added = new HashSet<>();
    HashSet<Integer> temp = new HashSet<>();
 
    for(int i=0; i<s.length(); i++){
        hash = (hash<<2) + dict.get(s.charAt(i));
 
        if(i>=9){
            hash&=mask;
            if(temp.contains(hash) && !added.contains(hash)){
                result.add(s.substring(i-9, i+1));
                added.add(hash);
            }
 
            temp.add(hash);
        }
    }
 
    return result;
}

5 thoughts on “LeetCode – Repeated DNA Sequences (Java)”

Yang Delia

August 2, 2019 at 8:41 pm

A JavaScript Solution:
Chinese: https://www.youtube.com/watch?v=z9F6N8Hh8dI
English: https://www.youtube.com/watch?v=ETUhHnvW2iU
Facebook: https://www.facebook.com/groups/2094071194216385/
Darewreck

August 14, 2016 at 12:00 pm

why do you have a temp and an added hashset. If your calculating the hashcode, shouldn’t you just have one set that contains all the seen hashcode to find duplicates?
Darewreck

August 14, 2016 at 4:39 am

The hashcode sometimes will give you the same value for sequences that are not valid. Example

ACCCCTGAGG
CTGTTCGTTG

Both return hashCode: 1406448045

In java at least. So you can’t rely on the under the hood java implementation of hashcode unless you implement your own version. In the code, they implement it’s own hashcode for 20 bits.
Jerome Liu

January 27, 2016 at 6:21 pm

You may need more memory for 10 letter string.
Salil Surendran

November 28, 2015 at 7:03 pm

Why do you need to generate your own hashcode? The String class has it’s own hashCode method that returns a unique hash for each unique string. So if you just take each 10 letter string and check if it exists in the Set and if so then add it to the list, wouldn’t that work?

5 thoughts on “LeetCode – Repeated DNA Sequences (Java)”

Leave a Comment