Question:
Help with Java-strings and frequencies?
2009-04-30 21:39:38 UTC
Hello,
I need help with a Java program I'm making for a certain part of my research. I'm fairly new to Java so I'm not sure if the method I'm thinking of would be the best option. All I'm asking is for direction in regards to which method would be best. Here's my situation:

My problem is that I have to input a DNA sequence, divide it into nucleotide base pairs and then determine the frequency. For example:
input=ATGGCCT
I have to divide that in nucleotide base pairs, which would be: AT, TG, GG, GC, CC, CT. Then I have to determine how many of each base pairs were found. I was thinking of dividing the string in sub strings, putting that into an array and then use some sort of search to determine the frequency. However, that could take forever because DNA sequences are very long! And if I then plan to modify the program so that I can divide strings in sets of 15 nucleotide bases that would not work. Is there any other advanced programming method that I am not aware of that could be used for this??? I would appreciate any direction you could give me. Thanks.
Six answers:
.
2009-05-01 02:25:59 UTC
UPDATE: glad it worked :) If you 'd like to have more functionality added to the program, let me know. For now: you 're welcome & have a nice day Lisa :)

PS: if you 'd like to mail me the written report when it's done, my email is anthony_ve*hotm*ail*com (replace the 3 *'s first though ;), but you shouldn't feel obliged to do so, of course. Im glad i could help.



EDIT3: System.out is the default "outputstream", i.e. to write to the command prompt & System.in is the default "inputstream", i.e. to read from the command prompt. The Scanner class is just a good help to get a line (i.e. until the user presses enter) from a File, or inputstream (in this case: the default inputstream, i.e. command prompt input). Ok, & now im off to bed :)



EDIT2: Ah, i think i understand the problem now :) I 've made the inputting "interactive", so that the user has to input it at run-time. While you probably prefer to just put it into the code & then run the program? If so, look for this line in the main method: "String userInput = getStrandFromUser();" & replace it with something like: String userInput = "AGCTTCAGAT";

Hope this helps. Have a nice day :)



EDIT: i don't know if you use an editor or IDE or just the command prompt to run the program. But e.g. in the command prompt, it should give you a line saying "Please enter your strand: ". Then just start typing there. Or in the Eclipse IDE, there is a window on the bottom, called "Console", which acts much the same way as the command prompt. If you 're still unsure about where to input it, just send me a message or leave another comment here. Or if you 'd like, i could add a method to the program, which would allow you to read the input from a file, so that you would just have to create a simple text file, run the program, & get a text file containing the frequency table as a result. But it's 10:30pm here, so that 'd be something for tomorrow after i get back from work.

About the last thing: i surely wouldn't mind you doing that. In fact, that 'd be very kind of you Lisa. Btw, when you 're done with the research, would it be possible to get to read it maybe? :)



UPDATE: hello again Lisa. Below is an updated class. I 've added a way to export the frequency table to a file & put some things u might want to change in a section called "SETTINGS". There are 2 ways for input: let the pc generate a random strand or let the user input a strand. In the main method, i 've just put some example code to show what 's possible with the actual code. In the declaration of DEFAULT_OUTPUT_FILE, i 've had to include a space because the line got too long. Normally this shouldn't give problems, but if it gives you an error on that line, just remove the space :) If there are things in the code that are not clear, don't hesitate to ask. Hope this helps. (PS: the name 's Anthony :)





import java.io.*;

import java.text.NumberFormat;

import java.util.*;



import javax.swing.filechooser.FileSystemView;



public class Main {



/************************* MAIN *************************/



public static void main(String[] args) {

String generatedInput = generateRandomStrand(100);

System.out.println("Generated input: "+generatedInput);

printFreqTable( generateFreqTable(generatedInput, 2));



String userInput = getStrandFromUser();

System.out.println("User input: "+userInput);

saveFreqTable( generateFreqTable(userInput, 15), DEFAULT_OUTPUT_FILE);

}



/************************* SETTINGS *************************/



/**

* A File object that can be used to save a frequency table

* = "freqTable.txt", on your desktop

*/

private static final File DEFAULT_OUTPUT_FILE = new File( FileSystemView .getFileSystemView().getHomeDirectory(), "freqTable.txt");



private static final String EXPORT_FIELD_SEPARATOR = " - ";

private static final boolean EXPORT_AMOUNTS = true;

private static final boolean EXPORT_PERCENTAGES = true;

private static final int EXPORT_PERCENTAGE_FRACTION_DIGITS = 2;



/************************* INPUT *************************/



/**

* Lets the user input a strand

*/

public static String getStrandFromUser() {

System.out.print("Please enter your strand: ");

return new Scanner(System.in).nextLine();

}



/**

* Generates a random strand of the given length

*/

public static String generateRandomStrand(int length) {

StringBuilder builder = new StringBuilder(length);

for(int i=0;i
int element = (int) (Math.random()*4);

switch(element) {

case 0 : builder.append('A'); break;

case 1 : builder.append('C'); break;

case 2 : builder.append('G'); break;

case 3 : builder.append('T'); break;

}

}

return builder.toString();

}



/************************* PROCESSING *************************/



public static Map generateFreqTable(String input, int partLength) {

if(partLength > input.length())

throw new IllegalArgumentException();



SortedMap result = new TreeMap();

for(int i=0;i
String part = input.substring(i, i+partLength);



int currentFreq = (result.get(part) == null) ? 0 : result.get(part);

result.put(part, currentFreq+1);

}

return result;

}



/************************* OUTPUT *************************/



/**

* Prints the given frequency table to the default output (command prompt)

*/

public static void printFreqTable(Map freqTable) {

saveFreqTable(freqTable, System.out);

}



/**

* Saves the given frequency table to the given file

*/

public static void saveFreqTable(Map freqTable, File output) {

try {

saveFreqTable(freqTable, new FileOutputStream(output));

} catch(FileNotFoundException e) {

System.err.println("Saving failed: "+e.getMessage());

}

}



/**

* Helps the other output methods

*/

private static void saveFreqTable(Map freqTable, OutputStream out) {

PrintWriter pw = null;

try {

pw = new PrintWriter(new OutputStreamWriter(out, "UTF-8"), true);

for(String key : freqTable.keySet()) {

String line = key;

if(EXPORT_AMOUNTS)

line += EXPORT_FIELD_SEPARATOR +freqTable.get(key);

if(EXPORT_PERCENTAGES)

line += EXPORT_FIELD_SEPARATOR +getPct(freqTable.get(key), freqTable);

pw.println(line);

}

} catch(IOException e) {

System.err.println("Saving failed: "+e.getMessage());

} finally {

if(pw!=null && out!=System.out)

pw.close();

}

}



/**

* Gets the percentage as a String

*/

private static String getPct(Integer value, Map freqTable) {

int totalNbSets = 0;

for(String key : freqTable.keySet())

totalNbSets += freqTable.get(key);



NumberFormat fmt = NumberFormat.getPercentInstance();

fmt.setMinimumFractionDigits( EXPORT_PERCENTAGE_FRACTION_DIGITS);

return fmt.format(1.0*value/totalNbSets);

}



}







EDIT6 (please do read my other edits below too, if you don't mind :)): this is a class i 've written which solves your problem.





import java.util.*;



public class Main {



/*Main method*/



public static void main(String[] args) {

String input = generateRandomStrand(100);

System.out.println("Used input: "+input);

printFreqTable( generateFreqTable(input, 2));

printFreqTable( generateFreqTable(input, 3));



}



/*This method will generate a random strand of the given length*/



public static String generateRandomStrand(int length) {

StringBuilder builder = new StringBuilder(length);

for(int i=0;i
int element = (int) (Math.random()*4);

switch(element) {

case 0 : builder.append('A'); break;

case 1 : builder.append('C'); break;

case 2 : builder.append('G'); break;

case 3 : builder.append('T'); break;

}

}

return builder.toString();

}



/*The following method will generate a frequency table*/



public static Map generateFreqTable(String input, int partLength) {

if(partLength > input.length())

throw new IllegalArgumentException();



SortedMap result = new TreeMap();

for(int i=0;i
String part = input.substring(i, i+partLength);



int currentFreq = (result.get(part) == null) ? 0 : result.get(part);

result.put(part, currentFreq+1);

}

return result;

}



/*The following method will print a frequency table*/



public static void printFreqTable(Map freqTable) {

for(String key : freqTable.keySet())

System.out.println(key+" - "+freqTable.get(key));

}



}



EDIT5bis: it's possible to encode a strand of length 4 with 1 byte. This encoding takes 8 times less memory & makes it very easy to read from & write to files.



EDIT5: what would be the maximum input length that should be handled? one million, one billion...?

& what 's the maximum "set length" that should be handled? 15?



EDIT4: where does the input strand come from? from a file or from input of the user?



EDIT3: what is the main goal of your program? is it:

1) a simple input-output program

input = a huge string + the length of the parts

output = a frequency table

2) a rather complex program, i.e.: this is just one of the many things your program should be able to do. You want it to be easily extensible, so that later on you can add things like "check if 2 strands can be paired", "concatenate two strands" etc...



EDIT2: so if you 'd have a String of length 30, & would want to have it split into sets of 15, you would end up with 16 sets, right?



EDIT: now i don't mean to be annoying, but: you say you 're dealing with RNA strands, but doe
videobobkart
2009-04-30 22:45:14 UTC
If I am understanding the problem correctly, the above answer has underestimated the size of the problem when strings of 15 nucleotide bases are involved rather than just 2. Since there are just 4 possible letters (A, C, G, T), there are 4*4 possible two-letter sequences, but with a string of 15 letters the number of possible combinations goes up to 4^15 (about one billion).



Two representations come to mind: if the length of the strings is a constant (2, or 15, or whatever), that can be built into the solution, and an N-dimensional array can be used, with the size of each dimension being 4 (one for each letter), where each letter in the sequence is used to index into each dimension of the array, and the frequency of that sequence of letters is stored in the resulting array element. This has the advantage of fast access to the frequency for each sequence, but the disadvantage of taking up a lot of potentially unnecessary space. In the case of 15 dimensions (one billion array elements) you WILL need a 64-bit address space.



The second obvious representation for this problem is a map data structure (Java's HashMap for example), the keys to the map are strings and the range of the map is the integer frequency for each string. Yes it will slow down somewhat when millions of entries are put in such a map, but you will not use space for sequences that don't ever occur.



EDIT: In response to your statement: "My input is not fixed length.", that I understood, my question about what was fixed was the length of the subsequences (the "2" in your "pos+2"). By the way you have to adjust that code slightly to not go right to the end of the string or you will be asking for indices past the end when you add 2 (or whatever the subsequence length is). But even if that length is not constant, then Map solution works fine. Yes strings carry an overhead that could be reduced if you devised a more compact representation of the ACGT sequences, and used that as the key to the Map. But I suggest starting out with a HashMap to hold the frequencies of each unique subsequence. So in your code that iterates through the subsequences in a longer sequence of "letters", the simple addition of a line like this would count each one:



freqMap.put ( substr, freqMap.get ( substr ) + 1 ) ;



where freqMap is declared/initialized like this:



HashMap freqMap = new HashMap ( ) ;



Documentation for Java's HashMap class: http://java.sun.com/j2se/1.4.2/docs/api/java/util/HashMap.html
synnethrelmn
2009-04-30 21:54:36 UTC
I think the most efficient way to do that would be to make an array to store the frequency of each type of base pair (in your case, a 6 integer array) and set each entry to 0. Then run a for-loop that checks every two letters in the string. For whatever pair it detects, (AT, TG, whatever), increment the corresponding array entry by one.



At the end you'll be left with an array that has the frequency of each type of pair. Expanding this to 15 pairs only involves expanding the array to size 15 and adding new possibilities to your search loop.
deonejuan
2009-04-30 23:58:59 UTC
I have seen this as a packed integer. Visualize a poker game with 7 cards (Texas hold 'em). There we examine the bits in a 31-slot stripe of booleans. We can manipulate the status of the bits using bitwise operators to represent card hands. We can see pairs and high hands -- jacks over queen vs. jacks over ten.



I would strive for a Linux 64-bit hardware using Java SE u12. Sounds to me like you are in proximity to academic CIS departments. Ask them about packed integer and bitwise masking.



For frequency, the framework for data in Java is HashSet.



I find your DNA quest intriguing, but I just get glazed over eyes trying to map biology to OOPs.



If you use Strings, you are screwed. You will have to back off and redesign the processing pattern. 112 BYTES of memory that never goes away with each iteneration. Strings are immutable. Compare that to the 4 bytes of an integer that does garbage collect.



Finally, database. Database is much more abstract than OOPs. But Database can model life more accurately. DB admin is a whole other branch of Computer Science that I'm not very good at. DBA is what I should have studied, but I stuck it out with the math side of CIS.
2016-04-07 04:11:06 UTC
it's been a while, but it looks like since you have j
Neeraj Yadav♄
2009-05-01 02:27:45 UTC
Have coded it for you ...







import java.util.ArrayList;

import java.util.Arrays;



public class DnaStringBreaks {



static String[] dnaArray = null;

static int count = 0;

static String dna = null;



static ArrayList ls = new ArrayList();



public static void main(String[] args) {

dna = "ATGGCCTATGATG";

dnaArray = new String[dna.length() - 1];



getDNAArray(dna);

for (int i = 0; i < dnaArray.length; i++) {

count = frequency(dnaArray, dnaArray[i]);

show(count, dnaArray[i]);



}

get15Nucleotide();



}



private static void show(int count2, String string) {



if (!ls.contains(string)) {

ls.add(string);

System.out.print(string);

System.out.println("\t Frequency is " + count);

}



}



static void getDNAArray(String dna) {

StringBuffer strbuf = new StringBuffer();

int k = dna.length();

for (int i = 0; i <= k - 1 && i + 2 <= k; i++) {

strbuf.append(dna.substring(i, i + 2));

dnaArray[i] = strbuf.toString();

strbuf.delete(0, strbuf.length());



}



}



public static int frequency(String[] arr, String strToFind) {

int occurence = 0;

for (int i = 0; i < arr.length; i++) {

if (arr[i].equals(strToFind))

occurence++;

}

return occurence;

}



static void get15Nucleotide() {

String[] array = new String[15];

int count2 = 0;

for (int i = 0; i < dnaArray.length; i += 15) {

count2++;

array = Arrays.copyOfRange(dnaArray, i, i + 15);

System.out.println("\n \t " + count2

+ "Th GROUP OF FIFTEEN NUCLEOTIDE \n");

if (/* dnaArray.length - 15 > 15 && */array.length <= 15) {

System.out.print("[");

for (int k = 0; k < array.length; k++) {



if (array[k] != null)

System.out.print(array[k] + ",");

else

System.out.print("--,");

}

System.out.print("]");

System.out.println();

}

}



}



}





tested for any kinda string length..



Hope this resolves your issues...


This content was originally posted on Y! Answers, a Q&A website that shut down in 2021.
Loading...