Does a string consist of unique characters —- A conversation

Algorithm is one the fundamental components that a programmer should master. But every time I read a book about it, I see intrusive instructions about how to solve a problem using several algorithms. Well they just tell you that you should use this and that. Sure, they are brilliant ideas, but how do those ideas come up? If we do not know this process, it is hard to say we have really learned.

I’ve been reading the book “How to solve it” by George Pólya. It is a great book that shows you how to approach a problem using various steps. You can ask about certain questions and answer them to get a better chance to obtain the final solution. Although this book is about mathematics, I think it also applies for algorithms or even just problems in general. So I’d like to write the conversation I made myself to show the process of how we get an algorithm without giving intrusive instructions. These are just my humble ideas and suggestions are always welcome.

====================================================================

Implement an algorithm to determine if a string has all unique characters. What if you can not use additional data structures?
Q: Let’s look at the first part of the question. What is the unknown? What is our goal?
A: To find a way to determine if a string has all unique characters.
Q: What are the data?
A: Just a string that may or may not have unique characters. Can we assume that all these characters are ASCII characters? (Note that this is a question you have to ask…)
Q: Sure, let’s do that. Now, could you restate the problem since you know they are ASCII characters?
A: OK. to design an algorithm to determine if a string consists of unique ASCII characters.
Q: Have you seen a similar problem before?
A: Not really…But I can imagine the case: for example a string “abc” is unique while “sdsty” is not.
Q: What does it mean by unique character?
A: It means a character only appear once in the string.
Q: How do you know if a character appears only once or more?
A: Eh I guess I just need to keep track one the number of occurrence of different characters.
Q: Good call. To do that, what do you need?
A: Eh I guess I could use a hashtable to keep track of the frequency.
Q: OK, can you think of another way to do it?
A: ….
Q: Look at the unknown. Have you used all the conditions?
A: …
Q: What kind of characters are they?
A: ASCII characters.
Q: How many are there…?
A: 256. Ah I see where this is going. The character set is limited so we can also use something with a constant size other than hashtable. Hmm, I would say an array of size 256.
Q: Good call. Remember that this is a very useful trick when dealing with characters and strings. So how do you use this array?
A: Eh array works with index. If you check the ASCII table, you will see that every character corresponds to a decimal number from 0 to 255. In fact, if you assign ‘A’ to an integer variable n, n will get the value 64, which is the index of A in the ASCII table.
Q: Good. Now tell me some details. How would you solve this problem with your hastable or array?
A: Well, since I need to check the occurrence of each character, I guess I should scan the character in the string one by one. At each position, I’ll check if this character shows up in my record (the hashtable or array). If it’s the second time one shows up, then we find a duplicate.
Q: Nice, now you have a plan. Be sure to test your code thoroughly after implementing the algorithm.
=============================================================================
Q: Now what if you cannot use additional data structures like hashtable or array?
A: Say what…? Then how am I be able to keep track of the occurrence of characters?
Q: Sorry, that’s rule. So you can try think about how to do it without keeping a record.
A: Hmm if I don’t have anything as reference when I’m dealing with the current character, then I guess the program has to check the uniqueness of each character one by one. Otherwise it will have no idea if the current character has shown up or not.
Q: Good. What exactly do you need to do to check the uniqueness of the current character?
A: I’ll have to iterate over the characters after the current one and compare them one by one. If we hit a duplicate, end of story. Otherwise, we will move on to the next character and repeat this process until we hit the end of the string.
Q: OK. This sounds like a plan.
=============================================================================
Q: Now what is the running time of your former algorithm?
A: If I can use additional data structure, it is O(n). If I can’t, then O(n^2).
Q: OK, let’s stick to the last one. It is O(n^2), can you think about a faster one? We also narrow down the characters to only ‘a’ to ‘z’ (OK, this is just the solution on the book. And this condition just comes out from nowhere. Nothing I can do…).
A: Still no additional data structure?
Q: Yep.
A: I don’t know…. (dude why are you doing this?!)
Q: OK previously you used hashtable or array. What did you do with them?
A: I used them to keep track of the occurrence of the characters. But now I cannot use them.
Q: Don’t limit yourself to only two options. Can you think of another way to do the job?
A: Well if I cannot use other data structures, what options do I have? I think only variables are left. I don’t think it is possible. A variable can only represent one value. We will need many variables…
Q: How many do you think we will need?
A: Eh…26? OK this is not feasible. I can’t define that many variables…doing that is the same as defining an array…
Q: You are right. Think about just one variable.
A: Seriously? One? One variable can only represent one value.
Q: Is it? Let me ask this question. How is everything represented so that computer can understand what we mean?
A: binary bits….OK…Now I start to see where this is heading (OK, who really does this in everyday life?). If we represent a variable using bit format, we get more stuff to work with.
Q: That’s right. Let’s think about an integer (in Java) 1. How is it represented in bits?
A: 0000 0000 0000 0000 0000 0000 0000 0001
Q: So is there anyway to use it to distinguish several characters like ‘a’, ‘b’ and ‘d’?
A: ….
Q: OK Suppose ‘a’ is represented as 0000 * 7 0001, can you think another way to represent ‘b’?
A: Eh sure, 0000*7 0010 or 0000*7 0100… Wait I see what is going on. When the bit 1 is on different position, we can use it to represent different characters. And since we have limited the characters from a to z, we have enough space (32 > 26).
Q: Great. Now if you have seen a, b and c 3 characters in the string, how would you represent them all together?
A: Hmm a could be 0001
                 b could be 0010
                  c could be 0100….so if I have seen all 3, then probably 0111?
Q: Good call.  Now what kind of operation do you think we should use to get 0111 from 0001, 0010 and 0100?
A: Eh… add them up?
Q: That’s one way. Think about the bit operations we usually use. What do you have in mind?
A: Probably &, | and <<(>>). Hmm I would say bit-wise or.
Q: Good. Now suppose you have seen abc (0111). What happens if you see a ‘d’ next?
A: Eh d is new so I’ll probably change the record to 1111 since d is 1000.
Q: Well, you know ‘d’ is new. But the computer is not that smart, You have to program the computer to make the decision.
A: I’m not quite sure.
Q: OK, let’s say the next character is ‘a’ again. Write down the current record, d and a in bit format.
A: record is 0111
     char a  is 0001
     char d is 1000
Q: A decision should be made based on the interaction between the record and the character in consideration. What kind of interaction do you think is the right choice?
A: Hmm, like you said, in bit format we can try &, | and << (>>). We’ve tried |. Let’s try & this time…a & 0111 is 0001 while d $ 0111 is 0000. Hmm I guess if the character is new, the result from & operation is always 0000. But if it is not, the result must be greater than 0000.
Q: Here you go. One more thing, how do you convert the characters to the bit format we are using here? ‘a’ to 0001. ‘b’ to 0010 and so on.
A: ‘a’ 0001
     ‘b’ 0010
     ‘c’ 0100
     ‘d’ 1000
Hmm so the difference is the position of the bit 1. in ‘b’, the bit 1 moves 1 position to the left, in ‘c’ it is 2, and in ‘d’ it is 3.
Hmm b-1, c-2, d-3…Ah so b-a = 1, c-a=2 and d-a=3. And a-a=0 (In ASCII). So if a character x is represented by 1 << (x-‘a’)
Q: You got it! Now you have the plan. Carry it out carefully.
Link to the Java Code on GitHub:
===============================================================
A few more things to keep in mind:
  • It’s always about asking yourself the right questions. Ask questions to clarify ambiguity in the problem statement. Make sure you know the definition of each word. Sometimes the definition provides more information or leads for you (Like ASCII table has 256 characters). Never assume anything. Ask!
  • Bit format provides more stuff for us to work with. Well one variable => multiple bits. If you cannot use other data structures, take a look at this option.
Enhanced by Zemanta
Advertisements
Does a string consist of unique characters —- A conversation

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s