Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

MarkCollette

macrumors 68000
Mar 6, 2003
1,559
36
Toronto, Canada
Code:
char s[] = "amajjjfgy, torsrfew";
for (i=0; i++; s[i]){
  if(isalpha(s[i]){
    strcpy(&s[i], &s[i+1]);
  }
}

Good example of in-place string manipulation, and standard library use. But, may I suggest using two index variables, to reduce redundant copies, and migrating to strncpy or strlcpy. Of course strcpy is completely safe in your example, but it sets a good example to copy&pasters. Oh, and do you mean: if( ! isalpha(s )
 

MarkCollette

macrumors 68000
Mar 6, 2003
1,559
36
Toronto, Canada
Not really relevant here, but you could remember it in the future.

When manipulating characters from Strings in Java, if the String is longer than about 300 characters, getting an array using toCharArray and iterating over the array is faster than using charAt. charAt is slightly faster than toCharArray with shorter Strings. (That's since JDK1.4 - with earlier versions the limit was about 15 characters)

This whole discussion reminded me of a change in Java 1.5. Apparently the new Unicode 4 standard adds more characters than would fit in 16 bits. So, Java 1.5 now supports UTF-16. So basically, if you want to be completely correct, you can no longer assume that a char primitive is sufficient to hold a character. In retrospect, everyone else using UTF-8 now sounds pretty smart. The point is, that the optimisation of getting the source char[] is no longer a good idea. If someone had used the simple String.charAt(int) and Character.isLetter(char) approach, then they could simply replace those with String.codePointAt(int) and Character.isLetter(int). Well, to be honest, they'd also have to change the looping to depend on Character.isSupplementaryCodePoint(int). Of course, you could just use the Character methods on the char[], but I think it makes the porting issue less straightforward.

String.codePointAt(int)
Character.isLetter(int)
Character.isSupplementaryCodePoint(int)
 

gnasher729

Suspended
Nov 25, 2005
17,980
5,566
I'll write it in C as compactly as possible you can translate to whatever you like

char s[] = "amajjjfgy, torsrfew";
for (i=0; i++; s){
if(isalpha(s){
strcpy(&s, &s[i+1]);
}
}

Yes I actually do use code like the above in real life. It has uses for example to remove prohibited characters from strings before using them to build an SQL query.


In real life, this has a good chance of crashing. strcpy for overlapping strings invokes undefined behavior. It might work on your particular machine today, but it can crash any time.
 

MarkCollette

macrumors 68000
Mar 6, 2003
1,559
36
Toronto, Canada
In real life, this has a good chance of crashing. strcpy for overlapping strings invokes undefined behavior. It might work on your particular machine today, but it can crash any time.

Another reason why Java is better: System.arraycopy guarantees proper behaviour with intra-array copying :)
 

gnasher729

Suspended
Nov 25, 2005
17,980
5,566
I'll write it in C as compactly as possible you can translate to whatever you like

char s[] = "amajjjfgy, torsrfew";
for (i=0; i++; s){
if(isalpha(s){
strcpy(&s, &s[i+1]);
}
}

Yes I actually do use code like the above in real life. It has uses for example to remove prohibited characters from strings before using them to build an SQL query.


It is worse than I thought. Apart from the undefined behavior in the call to strcpy, and the error in the test (it should have been ! isalpha (s )), execution time is quadratic, and it allows SQL code injections.

Try this first:

char* s = malloc (1000001);
memset (s, '*', 1000000);
s [1000000] = '\0';

Your code does one million calls to strcpy, each call copies on the average 500000 bytes, which means a total of 500 billion bytes are copied. Expect to wait for a few minutes.

But it actually runs twice as fast because there is a bug, and that bug is absolutely fatal: If s is a character that you want to reject, then you move all characters starting at s [i+1] forward by one byte, and the character that was previously stored at s [i+1] is not checked. In other words, your code will let any character through if I just put it into the input twice.

Lets say I am a registered user on your website, and I enter username and password (username = gnasher729 and password = abcdef), and you use this to construct a query like

username = "gnasher729" and password = "abcdef".

A hacker that wants to get into my account without knowing my password would enter gnasher729 as the user name. He enters the username X. As the password he enters

x"" or username == ""gnasher729"" and ""A"" == ""A

Your code strips out only the first of two consecutive non-letters, so its result is

x" or username = "gnasher729" and "A" = "A

If you put that into your query, the query looks like

username = "X" and password = "x" or username = "gnasher729" and "A" = "A"

In other words, since "A" = "A" is true, he will get into my account and create all kinds of trouble. Of course he can get into anyones account as long as he can guess the username. Classical SQL code injection.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.