Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

dejo

Moderator emeritus
Original poster
Sep 2, 2004
15,982
452
The Centennial State
I have some code that processes a String and replaces the path for SRC attributes (from HTML) with a new path. But with certain attribute values the replaceAll method seems to go into an infinite loop rather than just not matching (or matching). If you run the following Java, rather than getting output for the modifiedString, it just appears to hang. Is this a legitimate bug with Java's replaceAll method or am I doing something wrong?

Code:
class ReplaceAllBugApp {
    public static void main(String[] args) {
        String originalString = new String("<img src=\"/images/subdir/company_logo_final.jpg />");
        System.out.println("Original String: " + originalString); // Display the original string.
        String modifiedString = originalString.replaceAll("(?i)src ?= ?\"((/?[\\w\\d\\-@%]+)*/)?([\\w\\d\\.\\_\\-@]+\\.[\\w\\d]+)\"", "src=\"/webapp/UserFiles/Images/$3\"");
        System.out.println("Modified String: " + modifiedString); // Display the modified string.
    }
}

P.S. I am running Java 5 on Windows, in case it matters.
 

robbieduncan

Moderator emeritus
Jul 24, 2002
25,611
893
Harrogate
I'd not be surprised if it could end up in an infinite loop if the string you replace ends up inserting the string you are replacing...
 

dejo

Moderator emeritus
Original poster
Sep 2, 2004
15,982
452
The Centennial State
I'd not be surprised if it could end up in an infinite loop if the string you replace ends up inserting the string you are replacing...
But the strings are definitely different in this example case. We're going from src="/images/subdir/company_logo_final.jpg to src="/webapp/UserFiles/Images/company_logo_final.jpg".

And here's some fun stuff: if I add a (proper) closing " to the end of the src value (it's missing in the original example), the code works as hoped. But it shouldn't hang in the original case. It should just not match and do no replacing.
 

kylos

macrumors 6502a
Nov 8, 2002
948
4
MI
But the strings are definitely different in this example case. We're going from src="/images/subdir/company_logo_final.jpg to src="/webapp/UserFiles/Images/company_logo_final.jpg".

And here's some fun stuff: if I add a (proper) closing " to the end of the src value (it's missing in the original example), the code works as hoped. But it shouldn't hang in the original case. It should just not match and do no replacing.

Run it through a debugger and see where it hangs. I'd recommend Eclipse, as the version I'm using has the source for java.util.regex (used by String.replaceAll()), so you can step through the code and possibly see the actual problem.

robbieduncan said:
I'd not be surprised if it could end up in an infinite loop if the string you replace ends up inserting the string you are replacing...

From Matcher.replaceAll (used by String.replaceAll()):
This method first resets this matcher. It then scans the input sequence looking for matches of the pattern. Characters that are not part of any match are appended directly to the result string; each match is replaced in the result by the replacement string. The replacement string may contain references to captured subsequences as in the appendReplacement method.

This para seems to indicate a sequential match, not a recursive one, so there should be no issues with rematching an already matched tag.
 

jeremy.king

macrumors 603
Jul 23, 2002
5,479
1
Holly Springs, NC

Does this help at all?

Code:
String modifiedString = originalString.replaceAll("(?i)src ?= ?\"(/?[^/]+/)*([^\"]+\")", "src=\"/webapp/UserFiles/Images/$2");

I tend to approach negative logic approach when using regexp (ie. everything but some character)

Tested fine in my limited test cases...

I think the where the regexp gets all weird is when you don't have that closing quote in your test case, you expose the closing slash of the image tag to the matcher and it bombs somehow. I suggest you submit that test program to sun as a bug.
 

dejo

Moderator emeritus
Original poster
Sep 2, 2004
15,982
452
The Centennial State
Does this help at all?

Code:
String modifiedString = originalString.replaceAll("(?i)src ?= ?\"(/?[^/]+/)*([^\"]+\")", "src=\"/webapp/UserFiles/Images/$2");

I tend to approach negative logic approach when using regexp (ie. everything but some character)

Tested fine in my limited test cases...

I think the where the regexp gets all weird is when you don't have that closing quote in your test case, you expose the closing slash of the image tag to the matcher and it bombs somehow. I suggest you submit that test program to sun as a bug.
Thanks, kingjr3, that should do it! I think I will heed your advice and revisit some of my other regular-expressions and see if I can rework with the negative logic approach. And I will submit this to Sun as a bug. I know it's not the best regex but it should never hang. I expect no matching when I mess up my regex.

P.S. Now to just adjust the regex so that the closing double-quote is not the only allowed delimiter of the src= value but also whitespace too. That way, replacement WILL occur for either src="..." or src=... (or even src="... or src=...").

P.P.S. How does this look?
Code:
String modifiedString = originalString.replaceAll("(?i)src ?= ?\"?(/?[^/]+/)*([^\\s|\"]+)\"?\\s", "src=\"/webapp/UserFiles/Images/$2\" ");
This way I force the new src value to be surrounded by "s.
 

kylos

macrumors 6502a
Nov 8, 2002
948
4
MI
dejo said:
P.S. Now to just adjust the regex so that the closing double-quote is not the only allowed delimiter of the src= value but also whitespace too. That way, replacement WILL occur for either src="..." or src=... (or even src="... or src=...").

P.P.S. How does this look?

That'll work better in case your site isn't yet up to snuff with quotes. Technically, it accepts invalid input, but then it make such input valid, so there's nothing wrong with that.
 

dejo

Moderator emeritus
Original poster
Sep 2, 2004
15,982
452
The Centennial State
Re: Proposed Solution

Code:
String modifiedString = originalString.replaceAll("(?i)src ?= ?\"(/?[^/]+/)*([^\"]+\")", "src=\"/webapp/UserFiles/Images/$2");
So, seems that once I try to apply this regex to a much larger HTML string (220K), I'm getting a stack overflow exception. Thoughts?
 

kylos

macrumors 6502a
Nov 8, 2002
948
4
MI
So, seems that once I try to apply this regex to a much larger HTML string (220K), I'm getting a stack overflow exception. Thoughts?

Does it work on a much smaller file? Something with more than one src attribute but say only 1k in size? The reason I ask is because you are using greedy operators in your regex which read in the entire file and then back off until they find an acceptable match. Depending on the implementation of the regex engine, this could result in a lot of data being pushed on to the stack.

Check the Sun tutorial on quantifiers for more info.

I would recommend using the following

Code:
"(?i)src ?= ?\"(/?[^/]+?/)*?([^\"]+?\")"

instead of

Code:
"(?i)src ?= ?\"(/?[^/]+/)*([^\"]+\")"

The question marks following the (+,*) operators turn them into reluctant operators, meaning they will start with the smallest possible match and then increase until they fail, instead of starting with the largest string and stopping when it becomes a valid match.
 

SC68Cal

macrumors 68000
Feb 23, 2006
1,642
0
So, seems that once I try to apply this regex to a much larger HTML string (220K), I'm getting a stack overflow exception. Thoughts?

Welcome to the limitations of Java. Your VM is probably too small to handle the task. Allocate some more memory.
 

kylos

macrumors 6502a
Nov 8, 2002
948
4
MI
Welcome to the limitations of Java. Your VM is probably too small to handle the task. Allocate some more memory.

While increasing the VM stack using the VM option -Xss may help avoid stack overflows for this situation, the stack overflow is only indirectly related to the file size While local references to objects are stored on the stack, the actual objects are stored on the heap. In other words, the String data associated with the file is not stored on the stack, so the file size does not directly cause the stack overflow.

In Java, only method calls, primitive types, and references are stored on the the stack. Thus, in Java, stack overflows can only plausibly be caused by deep recursion. Using greedy quantifiers on large files when only a small portion of the data will be matched will cause massive backtracking, which will not only take much longer to process, but will also potentially cause stack overflows due to apparently deep recursion in the regex engine.

More information on Java memory structure.

It would be interesting if the OP would post a stack trace to see what is causing the overflow.

In the meantime, I would wager that using reluctant quantifiers would not only dramatically speed up execution, but would also eliminate stack problems.
 

jeremy.king

macrumors 603
Jul 23, 2002
5,479
1
Holly Springs, NC
It would be interesting if the OP would post a stack trace to see what is causing the overflow.

In the meantime, I would wager that using reluctant quantifiers would not only dramatically speed up execution, but would also eliminate stack problems.

I'll put my money on Pattern.matches or Matcher.matches!

Overly ambitious regexp will almost always cause stack overflow because of the recursive nature of the Pattern/Matcher classes.

I don't think its unreasonable to try the reluctant quantifiers, modify the pattern to include end of line terminator, or simply read the file line by line (or tag by tag) and apply the pattern accordingly...

Several ways to skin this cat :)
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.