Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

ace2600

macrumors member
Original poster
Mar 16, 2008
71
0
Austin, Texas
Hi,

How would I efficiently remove all HTML tags from an NS(Mutable)String?

For example:
Code:
<h1>Header</h1><p>Hello world</p>
Would become:
Code:
Header Hello world
.
 

wintergreen

macrumors newbie
Jun 30, 2007
4
0
Not sure what environment you are in. sed is always there for me in the darkest hour of need.

# cat /path/to/file | sed -e 's/<[^>]*>//g'
 

hhas

macrumors regular
Oct 15, 2007
126
0
Not sure what environment you are in. sed is always there for me in the darkest hour of need.

# cat /path/to/file | sed -e 's/<[^>]*>//g'

More robust command line solution (10.4+):

Code:
textutil -convert txt -output foo.txt foo.html

Also take a look at the TextEdit source at /Developer/Examples/AppKit/TextEdit
 

ace2600

macrumors member
Original poster
Mar 16, 2008
71
0
Austin, Texas
Sorry, I should have made it clear. I need to do this only using Objective-C and Cocoa. It will actually be on the iPhone, but I figured this was more of a general Cocoa question so put it in Mac Programming.
 

Soulstorm

macrumors 68000
Feb 1, 2005
1,887
1
Sorry, I should have made it clear. I need to do this only using Objective-C and Cocoa. It will actually be on the iPhone, but I figured this was more of a general Cocoa question so put it in Mac Programming.

You will need to implement regular expression support, and learn about regular expressions. I have written a good article about that here
 

hhas

macrumors regular
Oct 15, 2007
126
0
Sorry, I should have made it clear. I need to do this only using Objective-C and Cocoa. It will actually be on the iPhone, but I figured this was more of a general Cocoa question so put it in Mac Programming.

TextEdit is written in Cocoa. Look at the source for ideas. Check the Webkit API to see if there's anything useful there. Do a websearch for other suggestions. Assuming you want the resulting plain text reasonably formatted, you'll need to use some sort of HMTL parser; just stripping tags with a regex won't cut it.

HTH
 

kainjow

Moderator emeritus
Jun 15, 2000
7,958
7
If it's eventually going to end up in an iPhone app you should post it in the iPhone section because even though the iPhone uses Cocoa it's still a very different environment from the Mac and you are much more limited in what you can do. For example the suggestion on using command-line utilities or the WebKit API or even NSAttributedString are all good suggestions, but they are all unavailable on the iPhone.
 

ace2600

macrumors member
Original poster
Mar 16, 2008
71
0
Austin, Texas
Thanks everyone for the help. I looked at the methods mentioned and did more searching, but most did not work on the iPhone platform. Next time I will post something like this to that forum.

I tried using XMLParser first, but it failed often with malformed HTML. I tried a couple other ways with direct string manipulation. I ended up with the approach below. I had to trim the text between tags because I ended up with lots of whitespace. The code below is definitely not the most efficient and I'm not too proud of it, but it seems to work.
PHP:
+ (NSString *)extractTextFromXML:(NSString *)xml{
	//Will hold just the text
	NSMutableString *text = [NSMutableString string];
	NSInteger startOfSubstring = 0;
	//Finds first instance of "<"
	NSRange startTagRange = [xml rangeOfString:@"<"];
	while(startTagRange.location != NSNotFound){
		//Extracts text from last location up to "<"
		NSString *substring = [xml substringWithRange:NSMakeRange(startOfSubstring, startTagRange.location-startOfSubstring)];
		//Removes whitespace from substring
		[text appendString:[substring stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceCharacterSet]]];
		
		//Searches for ">" from "<" to end of string
		NSRange startTagToEndRange = NSMakeRange(startTagRange.location, [xml length]-startTagRange.location);
		NSRange endTagRange = [xml rangeOfString:@">" options:NSCaseInsensitiveSearch range:startTagToEndRange];
		//If ">" found, then sets next location of substring to after that
		if(endTagRange.location != NSNotFound){
			startOfSubstring = endTagRange.location+1;
		}
		//If no ">", then appends rest of string and returns
		else{
			[text appendString:[xml substringFromIndex:startTagRange.location]];
			return text;
		}
		//Finds next "<" in string
		NSRange endTagToEndRange = NSMakeRange(startOfSubstring, [xml length]-startOfSubstring);
		startTagRange = [xml rangeOfString:@"<" options:NSCaseInsensitiveSearch range:endTagToEndRange];
	}
	
	return text;
}
 

lee1210

macrumors 68040
Jan 10, 2005
3,182
3
Dallas, TX
erm, what about embedded Javascript, or CSS? What if a tag has a property that contains a > in quotes? not to be a black cloud, just trying to point things out.

-Lee
 

davedelong

macrumors member
Sep 9, 2007
50
0
Right here.
I'm doing some similar text processing in an app I'm writing, and I'm using the RegexKit framework (actually the Lite version):

http://regexkit.sourceforge.net/

I love it. It's really easy to use, because it uses categories to add methods to NS*String and NS*Array classes, so you don't have to deal with funky regex objects or anything.

Dave
 

robbieduncan

Moderator emeritus
Jul 24, 2002
25,611
893
Harrogate
I'm doing some similar text processing in an app I'm writing, and I'm using the RegexKit framework (actually the Lite version):

http://regexkit.sourceforge.net/

I love it. It's really easy to use, because it uses categories to add methods to NS*String and NS*Array classes, so you don't have to deal with funky regex objects or anything.

Dave

That's all fine and dandy, but you can't use Frameworks in an iPhone application which the OP has said he is intending this to be. It is possible that he could use the sourcecode for that and compile the files he needs directly into the app of course. The only issue might be licensing. What is the license of RegexKit?
 

davedelong

macrumors member
Sep 9, 2007
50
0
Right here.
It's BSD. The RegexKitLite version doesn't actually give you anything other than an extra class, which contains the additional methods on NS*String and NS*Array. I also just add the linker flag "-licucore" to get the RKL to work. It works great. :)

Dave
 

robbieduncan

Moderator emeritus
Jul 24, 2002
25,611
893
Harrogate
It's BSD. The RegexKitLite version doesn't actually give you anything other than an extra class, which contains the additional methods on NS*String and NS*Array. I also just add the linker flag "-licucore" to get the RKL to work. It works great. :)

Dave

That sounds like it would work fine then :)
 

ChrisA

macrumors G5
Jan 5, 2006
12,919
2,172
Redondo Beach, California
That's all fine and dandy, but you can't use Frameworks in an iPhone application which the OP has said he is intending this to be. It is possible that he could use the sourcecode for that and compile the files he needs directly into the app of course. The only issue might be licensing. What is the license of RegexKit?

I don't know your application but if you are reading HTML you have to allow for broken, syntactically invalid HTML. At least make sure you don't go off in some infinite loop or crash.

I just worked this out on paper. You can do this in plain old C without using any libraries, it takes all of about 12 lines of code. I think you guys are working to hard at this. Make a for loop that loops over the string from left to right. When you see a "<" set in_tag to 1.
if in_tag is set remove the current character from the string. If the character just removed is a > reset in_tag. That will work most of the time, you have to watch dfor escaped angle brackets
 

lee1210

macrumors 68040
Jan 10, 2005
3,182
3
Dallas, TX
I don't know your application but if you are reading HTML you have to allow for broken, syntactically invalid HTML. At least make sure you don't go off in some infinite loop or crash.

I just worked this out on paper. You can do this in plain old C without using any libraries, it takes all of about 12 lines of code. I think you guys are working to hard at this. Make a for loop that loops over the string from left to right. When you see a "<" set in_tag to 1.
if in_tag is set remove the current character from the string. If the character just removed is a > reset in_tag. That will work most of the time, you have to watch dfor escaped angle brackets

That would be nice, if not for Javascript, CSS, chevrons in quotes in properties, etc. Just stripping tags is easy. Essentially what the OP needs is a light HTML parser. I'm betting they also want < to show as <, etc.

-Lee
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.