Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

GRMrGecko

macrumors member
Original poster
Jun 7, 2008
89
0
Nowhere and everywhere
Hello I am trying to find links in a NSString that has html in it. I am working on a web crawler and I'll need to find all links so that I can add it to my database.
I've tried to use RegexKit, but it didn't seem to work at all for me.

I know how to do this in php using preg_match_all
Code:
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,"http://example.com/");
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_TIMEOUT, 5);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10_5_5; en-us) AppleWebKit/528.5+ (KHTML, like Gecko) Version/4.0dp1 Safari/526.11.2");
$result = curl_exec($ch);
curl_close($ch);
$links = array();
preg_match_all("/<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>.*<\/a>/siU", $result, $links);
print_r($links);
?>

Thanks for any help.
 

kainjow

Moderator emeritus
Jun 15, 2000
7,958
7
A quick and dirty way is to use the rangeOfString:eek:ptions:range: method with NSCaseInsensitiveSearch. You can then use the returned NSRange in a loop to find all instances of your search string.
 

GRMrGecko

macrumors member
Original poster
Jun 7, 2008
89
0
Nowhere and everywhere
I've Decided to use NSXML to parse the html and xpath to get all links.

But it would still be nice to know how to use regex like in preg_match_all.

A quick and dirty way is to use the rangeOfString:eek:ptions:range: method with NSCaseInsensitiveSearch. You can then use the returned NSRange in a loop to find all instances of your search string.
 

kainjow

Moderator emeritus
Jun 15, 2000
7,958
7
I've used several Cocoa ways of scraping webpages and they're all fairly slow (including NSXML with the tidy option, NSRanges, etc).

What I used to do was use NSTask to pipe the HTML to a Perl script which would then use regex. That was the fastest method I found, even over C-based regex libraries (maybe I wasn't using them right?). But it's a bit of a hassle to do that though, and writing Perl is no fun ;)
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.