Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

glossywhite

macrumors 65816
Original poster
Feb 28, 2008
1,120
3
Hi there. I just started downloading the WHOLE of an online catalogue, by using:

Code:
wget -rH -Dserver.com http://www.server.com/

(just an example URL)

I now have a folder FULL of HTML pages, one for each product. I need to filter out the same data for ALL products on each page, and send that to a CSV file.

Can anyone tell me how I would do this, giving examples for someone who has SOME Unix command line experience (mediocre) but not a "Guru"?.

Thanks
 
You could find a module for a language such as Perl that interact with the DOM structure of the page and use that to parse things out. Otherwise, you could use regular expressions to get at the information.
 
You could find a module for a language such as Perl that interact with the DOM structure of the page and use that to parse things out. Otherwise, you could use regular expressions to get at the information.

Document Object Model, right?. As for PERL, forget that - I know nothing about it, and am unwilling to learn a new language just for one job. :D thanks... could you explain how I would do what you suggest?. Thankyou
 
Document Object Model, right?. As for PERL, forget that - I know nothing about it, and am unwilling to learn a new language just for one job. :D thanks... could you explain how I would do what you suggest?. Thankyou

Yup, Document Object Model. Perl was just one example. I don't know what languages you know. I haven't needed to parse HTML like this so don't know of any good tools off hand. I ave messed some with XSLT, but it would require the HTML to be XML valid to work correctly, and most web sites are not.

Regular expressions take some time to learn. Here's an example though.
HTML:
<h1>Heading</h1>
<h2>Other</h2>
<p>paragraph</p>
Regex:
Code:
/<(h[1-6]).*?>(.*?)</\1>/gi
That regex would capture all the headings on the page. It would be stored it capture item 2 (the contents of the parentheses). There's many online resources for learning regular expressions. I even created an online regular expression testing tool. It has some resources at the bottom that you would want to look at as well if you want to try learning them.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.