macrumors 6502a
Original poster
Jun 20, 2006
Denver, CO
Hey all,

I'm trying to teach myself proper Perl parsing methods, but running into issues. Especially on this one.

I am trying to parse content from a website that is stored in $contents. I am specifically looking for a series of 6 digits. I have an array filled with 6 digit elements (example 990146). What I want to do is is parse the $contents variable line by line, and pull out the lines that contain that 6 digit number.

What functions/methods/routines should/can I use to do this?

EDIT: this is what I have so far:

145 sub ContentParser($$)
146 {
147     my @ContentArray = "";
148     my $content = $_[0];
149     my $model = $_[1];
150     while(<$_[0]>)
151     {
152         chomp($_[0]);
153         if($_[0] =~ /($model)/)
154         {
155             print "Item model is: $model\tItem is: $_\n";
156         }
157     }
158 }

Its output is:

Item model is: 991779	Item is: <!DOCTYPE
Item model is: 991779	Item is: html
Item model is: 991779	Item is: PUBLIC
Item model is: 991779	Item is: -//W3C//DTD XHTML 1.0 Transitional//EN
Item model is: 991779	Item is:>
Item model is: 991779	Item is: <html
Item model is: 991779	Item is: xmlns=>
Item model is: 991779	Item is: <head

I am quite confused on what's going on with this. Any help is greatly appreciated.
I've done the regex a little bit, but it's either being too greedy or not greedy enough.

I'll try the split function and see what it can do. Thanks.
Well I suggested split as you said that the entire content was in a scalar variable. But your while loop is kind of doing that for you. But kind of not as it seems to be splitting on white space...
Thanks for the help!

I think I got it working now.

145 sub ContentParser($$)
146 {
147     my @ContentArray = "";
148     my $content = $_[0];
149     my $model = $_[1];
151     @ContentArray = split(/\n/,$content);
153     foreach $line (@ContentArray)
154     {
155         if($line =~ /($model)/)
156         {
157             print "Model: $model\t Line: $line\n";
158         }
159     }
161 #   while(<$_[0]>)
162 #   {   
163 #       chomp($_[0]);
164 #       $line = split(/\n/);
165 #       if($line =~ /($model)/)
166 #       {   
167 #           print "Item model is: $model\tItem is: $_\n";
168 #       }
169 #   }
170 }
You could also do something like this:

sub getContentLines {
 my ($content,$model) = @_;
 return grep /$model/, split /\n/, $content;

Which you can then use in your main program like:

my @contentLines = getContentLines($yourPage, '991779');

But then you might aswell just do this and skip the whole subroutine:

my @contentLines = grep /$model/, split /\n/, $yourPage;

And if you are looking for multiple models:

 my $model = join '|', ('991779','991780','991781');
 my @contentLines = grep /$model/, split /\n/, $yourPage;

Loads of ways :D
