How do I write a Perl regular expression that extracts the text from an anchor tag with a pattern?

Question:

confiteor_deo

2008-03-10 07:45:30 UTC

I am trying to extract some of the text from a Web page that has a large number of URLs on it. The URLs look like this:

$23.45 first text
$37.25 second text
$48.32 third text

etc.

I want to extract only the text in the tag, not the URL, and I want to exclude the price. Also, there may be more than one space between the price and the text (though only one space shows up when the browser parses the HTML). This is a standard Web page, not an XML feed. I am using Perl to write the regex. Any suggestions? Thanks!

Four answers:

martinthurn

2008-03-10 12:12:57 UTC

David D is right, you should use HTML::TreeBuilder. The other answer has the ? and * in the wrong places. If you really want a quick-and-dirty solution, try this:

$s is your HTML:

while ($s =~ m!>\$[.0-9]+\s+([^<]+)!g)

{

my $text = $1;

# play with your $text here

}

record

2016-10-22 11:27:15 UTC

One word approximately PERL's typelessness. you are able to tension variable definition by utilizing utilizing "strict". I advise you do this. I even have run into issues by utilizing mistyping variables and that they can be hard to become attentive to. to illustrate, MyFavoriteVariable and MyFavorIteVariable are distinct and in case you utilize MyFavoriteVariable in one area and later on attempt to reuse MyFavoriteVariable and mistype MyFavorIteVariable PERL won't grant you with a warning approximately that except you utilize "strict". one in all those project may be very hard to become attentive to. nonetheless, the announcement is so easy as: my MyFavoriteVariable; no could say no count number if that's a character or numeric or what have you ever.

anonymous

2008-03-10 07:55:25 UTC

\$\d+\.\d+\s+(.*)*?

Should, and i havent tested this, capture what you're looking for into capture group 1.

A few things to not here, I'm assuming prices have a decimal point here:

\$\d+\.\d+

So if they don't it will fail, change it if that doesn't work for you. Also note that the capture group (.*)*? is non-greedy, I think this is correct here but ymmv. Also, I think I got all the escaping right, but there might be a \ missing so double check that if it fails.

Good luck!

David D

2008-03-10 07:49:20 UTC

Parsing HTML with regular expressions is hard and painful. I'd strongly suggest not doing it and using a module such as HTML::TreeBuilder (which you can find on the CPAN: http://search.cpan.org/ ) instead.

ⓘ

This content was originally posted on Y! Answers, a Q&A website that shut down in 2021.

about - legalese