[問題] 抓html tag

作者yingwan (yingwan)

看板RegExp

標題[問題] 抓html tag

時間Sat Nov 8 07:39:43 2008

我想把一個網頁裡的<> 跟 <\> 分別抓出來原始碼是 <HTML > <HEAD><TITLE> Hello World </TITLE></HEAD > <BODY> <H1>Greetings</H1> <a href="index,html" targe=_self > Homepage </a ><p> <strong >Tat Tval Asi</strong> </BODY> </HTML> 抓出來後變成: These are the opening tags: <HTML> <HEAD> <TITLE> <BODY> <H1> <a href="index.html" targe=_self> <p> <strong> These are the closing tags: </TITLE> </HEAD> </H1> </a> </strong> </BODY> </HTML> 我用perl是這樣寫的: open(IN, $file) || die "can't read $file"; @file = <IN>; print "These are the opening tags:\n"; foreach $line (@file){ find_opening_tags($line); } print "\n"; print "These are the closing tags:\n"; foreach $line (@file){ find_closing_tags($line); } close IN; # end of main #------------------- # subroutines #------------------- sub find_opening_tags { my $line = $_[0]; if ($line=~ /(\<[^\/].*\>)/){ print "$1\n"; } } sub find_closing_tags { my $line = $_[0]; if ($line =~ /(\<\/.*\>)/) { print "$1\n"; } } 結果是 These are the opening tags: <HTML > <HEAD> <BODY> <H1> <p> <strong > These are the closing tags: </TITLE></HEAD > </H1> </a ><p> </strong> </BODY> </HTML> 希望高手指點一下，謝謝 -- ※ 發信站: 批踢踢實業坊(ptt.cc) ◆ From: 149.159.132.73

→ supertitler:使用*?避免吃掉後面的字串 11/08 13:37

> -------------------------------------------------------------------------- < 作者: giacch (小ａ) 看板: RegExp 標題: Re: [問題] 抓html tag 時間: Sat Nov 8 13:46:02 2008 ※ 引述《yingwan (yingwan)》之銘言： (略過...) : 我用perl是這樣寫的: : open(IN, $file) || die "can't read $file"; : @file = <IN>; undef *TMP; for(@file) { $_ = $TMP . $_ if($TMP); while(/<[^>]+>/) { push(@TMP, $1) if(s/(<[^>]+>)//); } /(<[^>]+)/ ? $TMP = $1 : undef $TMP; } @file = map { s/ >/>/; s/\n//; s/ +/ /g; $_ } @TMP; : print "These are the opening tags:\n"; : foreach $line (@file){ : find_opening_tags($line); (略過...) 加上那一段就會和結果一樣了... -- ※ 發信站: 批踢踢實業坊(ptt.cc) ◆ From: 118.232.236.185

推 yingwan:你太強了啦，感謝 11/09 08:09