作者yingwan (yingwan)
看板RegExp
標題[問題] 抓html tag
時間Sat Nov 8 07:39:43 2008
我想把一個網頁裡的<> 跟 <\> 分別抓出來
原始碼是
<HTML >
<HEAD><TITLE> Hello World </TITLE></HEAD >
<BODY>
<H1>Greetings</H1>
<a href="index,html"
targe=_self > Homepage </a ><p>
<strong >Tat Tval Asi</strong>
</BODY>
</HTML>
抓出來後變成:
These are the opening tags:
<HTML>
<HEAD>
<TITLE>
<BODY>
<H1>
<a href="index.html" targe=_self>
<p>
<strong>
These are the closing tags:
</TITLE>
</HEAD>
</H1>
</a>
</strong>
</BODY>
</HTML>
我用perl是這樣寫的:
open(IN, $file) || die "can't read $file";
@file = <IN>;
print "These are the opening tags:\n";
foreach $line (@file){
find_opening_tags($line);
}
print "\n";
print "These are the closing tags:\n";
foreach $line (@file){
find_closing_tags($line);
}
close IN;
# end of main
#-------------------
# subroutines
#-------------------
sub find_opening_tags {
my $line = $_[0];
if ($line=~ /(\<[^\/].*\>)/){
print "$1\n";
}
}
sub find_closing_tags {
my $line = $_[0];
if ($line =~ /(\<\/.*\>)/) {
print "$1\n";
}
}
結果是
These are the opening tags:
<HTML >
<HEAD>
<BODY>
<H1>
<p>
<strong >
These are the closing tags:
</TITLE></HEAD >
</H1>
</a ><p>
</strong>
</BODY>
</HTML>
希望高手指點一下,謝謝
--
※ 發信站: 批踢踢實業坊(ptt.cc)
◆ From: 149.159.132.73
→ supertitler:使用*?避免吃掉後面的字串 11/08 13:37
> -------------------------------------------------------------------------- <
作者: giacch (小a) 看板: RegExp
標題: Re: [問題] 抓html tag
時間: Sat Nov 8 13:46:02 2008
※ 引述《yingwan (yingwan)》之銘言:
(略過...)
: 我用perl是這樣寫的:
: open(IN, $file) || die "can't read $file";
: @file = <IN>;
undef *TMP;
for(@file) {
$_ = $TMP . $_ if($TMP);
while(/<[^>]+>/) {
push(@TMP, $1) if(s/(<[^>]+>)//);
}
/(<[^>]+)/ ? $TMP = $1 : undef $TMP;
} @file = map { s/ >/>/; s/\n//; s/ +/ /g; $_ } @TMP;
: print "These are the opening tags:\n";
: foreach $line (@file){
: find_opening_tags($line);
(略過...)
加上那一段就會和結果一樣了...
--
※ 發信站: 批踢踢實業坊(ptt.cc)
◆ From: 118.232.236.185
推 yingwan:你太強了啦,感謝 11/09 08:09