2009-05-01

【P】TIDY - 修正錯誤的 HTML, XHTML, XML

最近發現有時候抓回來的RSS,會有格式錯誤的問題,

問了我們這的大師 部落小波 後,他指引了我一條路 【HTML Tidy Library Project

( 真的覺的 部落小波 懂的還真多耶~~ )

Tidy 這東西可以把一些缺少的或多餘的Tag修正,

因為現在很多發表文章的編輯器,都可以自己修改程式碼了~

但有時都會把程式碼改的亂七八糟~這時候這個東西就挺有用的,

可以幫你修正這些錯誤~~

很幸運的,PHP也有支援tidy了~

安裝方式如下 -- ( 偷貼 部落小波 的文章 )

(1) tidy 安裝

我是使用 SuSE, 所以就去找一找 libtidy, libtidy_devel 這兩個 rpm後, 給他裝上去

(2) 安裝 tidy extension in php

./configure --with-tidy=/path/to/libtidy

在您原本的 build options 裏面, 加上 --with-tidy

(3) build 完以後, 就可以開心的用了

下來貼上一些範例

如果是修正 XML 的話~

將在【.....醜醜風的老頭~~】後面的Tag都刪掉

<?php
header("Content-Type:text/xml");
ob_start();
?>
<rss version="2.0">
  <channel>
    <title>My Lief</title>
    <link>http://blog.xuite.net/chingwei/blog%26lt%3B/link>
    <description>爽快過生活!!</description>
    <item>
      <title>【圖】畫了個醜醜風的老頭</title>
      <link>http://blog.xuite.net/chingwei/blog/23691532%26lt%3B/link>
      <description>醜醜風的老頭~~
<?php
$buffer = ob_get_clean();
$tidy_options = array(
    'input-xml'    => true,
    'output-xml'   => true,
    'indent'       => true,
    'wrap'         => false,
  );
$tidy = new tidy();
$tidy->parseString($buffer, $tidy_options,'utf8');
$tidy->cleanRepair();
echo $tidy;
?>

輸出結果,他幫我把少掉的Tag都補上了,真棒。

<rss version="2.0">
  <channel>
    <title>My Lief</title>
    <link>http://blog.xuite.net/chingwei/blog%26lt%3B/link>
    <description>爽快過生活!!</description>
    <item>
      <title>【圖】畫了個醜醜風的老頭</title>
      <link>http://blog.xuite.net/chingwei/blog/23691532%26lt%3B/link>
      <description>醜醜風的老頭~~</description>
    </item>
  </channel>
</rss>

接著我們將在【.....醜醜風的老頭~~】後面再加上<Error 的Tag

<?php
header("Content-Type:text/xml");
ob_start();
?>
<rss version="2.0">
  <channel>
    <title>My Lief</title>
    <link>http://blog.xuite.net/chingwei/blog%26lt%3B/link>
    <description>爽快過生活!!</description>
    <item>
      <title>【圖】畫了個醜醜風的老頭</title>
      <link>http://blog.xuite.net/chingwei/blog/23691532%26lt%3B/link>
      <description>醜醜風的老頭~~<Error
<?php
$buffer = ob_get_clean();
$tidy_options = array(
    'input-xml'    => true,
    'output-xml'   => true,
    'indent'       => true,
    'wrap'         => false,
  );
$tidy = new tidy();
$tidy->parseString($buffer, $tidy_options,'utf8');
$tidy->cleanRepair();
echo $tidy;
?>

最後的結果就多出了個Error的Tag,
這就不是我想要的結果了,但他的做法應該也沒錯。
不能太強求~他已經很強了~~

<rss version="2.0">
  <channel>
    <title>My Lief</title>
    <link>http://blog.xuite.net/chingwei/blog%26lt%3B/link>
    <description>爽快過生活!!</description>
    <item>
      <title>【圖】畫了個醜醜風的老頭</title>
      <link>http://blog.xuite.net/chingwei/blog/23691532%26lt%3B/link>
      <description>醜醜風的老頭~~
      <Error></Error></description>
    </item>
  </channel>
</rss>

下面貼上php官方網站的Sample,是修正HTML (tidy_repair_string)

<?php
ob_start();
?>
<html>
  <head>
    <title>test</title>
  </head>
  <body>
    <p>error</i>
  </body>
</html>
<?php
$buffer = ob_get_clean();
$tidy = tidy_repair_string($buffer);
echo $tidy;
?>

結果它把少的Tag跟錯掉的,都修正好了

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<title>test</title>
</head>
<body>
<p>error</p>
</body>
</html>

參考網站:

HTML Tidy Library Project

PHP: Tidy

不太會寫程式

PHP+Tidy-完美的XHTML纠错+过滤

PS.
這篇文章我重寫了三次,我快被這編輯器搞瘋了~我只要按Ctrl+V,他就會將我textarea裡的 < 給取代掉,天呀~~~~所以我不會再更新這個文章了~~~~

附上 HTML Tidy Configuration Options


HTML, XHTML, XML Options Top
Option Type Default
add-xml-decl Boolean no
add-xml-space Boolean no
alt-text String -
anchor-as-name Boolean yes
assume-xml-procins Boolean no
bare Boolean no
clean Boolean no
css-prefix String -
decorate-inferred-ul Boolean no
doctype DocType auto
drop-empty-paras Boolean yes
drop-font-tags Boolean no
drop-proprietary-attributes Boolean no
enclose-block-text Boolean no
enclose-text Boolean no
escape-cdata Boolean no
fix-backslash Boolean yes
fix-bad-comments Boolean yes
fix-uri Boolean yes
hide-comments Boolean no
hide-endtags Boolean no
indent-cdata Boolean no
input-xml Boolean no
join-classes Boolean no
join-styles Boolean yes
literal-attributes Boolean no
logical-emphasis Boolean no
lower-literals Boolean yes
merge-divs AutoBool auto
merge-spans AutoBool auto
ncr Boolean yes
new-blocklevel-tags Tag names -
new-empty-tags Tag names -
new-inline-tags Tag names -
new-pre-tags Tag names -
numeric-entities Boolean no
output-html Boolean no
output-xhtml Boolean no
output-xml Boolean no
preserve-entities Boolean no
quote-ampersand Boolean yes
quote-marks Boolean no
quote-nbsp Boolean yes
repeated-attributes enum keep-last
replace-color Boolean no
show-body-only AutoBool no
uppercase-attributes Boolean no
uppercase-tags Boolean no
word-2000 Boolean no
 
Diagnostics Options Top
Option Type Default
accessibility-check enum 0 (Tidy Classic)
show-errors Integer 6
show-warnings Boolean yes
 
Pretty Print Options Top
Option Type Default
break-before-br Boolean no
indent AutoBool no
indent-attributes Boolean no
indent-spaces Integer 2
markup Boolean yes
punctuation-wrap Boolean no
sort-attributes enum none
split Boolean no
tab-size Integer 8
vertical-space Boolean no
wrap Integer 68
wrap-asp Boolean yes
wrap-attributes Boolean no
wrap-jste Boolean yes
wrap-php Boolean yes
wrap-script-literals Boolean no
wrap-sections Boolean yes
 
Character Encoding Options Top
Option Type Default
ascii-chars Boolean no
char-encoding Encoding ascii
input-encoding Encoding latin1
language String -
newline enum Platform dependent
output-bom AutoBool auto
output-encoding Encoding ascii
 
Miscellaneous Options Top
Option Type Default
error-file String -
force-output Boolean no
gnu-emacs Boolean no
gnu-emacs-file String -
keep-time Boolean no
output-file String -
quiet Boolean no
slide-style String -
tidy-mark Boolean yes
write-back Boolean no

 

0 comments:

張貼留言