最近發現有時候抓回來的RSS,會有格式錯誤的問題,
問了我們這的大師 部落小波 後,他指引了我一條路 【HTML Tidy Library Project】
( 真的覺的 部落小波 懂的還真多耶~~ )
Tidy 這東西可以把一些缺少的或多餘的Tag修正,
因為現在很多發表文章的編輯器,都可以自己修改程式碼了~
但有時都會把程式碼改的亂七八糟~這時候這個東西就挺有用的,
可以幫你修正這些錯誤~~
很幸運的,PHP也有支援tidy了~
安裝方式如下 -- ( 偷貼 部落小波 的文章 )
(1) tidy 安裝
我是使用 SuSE, 所以就去找一找 libtidy, libtidy_devel 這兩個 rpm後, 給他裝上去
(2) 安裝 tidy extension in php
./configure --with-tidy=/path/to/libtidy
在您原本的 build options 裏面, 加上 --with-tidy
(3) build 完以後, 就可以開心的用了
下來貼上一些範例
如果是修正 XML 的話~
將在【.....醜醜風的老頭~~】後面的Tag都刪掉
<?php
header("Content-Type:text/xml");
ob_start();
?>
<rss version="2.0">
<channel>
<title>My Lief</title>
<link>http://blog.xuite.net/chingwei/blog%26lt%3B/link>
<description>爽快過生活!!</description>
<item>
<title>【圖】畫了個醜醜風的老頭</title>
<link>http://blog.xuite.net/chingwei/blog/23691532%26lt%3B/link>
<description>醜醜風的老頭~~
<?php
$buffer = ob_get_clean();
$tidy_options = array(
'input-xml' => true,
'output-xml' => true,
'indent' => true,
'wrap' => false,
);
$tidy = new tidy();
$tidy->parseString($buffer, $tidy_options,'utf8');
$tidy->cleanRepair();
echo $tidy;
?>
輸出結果,他幫我把少掉的Tag都補上了,真棒。
<rss version="2.0">
<channel>
<title>My Lief</title>
<link>http://blog.xuite.net/chingwei/blog%26lt%3B/link>
<description>爽快過生活!!</description>
<item>
<title>【圖】畫了個醜醜風的老頭</title>
<link>http://blog.xuite.net/chingwei/blog/23691532%26lt%3B/link>
<description>醜醜風的老頭~~</description>
</item>
</channel>
</rss>
接著我們將在【.....醜醜風的老頭~~】後面再加上<Error 的Tag
<?php
header("Content-Type:text/xml");
ob_start();
?>
<rss version="2.0">
<channel>
<title>My Lief</title>
<link>http://blog.xuite.net/chingwei/blog%26lt%3B/link>
<description>爽快過生活!!</description>
<item>
<title>【圖】畫了個醜醜風的老頭</title>
<link>http://blog.xuite.net/chingwei/blog/23691532%26lt%3B/link>
<description>醜醜風的老頭~~<Error
<?php
$buffer = ob_get_clean();
$tidy_options = array(
'input-xml' => true,
'output-xml' => true,
'indent' => true,
'wrap' => false,
);
$tidy = new tidy();
$tidy->parseString($buffer, $tidy_options,'utf8');
$tidy->cleanRepair();
echo $tidy;
?>
最後的結果就多出了個Error的Tag,
這就不是我想要的結果了,但他的做法應該也沒錯。
不能太強求~他已經很強了~~
<rss version="2.0">
<channel>
<title>My Lief</title>
<link>http://blog.xuite.net/chingwei/blog%26lt%3B/link>
<description>爽快過生活!!</description>
<item>
<title>【圖】畫了個醜醜風的老頭</title>
<link>http://blog.xuite.net/chingwei/blog/23691532%26lt%3B/link>
<description>醜醜風的老頭~~
<Error></Error></description>
</item>
</channel>
</rss>
下面貼上php官方網站的Sample,是修正HTML (tidy_repair_string)
<?php
ob_start();
?>
<html>
<head>
<title>test</title>
</head>
<body>
<p>error</i>
</body>
</html>
<?php
$buffer = ob_get_clean();
$tidy = tidy_repair_string($buffer);
echo $tidy;
?>
結果它把少的Tag跟錯掉的,都修正好了
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN"> <html> <head> <title>test</title> </head> <body> <p>error</p> </body> </html>
參考網站:
PS.
這篇文章我重寫了三次,我快被這編輯器搞瘋了~我只要按Ctrl+V,他就會將我textarea裡的 < 給取代掉,天呀~~~~所以我不會再更新這個文章了~~~~
附上 HTML Tidy Configuration Options
| HTML, XHTML, XML Options | Top | |
| Option | Type | Default |
| add-xml-decl | Boolean | no |
| add-xml-space | Boolean | no |
| alt-text | String | - |
| anchor-as-name | Boolean | yes |
| assume-xml-procins | Boolean | no |
| bare | Boolean | no |
| clean | Boolean | no |
| css-prefix | String | - |
| decorate-inferred-ul | Boolean | no |
| doctype | DocType | auto |
| drop-empty-paras | Boolean | yes |
| drop-font-tags | Boolean | no |
| drop-proprietary-attributes | Boolean | no |
| enclose-block-text | Boolean | no |
| enclose-text | Boolean | no |
| escape-cdata | Boolean | no |
| fix-backslash | Boolean | yes |
| fix-bad-comments | Boolean | yes |
| fix-uri | Boolean | yes |
| hide-comments | Boolean | no |
| hide-endtags | Boolean | no |
| indent-cdata | Boolean | no |
| input-xml | Boolean | no |
| join-classes | Boolean | no |
| join-styles | Boolean | yes |
| literal-attributes | Boolean | no |
| logical-emphasis | Boolean | no |
| lower-literals | Boolean | yes |
| merge-divs | AutoBool | auto |
| merge-spans | AutoBool | auto |
| ncr | Boolean | yes |
| new-blocklevel-tags | Tag names | - |
| new-empty-tags | Tag names | - |
| new-inline-tags | Tag names | - |
| new-pre-tags | Tag names | - |
| numeric-entities | Boolean | no |
| output-html | Boolean | no |
| output-xhtml | Boolean | no |
| output-xml | Boolean | no |
| preserve-entities | Boolean | no |
| quote-ampersand | Boolean | yes |
| quote-marks | Boolean | no |
| quote-nbsp | Boolean | yes |
| repeated-attributes | enum | keep-last |
| replace-color | Boolean | no |
| show-body-only | AutoBool | no |
| uppercase-attributes | Boolean | no |
| uppercase-tags | Boolean | no |
| word-2000 | Boolean | no |
| Diagnostics Options | Top | |
| Option | Type | Default |
| accessibility-check | enum | 0 (Tidy Classic) |
| show-errors | Integer | 6 |
| show-warnings | Boolean | yes |
| Pretty Print Options | Top | |
| Option | Type | Default |
| break-before-br | Boolean | no |
| indent | AutoBool | no |
| indent-attributes | Boolean | no |
| indent-spaces | Integer | 2 |
| markup | Boolean | yes |
| punctuation-wrap | Boolean | no |
| sort-attributes | enum | none |
| split | Boolean | no |
| tab-size | Integer | 8 |
| vertical-space | Boolean | no |
| wrap | Integer | 68 |
| wrap-asp | Boolean | yes |
| wrap-attributes | Boolean | no |
| wrap-jste | Boolean | yes |
| wrap-php | Boolean | yes |
| wrap-script-literals | Boolean | no |
| wrap-sections | Boolean | yes |
| Character Encoding Options | Top | |
| Option | Type | Default |
| ascii-chars | Boolean | no |
| char-encoding | Encoding | ascii |
| input-encoding | Encoding | latin1 |
| language | String | - |
| newline | enum | Platform dependent |
| output-bom | AutoBool | auto |
| output-encoding | Encoding | ascii |
| Miscellaneous Options | Top | |
| Option | Type | Default |
| error-file | String | - |
| force-output | Boolean | no |
| gnu-emacs | Boolean | no |
| gnu-emacs-file | String | - |
| keep-time | Boolean | no |
| output-file | String | - |
| quiet | Boolean | no |
| slide-style | String | - |
| tidy-mark | Boolean | yes |
| write-back | Boolean | no |
|
|
||
沒有留言:
張貼留言