取html里的img和去html标签-白红宇

取html里的img和去html标签

阅读量：4614 次

发布时间：2019-06-09

本文共 2279 字，大约阅读时间需要 7 分钟。

C# ：

public string RemoveHTML(string html)

{

html = Regex.Replace(html, @"<script[^>]*?>.*?</script>", "", RegexOptions.IgnoreCase);

html = Regex.Replace(html, @"<(.[^>]*)>", "", RegexOptions.IgnoreCase);

html = Regex.Replace(html, @"([\r\n])[\s]+", "", RegexOptions.IgnoreCase);

html = Regex.Replace(html, @"-->", "", RegexOptions.IgnoreCase);

html = Regex.Replace(html, @"<!--.*", "", RegexOptions.IgnoreCase);

html = Regex.Replace(html, @"&(quot|#34);", "\"", RegexOptions.IgnoreCase);

html = Regex.Replace(html, @"&(amp|#38);", "&", RegexOptions.IgnoreCase);

html = Regex.Replace(html, @"&(lt|#60);", "<", RegexOptions.IgnoreCase);

html = Regex.Replace(html, @"&(gt|#62);", ">", RegexOptions.IgnoreCase);

html = Regex.Replace(html, @"&(nbsp|#160);", " ", RegexOptions.IgnoreCase);

html = Regex.Replace(html, @"&(iexcl|#161);", "\xa1", RegexOptions.IgnoreCase);

html = Regex.Replace(html, @"&(cent|#162);", "\xa2", RegexOptions.IgnoreCase);

html = Regex.Replace(html, @"&(pound|#163);", "\xa3", RegexOptions.IgnoreCase);

html = Regex.Replace(html, @"&(copy|#169);", "\xa9", RegexOptions.IgnoreCase);

html = Regex.Replace(html, @"&#(\d+);", "", RegexOptions.IgnoreCase);

html = Regex.Replace(html, @"<img[^>]*>;", "", RegexOptions.IgnoreCase);

html.Replace("<", "");

html.Replace(">", "");

html.Replace("\r\n", "");

return html;

}

public static string[] GetHtmlImageUrlList(string sHtmlText)

{

// 定义正则表达式用来匹配 img 标签

Regex regImg = new Regex(@"<img\b[^<>]*?\bsrc[\s\t\r\n]*=[\s\t\r\n]*[""']?[\s\t\r\n]*(?<imgUrl>[^\s\t\r\n""'<>]*)[^<>]*?/?[\s\t\r\n]*>", RegexOptions.IgnoreCase);

// 搜索匹配的字符串

MatchCollection matches = regImg.Matches(sHtmlText);

int i = 0;

string[] sUrlList = new string[matches.Count];

// 取得匹配项列表

foreach (Match match in matches)

sUrlList[i++] = match.Groups["imgUrl"].Value;

return sUrlList;

}

js：

function getimgsrc(htmlstr) {

var reg = /<img.+?src=('|")?([^'"]+)('|")?(?:\s+|>)/gim;

var arr = []; while (tem = reg.exec(htmlstr)) { arr.push(tem[2]); }

return arr;

}

function removeHTMLTag(str) {

str = str.replace(/<\/?[^>]*>/g, ''); //去除HTML tag

str = str.replace(/[ | ]*\n/g, '\n'); //去除行尾空白

//str = str.replace(/\n[\s| | ]*\r/g,'\n'); //去除多余空行

str = str.replace(/ /ig, ''); //去掉 

return str;

}

转载于:https://www.cnblogs.com/codeloves/p/3461539.html

你可能感兴趣的文章

各浏览器对 onbeforeunload 事件的支持与触发条件实现有差异

Display对象,Displayable对象

查看>>

安装oracle11G,10G时都会出现:注册ocx时出现OLE初始化错误或ocx装载错误对话框

查看>>

数据结构(并查集)：COGS 260. [NOI2002] 银河英雄传说

Arduino可穿戴开发入门教程Arduino开发环境介绍

查看>>

Windows平台flex+gcc词法分析实验工具包

查看>>

3.Python基础序列sequence

查看>>

Chapter 4 Syntax Analysis

查看>>

Java3D实例应用-载入3ds 模型

查看>>

872. Leaf-Similar Trees

查看>>