豆瓣个人简介中的网址提取是用的什么正则？感觉挺强大的。 - V2EX

首页注册登录

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

V2EX 提问指南

这是一个创建于 4298 天前的主题，其中的信息可能已经有所发展或是发生改变。

http://xxx.xxxx.com
xxx.xxxx.com
xxxx.com
xxxx.me
xxxx.it

试了下，基本上都能匹配到了。

一对比，我现在用的这个简直弱爆了：

def replace_links(s):
return re.sub('(http://[^\s]+)', r'<a rel="nofollow" href="\1">' + r'\1' + '</a>', s, re.M)

求指点，求提高。

4 条回复 • 1970-01-01 08:00:00 +08:00

1

rankjie

2013-02-23 11:31:30 +08:00 via iPad

不要用正则去解析html
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

2

Mutoo

2013-02-23 11:37:31 +08:00

1

@rankjie 解析url和解析html根本是两回事嘛

lz可以参考一些现成的regex
http://regexlib.com/DisplayPatterns.aspx?cattabindex=1&categoryId=2&AspxAutoDetectCookieSupport=1

或者根据w3c对uri的定义自己构造（参考第50页）
http://www.ietf.org/rfc/rfc3986.txt

3

rankjie

2013-02-23 12:04:24 +08:00 via iPhone

@Mutoo 我看楼主的匹配里面有个</a>，看起来似乎就是在解析html，我不会正则啊=_=说错了还请指正

4

CoX

2013-02-23 13:46:00 +08:00

lz可以试试tornado.escape.linkify
它的正则写的复杂点： _URL_RE = re.compile(ur"""\b((?:([\w-]+):(/{1,3})|www[.])(?:(?:(?:[^\s&()]|&|")*(?:[^!"#$%&'()*+,.:;<=>?@\[\]^`{|}~\s]))|(?:\((?:[^\s&()]|&|")*\)))+)""")

关于 · 帮助文档 · 博客 · API · FAQ · 实用小工具 · 2534 人在线 最高记录 6679 ·

Select Language

创意工作者们的社区

World is powered by solitude

VERSION: 3.9.8.5 · 23ms · UTC 04:50 · PVG 12:50 · LAX 20:50 · JFK 23:50
Developed with CodeLauncher
♥ Do have faith in what you're doing.