《Python Cookbook（第3版）中文版》——6.7　用命名空间来解析XML文档

xiaoxiao2024-09-28 121

本节书摘来自异步社区《Python Cookbook（第3版）中文版》一书中的第6章，第6.7节，作者[美]David Beazley , Brian K.Jones，陈舸译，更多章节内容可以访问云栖社区“异步社区”公众号查看。

6.7　用命名空间来解析XML文档

6.7.1　问题

我们要解析一个XML文档，但是需要使用XML命名空间来完成。

6.7.2　解决方案

考虑使用了命名空间的如下XML文档：

<?xml version="1.0" encoding="utf-8"?> <top> <author>David Beazley</author> <content> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Hello World</title> </head> <body> <h1>Hello World!</h1> </body> </html> </content> </top>

如果解析这个文档并尝试执行普通的查询操作，就会发现没那么容易实现，因为所有的东西都变得特别冗长啰嗦：

>>> # Some queries that work >>> doc.findtext('author') 'David Beazley' >>> doc.find('content') <Element 'content' at 0x100776ec0> >>> # A query involving a namespace (doesn't work) >>> doc.find('content/html') >>> # Works if fully qualified >>> doc.find('content/{http://www.w3.org/1999/xhtml}html') <Element '{http://www.w3.org/1999/xhtml}html' at 0x1007767e0> >>> # Doesn't work >>> doc.findtext('content/{http://www.w3.org/1999/xhtml}html/head/title') >>> # Fully qualified >>> doc.findtext('content/{http://www.w3.org/1999/xhtml}html/' ... '{http://www.w3.org/1999/xhtml}head/{http://www.w3.org/1999/xhtml}title') 'Hello World' >>>

通常可以将命名空间的处理包装到一个通用的类中，这样可以省去一些麻烦：

class XMLNamespaces: def __init__(self, **kwargs): self.namespaces = {} for name, uri in kwargs.items(): self.register(name, uri) def register(self, name, uri): self.namespaces[name] = '{'+uri+'}' def __call__(self, path): return path.format_map(self.namespaces)

要使用这个类，可以按照下面的方式进行：

>>> ns = XMLNamespaces(html='http://www.w3.org/1999/xhtml') >>> doc.find(ns('content/{html}html')) <Element '{http://www.w3.org/1999/xhtml}html' at 0x1007767e0> >>> doc.findtext(ns('content/{html}html/{html}head/{html}title')) 'Hello World' >>>

6.7.3　讨论

对包含有命名空间的XML文档进行解析会非常繁琐。XMLNamespaces类的功能只是用来稍微简化一下这个过程，它允许在后序的操作中使用缩短的命名空间名称，而不必去使用完全限定的URI。

不幸的是，在基本的ElementTree解析器中不存在什么机制能获得有关命名空间的进一步信息。但是如果愿意使用iterparse()函数的话，还是可以获得一些有关正在处理的命名空间范围的信息。示例如下：

>>> from xml.etree.ElementTree import iterparse >>> for evt, elem in iterparse('ns2.xml', ('end', 'start-ns', 'end-ns')): ... print(evt, elem) ... end <Element 'author' at 0x10110de10> start-ns ('', 'http://www.w3.org/1999/xhtml') end <Element '{http://www.w3.org/1999/xhtml}title' at 0x1011131b0> end <Element '{http://www.w3.org/1999/xhtml}head' at 0x1011130a8> end <Element '{http://www.w3.org/1999/xhtml}h1' at 0x101113310> end <Element '{http://www.w3.org/1999/xhtml}body' at 0x101113260> end <Element '{http://www.w3.org/1999/xhtml}html' at 0x10110df70> end-ns None end <Element 'content' at 0x10110de68> end <Element 'top' at 0x10110dd60> >>> elem # This is the topmost element <Element 'top' at 0x10110dd60> >>>

最后要提到的是，如果正在解析的文本用到了除命名空间之外的其他高级XML特性，那么最好还是使用lxml库。比方说，lxml对文档的DTD验证、更加完整的XPath支持和其他的高级XML特性提供了更好的支持。本节提到的技术只是为解析操作做了一点修改，使得这个过程能够稍微简单一些。

《Python Cookbook（第3版）中文版》——6.7 用命名空间来解析XML文档

6.7 用命名空间来解析XML文档

6.7.1 问题

6.7.2 解决方案

6.7.3 讨论

《Python Cookbook（第3版）中文版》——6.7　用命名空间来解析XML文档

6.7　用命名空间来解析XML文档

6.7.1　问题

6.7.2　解决方案

6.7.3　讨论