模式匹配算法总结

xiaoxiao2022-07-13 217

前言

读书笔记，整理自 [美] Goodrich et al. 所著《Data Structures and Algorithms in Python》。

模式匹配

模式匹配是数据结构中字符串的一种基本运算场景，给定一个子串，要求在某个字符串中找出与该子串相同的所有子串。尽管早已可以通过 Python 下的 re 库使用正则表达式高效而简洁地实现模式匹配，但了解相关算法背后机理亦不失其学习的意义。

1. Brute-Force算法

又称为暴风算法，核心思想在于从 T 的第一个字符开始遍历每一个字符，依次匹配 P。是最简单，也最低效的匹配方法。

def find_brute(T, P): """Return the lowest index of T at which substring P begins (or else -1).""" n, m = len(T), len(P) # introduce convenient notations for i in range(n-m+1): # try every potential starting index within T k = 0 # an index into pattern P while k < m and T[i + k] == P[k]: # kth character of P matches k += 1 if k == m: # if we reached the end of pattern, return i # substring T[i:i+m] matches P return -1 # failed to find a match starting with any i

2. Boyer-Moore算法

通过跳跃启发式算法避免大量无用的比较。每次逐字符匹配从 P 最后一个字符开始。

def find_boyer_moore(T, P): """Return the lowest index of T at which substring P begins (or else -1).""" n, m = len(T), len(P) # introduce convenient notations if m == 0: return 0 # trivial search for empty string last = {} # build 'last' dictionary for k in range(m): last[ P[k] ] = k # later occurrence overwrites # align end of pattern at index m-1 of text i = m-1 # an index into T k = m-1 # an index into P while i < n: if T[i] == P[k]: # a matching character if k == 0: return i # pattern begins at index i of text else: i -= 1 # examine previous character k -= 1 # of both T and P else: j = last.get(T[i], -1) # last(T[i]) is -1 if not found i += m - min(k, j + 1) # case analysis for jump step k = m - 1 # restart at end of pattern return -1

3. Knuth-Morris-Pratt算法

穷举算法和 Boyer-Moore 算法在完全匹配中必然进行 $\text{len}(P)$ 次匹配，KMP 算法充分利用 $P$ 内部的字符串重叠，做进一步优化。

def find_kmp(T, P): """Return the lowest index of T at which substring P begins (or else -1).""" n, m = len(T), len(P) # introduce convenient notations if m == 0: return 0 # trivial search for empty string fail = compute_kmp_fail(P) # rely on utility to precompute j = 0 # index into text k = 0 # index into pattern while j < n: if T[j] == P[k]: # P[0:1+k] matched thus far if k == m - 1: # match is complete return j - m + 1 j += 1 # try to extend match k += 1 elif k > 0: k = fail[k-1] # reuse suffix of P[0:k] else: j += 1 return -1 # reached end without match def compute_kmp_fail(P): """Utility that computes and returns KMP 'fail' list.""" m = len(P) fail = [0] * m # by default, presume overlap of 0 everywhere j = 1 k = 0 while j < m: # compute f(j) during this pass, if nonzero if P[j] == P[k]: # k + 1 characters match thus far fail[j] = k + 1 j += 1 k += 1 elif k > 0: # k follows a matching prefix k = fail[k-1] else: # no match found starting at j j += 1 return fail

最新回复(0)