Nutch学习笔记7---url的正则过滤机制研究

    xiaoxiao2023-09-06  156

    今天碰到问题,url正则过滤老是出问题,不爽之下,又打开源码了。

    Crawl.java里有这么一段

    for (i = 0; i < depth; i++) { // generate new segment Path[] segs = generator.generate(crawlDb, segments, -1, topN, System .currentTimeMillis()); if (segs == null) { LOG.info("Stopping at depth=" + i + " - no more URLs to fetch."); break; } fetcher.fetch(segs[0], threads); // fetch it if (!Fetcher.isParsing(job)) { parseSegment.parse(segs[0]); // parse it, if needed } crawlDbTool.update(crawlDb, segs, true, true); // update crawldb }

    可以看到,生成下一步要去fetch抓取的url列表是由下面的代码决定:

    Path[] segs = generator.generate(crawlDb, segments, -1, topN, System .currentTimeMillis()); 跟踪org.apache.nutch.crawl.Generator类 public Path[] generate(Path dbDir, Path segments, int numLists, long topN, long curTime) throws IOException { JobConf job = new NutchJob(getConf()); boolean filter = job.getBoolean(GENERATOR_FILTER, true); boolean normalise = job.getBoolean(GENERATOR_NORMALISE, true); return generate(dbDir, segments, numLists, topN, curTime, filter, normalise, false, 1); } 继续进入 generate(dbDir, segments, numLists, topN, curTime, filter, normalise, false, 1);

    有这么一段

    job.setMapperClass(Selector.class); job.setPartitionerClass(Selector.class); job.setReducerClass(Selector.class);

    可见用了hadoop模式的map/reduce, 类是Selector.class.

    跟踪Selector.java

    map函数代码如下:

    /** Select & invert subset due for fetch. */ public void map(Text key, CrawlDatum value, OutputCollector<FloatWritable,SelectorEntry> output, Reporter reporter) throws IOException { Text url = key; if (filter) { // If filtering is on don't generate URLs that don't pass // URLFilters try { if (filters.filter(url.toString()) == null) return; } catch (URLFilterException e) { if (LOG.isWarnEnabled()) { LOG.warn("Couldn't filter url: " + url + " (" + e.getMessage() + ")"); } } } CrawlDatum crawlDatum = value; // check fetch schedule if (!schedule.shouldFetch(url, crawlDatum, curTime)) { LOG.debug("-shouldFetch rejected '" + url + "', fetchTime=" + crawlDatum.getFetchTime() + ", curTime=" + curTime); return; } LongWritable oldGenTime = (LongWritable) crawlDatum.getMetaData().get( Nutch.WRITABLE_GENERATE_TIME_KEY); if (oldGenTime != null) { // awaiting fetch & update if (oldGenTime.get() + genDelay > curTime) // still wait for // update return; } float sort = 1.0f; try { sort = scfilters.generatorSortValue(key, crawlDatum, sort); } catch (ScoringFilterException sfe) { if (LOG.isWarnEnabled()) { LOG.warn("Couldn't filter generatorSortValue for " + key + ": " + sfe); } } if (restrictStatus != null && !restrictStatus.equalsIgnoreCase(CrawlDatum.getStatusName(crawlDatum.getStatus()))) return; // consider only entries with a score superior to the threshold if (scoreThreshold != Float.NaN && sort < scoreThreshold) return; // consider only entries with a retry (or fetch) interval lower than threshold if (intervalThreshold != -1 && crawlDatum.getFetchInterval() > intervalThreshold) return; // sort by decreasing score, using DecreasingFloatComparator sortValue.set(sort); // record generation time crawlDatum.getMetaData().put(Nutch.WRITABLE_GENERATE_TIME_KEY, genTime); entry.datum = crawlDatum; entry.url = key; output.collect(sortValue, entry); // invert for sort by score }

    重点在于:

    filters.filter(url.toString()) 看看filter函数 public String filter(String urlString) throws URLFilterException { for (int i = 0; i < this.filters.length; i++) { if (urlString == null) return null; urlString = this.filters[i].filter(urlString); } return urlString; }

    也就是说:

    任何一个url,都必须经过N个符合条件的url过滤器的检查,一旦其中一个检查没通过,那就没通过。

    这里的url过滤器,不用说,又是插件。

    开辟第二战场,我们去看过滤器的代码,我的 conf/nutch-site.xml文件里是这么定义的。

    plugin.includesprotocol-http|urlfilter-(domain|regex)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|htmlparsefilter-youku

    这里有2个url过滤器,urlfilter-domain,urlfilter-regex. 重点放在urlfilter-regex的研究上!

    这个插件的代码位于$...nutch-1.7/src/plugin/urlfilterregex/src/java/org/apache/nutch/urlfilter/regex

    目录下的 RegexURLFilter.java

    那我们的任务就是研究这个类的filter函数。

    这个类本身没做filter,那就看父类

    import org.apache.nutch.urlfilter.api.RegexURLFilterBase;

    这个类在哪呢:

    src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexURLFilterBase.java

    最重要的函数是filter

    public String filter(String url) { for (RegexRule rule : rules) { if (rule.match(url)) { return rule.accept() ? url : null; } }; return null; }

    结论:

    拿到一个url后,跟conf/regex-urlfilter.txt里的正则表达式挨个匹配

    找到第一个匹配规则就肯定会返回结果,至于结果就根据第一个字符是+还是-表示通过还是失败。

    正则过滤规则分析完毕!

    相关资源:敏捷开发V1.0.pptx
    最新回复(0)