solr中的filterCache使用场景源码解读

都知道solr有四个缓存，queryResultCache，documentCache，filterCache，fieldValueCache，今天我要好好说一下filterCache，据说他是用来缓存fq的docid的，也就是当搜索到一个fq对应的query的所有的docid之后，对这个结果进行缓存，方便以后的重复使用，这样就能省去更多的io操作。为了得到一个更准确的结论，我就又仔细的读了一遍代码，用公司的4.10.4的版本的solr做了个solr节点，做了很多实验，算是掌握了filterCache的使用场景了吧。

什么情况下会用到filterCache。在SolrIndexSearcher的getDocListC中，如果命中了遇到了缓存（这里的缓存指得是queryResultCache），我们看下代码：

if (queryResultCache != null && cmd.getFilter() == null	&& (flags & (NO_CHECK_QCACHE | NO_SET_QCACHE)) != ((NO_CHECK_QCACHE | NO_SET_QCACHE))) //如果可以查询缓存
	key = new QueryResultKey(q, cmd.getFilterList(), cmd.getSort(), flags);//构建从queryResultCache中查询的key
	if ((flags & NO_CHECK_QCACHE) == 0) { //再判断一遍如果可以查询缓存
		superset = queryResultCache.get(key);//从queryResultCache中查询
		if (superset != null) {//缓存命中，
			if ((flags & GET_SCORES) == 0 || superset.hasScores()) {//如果此次查询不需要返回得分，或者缓存的结果中有得分，则进入if
				out.docList = superset.subset(cmd.getOffset(), cmd.getLen());//从缓存的结果中取得本次需要的结果集合，判断的根据是start + rows这两个参数，经过这个步骤，可能会因为符合start + rows而又结果，也可能因为不符合而没有结果
			}
		}
		if (out.docList != null) {//如果有结果，
			if (out.docSet == null && ((flags & GET_DOCSET) != 0)) {//关键就是这个flags ，我们需要知道这个flats和GET_DOCSET的关系，经过我的代码查找，当使用到了facet的时候，flags & GET_DOCSET) != 0成立，也就是当facet的时候回需要返回docSet
				if (cmd.getFilterList() == null) {//如果此次查询没有fq
					out.docSet = getDocSet(cmd.getQuery()); // 这个方法就是获得q解析的query的docSet，先从filterCache中查找，如果没有命中就会再从lucene中查找，然后放入filterCache中。从这里看filteCache也是会放入q的docset的。
				} else {//如果有fq
					List<Query> newList = new ArrayList<>(cmd.getFilterList().size() + 1);
					newList.add(cmd.getQuery());
					newList.addAll(cmd.getFilterList());
					out.docSet = getDocSet(newList);// 这个方法也会从filterCache中获取docSet，是在getPositiveDocSet方法里面调用的，然后再在这个方法里面做交集或者差集 ，经过这个方法后，q和fq对应的所有的query的的docSet都会进入到filterCache中
				}
			}
			return;
		}
	}

（我先说一下我是怎么找到facet会设置flags & GET_DOCSET != 0的，在org.apache.solr.search.SolrIndexSearcher.QueryCommand.setNeedDocSet(boolean)方法里面就会将flags设置为flags & GET_DOCSET != 0，而这个方法的调用时在org.apache.solr.handler.component.ResponseBuilder.getQueryCommand()里面，而使用的参数就是org.apache.solr.handler.component.ResponseBuilder.isNeedDocSet()，我们看一下org.apache.solr.handler.component.ResponseBuilder.setNeedDocSet(boolean)这个方法，他的调用实在org.apache.solr.handler.component.FacetComponent.prepare(ResponseBuilder)里面，而且传入的就是true，也就是在打开facet的时候，就会是flags & GET_DOCSET != 0）

上面的代码说明，如果命中了缓存且开启了facet，那么就会调用getDocSet方法，参数或者是一个query，另一个是List<query>，来获得所有的docid以实现facet的功能。在只有一个参数的getDocSet方法里面就会从filterCache中查找docset，如果没有查找就会调用getDocSetNC（NC表示not cache）从lucne的索引中查找，然后放入到filterCache中去，此时q的query的docSet就会被放入fitlerCache了；而在参数是List<query>的方法中也会从filterCache中查找，只不过他是将query单独查找的filterCache（具体的实现方法是getProcessedFilter，这个方法会通过调用getPositiveDocSet从filterCache中获取docSet，然后再在这个方法里面做交集或者差集，这个方法的第二个参数的所有的query的倒排表都会放入到filteCache中去），此时所有的fq的docSet以及q的docSet都被放到了filterCache中去。这就说明了在命中了缓存（再次强调这里的缓存是queryResultCache）的情况下，如果开启了facet，就会从filterCache中查找docSet，并且所有的fq以及q形成的query的docSet都会放入到filterCache中去（从这一点可以发现叫做filterCache不太合适啊，因为q的docSet也会放进去）。

如果没有命中缓存呢，代码是solrIndexSearcher的getDocListC的一部分，如下：

if (useFilterCache) {//先不用管这个，后面会有单独的说明
	// now actually use the filter cache.
	// for large filters that match few documents, this may be
	// slower than simply re-executing the query.
	if (out.docSet == null) {
		out.docSet = getDocSet(cmd.getQuery(), cmd.getFilter());
		DocSet bigFilt = getDocSet(cmd.getFilterList());
		if (bigFilt != null)
			out.docSet = out.docSet.intersection(bigFilt);
	}
	// todo: there could be a sortDocSet that could take a list of
	// the filters instead of anding them first...
	// perhaps there should be a multi-docset-iterator
	sortDocSet(qr, cmd);
} else {
	// do it the normal way... 也就是从lucene中查找。
	if ((flags & GET_DOCSET) != 0) {//还是先判断是不是GET_DOCSET，从上面我们知道，如果是facet的话，就是true，否则是false.
		// this currently conflates returning the docset for the base query vs the base query and all filters.
		DocSet qDocSet = getDocListAndSetNC(qr, cmd);//这个方法中，同样会调getProcessedFilter方法，第二个参数是所有的fq的queyr，即fq的所有的docSet都放入了fitlerCache。
		if (qDocSet != null && filterCache != null && !qr.isPartialResults())//当没有filter的时候，也会把query对应的docSet放入filteCache。因此此时获得的docSet和query是匹配的。
			filterCache.put(cmd.getQuery(), qDocSet);
	} else {
		getDocListNC(qr, cmd);//在不进行facet的情况下，对于fq，也会用到上面的getProcessedFilter方法，也就是也会向filterCache中查找，如果没有命中就从lucene中查找，然后将结果放入filterCache。
	}
}

上面的两个方法，getDocListAndSetNC和getDocListNC里面都会调用getProcessedFilter方法，传入的参数是fq所代表的query，获得的结果就是所有的fq的交集，也就是对于fq，即使是在facet不打开的时候，进行fq的倒排表的合并也是会使用filterCache的。这就说明了在没有命中QueryResultCache的情况下，不论是不是打开facet也会使用filterCache的，使用它进行fq的倒排表的合并，不过在使用facet的时候对于docSet的获得仍然是通过先查询的lucene（因为没有命中缓存嘛）。

经过上面的代码，无论是命中缓存还是不命中缓存的时候，我们可以总结一个结论，filterCache的作用有两个，一个是进行倒排表的合并，是实现了多个fq的交集，第二个就是从filterCache中获得docset，实现facet的功能。或者更抽象一下，filterCache就是存贮query的docSet的，query不一定非得是fq，q的倒排表也会放入的。

其实filterCache还有一个功能，也就是上面代码中的if(useFilterCache)的部分，他的逻辑很简单，我们看一下代码

boolean useFilterCache = false;
if ((flags & (GET_SCORES | NO_CHECK_FILTERCACHE)) == 0 && useFilterForSortedQuery && cmd.getSort() != null && filterCache != null) {//如果这次请求是不用返回得分的，且在solrconfig中配置了useFilterForSortedQuery=true且这次请求有排序且filterCache不是null
	useFilterCache = true;
	SortField[] sfields = cmd.getSort().getSort();
	for (SortField sf : sfields) {
		if (sf.getType() == SortField.Type.SCORE) {//如果所有的排序中没有使用score的
			useFilterCache = false;
			break;
		}
	}
}
if (useFilterCache) {//下面的代码就是使用filerCache实现请求的结果
	if (out.docSet == null) {
		out.docSet = getDocSet(cmd.getQuery(), cmd.getFilter());//这个是从lucene的索引中查找query + cmd.getFilter的倒排表的docSet（注意这里的cmd.fitler不是fq，fq是cmd.getFilterList）
		DocSet bigFilt = getDocSet(cmd.getFilterList());//从filterCache中查找，如果没有找得到则从lucene中查找再放入
		if (bigFilt != null)
			out.docSet = out.docSet.intersection(bigFilt);//两者取交集
	}
	sortDocSet(qr, cmd);//对结果结合进行排序
} else {xxxxx}//同上，省略

为什么上面要单独强调不能使用得分呢？原因很简单，因为如果使用得分排序的话，就可能需要tf，可能需要位置信息，可能需要payload，但是filterCache中是没有这些的，他仅仅含有id，所以如果使用score的话，就不能使用fitlerCache了。而如果不适用score排序的话，也就是使用某个域或者某个函数排序，这样就可以根据id从FieldCache中去查找了，此时filterCache提供的id就可以满足需求。

所以，从上所述，filterCache除了上面的功能外，还有一个功能就是满足不带有得分的排序时的请求的功能，不过这个功能用到的可能性很小。

solr中的filterCache使用场景源码解读

猜你喜欢