Skip to content

Basic Collecting

Simon Proctor edited this page Oct 1, 2013 · 5 revisions

Lucene allows you to change the way results are collected. This has a number of useful use cases and we'll use a real world sample here to describe the current API.

In a number of news and blog related applications we've worked on, we have needed to provide summary views and dynamic navigation based on the data that is currently present. Depending on the project, we don't always have access to a simple relational database and have to aggregate the data differently. One sample of this is when using Sitecore. Here we had to generate the site navigation based on the news story data available. If there were no stories for a month or year, we wouldn't show them in the navigation.

The sample below, taken from the test suite, is a simple version of this that shows how to sum documents that are related by some search term and group them according their day of the week. This is very similar to an aggregate function in SQL but with the power of Lucene:

		IQueryBuilder queryBuilder = new QueryBuilder();
		queryBuilder.Setup
			(
				x => x.WildCard(BBCFields.Description, "food"),
				x => x.Filter(DateRangeFilter.Filter(BBCFields.PublishDateObject, DateTime.Parse("01/02/2013"), DateTime.Parse("28/02/2013")))
			);

		DateCollector collector = new DateCollector();
		luceneSearch.Collect(queryBuilder.Build(), collector);

		Assert.Greater(collector.DailyCount.Keys.Count, 0);
		foreach (String day in collector.DailyCount.Keys)
		{
			Console.Error.WriteLine("Day: {0} had {1} documents", day, collector.DailyCount[day]);
		}

		Console.WriteLine();

Here we search for all documents that contain 'food' in the description field and use a filter to restrict the date range to February, 2013. We use the collector to sum up these documents by day so we find how many documents mentioned food by day for that month.

The collector, also included in the test suite, looks like this:

public class DateCollector : Collector
{
	public int Count { get; private set; }

	private String[] dates;

	public Dictionary<String, int> DailyCount { get; set; }

	public DateCollector()
	{
		//Years = new Dictionary<String, Dictionary<String, int>>();
		dates = new String[10];
		DailyCount = new Dictionary<String, int>();
	}

	public void Reset()
	{
		Count = 0;
	}

	/// <summary>
	/// 
	/// </summary>
	/// <param name="docId"></param>
	public override void Collect(int docId)
	{
		Count = Count + 1;

		String temp = dates[docId];

		// "20130220-060258-38"

		DateTime date = DateTime.ParseExact(temp, "yyyyMMdd-HHmmss-ff", CultureInfo.InvariantCulture); // DateTime.Parse(temp, );
		String day = date.DayOfWeek.ToString();

		if (!DailyCount.ContainsKey(day))
		{
			DailyCount[day] = 1;
		}
		else
		{
			DailyCount[day]++;
		}
	}

	public override void SetScorer(Scorer scorer) { }

	public override void SetNextReader(IndexReader reader, int docBase)
	{
        dates = FieldCache_Fields.DEFAULT.GetStrings(reader, BBCFields.PublishDateString);
	}

    public override bool AcceptsDocsOutOfOrder
    {
        get { return true; }
    }
}

This is a very simple implementation based on storing the DateTime in a particular format so is particular to the index and our use case. However, it is trivial to implement.