Hack 20 Index and Search Local PDF Collections on Windows

figs/moderate.gif figs/hack20.gif

Teach Windows XP or 2000 how to search the full text of your PDF along with your other documents. Or, use Adobe Reader to search PDF only.

Search is essential for utilizing document archives. Search can also find things where you might not have thought to look. The problem is that Windows search doesn't know how to read PDF files, by default. We present a couple of solutions.

2.7.1 Search PDF with Adobe Reader

The free Adobe Reader 6.0 provides the easiest solution. It enables you to perform searches across your entire PDF collection (Edit Search). Its detailed query results include links to individual PDF pages and snippets of the text surrounding your query, as shown in Figure 2-5. Its Fast Find setting, enabled by default, caches the results of your searches, so subsequent searches go much faster. View or change the Reader search preferences by selecting Edit Preferences Search.

Figure 2-5. Collection search results in Reader linking directly into the documents

The downside to Adobe Reader search is that it searches PDF documents only.

2.7.2 Index and Search PDF with Windows XP and 2000

It makes sense to search across all file types from a single interface. Newer versions of Windows enable you to extend its built-in search feature to include PDF documents. With Windows 2000, all you need to do is install the freely available PDF IFilter from Adobe. With Windows XP, you must also apply a couple of workarounds. In both cases, you can use the Windows Indexing Service to speed up searches.

The Windows Indexing Service is powerful but needs to be configured for best performance. The next section introduces you to the Indexing Service. We then discuss installing and troubleshooting Adobe's PDF IFilter.

2.7.3 Windows Indexing Service: Installation, Configuration, and Documentation

You don't need Indexing Service to search your computer, but it can be handy. Queries run much faster, and you can use advanced search features such as Boolean operators (e.g., AND, OR, and NOT), metadata searches (e.g., @DocTitle Contains "pdf"), and pattern matching. The downside is that the Indexing Service always runs in the background, using resources to index new or updated documents. A little configuration ensures that you get the best performance.

First off, do you have Indexing Service? If not, how do you install it? Both questions are answered in the Windows Components Wizard window. In Windows XP or 2000, open this wizard by selecting Start Settings Control Panel Add or Remove Programs and clicking the Add/Remove Windows Components button on the left. Find the Indexing Service component and place a check in its box, if it is empty, as shown in Figure 2-6. Click Next and proceed through the wizard.

Figure 2-6. Adding the Indexing Service component to XP or 2000

Access Indexing Service configuration and documentation from the Computer Management window, shown in Figure 2-7. Right-click My Computer and select Manage. In the left pane, unroll Services and Applications and then Indexing Service.

Figure 2-7. The Computer Management window, where you configure the Indexing Service

Sometimes you must stop or start the Indexing Service. Right-click the Indexing Service node and select Stop or Start from the context menu.

Under the Indexing Service node you'll find index catalogs, such as System. Add, delete, and configure these catalogs so that they index only the directories you need. For details on how to do this, I highly recommend the documentation under Help Help Topics Indexing Service. This document also details the advanced query language.

You can fine-tune your Indexing Service with the registry entries located at HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\ContentIndex. These are documented at http://msdn.microsoft.com/library/default.asp?url=/library/en-us/indexsrv/html/ixrefreg_192r.asp.

You still can search the directories you do not index by selecting Start Search For Files or Folders, so don't feel compelled to index your entire computer.

Before installing the PDF IFilter, create a special catalog for testing purposes. Put a few PDFs in its directory. Disable indexing on all other catalog directories by double-clicking these directories and selecting "Include in Index? No." This will simplify testing because indexing many documents can take a long time.

Download our indexing test PDF from http://www.pdfhacks.com/ifilter/. During testing, search this PDF for guidelines.

2.7.4 Prepare to Install PDF IFilter 5.0

On Windows XP and 2000, you have two kinds of searches: indexed and unindexed. An indexed search relies on the Indexing Service, as we have discussed. An unindexed search takes a brute-force approach, scanning all files for your queried text, as shown in Figure 2-8. In both cases, the system uses filters to handle the numerous file types. These filters use the IFilter API to interface with the system.

Figure 2-8. An unindexed search

A PDF IFilter is freely available from Adobe. Visit http://www.adobe.com/support/salesdocs/1043a.htm and download ifilter50.exe. Adobe's web page states that this PDF IFilter works only on servers. In fact, it works on XP Home Edition, too.

If you run Windows 2000, you can install the PDF IFilter and it will work for both indexed and unindexed PDF searching.

If you run Windows XP Home Edition and install the PDF IFilter (Version 5.0), you might need to disable the PDF IFilter for unindexed PDF searches. Unindexed searching of PDFs on XP Home Edition with the PDF IFilter can leave open file handles lying around, which will cause all sorts of problems. Visit http://www.pdfhacks.com/ifilter/ and download PDFFilt_FileHandleLeakFix.reg. We will use it in our installation instructions, later in this hack. This registry hack ensures that only the Indexing Service uses the PDF IFilter. After you apply this hack, PDFs will be treated like plain-text files during unindexed searches. You can undo this registry hack with PDFFilt_FileHandleLeakFix.uninstall.reg.

Unindexed searching of PDFs on XP with the PDF IFilter can leave open file handles lying around.

If you perform an unindexed search in a folder of PDFs and then find you can't move or delete these PDFs, you have open file handles. Reboot Windows to close them.

Download Process Explorer from http://www.sysinternals.com and follow the explorer.exe process to see these open file handles. Use our PDFFilt_FileHandleLeakFix.reg registry hack as a workaround, as we describe next.

2.7.5 Install and Troubleshoot Adobe PDF IFilter 5.0

On XP, installing the PDF IFilter might require a couple of registry hacks. First we'll install it, then we'll troubleshoot.

  1. In the Computer Management window (right-click My Computer and select Manage), right-click Services and Applications Indexing Service and select Stop.

  2. Run the Adobe PDF IFilter installer through to completion.

  3. Windows XP Home users: install PDFFilt_FileHandleLeakFix.reg by double-clicking it and selecting Yes to confirm installation. (If you need to undo this registry hack, run PDFFilt_FileHandleLeakFix.uninstall.reg.)

  4. Start Indexing Service back up again (right-click Services and Applications Indexing Service and select Start).

  5. Rescan your test catalog. Do this by selecting the catalog's Directories node, right-clicking your test directory, and selecting All Tasks Rescan (Full).

  6. Wait for the rescan to complete.

Follow the Indexing Service's progress by selecting Services and Applications Indexing Service in the Computer Management window. Watch the pane on the right. It is done indexing a catalog when Docs to Index goes to zero.

If a PDF is open in Acrobat, it won't get indexed. Be sure your test document is closed.

To test your index, don't select Start Search. Instead, in the Computer Management window, select the Query Catalog node listed under your test catalog. Submit a few queries that would work only on the full text of your PDFs. Avoid using document headings or titles. Did it work? If so, you're done! If you get no results, as shown in Figure 2-9, work through the next section, which explains a common workaround for Windows XP.

Figure 2-9. Testing your index with negative results
figs/pdfh_0209.gif PDF IFilter doesn't work with XP Indexing Service?workaround

PDF IFilter and Indexing Service don't see eye to eye on Windows XP. If querying indexed PDF yields empty sets, give this a try:

  1. In the Computer Management window (right-click My Computer and select Manage), right-click Services and Applications Indexing Service and select Stop.

  2. Open the Registry Editor (Start Run . . . Open: regedit OK).

  3. Select HKEY_CLASSES_ROOT and then search for pdffilt.dll in the registry data (Edit Find . . . Find what: pdffilt.dll Look at: Data Find Next).

  4. You should hit upon an InprocServer32 key that references pdffilt.dll and specifies its ThreadingModel. Double-click the ThreadingModel and change it from Apartment to Both.

  5. Select HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\ContentIndex and double-click the DLLsToRegister key to edit it.

  6. In the list of DLLs, delete the following line:

    C:\Program Files\Adobe\PDF IFilter 5.0\PDFFilt.dll

  7. Click OK, and then close the Registry Editor.

  8. Start the Indexing Service back up (right-click Services and Applications Indexing Service and select Start).

  9. Rescan your test catalog. Do this by opening the catalog's Directories node, right-clicking your test directory, and selecting All Tasks Rescan (Full).

  10. Wait for rescan to complete.

Your test query should now work, as shown in Figure 2-10.

Figure 2-10. PDF indexed search success

Adobe documents this workaround on its web site at http://www.adobe.com/support/techdocs/333ae.htm.

2.7.6 Using Start Search For Files and Folders

When searching PDFs by selecting Start Search For Files and Folders, don't search for Documents. Search All Files and Folders instead. The Documents search overlooks PDFs.

If you indexed a specific folder instead of an entire drive, that folder (or one of its subfolders) must be given in the Look In: field when using Start Search For Files and Folders. Otherwise, the index won't be consulted; an unindexed search will be performed instead, even within the indexed folder. Set the Look In: field to a specific folder by clicking the drop-down box and selecting Browse . . . , as demonstrated in Figure 2-11.

Figure 2-11. Similar results produced by the traditional Start Search For Files and Folders interface when searching within indexed folders

When searching within an indexed folder, you can use advanced search terms (e.g., @DocTitle Contains "earnings"). Consult the Indexing Service online documentation, described earlier, for details.

2.7.7 Searching PDF Using Windows 98 and NT System Tools

Using the older Windows search tool on PDF still can be useful, even if it doesn't access the full text of your document. If the PDF documents are not encrypted, their metadata (Title, Author, etc.) and bookmarks are visible to the search tool as plain text. PDF shortcut titles [Hack #17] also are searched.