13.1. The Indexing ServiceWindows Server 2003 includes Version 3.0 of the Indexing Service, which catalogs files stored on network drives, corporate intranets, and Internet sites, and provides a web-based query form for easy search and retrieval of those cataloged resources. The service is part of Internet Information Services (see Chapter 8 for a complete and detailed walkthrough of IIS). Part of the power behind the Indexing Service is its ability to catalog documents without needing them reformatted to a special, proprietary format. The service understands most Microsoft Office file formats, including Word and Excel documents. This makes the service very useful, even beyond its basic premise of indexing plain web sites. The Indexing Service works by identifying unique words within a document and establishing its location with that document, and then reporting that information back to a central databasethe "index," as it were. You, as the administrator, can specify certain documents to either be indexed or be excluded from indexing. You also can include additional properties of a document in the catalog to expand the criteria on which your users can search, including title, author, creation date, date of last edit, and similar bits of information. Of course, in some instances you might not want the Indexing Service installed. For example, on regular client workstations with no special needs, there would be no reason to have this service installed, only to occupy resources needlessly and present an additional security risk (that's not to say that the service is insecure, but that you should reduce the surface of attack for a machine as much as possible). On fileservers, however, the Indexing Service adds value and provides a service for your user community. You can install the Indexing Service through the Add/Remove Programs applet within the Control Panel. Click Add/Remove Windows Components, and then check the box next to Indexing Service and click Next. That's all it takes to install ita very easy process. To confuse you further: by default, the Indexing Service is already installed, but it's just not started. If you open the Services console from the Administrative Tools menu and select Indexing Service, then set its startup type to Automatic and click Start, the service starts and functions properly even though it still doesn't show up as installed in Add/Remove Windows Components. However, the only way to fully uninstall the service is to simply uncheck the box within Add/Remove Programs and click Next. 13.1.1. How the Indexing Service WorksThe Indexing Service uses filters to extract information from documents. The CiDaemon process, which is initiated by the Indexing Service, runs in the background and filters documents for later indexing. It filters DLLs that actually extract words or property information from specific types of files such as Word documents or HTML pages. The Indexing Service comes with a standard set of filters that can index text, HTML, Microsoft Office documents created in Versions 95, 97, 2000, XP, and 2003, and Internet Mail and News posts. Filters are extensible and can be created by third-party vendors for their specific data types. After using filters to extract data, the service compares the filtered data against an exception list, which mainly contains a list of commonly used prepositions, pronouns, articles, and other nonessential words. The exception list is called NOISE.<XXX>, where XXX represents the language of the document being indexed. After the filtered data has had words that matched entries on the exception list removed, the remaining data is moved to word lists, which are small, temporary, and volatile stores of index information that serve as holding bins. About once a day, a process called a shadow merge takes place to aggregate the information within shadow indexes and remove data from the "holding bin," to both free up memory occupied by volatile word lists and make filtered data persistent by saving it on a disk. Shadow indexes are created when word lists and other shadow indexes are combined into a single index. At a separate time, the Indexing Service initiates master merges, which take place when individual shadow indexes are aggregated and infused into a current master index to create a single master index. The master index is a permanent index of a larger collection of documents. In a truer sense, the master index is the only index, containing pointers to resources within the corpus (a technical term for the body of work that is being indexed), much like the index of this book points you to certain words and phrases at specific points within the body. Picture a set of indexes, each for a certain chapter of this book. One could take these individual indexes (the "shadow indexes") and combine them into a master index, which would be placed at the back of this bookthis is the process of master merging. These indexes are stored in the catalog, a specific folder that contains all indexes, either temporary word lists or more permanent shadow and master indexes. Here are some additional terms you might run across while administering the Indexing Service:
13.1.2. Performance ConsiderationsObviously, the single largest requirement of any indexing service is its disk spacethe service will need room to store its indexing files. Microsoft recommends that you allocate about 35% of the size of your corpus for the indexing serviceI would allocate about 45%, simply to provide your service with room to grow. As more electronic information hits your disks you'll want to have ample space to index that data optimally. Master merges typically require large amounts of disk space on a temporary basis, as much as 50% of the corpus size. Memory is also an important consideration. Table 13-1 shows the Microsoft minimum memory amounts and my recommended memory amounts for certain corpus sizes. Keep in mind that these recommendations are in excess of the current amount of memory in a machine for Windows Server 2003's general useadd the amount of memory you have plus the appropriate recommended amount from the table to obtain the correct total amount of memory for your machine.
Perhaps the greatest demand on your machine's CPU from the Indexing Service comes from master merges, which are very intensive and require large amounts of CPU time. Because of this, the Indexing Service schedules master merges automatically for midnight local time. However, if there is a better time when your machine's CPU load is low, you can change the time at which master merges will begin by doing some Registry editing. The MasterMergeTime value, located in HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\ContentIndex, allows you to specify the number of minutes after midnight local time that the master merge should commence. For this value, you can enter any number between 0 and 1439.
13.1.3. Common Administrative TasksIn this section, I'll go through some common administrative tasks you will encounter with the Indexing Service. When performing most of these tasks, you'll find it easier to create a custom view within the MMC to access the Indexing Service controls because a clean default view of these options is not built into Windows Server 2003. To create a custom view, follow these steps:
Now you're ready to manage the service, as described in the next section. 13.1.3.1. Administering a catalogAs you read previously in this chapter, a catalog is a specific folder that contains all indexes; both temporary word lists and more permanent shadow and master indexes. When you first add the Indexing Service to a computer, the service creates a default index, named System, that includes all directories on all local drives attached to the system, and another default index, named Web, for any IIS-based web sites that might be running on that particular machine. For security reasons, I recommend deleting both of these default catalogs. They are too all-encompassing, particular for web servers. It's best to create your own catalogs that index only certain data on your disk and not every file it can find. However, you might have a completely sanitized system and find that the defaults work well for youif this is the case, by all means go for it. But for most people, I recommend deleting the default catalogs and enabling more specific, focused, and restrictive catalogs. Creating a catalog. To create a custom catalog, use the custom MMC that you created with the Indexing Service snap-in and highlight the Indexing Service node in the left pane. Then, select the Action menu and choose Catalog from the New menu. The Add Catalog screen displays, as shown in Figure 13-3. Figure 13-3. Creating a custom catalogEnter a name for the new catalog in the Name box and then enter the path to the folder that will house the contents of the catalog in the Location box. You can use the Browse button to graphically navigate your directory structure. Click OK when you've entered this information. Keep in mind that if you're managing the service on a remote computer, that remote computer must have the default administrative shares (i.e., C$, D$, and the others, as discussed in Chapter 3) intact; otherwise, the operation will fail.
Before the new catalog will become active, you must restart the Indexing Service. To quickly restart the service, right-click the Indexing Service node within the Console window and select Stop. Once the service has stopped, right-click in the same place again and choose Start. Configuring a catalog. After catalogs are added, they need to be configured to act as you want. Within the Indexing Service console, right-click the catalog to be configured and select Properties. The screen shown in Figure 13-4 appears. Figure 13-4. Adjusting the properties of an individual catalogA discussion of the features available on each tab follows:
Figure 13-6. The Generation tabAside from the abstract generation setting, all the options on these tabs require a restart of the catalog to be recognized. To restart the catalog, simply right-click the appropriate catalog within the Indexing Service snap-in, select Stop and then Start. If you change the abstract generation setting, you need to stop and restart the Indexing Service itself; see the previous instructions for a procedure to do that. Selecting a directory and location. Upon adding a new catalog and configuring its properties, you also need to define the directories to be included or excluded from its indexing activities. Specifying "included directories" encompasses any subdirectories of that particular directory. You can choose to exclude individual directories within an included parent directory, but you cannot include individual directories within an excluded parent directorythe directory will appear to be included, but it will not be indexed. How does security play into the indexing process? The Indexing Service is completely compatible with any NTFS permissions you apply to files and folders; if a user's current security privileges won't allow him to see a file that is stored on a local NTFS volume, the Indexing Service won't return that file within the results of a query. If a catalog is configured to index a remote UNC share, it will show the protected files in the results of a search, but the user won't be able to access them. This is a moderate security risk of course, since a user knows of the existence of a file containing that sort of information. Additionally, encrypted files are not indexed at all. If a file included in a catalog is encrypted after it is indexed, it will be removed from the index. You can block the service from indexing a particular file or folder by adjusting that object's attributes. Right-click the appropriate file or folder, choose Properties, and then click the Advanced button on the General tab. This opens the Advanced Attributes dialog box, as shown in Figure 13-7. Figure 13-7. The Advanced Attributes dialog boxUnder Archive and Index attributes, uncheck the second option, and the folder won't be indexed by the service. Also, note that the operating system that hosts drives being indexed also affects the operation of the Indexing Service in the following ways:
To include or exclude directories from a catalog's indexing processes, follow these steps:
Figure 13-8. Specifying included and excluded directoriesThe property cache. The property cache is where the Indexing Service stores file property information for all documents and pages within each catalog. The cache is a dual-level cache, with the primary level containing property information that is accessed fairly regularly, and the secondary level holding information that is not accessed very often. Table 13-2 shows the property values stored in the cache by default and their respective levels.
You might find that you would like to track and include other properties within the index. For instance, your users might often search on the date a document was created, a property that is not tracked by default. You can definitely add properties to either level of the property cache and track them, but adding values to either level degrades the performance of the service overallthis effect is even more pronounced if you add a value to the primary level. Also, adding properties of variable length dramatically increases the size of the cache, something to be aware of if disk space isn't inexpensive to you. Also, after you've restarted the Indexing Service, the levels to which you assigned any new properties are finalized and cannot be changed. You can see all the available properties to track by opening the Indexing Service console and clicking the appropriate catalog in the left pane. In the right pane, all the available properties will be listed. To add a property to be saved in the property cache, follow these steps:
The property has now been enabled for inclusion in the property cache. You will need to restart the Indexing Service for these changes to take effect. Also, only new documents added to the index will have these properties tracked and added to the cache; to include these specific properties of documents already in the index, you'll need to perform a full scan of the index (see the next section for details on that process). To remove a property from being tracked, simply repeat the preceding process for the appropriate property, and on the Properties sheet, remove the check mark in the Cached checkbox. Then, restart the Indexing Service and again perform a full scan of the index to remove all traces of the property from the property cache. Figure 13-9. Adding a property to the property cacheInitiating scans. Full scans involve making a complete list of all documents contained in a catalog. When the Indexing Service is first installed it of course conducts a full scan, but these types of scans also are conducted when directories are added to a catalog and as part of the error recovery process. On the other hand, incremental scanswhich only look for changed documents within a catalogare done automatically upon a restart of the Indexing Service to determine what documents have changed while it was inactive. If you have a heavy load on your server from a large amount of modified files, you might want to manually initiate either a full or an incremental scan. Here are the steps:
The scan will proceed. Indexing new web sites. When you create a new web site with IIS, it isn't indexed automatically when you create a catalog for it. If you want the contents of the web site to be indexed, follow these steps:
The new catalog is active and will begin indexing the site you specified. I'll cover how to query this new catalog later in this chapter. Indexing PDF files. Although the Indexing Service and Windows Server 2003 do not come bundled with a filter that can index the contents and properties of PDF files, Adobethe manufacturer of Acrobathas made available a free filter that you can install that will enable that functionality. You can find this filter at http://www.adobe.com/support/salesdocs/1043a.htm; you will need to have a login and password for the Adobe web site (both of which are free) to download it.
To install the PDF filter, follow these steps:
If, for some reason, PDF files still are not being indexed after this procedure, check the Registry to make sure the Indexing Service knows the PDF filter is present and where it can find it. Stop the Indexing Service, and then open the Registry Editor and navigate to the key: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\ContentIndex In the right pane, double-click the DLLsToRegister key. Look to see whether PDFFILT.DLL is present, and make sure the path is correct. (If you accepted the default entries during the filter installation process, the path is C:\Program Files\Adobe\PDF IFilter 5.0.) 13.1.3.2. Controlling mergesAt some point within your organization a significant number of documents within your corpus might be modified. In this instance, it might be beneficial to initiate a master merge yourself, instead of waiting for the automatic master merge to occur in the evening. To initiate a master merge manually, follow these steps:
Figure 13-10. Initiating a master merge manuallyYou also might find it convenient to change the scheduled time for master merges to occur. Perhaps your lowest CPU load occurs at 3:00 a.m. and not at midnight when the service is preconfigured to perform merges. To change this time, you'll need to edit the Registry. Follow these steps:
13.1.3.3. Running and configuring queriesThe Indexing Service has several interfaces. Perhaps the easiest and most accessible is simply through the Search command in the Start Menu, as shown in Figure 13-11. When using this interface, choose the option to search for files and folders, and then enter a filename, a word, a string of text from a file, or some other criterion in the box provided. Then the Indexing Service will work its magic, displaying results as much as 10 times faster than a search done without the Indexing Service present. The Indexing Service console also contains a Query the Catalog interface, as shown in Figure 13-12. The main advantage of the Query the Catalog form is the wider availability of search criteria. Using this page, you can search for words and phrases, search for words and phrases that are near other words and phrases, search for strings within text properties (such as a document summary in Microsoft Word), search within certain document formats, use operators such as <, <=, =, =>, >, and != against a fixed data point (useful for comparing against a date, a time, a size, or the like), use Boolean operators, use wildcard operators, use regular expressions, and rank results by how close the match is to the query. It's certainly quite a list. If you want to create your own custom query form, that's simple to do as well. A basic form might consist of the following: <h1>Indexing Service Query</h1> <p>Enter the term for your search, and then press Submit.</p> <form method="POST" action="/scripts/querydemo.idq"> <p><input type="text" name="CiRestriction" size="75"><input type="submit" value="Submit" name="B1"> <input type="reset" value="Reset" name="B2"></p> </form> Figure 13-11. Accessing the Indexing Service via the Windows user interfaceA custom query form has one requirement: it must post back to the Internet data query (IDQ) file, which simply configures the correct query parameters for a search. (Head over to http://msdn.microsoft.com and search for "format IDQ" for a detailed reference on the formatting for these files.) The following code is a standard format for an IDQ file: [Query] CiCatalog=d:\ <= COMMENTED OUT - default registry value used CiColumns=filename,size,rank,characterization,vpath,DocTitle,write CiRestriction=%CiRestriction% CiMaxRecordsInResultSet=200 CiMaxRecordsPerPage=35 CiScope=/ CiFlags=DEEP CiTemplate=/iissamples/issamples/ixtourqy.htx CiSort=rank[d] CiForceUseCi=true Figure 13-12. Accessing the Indexing Service via the Query the Catalog pageLet's take a closer look at each part of the IDQ file:
If you are receiving an error such as "No documents matched the query" when using a custom query form, you can try a few things. For one, check the .IDQ file that is being used for the query, and make sure the line CiCatalog is pointing to the correct catalog location. If you are using a custom catalog, be sure to point this entry somewhere; otherwise, you are searching in the default catalog, which isn't what you want. Also, if you are trying to search content on an IIS-hosted web site, make sure the Index this Resource checkbox is checked for that particular site. Open the IIS Manager console (see Chapter 8 for detailed instructions on administering IIS), and right-click the relevant web site in the left pane. Then select Properties from the context menu. Navigate to the Home Directory tab, check the Index This Resource box, and then click OK. Finally, restart the Indexing Service. You also might be impeded from viewing some documents because of permissions. The Indexing Service scans and indexes using the System local account and must have at least Read permissions on the files you want indexed; otherwise, the service can't read them and they're not indexed. The service also needs Full Control permissions for the root folder of the drive that houses the catalog, and it needs Full Control on the CATALOG.WCI directorythis is located within the catalog directory. Additionally, if your users are attempting to search for documents, they might not be allowed to access them and thus those documents would not show up in the search results (if those documents are hosted on an NTFS volume). 13.1.3.4. Adjusting performance optionsTrying to adjust performance for the Indexing Service and to issue recommendations is tantamount to aiming at a moving target: several variables significantly affect the performance of the service, including the obvious onescorpus size, amount of memory available, and amount of physical disk space present. Testing on an informal basis has revealed that indexes with 150,000 documents or less tend to not require a special hardware emphasis: the stock hardware that runs Windows Server 2003 should be a sufficient base for such a small corpus. Above that "magic" number, however, you might need to look at expanding hardware on the machine running the Indexing Service to improve performance. Configuring performance within the Indexing Service. You need to adjust a couple of knobs within the Indexing Service to configure a certain level of performance based on system load; these adjustments are sometimes a quick fix to avoid needing a hardware upgrade. However, it's important to realize that in the majority of cases, the service works in the background and configures itself to consume resources appropriately; these options will make a noticeable difference only in either very high- or very low-load situations. With that disclaimer out of the way, let's turn to adjustments. For one, you can adjust the level at which the Indexing Service thinks it runs on the serversometimes this can make the service a better player among the other processes jockeying for CPU time on your machine. To try this, do the following:
Figure 13-13. Adjusting Indexing Service usageOn this screen, simply select the options that adequately fit this machine's usage profile. Your options are as follows:
The Desired Performance screen allows you to individually adjust the indexing and querying settings for the service to use. You can choose between lazy, moderate, and instant indexing, and low load, moderate load, and high load querying. Figure 13-14. Customizing Indexing Server performanceMonitoring performance using the Performance Monitor. You might find that using the Performance Monitor bundled with Windows Server 2003 provides you with data on how the Indexing Service is performing. To call up the Performance Monitor, load the application from the Administrative Tools menu off the Start menu. Then, click the "+" icon in the middle of the toolbar in the right pane to open the Add Counters screen. Select the appropriate performance object as outlined in Table 13-3, which lists the relevant counters you can use to track this performance. Then, select the appropriate performance object and the appropriate counters on the right side of the screen, using Table 13-3 as a guide.
|