This site is for demonstration purposes and has been scrubbed for proprietary data. Return to dmbaughman.com

Frequently Asked Questions

If you don't find the answer to your question on this page, feel free to send us an email.

Current Events


Crawling and Indexing


Searching


Relevance and Optimization


Development


General



Current Events

I am getting redirected to my Internet Service Provider's search page when trying to use Enterprise Search. How do I fix this?

When telecommuting from outside the Boeing network, you may have trouble accessing search.boeing.com. This is due to recent changes by Comcast and other ISPs in how they handle search traffic. The Network Team is actively testing solutions to remedy this issue.

In the meantime, the quickest fix is to ensure you have selected "Tunnel All" when signing on (as opposed to "Split Tunnel"). If you have Comcast, you can also call and ask to be removed from their newly implemented "Domain Helper" service.

When I include secure content, I get a blank page instead of search results--how do I fix this?

The most common solution to this problem is to clear cookies, browser history, and other temporary files from your web browser. Use the following browser-specific instructions to fix this issue.

Internet Explorer 6

  1. From an Internet Explorer browser window, open the Internet Options menu by selecting:
    Tools > Internet Options
  2. Select the options to Delete Cookies, Delete Files, and Clear History. This will clear the old temporary data from your browser.
  3. Close all instances of Internet Explorer and re-open.
  4. Go to http://search.boeing.com, enter your query and click the checkbox to "Include Secure Content".
Internet Options Menu with buttons highlighted

Internet Explorer 8

  1. From an Internet Explorer browser window, open the Delete Browsing History menu by selecting:
    Tools > Delete Browsing History
  2. Select the checkboxes for the following options and click Delete:
    • Temporary Internet Files
    • Cookies
    • History
    • Form Data
    • Passwords
  3. Close all instances of Internet Explorer and re-open.
  4. Go to http://search.boeing.com, enter your query and click the checkbox to "Include Secure Content".
Delete Browsing History menu with checkboxes selected

My content does not appear in Enterprise Search, what is wrong?

There are a host of reasons why a web site or web page may not be indexed. Here is a table of some of the possibilities and what you might do to correct the situation:

Causes Possible solutions
My pages are not in register.boeing.com list Submit your web site via http://register.boeing.com/
My pages are protected with authentication and authorization mechanisms like Web Single Sign-on (WSSO) Secure search is available for NTLM and basic authentication

WSSO-protected content in general is not presently supported. There are several pilots underway that allow us to crawl content behind WSSO and serve it either as non-secure (with no authorization) or as secure content (with authorization). For more information, contact us.

There is also a work-around available for giving a site behind WSSO some level of visibility so it can be found.

My pages use Flash The search engine can index some Flash enabled pages.
Alternatively, provide a sitemap that lists all the links within your site.
My pages use JavaScript The search engine cannot index links embedded within JavaScript.
Alternatively, provide a sitemap that lists all the links within your site. The sitemap can be set to noindex, follow with the robots metatag, if needed.
Some authors provide both the links embedded within JavaScript and straight URL links to assure accessibility. This is the author's choice.
My pages use Frames The search engine does handle frame web pages well. Providing a sitemap that lists all the links within your site can help. The sitemap can be set to noindex, follow with the robots metatag, if needed.
My pages are dynamically generated from a database Currently, the search engines does not support dynamic database query indexing.
Alternatively, you could use static links that generated web pages from the database.
My pages are dynamically generated using ASP (asp?), Coldfusion (cfm?), Java Server Pages (jsp?) or PHP (php?) Currently, the ESS team has choosen to disallow these types of pages. We have found that these some times generate what could be termed as a "black hole". A black hole is this context is a set of web pages with an infinite number of dynamically generate resulting pages. This unfortunately, exhausts the document license count of the Boeing serach engine.
Alternatively, you can contact the ESS Team to have your dynamic site allowed after the Team reviews it and confirms that the site will not become a "black hole".
I added new pages to the web but they do show up in search Be sure to have a link to the new pages. Google does not discover new pages, it will only follow known links.

Or be sure to have used register.boeing.com to tell Google that your pages have become available. You do not need to perform both.

My pages are not indexed and I have no clue why Check the basics:
  • Check the settings in YOUR robots.txt file to see if the site is restricted (display http://mysite.boeing.com/robotst.txt)
  • Check for any robots metatags on the web pages
  • Check with Enterprise Search to see if the site is in the disallowed URL patterns list
  • Check for browser return code error, such as 404 - page not found or 403 - forbidden access
  • Check the URL of your pages, are they fully qualified. e.g. http://mysite.boeing.com/index.html rather than http://mysite/index.html
  • If none of those were helpful then contact us.

Why do documents show up with future dates in the search results?

There are many reasons why documents show a future date in the search results, but the two primary reasons are:


  • There is no date found as expected in the metadata of the document
  • The document's date is not correctly formatted as yyyy-mm-dd, according to the Intranet Style Guide

The date associated with a given document should be taken from the metadata of that document. When the date is not found where it is expected, the web crawler will proceed to look through the document for anything that looks like a date. For example, let's say that one of the first dates found in a document is ?Jan 23?. When the crawler only sees a month name with one other number it interprets it as the Month and Year, and records that as the date for the document. Hence, the recorded date for this document is 2023-01-01.

The simple solution to this problem is for the content owners of these documents to ensure that their documents have the correct metadata, according to the Intranet Style Guide.


How often are web sites indexed?

The crawler continuously indexes content. On average, content is re-indexed every 7 days. To request an urgent re-index of your site, please contact us.


How soon does the search engine find my updated pages?

The crawler now runs continuously. Search results should therefore be updated at least once per week. If you need the changes to appear sooner, please contact us.


Why does the crawler execute commands on my web page?

Web-based administration pages can be activated by the crawler when the administration pages are not password protected. The crawler will follow all links on a page unless directed by the robots.txt file or robots meta tag. It attempts to access each of the URLs it finds on the page. If the URL returns a valid page code the URL or page is indexed.

When the URL is a command to perform an administrative task, the task will be executed if valid access is available.

To prevent this, be sure to password protect any web pages that provide administrative access to software, servers or databases. In addition, you can contact us to remove index entries for these URLs.


How do I get my blog, wiki, or Sharepoint site to show up in a search or when I select the drop down for the specified content type in Enterprise Search?

If your blog or wiki resides under wiki.web.boeing.com, *.wiki.boeing.com, blog.web.boeing.com, or *.blog.boeing.com address it should automatically be included, as long as the wiki or blog has not been restricted to certain users or groups. If the content resides elsewhere, contact us and specify the location of the content.


How can I find out when the crawls run?

There is currently no reporting utility to allow you to view when the last crawl was completed. However, you can enter the following into the search box to retrieve the pages that are indexed for your site:

site:your-site.boeing.com inurl:boeing

where your-site = the name of YOUR web site's domain
(e.g. library.web)
{The inurl:boeing is just a way to provide a term for the site: restrictor, that will provide the most results for the site.}

Another way to find out when the crawler visits a site is to look at the web server logs and find instances where the user agent equals "googlebot-***".


How do PDF documents get "indexed"?

The appliance uses pdftohtml (http://pdftohtml.sourceforge.net/) to convert PDFs to HTML.

The simplest way to see what content has been indexed for a file is to click on the cached link. Some of the issues that can be caused by PDF files are:

  • PDF files don't contain any text, e.g., they only contain graphics (#10625).
  • PDF files don't send a Content-length header in the HTTP response (#10791).

In addition, there are two reasons why a PDF page may get recrawled, though its content is indexed correctly.

  • PDF files are not recrawled from cache if you are using NTLM authentication (#10809).
  • If the page was modified within the past few hours, it may be recrawled because the appliance sends its If-modified-since header in the local timezone, but the web server calculates the date in GMT.

The pdftohtml converter inserts a noarchive robots meta tag into the HTML file if the PDF has security enabled, meaning that these PDF files will not show a cached link.

Links inside PDF documents will be crawled unless you have a robots.txt file or crawler patterns to stop these links being followed.

The appliance extracts the document properties from a PDF file, which become meta tags in the HTML document. Here is an example of the meta tags which are created:

<meta http-equiv="Content-Type" content="text/html; charset=Latin1">
<meta name="Producer" content="Acrobat Distiller 4.05 for Windows">
<meta name="ModDate" content="D:20011129112148-06'00'">
<meta name="Author" content="Charles Dickens">
<meta name="CreationDate" content="D:20011129112114">
<meta name="Creator" content="Microsoft Word 9.0">

The appliance uses the PDF's document title property as the title in its index. However, if the title is the same as the filename, the appliance uses the first text in a large font from within the document.


How can I prevent the crawler from following links from a particular page or archiving a copy of a page?

The crawler obeys the noindex, nofollow, and noarchive meta tags. If you place these tags in the head of your HTML document, you can tell the crawler not to index, follow, and/or archive particular documents on your site. The tags to include and their effects are:

Meta Tag Description
<META NAME="robots" CONTENT="noindex"> Googlebot will retrieve the document, but it will not index the document.
<META NAME="robots" CONTENT="nofollow"> Googlebot will not follow any links that are present on the page to other documents.
<META NAME="robots" CONTENT="noarchive"> Google maintains a cache of all the documents that we fetch, to permit our users to access the content that we indexed (in the event that the original host of the content is inaccessible, or the content has changed). If you do not wish us to archive a document from your site, you can place this tag in the head of the document, and Google will not provide an archive copy for the document.

The "robots" tag is obeyed by many different web robots. If you'd like to specify some of these restrictions just for googlebot, you may use "googlebot" in place of "robots". You can also combine any or all of these tags into a single meta tag. For example:

<META NAME="robots" CONTENT="noarchive,nofollow"> -- or -- <META NAME="googlebot" CONTENT="noarchive,nofollow">


How does the crawler treat frames?

The crawler does not crawl content embedded in frames. If this content must be crawled, there has to be direct links to the frame pages. This will create links to the frame pages, meaning navigation and other contextual content around the frame will be lost.


How can the crawler find my pages if I am using Javascript for navigation?

The crawler does not follow links embedded in Javascript. Unless you provide an html link to pages within your site, your site will not be indexed.

In general, Google's crawler does not execute JavaScript. Pages which meet the handicapped-accessibility requirements of the ADA will also have links which work without JavaScript. Those will work fine for the crawler.

The ADA doesn't forbid JavaScript but it does require that users can use the pages without it. The crawler has the same requirement. It doesn't have to follow every link on the page, but it does need one readable link to each document.

There are good resources for making your sites comply with the ADA guidelines at the following site: http://www.w3.org/WAI/.


What specific file formats that are supported?

More than 220 file formats/versions are supported including HTML, PDF, text, PostScript, Microsoft Office, IBM Office, and many more. A complete list of supported file formats is available on Google's website.


What are the inSite results for and how do I manage when my profile appears?

inSite results are returned in search results to connect you with people in addition to documents and web pages. Name or BEMS ID-based searches will return direct matches for people. Other types of searches will return subject matter experts for your search as well as people who have tagged themselves with a particular skill that matches the search submitted. The custom relevancy algorithm also takes a person's contribution to the inSite community into account when determining the order of inSite results. For more help on managing your profile, visit inSite's Managing Your Profile help page.


What does the "include secure content" option mean?

By default, your search only returns content from sources that do not require a login. When the "include secure content" option is selected, content with access control that has been registered or indexed by Enterprise Search will be included in the search results as well. You will only see the secure content you have access to. Secure content includes SharePoint sites and websites that require a login or user credentials. The login you use is your Windows account (Domain\username and password).


What is the maximum number of results that can be returned in a single request?

1000 results. The system is optimized for the task of fast retrieval of search results and requires certain engineering trade-offs.


How do I add a toolbar for Enterprise Search to my browser?

Both the Boeing-provided Firefox and Internet Explorer browsers come packaged with a specific search box for enterprise search. If it is not there anymore, reinstall the browser. The Google Toolbar should not be installed on your Boeing-provided computer.


Why do some results pages display this text: "... to show you the most relevant results, we have omitted some entries very similar to the already displayed."

There is an algorithm which compares the snippets generated for each page. Pages with near duplicate snippets are considered similar. By clicking on the "repeat the search with the omitted results included" turns off this feature, it will then include the following parameter in the query string:
&filter=0


The search tips say that Google search is not case sensitive but when I enter a URL into the link: search is case sensitive. Can you explain?

Link searches are an exception to the case sensitivity rule as URLs are, by specification, case sensitive. The server name in a link search is case-insensitive but the path and document name are case sensitive.


It appears that the search engine does some interpretation of search terms. For example, if I search for council, I get council and counsel. What other interpretations are happening and can we control/add/remove them?

Spelling suggestions are automatically generated in a context-sensitive manner from similar words contained in the index. There is not a mechanism to control, add, or remove them.

The appliance uses a corpus-based Bayesian algorithm to generate spelling suggestions from the content in your index. This makes it good at providing relevant spelling suggestions for proper nouns such as names of employees and product names. The spelling system is fully automated, so it is not possible to manually edit the spelling dictionary.


It appears that Google does not keep the fully qualified URL if it ends in index.html or index.htm. Users can not access the non-fully qualified URL for my site. How can I get Google to index my site?

Unfortunately, this is a limitation of Google's page rank algorithm. The URL gets stored internally with a trailing "/" rather than the "/index.html".

If possible create a special jump page for the crawler to start from, such as renaming index.html to jump_page.html. The other option is to ensure your web server configuration redirects requests for "/" to index.html or another appropriate page.


What are "Starting Points" and why do they appear at the top of search results?

Starting Points are pages that have been deemed authoritative and/or most relevant for certain searches. The Boeing Library is responsible for managing all Starting Points. Anyone can suggest a Starting Point by clicking the "Suggest a Starting Point" link on the search results page. As of June 2009, there are approximately 8,000 Starting Points.


How do I find documents of a particular type, such as .ppt, .doc, .pdf or the like?

To find specific document types, enter a term and a filetype: restrictor. For example:

  • boeing filetype:ppt
  • boeing filetype:pdf
  • boeing filetype:doc

A new addition to our search results page creates an easier way to filter by document type. It does not require you to memorize commands and insteads allows you to select document types from the left hand column.

Known Issues

Since this filter-by-click functionality is not truly dynamic, it does have a few quirks.

  • When applying a filter (xyz) you will see the "filetype:xyz" show up in the search box. If the "filetype:xyz" statements are removed from the search box, the filter will not work (even though the link in the filters column may still be selected).
  • After applying a filter (xyz), if the search terms are changed in the search box, and another filter (abc) is applied, no results will be returned. This is because the initial filter (xyz) is still applied, so by selecting a second filter (abc) the search engine looks for results that are both type (xyz) AND (abc), which will never be true. This issue only occurs if the search term is modified.

If a site appears in a search, does it still need to be registered at register.boeing.com?

If your site appears in a search then adding it again is probably unnecessary. There is no negative affect on the search engine by registering your site again.


When I do a search, the number of documents found (Results n - nn of about m) doesn't match with the actual number of documents returned. Why doesn't it match?

The Google search engine does not guarantee the ability to return a particular number of results for any given search query. The total number of results provided by Google in the search results is an estimate of the actual number of results for the query. This number can be higher or lower than the actual number of results available.

Behavior: When a search request is made to Google, the following behavior occurs:

If Google has results to satisfy the search request, then the requested number of results will be returned. If Google has results and the search request is for results beyond what is available, the last page of results will be returned. The last page of results is determined by dividing the total number of results into pages based on the number of results requested. If no results are available for the search request, then an empty result set will be returned. In order to determine if a particular results page is the last page of available results, check for any of the following conditions:

The first result number returned does not match the first result number requested. The number of results returned is less than the number of results requested. The results returned do not contain a link to the next result set.

In addition, Google support says:

The Google system is optimized for speed. It tries to get the first 10 results as quickly as possible to return them to the user. The algorithm used for approximating results number can be very inaccurate for results numbers greater than 1000.

The thinking behind this is that most users will find what they want in the first 10 results. For most people, the estimated number of results will not be that important. It will be far more important to get the first 10 as quickly as possible.

Several good examples of this situation can be demonstrated with the searches below:

search estimated filtered actual
aardvark 87 48 62
cranfill 212 19 22
CoABS 29 18 22
bethany 82 53 68
agents 28800 810 934

How can I search for and/or display meta tag information using Google?

ESS has implemented the partialfield meta tag search capability. This allows the user to specify a meta tag and associated single term (format is metatag:term). The term can appear anywhere in the specified meta tag.

For example the meta tag search for creator:rich will find all of the following occurrences:

  • creator = Mitch Fritschle/Rich Satow
  • creator = 8386-Rich Anthony & 873-Kim Little & 8334-Greg Miller
  • creator = Rich North

To look for a phrase, separate the words in the phrase with a minus sign. For example the meta tag search for creator:richard-f-hand will find the following:

  • creator = 7147-Richard F. Hand

You can also choose to display meta tags within the search results. This features can be used in conjunction with the meta tag search or a regular search. To display all meta tags, visit the advanced search page and select "Display all meta tags".

Note that the Boeing Meta Tag Standard includes the fields title, description, subject, creator, owner, date and validuntil.

More information about the meta tag search is available in the protocol document.


Is there a way to use wildcards in the site: restrictor, something like site:*.blog.boeing.com?

This is possible using the normal "site:" operator, but no asterisk is required. You can search by any subset of the site name. So, in the case of "iscfp.blog.boeing.com", all of the following are valid searches:

  • site:iscfp.blog.boeing.com
  • site:blog.boeing.com
  • site:boeing.com
  • site:.com

When I do a search, is the search engine case sensitive?

The search engine is not case sensitive, entering either capital or lower-case keywords will return the same results.


Since Google does not search meta tags, why should I use them? I thought using subject and keyword meta tags would increase the number of hits on my page.

Google does in fact both index meta tags and allows for searching of meta tags.

Metatags are indexed but are not a primary factor in determining the ranking of your web pages. Metatags are good! However, don't rely on metatags. Good content, descriptive titles and many other pages pointing at your site are excellent ways of increasing your search results ranking.

There is a Boeing standard for meta tags. They are required and can be used by categorization and content management functions.


How can I customize the Google Appliance search results for my site?

So you have your pages indexed and you have setup a local search box on your site. But you don't want to use the Enterprise Search Results format. Since the Google appliance returns the search results in XML format, you can use an XML stylesheet to make your search results and the input form to appear exactly how you wish.

You should have some experience in coding XML stylesheets.

Contact us if you're interested in developing your own custom search interface.


Can I access Google search from a Perl application?

Yes, a fellow Boeing employee developed a Perl module for easy access to the Boeing Google search appliance. Since, XML formatted results are available, it would be possible to do the same with any other language that supports translation of XML including Python or Java.

More information can be found at:


What are the recent changes to Enterprise Search about? (9/2009)

Search Homepage
We have made several updates to the search homepage and search results page. Starting with the homepage, you will notice a centered search box with a fewer number of options. Our metrics showed that less than 1% of users made use of the "Number of Results" and "Content Collection" options. Therefore, we simplified the design and now present a design similar to the search experience you'll find on the internet. If you still desire additional fields, these options will continue to be available on the advanced search page.

In addition to the search box, we have also added important links across the top of all pages. This is intended to bring visibility to our support pages and also emphasize new features as they are introduced. Earlier designs of our search pages did not place an emphasis on these types of links. The search homepage also has a drop-down menu that allows you to jump to other popular sites on the intranet.

Search Results Pages
Moving to the search results page, the first thing you'll notice is that it closely resembles the new homepage. The additional little-used options are removed and continue to be available on the advanced search page. We also have optimized the use of screen real estate and aligned the search box higher up on the page. This enables valuable content to appear earlier on the page than in prior releases.

The simplified design also calls increased attention to the secure search feature. We found that the majority of users did not notice the secure search option in prior releases. By clicking on the checkbox labeled "include secure results", you will be prompted for your Windows username and password. This option will return results from content on the intranet that requires a username and password and has been registered with Enterprise Search.

Search Labs
One of the featured links at the top of all pages is Search Labs. This is a new offering that allows you to try out new search features that are intended to help you do your job. If a feature is deemed valuable based on your feedback, we will integrate it into the default search pages for all to benefit from. We encourage you to bookmark Search Labs pages for you to use as you go about your daily tasks.

For questions and feedback, please contact us at GRP Enterprise Search.


Does Enterprise Search support Web Single Sign-on (WSSO) protected sites? Is there a workaround?

WSSO-protected content in general is not presently supported by Enterprise Search. There are several pilots underway that allow us to crawl content behind WSSO and serve it either as non-secure (with no authorization) or as secure content (with authorization). To learn more about these pilots, contact us.

If you have a site protected in this manner, you can use the following work-around to get an initial front page indexed by Enterprise Search. Your entire site will not be indexed, only the unprotected front page.

Steps:

  1. Create a front page that does not require WSSO protection. Include significant information about your over all site.
  2. On this unprotected page, you can place a link to enter the protected area or you can code a redirect. If you use the redirect be sure to place a 10 to 15 second delay so that the search engine can index the page.
  3. Tell Enterprise Search about this new page by registering it. Wait approximately 12 hours for the site to get picked up and indexed.
  4. If you have any problems, contact us for assistance.

Why doesn't Enterprise Search have the same capabilities as Google?

There are several aspects of the internet-based Google search that are unique and will not be incorporated into the Google Search Appliance that is provided to corporate customers. When you see a feature on Google that appears to have benefit for us internally, feel free to let us know. We will in-turn forward the requests for enhancements to Google and they can then choose which features will be in their product.


Why doesn't Enterprise Search have a "find similar" capability like Google.com?

The "find similar" or "related" feature on google.com relies on information from the open directory project http://dmoz.org/. This directory is maintained through voluntary submissions of sites into categories. It is also the basis of http://directory.google.com.

Due to this dependence, Google does not have any current plans on offering this feature with the Google Search Appliance.


How can I just display meta tag information using Google without performing a meta tag search?

Meta tag information can only be displayed from the advanced search screen. It is not required to enter any term in the metatag search box. You can simply select the meta tag information you want to display, such as description, etc.

The steps are:

  • click on the advanced search link
  • on the advanced search page, enter search term(s) desired in the Find Results area
  • Below, select the "display all meta tags" option
  • click on the search button

The meta tag(s) should be display in the lower portion of each result.


How can I search a particular web site directory and only that directory, even if it has sub-directories?

On the Site Search page there are examples of how to use the site: restrictor to search including directories. These examples will search starting at a particular directory level and continue to search into sub-directories.

To limit the search to just particular directory, you need to add an ending / (slash) to the site: restrictor term.


How long will it take for a URL to come out of the index if it has been removed from the server?

The index is refreshed approximately once per week. After the indexing has completed, the pages will no longer appear in the search results.

If you need the file to be removed sooner, you can submit a remove url request.


Where do I find documents that begin with a prefix like D6?

We have integrated Enterprise Search with the Boeing Library Catalog so searching for a document number should return relevant results. If you do not find what you expected, contact us.


Why are external sites appearing in the search results?

Normally internet/external sites are not indexed by Enterprise Search. However, we will include external sites that are related to Boeing but may not have a presence on the Boeing intranet.


How can I use meta tags to make my content appear higher in search results?

Generally, meta tags are not a primary factor in the algorithm that ranks results. Stuffing meta tag fields with keywords is acceptable but will not greatly influence the rank of a page. The key elements for making the page rank higher are the page title and actual content.