This site is for demonstration purposes and has been scrubbed for proprietary data. Return to dmbaughman.com

Developer Help

The following resources have been compiled to help you leverage the enterprise search infrastructure and services for your development needs. The term "developers" is used liberally and encompasses content publishers, authors, and anyone else who generates content on the intranet.



Best Practices for Developers

Return to top

Make web pages for users, not for search engines

Create a useful, information-rich content site. Write pages that clearly and accurately describe your content. Don?t load pages with irrelevant words. Think about the words users would type to find your pages, and make sure that your site actually includes those words within it.

Focus on text

Focus on the text on your site. Make sure that your TITLE and ALT tags are descriptive and accurate. Since the Google crawler doesn't recognize text contained in images, avoid using graphical text and instead place information within the alt and anchor text of pictures. When linking to non-HTML documents, use strong descriptions within the anchor text that describe the links your site is making.

Make your site easy to navigate

Make a site with a clear hierarchy of hypertext links. Every page should be reachable from at least one hypertext link. Offer a site map to your users with hypertext links that point to the important parts of your site. Keep the links on a given page to a reasonable number (fewer than 100).

Ensure that your site is linked

Ensure that your site is linked from all relevant sites within your network. Interlinking between sites and within sites gives the Google crawler additional ability to find content, as well as improving the quality of the search.

Make sure that the crawler can read your content

Validate all HTML content to ensure that the HTML is well-formed. If extra features such as JavaScript, cookies, session IDs, frames, DHTML, or Flash keep you from seeing all of your site in a text browser, then search engine crawlers may have trouble crawling your site.

Allow crawlers to crawl your sites without session IDs or arguments that track their path through the site. These techniques are useful for tracking individual user behavior, but the access pattern of bots is entirely different. Using these techniques may result in multiple copies of the same document being indexed for your site, as crawl robots will see each unique URL (including session ID) as a unique document.

Ensure that your site's internal link structure provides a hypertext link path to all of your pages. The crawler follows hypertext links from one page to the next, so pages that are not linked to by others may be missed. Additionally, you should consult the administrator of your Google Search Appliance to ensure that your site?s home page is accessible to the search engine.

Use robots standards to control search engine interaction with your content

Make use of the robots.txt file on your web server. This file tells crawlers which files and directories can or cannot be crawled, including various file types. If the search engine gets an error when getting this file, no content will be crawled on that server. The robots.txt file will be checked on a regular basis, but changes may not have immediate results. Each port (including HTTP and HTTPS) requires its own robots.txt file.

Use robots meta tags to control whether individual documents are indexed, whether the links on a document should be crawled, and whether the document should be cached. The "NOARCHIVE" value for robots meta tags is supported by the Google search engine to block cached content, even though it is not mentioned in the robots standard.

For information on how robots.txt files and ROBOTS meta tags work, see the section on meta tags

If the search engine is generating too much traffic on your site during peak hours, contact your Google Search Appliance administrator to customize the traffic.

Let the search engine know how fresh your content is

Make sure your web server supports the If-Modified-Since HTTP header. This feature allows your web server to tell the crawler whether your content has changed since it last crawled your site. Supporting this feature saves you bandwidth and overhead.

Understand why some documents may be missing from the index

Each time that the search engine updates its database of web pages, the documents in the index can change. Here are a few examples of reasons why pages may not appear in the index.

  • Your content pages may have been intentionally blocked by a robots.txt file or ROBOTS meta tags.
  • Your web site was inaccessible when the crawl robot attempted to access it, due to network or server outage. If this happens, the Google Search Appliance will retry multiple times; but if the site cannot be crawled, it will not be included in the index.
  • The Google crawl robot cannot find a path of links to your site from the starting points it was given.
  • Your content pages may not be considered relevant to the query you entered. Ensure that the query terms exist on your target page.
  • Your content pages contain invalid HTML code.
  • Your content pages were manually removed from the index by the Google Search Appliance administrator.

If you still have questions, contact us to get more information.

Avoid using frames

The search engine supports frames to the extent that it can. Frames tend to cause problems with search engines, bookmarks, e-mail links and so on, because frames don't fit the conceptual model of the web (where every document corresponds to a single URL).

Searches that return framed pages will most likely only produce hits against the "body" HTML page and present it back without the original framed "Menu" or "Header" pages. Google recommends that you use tables or dynamically generate content into a single page (using ASP, JSP, PHP, etc.), instead of using FRAME tags. This will ultimately maintain the content owner?s originally intended look and feel, as well as allow most search engines to properly index your content.

Avoid placing content and links in script code

Most search engines do not read any information found in SCRIPT tags within an HTML document. This means that content within script code will not be indexed, and hypertext links within script code will not be followed when crawling. When using a scripting language, make sure that your content and links are outside SCRIPT tags. Investigate alternate HTML technologies to dynamic web pages, such as HTML layers.

Ensure your development server is not being crawled

If we find a document with exactly the same <title></title> or content, we will not index the second one. If the crawler finds your development server before your production server, the development server gets indexed first, and the production server is found to be a duplicate, and is not indexed. Put a robots.txt file at the root level of your development server blocking all access to the crawler.

Edit robots.txt to block specific sub directories

If you have added a directive telling the crawler to stay out of a specific directory, because it is in development/testing/etc., and now it is ready to be released, make sure that you change the robots.txt file to allow the crawler in.

Removing Content from the Enteprise Search Index

Remove your website

If you wish to exclude your entire website or a specific section (directory) of your server from the index, you can place a file at the root of your server called robots.txt.

To prevent the enterprise search crawler and other search engines from crawling your site, place the following robots.txt file in your server root:

User-Agent: *
Disallow: /

This is the standard protocol that most web crawlers observe for excluding a web server or directory from an index. More information on robots.txt is available here: http://www.robotstxt.org/wc/norobots.html.

Change the URL of your site

Since Google's crawler associates the content of a page with its URL, there is no way to manually change the URL that is displayed for your website. The URL will be updated the next time we crawl your site. The crawler revisits each site according to an automatic schedule and we cannot manually accelerate the date on which your site will be recrawled.

If the URL of your website has changed since we last crawled it, you may use the URL submission form and the URL removal methods described below. However, the URL submission form does not take effect immediately, so using the URL removal feature may leave your website inaccessible until we crawl your site again.

Instead of requesting a change, we recommend that you ask the sites currently linked to your old site to update their links (to point to your new site). Finally, if your old URLs redirect to your new site using HTTP 301 (permanent) redirects, our crawler will know to use the new URL. Changes made in this way will take 6-8 days to be reflected in enterprise search.

Remove individual pages

If you want to prevent all robots from indexing individual pages on your site, then you can place the following meta tag element into the page's HTML code:

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

If you want to allow other robots to index individual pages on your site, preventing only Google's robots from indexing the pages, use the following tag:

<META NAME="googlebot" CONTENT="NOINDEX, NOFOLLOW">

More information on this standard meta tag element is available here: http://www.robotstxt.org/wc/exclusion.html#meta.

Remove snippets

A snippet is a text excerpt from the returned result page that has all query terms bolded. The excerpt allows users to see the context in which search terms appear on a web page, before clicking on the result. Users are more likely to click on a search result if it has a corresponding snippet.

If you wish to prevent Google from displaying snippets for your pages, use the following tag:

<META NAME="googlebot" CONTENT="NOSNIPPET">

Note: removing snippets also removes cached pages.

Remove cached pages

Google keeps the text of the many documents it crawls available in a cache. This allows an archived, or "cached", version of a web page to be retrieved for your end users if the original page is ever unavailable (due to temporary failure of the page's web server). The cached page appears to users exactly as it looked when Google last crawled it. The cached page also includes a message (at the top of the page) to indicate that it's a cached version of the page.

If you want to prevent all robots from archiving content on your site, use the NOARCHIVE meta tag. Place this tag in the <HEAD> section of your documents as follows:

<META NAME="ROBOTS" CONTENT="NOARCHIVE">

If you want to allow other indexing robots to archive your page's content, preventing only Google's robots from caching the page, use the following tag:

<META NAME="googlebot" CONTENT="NOARCHIVE">

Note: this tag only removes the "cached" link for the page. Google continues to index the page and display a snippet.

Remove an outdated link

The Google Appliance updates its entire index automatically on a regular basis. When we crawl the web, we find new pages, discard dead links, and update links automatically. Links that are outdated now will most likely "fade out" of our index during our next crawl.

Tips for Optimizating Content - Search Engine Optimization

Return to top

Virtual hosting: Virtual hosting allows web sites to move to different machines, without changing the URL, which leads to bad links. We also calculate scores returned from the search engine, and add points for shorter URL's, under the assumption that home pages have shorter URL's, and are usually at the top of the collection.

You can check to see if a 'name' you would like to use is already taken in Boeing's DNS namespace, and then contact your NAMS focal to request:

NAMS Lookup

Use descriptive and relevant titles for every unique page: The titles are what the user sees on the results page. If the title isn't descriptive, the user won't even click on it. If the title is doesn't match the contents, the user will learn to mistrust the search results. You can use the metatag TITLE to define a web page's title.

The search engine also puts an importance weighting on Title, but other metatags are not weighted heavily. The search engine uses the other metatags as a last resort to help find information.

Term weighting: Terms in the title are weighted much higher than terms in the body. Terms in the keywords and description meta tags are considered in the weighting but have less impact than the title tags and body content.

Have text on your home page: Search engines index the text read from the various web pages they visit. If a page lacks descriptive text, then there is little chance that page will come up in the results of a search engine query.

It's not enough for that text to be in graphics. It must be HTML text. Some search engines will catalog ALT text and text in comment and meta tags. To be safe, a plain HTML description is recommended.

Have text high on your page: Tables are one way of pushing your text further down the page. The engine reads the table on the left-hand side, then works over to the text in the next column. Get your text up higher, either through meta tag use or smart design, when possible.

Pick your keywords: Focus on the two or three keywords that you think are most crucial to your site, then ensure those words are both in your title, your description and keywords META tags, and mentioned early on your web page. Generally, most people will already have those words present on their pages but may not also have them in page titles.

Keep in mind that the keywords you consider crucial may not be exactly what users enter. The addition of just one extra word can suddenly make a site appear more relevant, and it can be impossible to anticipate what that word will be. The best bet is to focus on your chosen keywords but to also have a complete description.

Have links to inside pages: If there are no links to inside pages from the home page, some search engines will not fully catalog a site. Unfortunately, the most descriptive, relevant pages that are often inside pages rather than the home page.

Make links to your top level pages. Some search engines will come in the middle of your web site. If there are no links to your homepage, the search engine may never get to the top.

Frames can kill: The crawler cannot follow frame links. This is because there is not text on the page for the search engine to index. Make sure there is an alternative method for them to enter and index your site, either through meta tags or smart design.

The Meta Myth: Meta tags will help you control your site's description in engines that support them. They will NOT guarantee that your site appears first. Adding some meta description code is not a magic bullet that cures your site of dismal rankings. For more information, see the tips on using meta tags.

Using Meta Tags

Return to top

Before we start, let's make it clear:
Meta tags are not a magic solution.

Please note that the information below is for general metatag reference. The Google Search Appliance within Boeing DOES NOT use metatags as a overriding mechanism for page content. Therefore, DO NOT rely on metatags to affect the search ranking of your web pages.

Meta tags provide a useful way to control your summary in some search engines.

Meta tags can also help you provide keywords and descriptions on pages that for various reasons lack text. Examples are splash pages and frames pages. They might also boost your page's relevancy. However, simply including a meta tag is not a guarantee that your page should suddenly leap to the top of every search engine listing. They are a useful tool but, as said above, not a magic solution.

These are the META tags that have been adopted as Boeing standard.

  • title
  • date
  • description
  • subject & keywords
  • creator & author
  • owner & publisher
  • validuntil

Keywords are something that authors provide, whereas subject is a smaller set of controlled keywords that are chosen from a list. The smaller the number, the easier it is to find something.

creator & author

Boeing chose creator, but author is also used heavily in authoring packages.

Although an author writes a document, they do not author a photograph, so it was thought that creator might apply better to more than just documents.

owner & publisher

Boeing chose owner, but publisher is displayed in the output of the search engine.

There are several meta tags, but the most important for search engine indexing are the description and keywords tags. The description tag returns a description of the page in place of the summary the engine would ordinarily create. The keywords tag provides keywords for the engine to associate with your page.

Before getting into further specifics, let's assume you have a page without the tags. The page is titled "My World", with a header that says "Welcome to My World", then a giant graphic image, then a link at the bottom that says "enter". Engines that index this daring creation will probably return a listing like this:

My World
Welcome to My World

Now let's fix it. Let's assume that within "My World" is a site chock full of information about stamp collecting. Here visitors can find out about stamp prices, stamp conventions, stamps for sale and trade, the history of stamps and much more. We'll use the meta tags to communicate this without destroying the image you've worked so hard to create. The tags go inside the header tags, so that everything looks like this:

<HEAD>
<TITLE>My World</TITLE>
<META name="description" content="Everything you wanted to know about stamps, from prices to history.">
<META name="keywords" content="stamp, collecting, stamp collecting, stamps for sale">
<META name="subject" content="stamp, collecting, stamp collecting, stamps for sale">
</HEAD>

Now your listing will look something like this in search engines that support the descriptions tag:

My World
Everything you wanted to know about stamps, from prices to history.

Notice how the description matches what's in the description tag? That exactly what the tag does. It lets you return the exact description you want to appear.

What about the keywords tag? It now gives your page a chance to come up if someone types in any of the words listed. For example, someone might enter "stamp collecting," which will match with one of the keywords in the keywords tag. Without that tag, there would be no chance at all, since "stamp collecting" doesn't appear on the page or in the description tag.

Should you have different variations for keywords, such as shown in the example? Having "stamp collecting" together as a word vs. "stamp" and "collecting" can help if someone is searching for the exact phrase "stamp collecting."

Remember, you are using these tags to help make up for the lack of text on your pages, not as a way to successfully anticipate every keyword variation a person might enter into a search engine. The only hope you have of ever doing that is to have good, descriptive pages with good titles and text that is not buried on the bottom of the page by Java script, frames tags or tables. The meta tags are a tool to get around these aforementioned problems.

One other meta tag worth mentioning is the robots tag. This is different than the robots.txt file. This lets you specific that a particular page should not be indexed, and/or the links should not be followed by a search engine. The format is like this:

<META NAME="robots" CONTENT="directive">

where directive is

  • default is all
  • noindex - do not index this page
  • nofollow - do not follow the links on this page
  • noarchive - do not cache the page
  • none - do not index of follow the links on this page (Not officially supported by the Google search appliance, but tested and works)
  • all - index and follow the links on this page (Not officially supported by the Google search appliance, but tested and works)
  • index - index this page (Not currently supported by the Google search appliance)
  • follow - follow the links on this page (Not currently supported by the Google search appliance)

The format is like this...

Index this page but do not follow the links or archive it:

<META NAME="robots" CONTENT="noarchive, nofollow">

To sum up, meta tags are not a magic solution. They are not the secret method that some people will tell you assures success. They are more or less another design element you can tap into, a crutch for helping information-poor pages better be acknowledged by the search engines.

More information:

Using a robots.txt file to control how your site is indexed

Robots.txt will control crawlers that come on to your site. A robots.txt file is usually controlled by the web administrator with access at the root or top level of the web server.

The following example "/robots.txt" file specifies that no robots should visit any URL starting with "/cyberworld/map/" or "/tmp/", or /foo.html:

# robots.txt for http://www.example.com/
User-agent: *
Disallow: /cyberworld/map/ # This is an infinite virtual URL space
Disallow: /tmp/ # these will soon disappear
Disallow: /foo.html

This example "/robots.txt" file specifies that no robots should visit any URL starting with "/cyberworld/map/", except the robot called "googlebot":

# robots.txt for http://www.example.com/
User-agent: *
Disallow: /cyberworld/map/ # This is an infinite virtual URL space

# Cybermapper knows where to go.
User-agent: googlebot
Disallow:

This example indicates that no robots should visit this site further:

# go away
User-agent: *
Disallow: /

This example "/robots.txt" file specifies that you want Google production intra-Boeing as the only spider, but to block test subdirectory:

User-agent: googlebot
Disallow: /test/

User-agent: *
Disallow: /

This example "/robots.txt" file specifies that you want Google development intra-Boeing as the only spider, but to block test subdirectory:

User-agent: googlebot-dev
Disallow: /test/

User-agent: *
Disallow: /

Finally, some crawlers now support an additional field called "Allow:". As its name implies, "Allow:" lets you explicitly dictate what files/folders can be crawled. However, this field is currently not part of the "robots.txt" protocol, so use it only if absolutely needed, as it might confuse other crawlers. The below is the preferred way to disallow all crawlers from your site, but allow the crawler to index the /test/ directory:

User-agent: *
Disallow: /

User-agent: *
Allow: /test/

To check to see your sites robots.txt file (if your site has already established robots.txt) enter http://your-site-address/robots.txt (i.e. http://search.boeing.com/robots.txt)

For more information about robots.txt files, see robotstxt.org.

Implementing Site Search

The following modules of html code provide a simple way to implement a search feature into your website. If you have not registered your site or it is not showing up when you conduct a search, please register your site first before trying to implement this code. Copy and paste the code that best fits your requirements and modify as necessary. If you have any questions, contact Enterprise Search Services.

Site Search Options

Enterprise Search Options

Other Search Options

Site Search without site displayed

<form name="sites" method="GET" action="http://googleweb.cs.boeing.com/search">
<input type="text" name="q" size="20" maxlength="256" value=""> 
<input type="hidden" name="sitesearch" value="search.boeing.com">
<input type="hidden" name="client" value="boeing">
<input type="hidden" name="output" value="xml_no_dtd">
<input type="hidden" name="proxystylesheet" value="boeing">
<input type="hidden" name="site" value="boeing">
<input type="submit" name="btnG" value="Search">
</form>

Example:
 
Google multi-URL Site Search Code:

<form name="gs" method="GET" action="http://googleweb.cs.boeing.com/search">
Search the Gartner Group and IntraGiga web sites <br />
<input type="text" name="as_q" size="40" maxlength="256" value="">
<input type="hidden" name="client" value="boeing">
<input type="hidden" name="output" value="xml_no_dtd">
<input type="hidden" name="proxystylesheet" value="boeing">
<input type="hidden" name="site" value="boeing">
<input type="hidden" name="as_oq" value="site:librarypubs.web.boeing.com/gartner site:librarypubs.web.boeing.com/intragiga">
<br />
<input type="submit" name="btnG" value="Search">
<input type="reset" VALUE="Clear">
</form>

Example:
Search the Gartner Group and IntraGiga web sites

Google site only Site Search code:

<form name="gs" method="GET" action="http://googleweb.cs.boeing.com/search">
Search the Library main site <br />
<input type="text" name="as_q" size="40" maxlength="256" value="">
<input type="hidden" name="client" value="boeing">
<input type="hidden" name="output" value="xml_no_dtd">
<input type="hidden" name="proxystylesheet" value="boeing">
<input type="hidden" name="site" value="boeing">
<input type="hidden" name="as_oq" value="site:library.web.boeing.com">
<br />
<input type="submit" name="btnG" value="Search">
<input type="reset" VALUE="Clear">
</form>

Example:
Search the Library main web site

Google site-URL code
uses both URL and site restrictions

<form name="gs" method="GET" action="http://googleweb.cs.boeing.com/search">
Search the Library main and IntraGiga web sites <br />
<input type="text" name="as_q" size="40" maxlength="256" value="">
<input type="hidden" name="client" value="boeing">
<input type="hidden" name="output" value="xml_no_dtd">
<input type="hidden" name="proxystylesheet" value="boeing">
<input type="hidden" name="site" value="boeing">
<input type="hidden" name="as_oq" value="site:library.web.boeing.com inurl:librarypubs.web.boeing.com/intragiga">
<br />
<input type="submit" name="btnG" value="Search">
</form>

Example:
Search the Library main and IntraGiga web sites

Google
+URL -URL
code
uses include and exclude URL restrictions
Note that the parameter as_eq is the exclude query term and as_eq is the OR terms.

<form name="gs" method="GET" action="http://googleweb.cs.boeing.com/search">
<input type="text" name="as_q" size="10" maxlength="256" value="">
<input type="hidden" name="client" value="boeing">
<input type="hidden" name="output" value="xml_no_dtd">
<input type="hidden" name="proxystylesheet" value="boeing">
<input type="hidden" name="site" value="boeing">
<input type="hidden" name="as_oq" value="+inurl:catdev.web.boeing.com/docs/cleardoc/">
<input type="hidden" name="as_eq" value="inurl:&quot;catdev.web.boeing.com/docs/cleardoc/cpf_3.2.1">
<input type="submit" name="btnG" value="Search"></td>
</form>
<input type="reset" VALUE="Clear">

Example:
Search the Boeing Web with Google Code (frames)

<form name="gbs" method="GET" action="http://googleweb.cs.boeing.com/search"
target="body">
Search the Boeing Web with Google<br>
<input type="text" name="q" size="40" maxlength="256" value="">
<input type="hidden" name="client" value="boeing">
<input type="hidden" name="output" value="xml_no_dtd">
<input type="hidden" name="proxystylesheet" value="boeing">
<input type="hidden" name="site" value="boeing">
<br />
<input type="submit" name="btnG" value="Search">
<input type="reset" VALUE="Clear">
</form>

Example:
Search the Boeing Web with Google

Search the Boeing Web with Google Code

<form name="gbs" method="GET" action="http://googleweb.cs.boeing.com/search">
Search the Boeing Web with Google<br>
<input type="text" name="q" size="40" maxlength="256" value="">
<input type="hidden" name="client" value="boeing">
<input type="hidden" name="output" value="xml_no_dtd">
<input type="hidden" name="proxystylesheet" value="boeing">
<input type="hidden" name="site" value="boeing">
<br />
<input type="submit" name="btnG" value="Search">
<input type="reset" VALUE="Clear">
</form>

Example:
Search the Boeing Web with Google

Search the Boeing Web with Google Code

<form name="gbs" method="GET" action="http://googleweb.cs.boeing.com/search">
Search the Boeing Web with Google<br>
<input type="text" name="q" size="40" maxlength="256" value="">
<input type="hidden" name="client" value="boeing">
<input type="hidden" name="output" value="xml_no_dtd">
<input type="hidden" name="proxystylesheet" value="boeing">
<input type="hidden" name="site" value="boeing">
<p>
<select name="num">
<option value="10" selected>10 results</option>
<option value="20">20 results</option>
<option value="30">30 results</option>
<option value="50">50 results</option>
<option value="100">100 results</option>
</select>
<p>Sort 
<input type="radio" name="sort" value="date:D:L:d1" checked>by Relevance
<input type="radio" name="sort" value="date:D:S:d1">by Date
<br />
<input type="submit" name="btnG" value="Search">
<input type="reset" VALUE="Clear">
</form>

Example:
Search the Boeing Web with Google

Sort  by Relevance by Date

Google site search with drop down list choice Code

<FORM name="gbpd" ACTION="http://googleweb.cs.boeing.com/search" METHOD="GET">
Enter Search term(s) for selected location<br>
<input type="text" name="as_q" size="40" maxlength="256" value="">
<br />Select your location<br>
<SELECT NAME="as_oq" size="1">
<option VALUE="site:catdev.web.boeing.com/">Search all documents on CATDEV
<option VALUE="site:aes.web.boeing.com/">Application Environment Support (/aes)
<option VALUE="site:catdev.web.boeing.com/dev">CASCADE System Management (/dev)
<option VALUE="site:catdev.web.boeing.com/dev/carp">CATIA Application Release Process (/dev/carp)
<option VALUE="site:catdev.web.boeing.com/cm">Change Management (/cm)
<option VALUE="site:catdev.web.boeing.com/cmc">Computing Management Center (/cmc)
<option VALUE="site:catdev.web.boeing.com/oop">Object Oriented Technology (/oop)
<option VALUE="site:catdev.web.boeing.com/docs/porting_docs">Porting Documents (/docs/porting_docs)
<option VALUE="site:catdev.web.boeing.com/tools">Tools (/tools)
<option VALUE="site:catdev.web.boeing.com">Web Server Root (/)
</select>
<input type="hidden" name="client" value="boeing">
<input type="hidden" name="output" value="xml_no_dtd">
<input type="hidden" name="proxystylesheet" value="boeing">
<input type="hidden" name="site" value="boeing">
<br />
<input type="submit" name="btnG" value="Search">
<input type="reset" VALUE="Clear">
</form>

Example:
Enter Search term(s) for selected location

Select your location

>Google site/Boeing web search radio button choice code

<form name="msrb" method="GET" action="http://googleweb.cs.boeing.com/search">
<strong>Search:</strong>
<br />
<input type="radio" name="as_oq" value="+site:search.boeing.com +site:richhand.web.boeing.com" checked>Rich's Sites </input>
<br />
<input type="radio" name="as_oq" value="">The Boeing Web </input>
<br />
<input type="text" name="q" size="20" maxlength="256" value="">
<input type="hidden" name="client" value="boeing">
<input type="hidden" name="output" value="xml_no_dtd">
<input type="hidden" name="proxystylesheet" value="boeing">
<input type="hidden" name="site" value="boeing">
<input type="submit" name="btnG" value="Search">

Example:
Search:
Rich's Sites
The Boeing Web
Customizing Your Search and Results Pages

<form name="msrb" method="GET" action="http://googleweb.cs.boeing.com/search">
<br />Search:<br>
<input type="radio" name="sitesearch" value="search.boeing.com" checked>search.boeing.com
<br />
<input type="radio" name="sitesearch" value="" >The Boeing Web
<br />
<input type="text" name="q" size="20" maxlength="256" value="">&nbsp;
<input type="hidden" name="client" value="boeing">
<input type="hidden" name="output" value="xml_no_dtd">
<input type="hidden" name="proxystylesheet" value="http://search.boeing.com/xslt-template.xsl">
<input type="hidden" name="site" value="boeing">
<input type="submit" name="btnG" value="Search">
</form>

Example:
Search:
search.boeing.com
The Boeing Web
Site Search on specifc Server port

<form name="msrb" method="GET" action="http://googleweb.cs.boeing.com/search">
<input type="text" name="q" size="20" maxlength="256" value=""> 
<input type="hidden" name="as_q" value="inurl:9090">
<input type="hidden" name="sitesearch" value="bearprod.ca.boeing.com:9090">
<input type="hidden" name="client" value="boeing">
<input type="hidden" name="output" value="xml_no_dtd">
<input type="hidden" name="proxystylesheet" value="boeing">
<input type="hidden" name="site" value="boeing">
<input type="submit" name="btnG" value="Search">
</form>

Example:
 
Search restricted to a particular type of file

<form name="sites" method="GET" action="http://googleweb.cs.boeing.com/search">
<input type="text" name="q" size="20" maxlength="256" value=""> 
<input type="hidden" name="client" value="boeing">
<input type="hidden" name="output" value="xml_no_dtd">
<input type="hidden" name="proxystylesheet" value="boeing">
<input type="hidden" name="site" value="boeing">
<SELECT NAME="as_q" size="-1">
<option VALUE="" selected>Select a file suffix
<option VALUE="filetype:doc">.doc
<option VALUE="filetype:xls">.xls
<option VALUE="filetype:pdf">.pdf
<option VALUE="filetype:ppt">.ppt
</select>
<br />
<input type="submit" name="btnG" value="Search">
</form>

Example:
       

About Secure Search

Secure search enables secure repositories to be crawled, indexed, and served by the Google Search Appliance. Secure results are mixed with public search results when the "public and secure" radio button is selected (the buttons underneath the Boeing Search search box). Secure repositories that are supported include Microsoft SharePoint and secure content served from web servers using Basic or NTLM Authentication. If you are unsure of how your content is being protected, ask your web server administrator. At this time, Web Single Sign On (WSSO) protected content is not supported.

Be sure to review the End User Security Handbook or check with Computing Security about issues on protecting secure information.

Enabling Secure Search for Microsoft SharePoint

To add secure content for a Microsoft Sharepoint Site to Boeing Search, first register the URL of the site. The confirmation e-mail will contain the name of the Boeing Search crawler account. Grant this account read access to the registered site. Then allow approximately one day for Sharepoint content to appear alongside unsecure results if the secure content is relevant to the search query.

Enabling Secure Search for Web Content

Web content includes web pages and/or documents served by a web server and protected using Basic or NTLM Authentication. A common use for secure search with web content is to enable documents in a file share on a server to be crawled, indexed, and served. The share can be located on a seperate server than the web server that is serving the content.

To enable secure search for web content, follow these instructions for configuring the web server:

  • Setup either a web page that has links which point to the actual document(s) OR ...
  • If you have many documents and don't want to create a long list of links on a single of webpage, setup a virtual directory and set the physical location to the directory that contains all the files. Also, turn on "directory browsing" for that virtual directory.
  • Turn off "Anonymous" access.

Configuring a web server in this way will pass authorization to the share or folder's Access Control List. There then are two levels for protecting the documents:

  • Protect the documents at the directory level: Set the permission at the directory level, and let the sub-directory and files inherit the permissions from the parent driectory.
  • Protect the documents at the file level: Set the permissions down to the file level.

The final step is to get the secure content crawled and indexed. If it is a new website, register the URL and follow the instructions in the confirmation e-mail to grant the crawler access. If it is a site that is already indexed, add a link to the repository somewhere on the website and then ask the ESS Team for the crawler account. This account will need to be granted read access to the relevant directories. Allow approximately one day for newly crawled content to appear in Boeing Search.

Potential Issues

Reasons why secure documents may not be showing up in the results list:

  • You do not have access to the documents that are not showing up.
  • The mechanism for determining access is to do an HTTP HEAD request at search time and pass the searching user's credentials to the web server where the documents are located. I have seen cases where web servers may have disabled (by default) HEAD requests and only allow GET requests. Please check the web server to see if HEAD requests are enabled.
  • Since the Google Search Appliance is checking access rights at search time, it is important to respond quickly to the searching user. Therefore, when the system performs the HTTP HEAD request, it can only wait about 2-3 seconds for a web server to respond with an answer. If the web server cannot respond in the alloted time, the search interface does not display the results for which it didn't get valid responses. If your web server is overloaded or is just unable to respond quickly enough, it will cause results to be left out of the display.

Frequently Asked Questions

Does Google flag secure documents as being secured in the index? If so, how is that done?

The Google Search Appliance does mark the documents as secure in the index. Documents flagged as secure are checked at query time by issuing a request to the web server with the credentials of the user doing the search.

What is the timeout for obtaining access rights? Is that timeout per result or per result page?

That timeout is per result. The operation is multi-threaded so that the system can check several results at the same time to speed up the time it takes to display the results to the user. If the web server does not respond within the timeout, the result will not be shown to the user.

Does Google keep any Access Control Lists (ACLs) in cache or index? If the document's security changes after being indexed, how does that effect current access? For example, document is indexed and is public, then is setup as secure but not reindexed. Does it appear in public results?

Since the document's ACLs are not kept in the search index, the Google Search Appliance does not suffer from this problem. Some caching is done to ensure that the system does not have to issue requests for every result if the same results reappear in the user's subsequent queries. However, once the cache expires, the appliance will again go to the web server to verify the user's authorization to view the result, and if the security level has been updated, the user will not be able to see the result now.

Support

If you have questions or need support, please contact us.