Posts Tagged 'Data Mining'

Word Order Matters

The following is reposted from the ResearchBuzz blog for January 13, 2010 by Tara Calishain.


Doing Real Time Search? Watch Your Word Order

Posted: 13 Jan 2010 04:36 AM PST

If you’ve been reading ResearchBuzz for a while, you probably know that the way you enter your search terms in Google makes a difference.  If you enter words in one order, you may very well get a different result count and a different order to the results you get back.  (Try searching Google for scratching post and post scratching to get an idea of what I’m talking about.)

I have used this knowledge to benefit over the years, when I needed to narrow down search results or just get a different perspective on what was available.  When Google’s new real-time search came out, I assumed word order would no longer make a difference.  After all, real-time search is just that — the latest and greatest material that Google is adding to its index.  The stream should be the stream, right?  No matter what kind of word order you use.

Turns out that’s incorrect; Google does change the real time search results based on your word order.  That’s okay, but it does mean if you’re looking for real-time data you may want to play around with your word order, especially if you’re searching for words that don’t make a common phrase.

Let’s take an example.  I’m interested in a Ford Taurus, and I want to see what kind of real-time buy/sell activity there is out there.  I do a Google search for Ford Taurus and pay attention to the latest results.


I’m getting the “latest” results, and the list looks very much like a Google search result except the results show how recently the content was indexed.  The result count for this search, at this writing, is 4,250,000.  You’ll also notice that the left nav gives you related searches, mostly other car models.

Now take that search and turn it around.  Just turn it and do a search for Taurus Ford.  Your search results now look like this:


You’ll note that the related searches are gone, the search results have shot up to about 6,670,000 results, and the order of the search results has shifted a little bit.

Now, is this bad?  No, of course not.  But if you’re really working in the live search and you want to make sure you get as many search results as you can, you’re going to have to run multiple searches of multi-word queries.

Word order shows a lot of difference when the words make up a phrase.  If you do a search for search engine, at this writing you’ll get about 315,000,000 results along with some Twitter tweets.  If you change the search to engine search, the result count drops to 109,000,000, the results shift around a lot, and only one tweet appears, way down at the bottom of the page.

I remember being astonished when search engines hit a billion pages of indexed content, but that’s nothing these days.  The name of the game continues to be narrowing down your results to get the information you need and approaching a search problem from different angles.  You can make a different angle just from changing the word order in your query even in Google’s real-time search; try it!

The following is reposted from the Scout Report for December 4, 2009:

Wading through the tremendous online resource that is the BioMed archive can be a bit tricky at times. This process just got much easier through the creation of the BioMedSearch feature. The goal of this work is “to make these important works available to the community in a way that is fast and easy, while still offering the advanced features demanded by power users such as portfolios, collaboration features, bibliographical citation export, alerts, and more.”  Their search engine contains all of the data in Pub Med/Medline, along with additional full-text documents, and a large database of theses and dissertations. Many users will find the “Clusters” section of the site most useful. Here, visitors can view “clusters” of documents grouped together thematically into topics such as clinical trials, exercises, diet and cholesterol, and medical imagining. The homepage contains a basic search engine, and visitors may also wish to use the “Search Tutorial” to gain a better understanding of how best to use the archive.

This research tool will be particularly useful for biomedical engineers.  Access BioMedSearch at

Info Tool of the Week: Alerting Services

Do you feel overwhelmed by all the information available in your subject area?  Are you frustrated by never having enough time to scan the journals in your field — or even to know when new issues become available?  Do you wish there was a way to get all this information organized and delivered to your desktop?  Well — there is!

Many journal publishers and journal article database providers provide alerting services for their products.  These services allow you to have the tables of contents (TOC) of each new issue of a journal delivered to your E-mail inbox or RSS feed as soon as it becomes available — sometimes even before the print version of the journal hits the library shelves.  This allows you to keep up with all the journals you read regularly or whose TOCs you want to scan for useful material — all without ever leaving your lab or office.

While it is possible to set up such alerts individually at a publisher’s web site, many researchers are using an alert service that aggregates thousands of titles into a single location.  One such service is called ticTOCs (see )  This British-based alerting service offers you nearly 13,000 scholarly journals to choose from, along with links to view TOCs at their site or to set up RSS feeds for ones of particular interest, so you’ll always know when a new issue of your favorite journal becomes available.  Check the site link above for more information or come by the Library for a demo or to explore other options.

Another kind of alert you may want to consider is a subject alert.  For this type of alert you would construct a subject search in the online database(s) of your choice, then save it as an alert.  After that, the database provider will automatically run your saved search weekly or monthly and send to you via E-mail or RSS feed any new results that have been posted to the database since your search was last run.  Most of the library’s major databases offer this service — check with a librarian for assistance in constructing and setting up your alerts or if you have questions about subject alerts.

TIP:  You can learn more about alerting services from the Library’s Keeping Up with Current Scholarship subject guide.

TIP: You can set up alerts for books, as well as for journal articles!  VIRGO, the Library’s online catalog, provides a way to set up alerts for authors and subjects that interest you, so that you can be notified when new materials are added to our collections.  Use the “Login to VIRGO” option under “Services”, do a search, then use the options on the left side of the page to set up your alerts.  Ask at any library for assistance or details!

TIP:  Did you know you can set up alerts for web searches, too?  Absolutely!  Web search engines such as Google, and Intute all offer some kind of alerting service for searches conducted in their databases.  So now you can keep up with new material in web form as well as published articles!

FINAL TIP:  Even with alerts you can become overwhelmed with information.  One key is to limit your TOC alerts to just those key journals you find consistently most useful.  Then use subect alerts to keep up with everything else — but make sure your subject alerts are precisely crafted to return only the most useful items.  For help constructing searches or using databases effectively, please contact a librarian or come by any UVa library for assistance!

Upcoming RCL Short Course

“Intro to Logistic Regression”

Kathy Gerber
Research Computing Support Specialist
Thursday, October 15, 2009, from 12:00 p.m. – 1:00 p.m.
In the Brown Science and Engineering Library Electronic Classroom

This session will provide an overview to logistic regression, as it pertains to quantitative statistics.

You can register for this course at

This session is part of the Fall 2009 Research Computing Lab Short Course Series

Canadian Scientific and Technical Data

 The National Research Council of Canada recently launched a new Gateway to Scientific Data which provides access to Canadian scientific, technical and medical data sets as well as information for scientists on best practices for managing data.  The Scientific Data Sets cover a wide range of subjects including Aerospace, Biochemistry, Environment, Geosciences, Physics, and Thermodynamics among others.  To learn more or search the data sets visit

Dating Google

The following is taken from the The Internet Tourbus, Vol. 15, No. 57, September 17, 2009:

When you search for something in Google, you normally have no idea how old the results are.  The highest-ranked results may obsolete or not relevant to you, because of Google’s page-ranking criteria.  Wouldn’t it be cool if Google let you find documents based on their age, or search for Web pages that were created in a specific date range?

Thanks to a relatively new Google search feature, now you can search Google by date and time. Here’s the scoop on date-based searching…


Mapping Virginia Communities Workshop: An Introduction to GIS and Community Analysis

Mapping Virginia Communities Workshop: An Introduction to GIS and Community Analysis

Richmond: October 9th, 2009

Computer Services and Training – 1516 Willow Lawn Drive, Suite 100 Richmond, VA, 23230

More Info/Registration: 

Audience:  Beginners, students, anyone interested in mapping their community. 
Already taken this workshop?  Now offering ArcGIS Training: Refresher and Advanced Classes (see website for more information)

Participants will learn to use ArcGIS 9.3.1

Research Computing Lab Short Courses

The Brown Library Research Computing Lab is pleased to announce its schedule of fall short courses and events.  These include (but are not limited to) topics such as Matlab, parallel computing, tools for discovering patterns in data, LaTeX, SAS and managing the research data lifecycle.  You can find more information on the RCL Short Courses web page.

The purpose of is to increase public access to high value, machine readable datasets generated by the Executive Branch of the U.S. Federal Government. includes searchable data catalogs providing access to data in three ways: through the “raw” data catalog, the tool catalog and the geodata catalog.  The site provides links to several hundred data sets in all subject areas, including science and technology.  The “Raw” Data Catalog provides an instant download of machine readable, platform-independent datasets, while the Tools Catalog provides hyperlinks to tools that allow you to mine datasets.

RSS Feed

July 2020