How to improve product search quality -seminar II v0.5 Leon Lee

How to improve product
search quality -seminar II
Leon Lee
2.Review and Summary Last Seminar
3.Principle of Engineering and Research
4.About Semantic Web
5.Current Hot, Advanced Search Engine Site
6.Possible Direction of combination of AI and SE
7.Identify Our Position
8.Improving Search Quality Plan draft v0.5
10.Other Topic
2.Review and summary last seminar
1. Basic of information retrieval
Why we need a dynamic relevance scoring/ranking.
2. Scoring and boosting in lucene.
How to boost a important field
3. No significant improve for thesaurus based query expansion
May decrease precision, particularly with ambiguous terms.
4. Metadata/search strategy requirement from Bin
abstracted as "select * from index where property1 = word1
and property2 = word2"
example: ibm t61?
3.Principle of Engineering and Research
1. KISS: keep it simple stupid
2. Simple Intuition
by naive examples and solid foundation of theory and
3. Step by Step
4. Neutral academic view point.
5. We Are Our Own Worst Enemy.
6. Communication
4.About Semantic Web
1. What's Semantic Web
Discussion on smth bbs,
2. Dbpedia and wikipedia
3. CYC
4. GoodRelation
GoodRelations: An ontology for linking product descriptions
and business entities on the Web
5. Hownet , Hierarchical Network of Concepts in China
4.About Semantic Web
Xml, RDF, RDFa, Rules, Inference ...
Semantic != Semantic Web/Net
Semantic in nowdays = understanding content by Machine
Learing, Natural Language Processing, Information Retrieval,
Data Mining ... techniques + Linked data ?
5. Current Hot, Advanced Search Engine Site & Techniques
What kind of techniques does the following site use ?
(1). Wolfram|Alpha
What is the core technology of Wolfram|Alpha?
a computational knowledge engine
Four key general areas are the data curation pipeline, the
algorithmic computation system, the linguistic processing
system, and the automated presentation system.
Wolfram|Alpha computes answers to specific questions using
its built-in knowledge base and algorithms.
n-grams "it was the best of times it was the worst of times"
unicode 8900 to 8915
5. Current Hot, Advanced Search Engine Site & Techniques
5. Current Hot, Advanced Search Engine Site & Techniques
Instead of the common “listing” of Web search queries, Yebol
automatically clusters and categorizes search terms, Web sites,
pages and contents.
Yebol allows for a multi-dimensional search result instead of
the normal one-dimensional search seen by most web search
engines today. This provides a more accurate summary of top
sites and categories; a wider array of related search terms; a
longer and richer expansion for query results; and a deeper
base of links and keywords in search result pages.
Unlike current search platforms, Yebol provides hundreds of
easily identified and accessibly categorized results in one easily
navigable page.
Yebol uses a combination of algorithms and human knowledge
to build a revolutionary web directory for each search term.
The Yebol engine clusters search results into groups of termspecific categories.
5. Current Hot, Advanced Search Engine Site & Techniques
(2). Powerset
In the search box, you can express yourself in keywords,
phrases, or simple questions. On the search results page,
Powerset gives more accurate results, often answering
questions directly, and aggregates information from across
multiple articles
Powerset is working on building a natural language search
engine that can find targeted answers to user questions (as
opposed to keyword based search).
For example, when confronted with a question of the form
'which U.S. state has the highest income tax?', conventional
search engines ignore the question and instead do a search on
the keywords 'state, income and tax'. Powerset on the other
hand, attempts to use natural language processing to
understand the nature of the question and return pages
containing the answer.
5. Current Hot, Advanced Search Engine Site & Techniques
(3). Freebase
freebase, an open database of the world’s information. It’s built
by the community and for the community – free for anyone to
query, contribute to, build applications on top of, or integrate
into their websites
it contains structured information on many popular topics,
including movies, music, people and locations – all reconciled
and freely available via an open API.
Under the hood Freebase is a graph database. This means that
instead of using tables and keys to define data structures,
Freebase defines its data structure as a set of nodes and a set
of links that...
5. Current Hot, Advanced Search Engine Site & Techniques
You'll find profiles on people, places, books, movies,
companies, events, organizations, teams, products -- all kinds
of stuff. And we're adding new ones all the time. You can find
quick facts and summary information, as well as recommended
articles, news, blog posts, photos, tweets, and videos. On each
page, we'll also highlight the top connections for each topics,
and make it easy to browse and discover more relevant
Evri’s technology automates connections between Web content
by applying a more human-like understanding of the words on
the page.
News, Tweets, Images, Quotes, Videos, Description from
Wikipedia, Products from, ...
5. Current Hot, Advanced Search Engine Site & Techniques
(5). text runner search (Relation extraction from webpages)
(6) opinion Analizer. Search, Rate and Comare
(7).Product wiki.
ProductWiki is the resource for free, unbiased product reports
written by a dedicated community.
(8). Online Dictionary, Encyclopedia
(9). wikipedia, dbpedia
5. Current Hot, Advanced Search Engine Site & Techniques
(10). Google Base
a place where you can easily submit all types of online and offline content,
which we'll make searchable on Google (if your content isn't online yet, we'll put
it there). You can describe any item you post with attributes, which will help
people find it when they do related searches.
(11). Yahoo SearchMonkey
Share structured data with Yahoo! Search to display a standard enhanced
(xml, RDFa, feed, goodrelation, ...)
5. Current Hot, Advanced Search Engine Site & Techniques
(12). Google Adwords, Adsense.
HTM technology has the potential to solve many difficult problems in
machine learning, inference, and prediction. Some of the application
areas we are exploring with our customers include recognizing objects
in images, recognizing behaviors in videos, identifying the gender of a
speaker, predicting traffic patterns, doing optical character recognition
on messy text, evaluating medical images, and predicting click through
patterns on the web.
5. Current Hot, Advanced Search Engine Site & Techniques
(15). product search engines:
Basic functions:
Filtered by cateogry, Brand, Price range, Stores,
additional information: reviews.
Advanced Product Search: phrase, boolean query, Search for words that
occur in fields
SafeSearch: Many users prefer not to have adult sites included in search
results (especially if kids use the same computer).
6. Possible Direction of combination of AI and SE
(1). ML,IR,NLP,DM,IE techniques we can use.
Semantic Relatedness
Text Classification.
Text Clustering
Name Entity Recognizing
Html Main Content Extraction
Collaborative Filtering
Sentiment Analysis
Opinion Mining
Language Models
Relevance Feedback
Query Expansion
Query Segmentation
Relevance Ranking
User Behavior Analysis
Machine Translation for Cross Language Retrieval
6. Possible Direction of combination of AI and SE
7. Identify our Position
Current Problems with our search engine:
1. ranking/scoring
2. production information extraction
3. Taxtonmy and Product Classification
4. lack functions of filting and advanced search
5. lack relevant information to enchance user
6. performance
7. no reliable distributed data store architecture
7. Identify our Position
A product search engine.
goal: extracted more products information.
more accurate ranking order.
more intellgent relevant information.
(connected, structured, categorized, recent, ranked
with meaning)
7. Identify our Position
more specified
goal: extracted more products information.
(unstructed webpage to structed information with unified taxtonmy)
more accurate ranking order.
(state of the art information retrieval model incorporated with more
more intellgent relevant information.
(connected, structured, categorized, recent, ranked with meaning)
8. Improving Search Quality Plan draft v0.5
(I). baseline
1.ranking module redesigned
experimental search environment setup
luke setup
boost important field
parameters adjust
scoring in lucene and solr
2.searching code reconstruct
easy to modify and optimize
delete old queryparser which no one maintained
3. a plan on improving information extraction from product of webpages.
4. a scheme on improving taxtonmy & classification
5. adding session & click through record to query log
6. redesign workflow of tokenizer process in solr
7. research on hadoop, pig, hive, hbase.
plan for migrate from mysql to distribued data store
8. a plan to integrate and simplify cache scheme
8. Improving Search Quality Plan draft v0.5
(II). Improving
1. ranking module improving
integrate product information quality into ranking module
modify scoring formual by TREC paper
add more parameters to tune
add easy UI to observe & adjust formula
2. adding advanced search functions
3. search result filtering ( adult & kids information distinguishing)
4. extract full product properties from web. (not only name, price, description)
applying wrapper techniques
page type identifying
5. taxtonmy integration & product classification
6. snippets / dynamic summaries
8. Improving Search Quality Plan draft v0.5
(III). advanced techniques incorporating
1. ranking module improving
research state of the art IR modules like language model.
2. add relevant categorized information.
2.1 Latest product information from RSS news article
2.2 related product url/information from,,, ... ( rdf database or other way?)
2.3 review opinion analysis/mining
2.4 collective filtering
3. Query Segmentation, concept from query and refine
4. Query log analysis (using click through record )
5. Simply Cache scheme to speed up response.
6. Using distributed data store ( hadoop, pig ...)
9. Other Topics
Wrong view point by most people.
multi index: reduce IO, put it fittable in memory or a big cache.
multi FieldQueryParser
multi Searcher
Information Extraction
Wrapper / template
Algorithm of identiy type of htmls in commerce website
Algorithm of extract full production information.
Web noises detection and elimination
Date Store using distributed map reduce techniques.
Projects based on Hadoop: subprojects and related projects, including
Hive, Avro, Pig, HBase, Cascading etc.
孙建军,成 颖 等译,科学出版社
晓明 闫宏飞 王继民 著,科学出版社
《Modern Information Retrieval》
Ricardo Baeza-Yates and Berthier Ribeiro-Neto, 1999
检索技术》 孙建军,成 颖 等译,科学出版社
术与系统》李晓明 闫宏飞 王继民 著,科学出版社
《Modern Information Retrieval》Ricardo Baeza-Yates and Berthier Ribeiro-Neto, 1999
Slides of "Modern Information Retrieval " Instructor: Zhang li in Tsinghua University
Slides of " Advanced Topics in Information Retrieval " Zhang min in Tsinghua University
开发自己的搜索引擎——Lucene 2.0+Heriterx 编著
邱哲 符滔滔
lucene 分析与
应用 吴众欣,沈家立
Mining the Web: Discovering Knowledge from Hypertext Data, Soumen Chakrabarti,Morgan-Kaufmann Publishers
Introduction to Information Retrieval,Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Cambridge University
Press. 2008.
Foundations of Statistical Natural Language Processing, Chris Manning and Hinrich Schütze, 1999
Lucene and Juru at TREC 2007: 1-Million Queries Track, page 449 E. Amitay, D. Carmel, D. Cohen, IBM Haifa Research Lab
Inverted index - Wikipedia
Lucene / Solr 开发经验(ZZ)!F6DD25E77539E5DD!358.entry?wa=wsignin1.
Analyzers, Tokenizers, and Token Filters, Solr Wiki,
Semantic web discussion on smth bbs,
Semantic web discussion on douban,
Better Search with Apache Lucene and Solr
Crowdsourcing for relevance evaluation, Alonso, O.; Rose, D. E. & Stewart, B. SIGIR Forum,Vol. 42,pp. 9-15,2008
Relevance judgments between TREC and Non-TREC assessors, Al-Maskari, A.; Sanderson, M. & Clough, P.SIGIR '08:
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval,pp.
Debugging Relevance Issues in Search
Optimizing Findability in Lucene and Solr