Data Driven Approaches for
Large-scale Knowledge Graph
Construction
Yanghua Xiao
Fudan University
Kowledge Works at Fudan (kw.fudan.edu.cn)
Knowledge Graph
• Knowledge graph is a large scale semantic network
consisting of entities/concepts as well as the semantic
relationships among them
• Higher coverage over entities and concept
• Richer semantic relationships
• Usually organized as RDF
• Quality insurance by Crowdsourcing
• Why Knowledge Graphs?
• Understanding the semantic of text needs background
• A robot brain needs knowledge base to understand the
knowledge
world
• Yago,WordNet, FreeBase, Probase, NELL, CYC, DBPedia….
Data Driven vs Hand Crafted
• Manually constructed knowledge graph
• Examples: WordNet, Cyc
• Size: Small
• Quality: Almost perfect
(Huge human cost)
(Each relation is checked by expects)
• Auto-constructed knowledge graph
• Automatically extracted from huge web corpus
• Examples: Probase、WikiTaxonomy, etc
• Size: Huge
• Quality: Good (The accuracy can’t reach 100%)
• Because of the huge size, there are many wrong facts
(From huge corpus)
Pipeline of KG construction
Extraction
• End-to-end
• Domain specific
Cost: Costly Human
Efforts
Quality:
Wrong data
Correction
• Graph structure based
correction
Quality:
Missing data
Completion
• Collaborative filtering
based completion
• Transitivity inference
based completion
Pipeline of KG construction
Extraction
• End-to-end
• Domain specific
Cost: Costly Human
Efforts
Quality:
Wrong data
Correction
• Graph structure based
correction
Completion
• Collaborative filtering
based completion
• Transitivity inference
based completion
Quality:
Missing data
Jiaqing Liang, Yanghua Xiao, et a, Probase+:
Inferring Missing Links in Conceptual Taxonomies,
to be published in TKDE 2017
Probase
• A web-scale taxonomy derived
from web pages by Hearst
linguistic patterns
• “…famous basketball players such as
Michael Jordan …”
• domestic animals such as cats and
dogs ...
• Chinais a developing country.
• Life is a box of chocolate.
• 10M concepts, and 16M isA relations
Hearst pattern
NP such as NP, NP, ..., and|or NP such NP
as NP,* or|and NP
NP, NP*, or other NP
NP, NP*, and other NP
NP, including NP,* or | and NP NP,
especially NP,* or|and NP
Missing isA relationships in Probase
• “car” and “automobile” are
synonyms
• They should share hypernyms
• “automobile” should beA “wheelbase
vehicle”
• Missing isA relaiton hurts the
understanding the concepts of
entities
• Is Lincoln zephyr a car?
Solution idea: CF based Missing isA inference
• User-based collaborative filtering!
• Hypernyms
• Concepts
• Synonyms or Siblings
--- Items
--- Users
--- Similar users
• Concepts with similar meanings tend to
share hypernyms/hyponyms in an isA
taxonomy
• To find missing hypernyms for a
concept c
• First find c’s synonyms and siblings
• Then we transport their hypernyms to c
Idea:
if most similar terms of c have h as the hypernym, c is
likely to have the hypernym h.