logo资料库

GDELT-Event_Codebook.pdf

第1页 / 共11页
第2页 / 共11页
第3页 / 共11页
第4页 / 共11页
第5页 / 共11页
第6页 / 共11页
第7页 / 共11页
第8页 / 共11页
资料共11页,剩余部分请下载后查看
THE GDELT EVENT DATABASE DATA FORMAT CODEBOOK V2.0 2/19/2015 http://gdeltproject.org/ INTRODUCTION This codebook provides a quick overview of the fields in the GDELT Event file format and their descriptions. GDELT Event records are stored in an expanded version of the dyadic CAMEO format, capturing two actors and the action performed by Actor1 upon Actor2. A wide array of variables break out the raw CAMEO actor codes into their respective fields to make it easier to interact with the data, the Action codes are broken out into their hierarchy, the Goldstein ranking score is provided, a unique array of georeferencing fields offer estimated landmark-centroid-level geographic positioning of both actors and the location of the action, and a new “Mentions” table records the network trajectory of the story of each event “in flight” through the global media system. At present, only records from February 19, 2015 onwards are available in the GDELT 2.0 file format, however in late Spring 2015 the entire historical backfile back to 1979 will be released in the GDELT 2.0 format. The Records are stored one per line, separated by a newline (\n) and are tab-delimited (note that files have a “.csv” extension, but are actually tab-delimited). With the release of GDELT 2.0, the daily GDELT 1.0 Event files will still be generated each morning at least through the end of Spring 2015 to enable existing applications to continue to function without modification. Please note that at present, since GDELT 2.0 files are only available for events beginning February 19, 2015, you will need to use GDELT 1.0 to examine longitudinal patterns (since it stretches back to January 1, 1979) and use GDELT 2.0 moving forward for realtime events. There are now two data tables created every 15 minutes for the GDELT Event dataset. The first is the traditional Event table. This table is largely identical to the GDELT 1.0 format, but does have several changes as noted below. In addition to the Event table there is now a new Mentions table that records all mentions of each event. As an event is mentioned across multiple news reports, each of those mentions is recorded in the Mentions table, along with several key indicators about that mention, including the location within the article where the mention appeared (in the lead paragraph versus being buried at the bottom) and the “confidence” of the algorithms in their identification of the event from that specific news report. The Confidence measure is a new feature in GDELT 2.0 that makes it possible to adjust the sensitivity of GDELT towards specific use cases. Those wishing to find the earliest glimmers of breaking events or reports of very small-bore events that tend to only appear as part of period “round up” reports, can use the entire event stream, while those wishing to find only the largest events with strongly detailed descriptions, can filter the Event stream to find only those events with the highest Confidence measures. This allows the GDELT Event stream to be dynamically filtered for each individual use case (learn more about the Confidence measure below). It also makes it possible to identify the “best” news report to return for a given event (filtering all mentions of an event for those with the highest Confidence scores, most prominent positioning within the article, and/or in a specific source language – such as Arabic coverage of a protest versus English coverage of that protest).
EVENT TABLE EVENTID AND DATE ATTRIBUTES The first few fields of an event record capture its globally unique identifier number, the date the event took place on, and several alternatively formatted versions of the date designed to make it easier to work with the event records in different analytical software programs that may have specific date format requirements. The parenthetical after each variable name gives the datatype of that field. Note that even though GDELT 2.0 operates at a 15 minute resolution, the date fields in this section still record the date at the daily level, since this is the resolution that event analysis has historically been performed at. To examine events at the 15 minute resolution, use the DATEADDED field (the second from the last field in this table at the end).  GlobalEventID. (integer) Globally unique identifier assigned to each event record that uniquely identifies it in the master dataset. NOTE: While these will often be sequential with date, this is NOT always the case and this field should NOT be used to sort events by date: the date fields should be used for this. NOTE: There is a large gap in the sequence between February 18, 2015 and February 19, 2015 with the switchover to GDELT 2.0 – these are not missing events, the ID sequence was simply reset at a higher number so that it is possible to easily distinguish events created after the switchover to GDELT 2.0 from those created using the older GDELT 1.0 system.  Day. (integer) Date the event took place in YYYYMMDD format. See DATEADDED field for YYYYMMDDHHMMSS date.  MonthYear. (integer) Alternative formatting of the event date, in YYYYMM format.  Year. (integer) Alternative formatting of the event date, in YYYY format.  FractionDate. (floating point) Alternative formatting of the event date, computed as YYYY.FFFF, where FFFF is the percentage of the year completed by that day. This collapses the month and day into a fractional range from 0 to 0.9999, capturing the 365 days of the year. The fractional component (FFFF) is computed as (MONTH * 30 + DAY) / 365. This is an approximation and does not correctly take into account the differing numbers of days in each month or leap years, but offers a simple single-number sorting mechanism for applications that wish to estimate the rough temporal distance between dates. ACTOR ATTRIBUTES The next fields describe attributes and characteristics of the two actors involved in the event. This includes the complete raw CAMEO code for each actor, its proper name, and associated attributes. The raw CAMEO code for each actor contains an array of coded attributes indicating geographic, ethnic, and religious affiliation and the actor’s role in the environment (political elite, military officer, rebel, etc). These 3-character codes may be combined in any order and are concatenated together to form the final raw actor CAMEO code. To make it easier to utilize this information in analysis, this section breaks these codes out into a set of individual fields that can be separately queried. NOTE: all attributes in this section other than CountryCode are derived from the TABARI ACTORS dictionary and are NOT supplemented from information in the text. Thus, if the text refers to a group as “Radicalized
terrorists,” but the TABARI ACTORS dictionary labels that group as “Insurgents,” the latter label will be used. Use the GDELT Global Knowledge Graph to enrich actors with additional information from the rest of the article. NOTE: the CountryCode field reflects a combination of information from the TABARI ACTORS dictionary and text, with the ACTORS dictionary taking precedence, and thus if the text refers to “French Assistant Minister Smith was in Moscow,” the CountryCode field will list France in the CountryCode field, while the geographic fields discussed at the end of this manual may list Moscow as his/her location. NOTE: One of the two actor fields may be blank in complex or single-actor situations or may contain only minimal detail for actors such as “Unidentified gunmen.” GDELT currently uses the CAMEO version 1.1b3 taxonomy. For more information on what each specific code in the fields below stands for and the complete available taxonomy of the various fields below, please see the CAMEO User Manual 1 or the GDELT website for crosswalk files.2  Actor1Code. (string) The complete raw CAMEO code for Actor1 (includes geographic, class, ethnic, religious, and type classes). May be blank if the system was unable to identify an Actor1.  Actor1Name. (string) The actual name of the Actor1. In the case of a political leader or organization, this will be the leader’s formal name (GEORGE W BUSH, UNITED NATIONS), for a geographic match it will be either the country or capital/major city name (UNITED STATES / PARIS), and for ethnic, religious, and type matches it will reflect the root match class (KURD, CATHOLIC, POLICE OFFICER, etc). May be blank if the system was unable to identify an Actor1.  Actor1CountryCode. (string) The 3-character CAMEO code for the country affiliation of Actor1. May be blank if the system was unable to identify an Actor1 or determine its country affiliation (such as “UNIDENTIFIED GUNMEN”).  Actor1KnownGroupCode. (string) If Actor1 is a known IGO/NGO/rebel organization (United Nations, World Bank, al-Qaeda, etc) with its own CAMEO code, this field will contain that code.  Actor1EthnicCode. (string) If the source document specifies the ethnic affiliation of Actor1 and that ethnic group has a CAMEO entry, the CAMEO code is entered here. NOTE: a few special groups like ARAB may also have entries in the type column due to legacy CAMEO behavior. NOTE: this behavior is highly experimental and may not capture all affiliations properly – for more comprehensive and sophisticated identification of ethnic affiliation, it is recommended that users use the GDELT Global Knowledge Graph’s ethnic, religious, and social group taxonomies and post-enrich actors from the GKG.  Actor1Religion1Code. (string) If the source document specifies the religious affiliation of Actor1 and that religious group has a CAMEO entry, the CAMEO code is entered here. NOTE: a few special groups like JEW may also have entries in the geographic or type columns due to legacy CAMEO behavior. NOTE: this behavior is highly experimental and may not capture all affiliations properly – for more comprehensive and sophisticated identification of ethnic affiliation, it is recommended that users use the GDELT Global Knowledge Graph’s ethnic, religious, and social group taxonomies and post-enrich actors from the GKG.  Actor1Religion2Code. (string) If multiple religious codes are specified for Actor1, this contains the secondary code. Some religion entries automatically use two codes, such as Catholic, which invokes Christianity as Code1 and Catholicism as Code2.  Actor1Type1Code. (string) The 3-character CAMEO code of the CAMEO “type” or “role” of Actor1, if specified. This can be a specific role such as Police Forces, Government, Military, Political Opposition, Rebels, etc, a broad role class such as Education, Elites, Media, Refugees, or 1 http://gdeltproject.org/data/documentation/CAMEO.Manual.1.1b3.pdf 2 http://gdeltproject.org/
organizational classes like Non-Governmental Movement. Special codes such as Moderate and Radical may refer to the operational strategy of a group.  Actor1Type2Code. (string) If multiple type/role codes are specified for Actor1, this returns the second code.  Actor1Type3Code. (string) If multiple type/role codes are specified for Actor1, this returns the third code. The fields above are repeated for Actor2. The set of fields above are repeated, but each is prefaced with “Actor2” instead of “Actor1”. The definitions and values of each field are the same as above. EVENT ACTION ATTRIBUTES The following fields break out various attributes of the event “action” (what Actor1 did to Actor2) and offer several mechanisms for assessing the “importance” or immediate-term “impact” of an event. NOTE: the various fields in this section recording the amount of coverage an event has received are included solely for legacy purposes – the new Mentions table should be used instead in most cases.  IsRootEvent. (integer) The system codes every event found in an entire document, using an array of techniques to deference and link information together. A number of previous projects such as the ICEWS initiative have found that events occurring in the lead paragraph of a document tend to be the most “important.” This flag can therefore be used as a proxy for the rough importance of an event to create subsets of the event stream. NOTE: this field refers only to the first news report to mention an event and is not updated if the event is found in a different context in other news reports. It is included for legacy purposes – for more precise information on the positioning of an event, see the Mentions table.  EventCode. (string) This is the raw CAMEO action code describing the action that Actor1 performed upon Actor2. NOTE: it is strongly recommended that this field be stored as a string instead of an integer, since the CAMEO taxonomy can include zero-leaded event codes that can make distinguishing between certain event types more difficult when stored as an integer.  EventBaseCode. (string) CAMEO event codes are defined in a three-level taxonomy. For events at level three in the taxonomy, this yields its level two leaf root node. For example, code “0251” (“Appeal for easing of administrative sanctions”) would yield an EventBaseCode of “025” (“Appeal to yield”). This makes it possible to aggregate events at various resolutions of specificity. For events at levels two or one, this field will be set to EventCode. NOTE: it is strongly recommended that this field be stored as a string instead of an integer, since the CAMEO taxonomy can include zero-leaded event codes that can make distinguishing between certain event types more difficult when stored as an integer.  EventRootCode. (string) Similar to EventBaseCode, this defines the root-level category the event code falls under. For example, code “0251” (“Appeal for easing of administrative sanctions”) has a root code of “02” (“Appeal”). This makes it possible to aggregate events at various resolutions of specificity. For events at levels two or one, this field will be set to EventCode. NOTE: it is strongly recommended that this field be stored as a string instead of an integer, since the CAMEO taxonomy can include zero-leaded event codes that can make distinguishing between certain event types more difficult when stored as an integer.  QuadClass. (integer) The entire CAMEO event taxonomy is ultimately organized under four primary classifications: Verbal Cooperation, Material Cooperation, Verbal Conflict, and Material
Conflict. This field specifies this primary classification for the event type, allowing analysis at the highest level of aggregation. The numeric codes in this field map to the Quad Classes as follows: 1=Verbal Cooperation, 2=Material Cooperation, 3=Verbal Conflict, 4=Material Conflict.  GoldsteinScale. (floating point) Each CAMEO event code is assigned a numeric score from -10 to +10, capturing the theoretical potential impact that type of event will have on the stability of a country. This is known as the Goldstein Scale. This field specifies the Goldstein score for each event type. NOTE: this score is based on the type of event, not the specifics of the actual event record being recorded – thus two riots, one with 10 people and one with 10,000, will both receive the same Goldstein score. This can be aggregated to various levels of time resolution to yield an approximation of the stability of a location over time.  NumMentions. (integer) This is the total number of mentions of this event across all source documents during the 15 minute update in which it was first seen. Multiple references to an event within a single document also contribute to this count. This can be used as a method of assessing the “importance” of an event: the more discussion of that event, the more likely it is to be significant. The total universe of source documents and the density of events within them vary over time, so it is recommended that this field be normalized by the average or other measure of the universe of events during the time period of interest. This field is actually a composite score of the total number of raw mentions and the number of mentions extracted from reprocessed versions of each article (see the discussion for the Mentions table). NOTE: this field refers only to the first news report to mention an event and is not updated if the event is found in a different context in other news reports. It is included for legacy purposes – for more precise information on the positioning of an event, see the Mentions table.  NumSources. (integer) This is the total number of information sources containing one or more mentions of this event during the 15 minute update in which it was first seen. This can be used as a method of assessing the “importance” of an event: the more discussion of that event, the more likely it is to be significant. The total universe of sources varies over time, so it is recommended that this field be normalized by the average or other measure of the universe of events during the time period of interest. NOTE: this field refers only to the first news report to mention an event and is not updated if the event is found in a different context in other news reports. It is included for legacy purposes – for more precise information on the positioning of an event, see the Mentions table.  NumArticles. (integer) This is the total number of source documents containing one or more mentions of this event during the 15 minute update in which it was first seen. This can be used as a method of assessing the “importance” of an event: the more discussion of that event, the more likely it is to be significant. The total universe of source documents varies over time, so it is recommended that this field be normalized by the average or other measure of the universe of events during the time period of interest. NOTE: this field refers only to the first news report to mention an event and is not updated if the event is found in a different context in other news reports. It is included for legacy purposes – for more precise information on the positioning of an event, see the Mentions table.  AvgTone. (numeric) This is the average “tone” of all documents containing one or more mentions of this event during the 15 minute update in which it was first seen. The score ranges from -100 (extremely negative) to +100 (extremely positive). Common values range between -10 and +10, with 0 indicating neutral. This can be used as a method of filtering the “context” of events as a subtle measure of the importance of an event and as a proxy for the “impact” of that event. For example, a riot event with a slightly negative average tone is likely to have been a minor occurrence, whereas if it had an extremely negative average tone, it suggests a far more serious occurrence. A riot with a positive score likely suggests a very minor
occurrence described in the context of a more positive narrative (such as a report of an attack occurring in a discussion of improving conditions on the ground in a country and how the number of attacks per day has been greatly reduced). NOTE: this field refers only to the first news report to mention an event and is not updated if the event is found in a different context in other news reports. It is included for legacy purposes – for more precise information on the positioning of an event, see the Mentions table. NOTE: this provides only a basic tonal assessment of an article and it is recommended that users interested in emotional measures use the Mentions and Global Knowledge Graph tables to merge the complete set of 2,300 emotions and themes from the GKG GCAM system into their analysis of event records. EVENT GEOGRAPHY The final set of fields add a novel enhancement to the CAMEO taxonomy, georeferencing each event along three primary dimensions to the landmark-centroid level. To do this, the fulltext of the source document is processed using fulltext geocoding and automatic disambiguation to identify every geographic reference.3 The closest reference to each of the two actors and to the action reference are then encoded in these fields. The georeferenced location for an actor may not always match the Actor1_CountryCode or Actor2_CountryCode field, such as in a case where the President of Russia is visiting Washington, DC in the United States, in which case the Actor1_CountryCode would contain the code for Russia, while the georeferencing fields below would contain a match for Washington, DC. It may not always be possible for the system to locate a match for each actor or location, in which case one or more of the fields may be blank. The Action fields capture the location information closest to the point in the event description that contains the actual statement of action and is the best location to use for placing events on a map or in other spatial context. To find all events located in or relating to a specific city or geographic landmark, the Geo_FeatureID column should be used, rather than the Geo_Fullname column. This is because the Geo_Fullname column captures the name of the location as expressed in the text and thus reflects differences in transliteration, alternative spellings, and alternative names for the same location. For example, Mecca is often spelled Makkah, while Jeddah is commonly spelled Jiddah or Jaddah. The Geo_Fullname column will reflect each of these different spellings, while the Geo_FeatureID column will resolve them all to the same unique GNS or GNIS feature identification number. For more information on the GNS and GNIS identifiers, see Leetaru (2012). 4 When looking for events in or relating to a specific country, such as Syria, there are two possible filtering methods. The first is to use the Actor_CountryCode fields in the Actor section to look for all actors having the SYR (Syria) code. However, conflict zones are often accompanied by high degrees of uncertainty in media reporting and a news article might mention only “Unidentified gunmen stormed a house and shot 12 civilians.” In this case, the Actor_CountryCode fields for Actor1 and Actor2 would both be blank, since the article did not specify the actor country affiliations, while their Geo_CountryCode values (and the ActorGeo_CountryCode for the event) would specify Syria. This can result in dramatic differences when examining active conflict zones. The second method is to examine the ActorGeo_CountryCode for the location of the event. This will also capture situations such as the United States criticizing a statement by Russia regarding a specific Syrian attack. 3 http://www.dlib.org/dlib/september12/leetaru/09leetaru.html 4 http://www.dlib.org/dlib/september12/leetaru/09leetaru.html
 Actor1Geo_Type. (integer) This field specifies the geographic resolution of the match type and holds one of the following values: 1=COUNTRY (match was at the country level), 2=USSTATE (match was to a US state), 3=USCITY (match was to a US city or landmark), 4=WORLDCITY (match was to a city or landmark outside the US), 5=WORLDSTATE (match was to an Administrative Division 1 outside the US – roughly equivalent to a US state). This can be used to filter events by geographic specificity, for example, extracting only those events with a landmark-level geographic resolution for mapping. Note that matches with codes 1 (COUNTRY), 2 (USSTATE), and 5 (WORLDSTATE) will still provide a latitude/longitude pair, which will be the centroid of that country or state, but the FeatureID field below will be blank.  Actor1Geo_Fullname. (string) This is the full human-readable name of the matched location. In the case of a country it is simply the country name. For US and World states it is in the format of “State, Country Name”, while for all other matches it is in the format of “City/Landmark, State, Country”. This can be used to label locations when placing events on a map. NOTE: this field reflects the precise name used to refer to the location in the text itself, meaning it may contain multiple spellings of the same location – use the FeatureID column to determine whether two location names refer to the same place.  Actor1Geo_CountryCode. (string) This is the 2-character FIPS10-4 country code for the location.  Actor1Geo_ADM1Code. (string). This is the 2-character FIPS10-4 country code followed by the 2-character FIPS10-4 administrative division 1 (ADM1) code for the administrative division housing the landmark. In the case of the United States, this is the 2-character shortform of the state’s name (such as “TX” for Texas).  Actor1Geo_ADM2Code. (string). For international locations this is the numeric Global Administrative Unit Layers (GAUL) administrative division 2 (ADM2) code assigned to each global location, while for US locations this is the two-character shortform of the state’s name (such as “TX” for Texas) followed by the 3-digit numeric county code (following the INCITS 31:200x standard used in GNIS). For more detail on the contents and computation of this field, please see the following footnoted URL. 5 NOTE: This field may be blank/null in cases where no ADM2 information was available, for some ADM1-level matches, and for all country-level matches. NOTE: this field may still contain a value for ADM1-level matches depending on how they are codified in GNS.  Actor1Geo_Lat. (floating point) This is the centroid latitude of the landmark for mapping.  Actor1Geo_Long. (floating point) This is the centroid longitude of the landmark for mapping.  Actor1Geo_FeatureID. (string). This is the GNS or GNIS FeatureID for this location. More information on these values can be found in Leetaru (2012).6 NOTE: When Actor1Geo_Type has a value of 3 or 4 this field will contain a signed numeric value, while it will contain a textual FeatureID in the case of other match resolutions (usually the country code or country code and ADM1 code). A small percentage of small cities and towns may have a blank value in this field even for Actor1Geo_Type values of 3 or 4: this will be corrected in the 2.0 release of GDELT. NOTE: This field can contain both positive and negative numbers, see Leetaru (2012) for more information on this. These codes are repeated for Actor2 and Action, using those prefixes. DATA MANAGEMENT FIELDS 5 http://blog.gdeltproject.org/global-second-order-administrative-divisions-now-available-from-gaul/ 6 http://www.dlib.org/dlib/september12/leetaru/09leetaru.html
Finally, a set of fields at the end of the record provide additional data management information for the event record.  DATEADDED. (integer) This field stores the date the event was added to the master database in YYYYMMDDHHMMSS format in the UTC timezone. For those needing to access events at 15 minute resolution, this is the field that should be used in queries.  SOURCEURL. (string) This field records the URL or citation of the first news report it found this event in. In most cases this is the first report it saw the article in, but due to the timing and flow of news reports through the processing pipeline, this may not always be the very first report, but is at least in the first few reports. MENTIONS TABLE The Mentions table is a new addition to GDELT 2.0 and records each mention of the events in the Event table, making it possible to track the trajectory and network structure of a story as it flows through the global media system. Each mention of an event receives its own entry in the Mentions table – thus an event which is mentioned in 100 articles will be listed 100 times in the Mentions table. Mentions are recorded irrespective of the date of the original event, meaning that a mention today of an event from a year ago will still be recorded, making it possible to trace discussion of “anniversary events” or historical events being recontextualized into present actions. If a news report mentions multiple events, each mention is recorded separately in this table. For translated documents, all measures below are based on its English translation. Several of the new measures recorded in the Mentions table make it possible to better filter events based on how confident GDELT was in its extraction of that event. When trying to understand news media spanning the entire globe, one finds that journalism is rife with ambiguities, assumed background knowledge, and complex linguistic structures. Not every event mention will take the form of “American President Barack Obama met with Russian President Vladimir Putin yesterday at a trade summit in Paris, France.” Instead, an event mention might more commonly appear as “Obama and Putin were in Paris yesterday for a trade summit. The two leaders met backstage where he discussed his policy on Ukraine.” To which of the two leader(s) do “he” and “his” refer? Is Obama discussing Obama’s policy on Ukraine, or is Obama discussing Putin’s policy on Ukraine, or is it Putin discussing Putin’s policy or perhaps Putin discussing Obama’s policy? While additional cues may be available in the surrounding text, ambiguous event mentions like this are exceptionally common across the world’s media. Similarly, it would be difficult indeed to maintain an exhaustive list of every single political figure in the entire world and thus context is often critical for disambiguating the geographic affiliation of an actor. Even in the case of more senior political leadership, a reference to “Renauld’s press conference this afternoon in Port-au-Prince” most likely refers to Lener Renauld, the Minister of Defense of Haiti, but this disambiguation still carries with it some degree of ambiguity. GDELT makes use of an array of natural language processing algorithms like coreference and deep parsing using whole-of-document context. While these enormously increase GDELT’s ability to understand and extract ambiguous and linguistically complex events, such extractions also come with a higher potential for error. Under GDELT 1.0, the NumMentions field as designed as a composite score of the absolute number of unique documents mentioning an event and the number of revisions to the text required by these various algorithms, up to six revision passes. Under GDELT 2.0, the Mentions table now separates these, with each record in the Mentions table recording an individual mention of an event in an article, while the new Confidence field
分享到:
收藏