# Emotion Stimulus Detection in German News Headlines 

The data in this file belongs to the paper:

Bao Minh Doan Dang, Laura Oberlaender, and Roman Klinger:
Emotion Stimulus Detection in German News Headlines.
KONVENS 2021.


## Introduction

The GERSTI dataset contains 2006 German news headlines with 811 instances with emotion stimulus annotation. We used this resource to analyse emotion stimuli German and to evaluate machine learning models. 

The emotion classes are "happiness", "sadness", "fear", "anger", "disgust", "positive surprise", "negative surprise", "shame", "hope", "other" and "no emotion". Furthermore, we annotated if an instance contains mentions of emotion roles such as experiencer and cues (but not where in the instance they are).

Those instances that received an emotion label were annotated for the emotion stimulus on the token level (span annotations).

This data folder contains following items:

- Annotation guidelines (`annotation_guidelines.pdf`)
- List of keywords and regular expressions for preprocessing and filtering data
  (`filter_terms.txt`)
- A JSONL file for the whole corpus called `gersti.jsonl`
- The folder `intermediate_annotation_files/` which has four different csv
  files (datasets):
    - `phase_1a.csv`: Annotating mentions of emotion cue, experiencer and
      emotion (This file was used for emotion annotation)
    - `phase_1b.csv`: Aggregating items for phase 2
    - `phase_2a.csv`: Annotating emotion stimuli (This file was used for
      annotating stimuli on token-level)
    - `phase_2b.csv`: Dataset with aggregated emotion causes
- `rss/` folder contains a text file which lists the RSS feeds that were used to collect the data


## Description

### `gersti.jsonl`

Each entry of the json file represents a headline and contains the headline`text` itself, an unique `id`, `tokens` of the text, media `source`, `gold` annotation for `emotion` and `stimulus`, as well as`annotations`, containing those of the first (`anno1`) and second (`anno2`) annotators for `cue`, `experiencer`, `emotion` and `stimulus`.

#### Example: 

```
{
  "id": 1749,
  "source": "BUZZFEED",
  "text": "Noch mehr Eltern erzählen von den unheimlichen Dingen, die ihr Kind mal gesagt hat",
  "tokens": ["Noch", "mehr", "Eltern", "erzählen", "von", "den", "unheimlichen", "Dingen", ",", "die", "ihr", "Kind", "mal", "gesagt", "hat"],
  "gold": {
    "emotion": "fear",
    "stimulus": ["O", "O", "O", "O", "B", "I", "I", "I", "I", "I", "I", "I", "I", "I", "I"]
  },
  "annotations": {
    "anno1": {
      "cue": 1,
      "experiencer": 1,
      "emotion": "fear",
      "stimulus": ["O", "O", "O", "O", "B", "I", "I", "I", "I", "I", "I", "I", "I", "I", "I"]
    },
    "anno2": {
      "cue": 1,
      "experiencer": 0,
      "emotion": "no emotion",
      "stimulus": ["O", "O", "O", "O", "B", "I", "I", "I", "O", "O", "O", "O", "O", "O", "O"]
    }
  }
}
```

All following files are redundant and only shared for transparency of the creation process.

### `phase_1a.csv` 

This file includes the emotion annotations of both annotators and contains 2006 data rows, as well as eight (five during annotation procedure) following columns:

    1. headlines
    2. source
    3. emotion_cue
    4. experiencer
    5. emotion
    -----------------
    6. emotion_cue2
    7. experiencer_2
    8. emotion_2


The `headlines` and `source` columns contain a headline and its source.

`emotion_cue`, `experiencer` and `emotion` indicate the annotation of the first annotator. Accordingly, `emotion_cue2`, `experiencer_2` and `emotion_2` show the results of the second annotator.

The binary specification was applied for annotating emotion cue and experiencer, since annotators should only annotate the existence of emotion cue and/or experiencer in the text. 1 indicates YES and 0 NO.

Concrete emotions mentioned above have been represented as integers for annotating emotions. Therefore, 1: Happiness, 2: Sadness, 3: Fear, 4: Anger, 5: Disgust, 6: pos. Surprise, 7: neg. Surprise, 8: Shame, 9: Hope and 10: Other
Emotion. 0 corresponds with the category "No Emotion".

If there is no experiencer and no cue in the headline, the category "No Emotion", respectively, it's numeric indication 0 should be labeled.


### `phase_1b.csv` 

This table includes headlines that were annotated with a concrete emotion by at least one annotator. It has six following columns:

    1. sentenceID
    2. headlines
    3. source
    4. emotion
    5. emotion_2
    6. emotion_aggr

The columns `emotion` and `emotion_2` show the emotion annotations of both annotators. While `emotion_aggr` represents aggregated emotion classes. Matched labels have been automatically taken over to the `emotion_aggr` column.

Additionally, annotators had to jointly discuss cases, that didn't have matched emotion classes. Therefore, they needed decide, which emotion is actually expressed by the headline.

For sentences that were annotated with both "No Emotion" and an emotion, it has been semi-heuristically checked whether the annotations for experiencer or cues (in `phase_1a.csv`) matched. If that was the case, annotated emotion has been adopted.

The class "No Emotion" could also be chosen when the adopted emotion was not appropriate for the headline.

NOTE: The sentence IDs from this file don't correspond with `phase_1a.csv`.


### `phase_2a.csv` 

This file contains five (four during annotation) columns:

    1. sentenceID
    2. tokensID
    3, tokens
    4. BIO
    -------------
    5. BIO2

and only headlines that had been labeled with concrete emotions.

Each sentence was tokenised and represented vertically in the `tokens` column. Hence, `sentenceID` served for separating headlines from each other and is not relational to `phase_1a` and `phase_1b` data tables. The column `tokensID` shows the position of each token in a given headline text. `BIO` and `BIO2` present the annotations of both evaluators for all tokens with IOB scheme. Given the aggregated emotions from `phase_1b.csv` file, annotators needed to annotate the stimulus span.

### `phase_2b.csv` 

This dataset was created by aggregating the token-level annotations of emotion stimuli. It also has a `sentenceID` column, which corresponds to the `phase_2a.csv` file. Accordingly, `tokensID` and `tokens` columns present each token and its position in the text.

To aggregate emotion causes from `phase_2a.csv`, token spans for each sentence that have been annotated by both annotators have been extracted, once they partially matched. Otherwise, annotators jointly discussed the following
cases:

* no overlapping tokens
* less meaningful aggregations

The `tags` column presents aggregated IOB labels for all tokens in the corpus.

