Finding annotator bias in crowdsourced data

source : https://www.educba.com/crowdsourcing-data

About • Configuration Requirements • Graphs and Findings • Conclusion

📒 About

Crowdsourcing has experienced a big boom with the increasing interest in obtaining labelled data. It is a powerfulway of obtaining data in a cheaper and faster way. However, annotator biases and spammers can affect the final quality of the models created with this data.

In the project,

I used different end-to-end methods such as Latent Truth Network and Fast Dawid-Skene to try and model annotator-specific biases as bias matrices. These models will help obtain the ground truth estimation with the help of the singlylabelled Organic dataset.
I also propose a new method where we combine both models by creating predictions to convert our dataset into a multi-labelled one.
I modelled the biases of each annotator to see if their annotations are reliable or not, and to detect possible spammers.
I clustered the bias matrices to discover groups of annotators that approach the labelling task in the same way. I was able to find these different groups of annotators and by adding noise to one of the annotators, I was also were able to cluster it as a spammer.

Even though the dataset was quite limited in length, which made the training of bias matrices hard, I was able to show that the approach can indeed model biases for annotator clustering and spammer detection.

👨‍💻 Configuration Requirements

What is the required configuration for running this code

Jupyter Notebook
Libraries used - Pandas, Numpy, MatplotLib, math, pytorch, sklearn.metrics, scipy, torch
GloVe encoding

🖥️ Graphs and Findings

Available Latent Network Architecture

Ingenious Hybrid Architecture

Clustering Similar Annotators

Spammer Detection

✍️ Conclusion

I showed with the help of Latent Truth Network architecture and bias modeling on the singly-labelled crowdsourced data, how we can create an end-to-end model for finding the bias in the annotators.
I figured out how the two approaches of LTNet and Fast David-Skene (FDS) are different from each other for the bias modelling.
- The LTNet trained in the end-to-end fashion considering the actual text for finding the latent truth and learning the attention vectors during the training and this latent truth then becomes the common ground for all the annotators bias matrices.
- The Fast David Skene, the ground truth sentiment for each sentence is found using the multi-labelled dataset. We found that the multi-labelled predictions that are produced from our experiment 1 could be chained to the FDS input. This particularly works because the output of our experiment 1 and input of FDS is in line. This approach can thus help in finding the ground truth in the singly-labelled dataset, which is considered as a very difficult task.
The bias and the confusion matrices produced by our improved architecture was able to precisely detect the spammer amongst the annotators.
The clustering of the annotators based on the bias matrix also seemed to work with our architecture. Also, the produced bias shows high robustness under very noisy conditions making the approach potentially usable outside of lab conditions.
I worked on just singly-labelled dataset, we tell that if finding ground truth is not absolutely necessary we can ignore the chaining part and find the bias in the annotators with the end-to-end latent truth model approach, whereas if their is the necessity to find the ground truth labels chaining of the two architecture could result in potential outcome.
I believe it is necessary to conduct more experiments on more datasets from different sources to solidify our conclusions regarding our hybrid approach of finding ground truths, as the singly-labelled crowdsourcing use case was performed on a very small dataset. Furthermore, we believe that there might be many different use cases other than the sentiment analysis which can be explored.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Images		Images
datasets		datasets
models		models
scripts		scripts
src		src
Clustering.ipynb		Clustering.ipynb
Crowd-Sourcing-Data.jpg		Crowd-Sourcing-Data.jpg
End-to-End_Modeling_and_Characterization_of_Crowdsourcing_Annotators.pdf		End-to-End_Modeling_and_Characterization_of_Crowdsourcing_Annotators.pdf
Get_Matrices.ipynb		Get_Matrices.ipynb
README.md		README.md
Train_Model.ipynb		Train_Model.ipynb
final_presentation.pdf		final_presentation.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Finding annotator bias in crowdsourced data

📒 About

👨‍💻 Configuration Requirements

🖥️ Graphs and Findings

✍️ Conclusion

About

Releases

Packages

Languages

saumyagoyal95/Finding-Annotator-Bias-in-crowdsourced-data

Folders and files

Latest commit

History

Repository files navigation

Finding annotator bias in crowdsourced data

📒 About

👨‍💻 Configuration Requirements

🖥️ Graphs and Findings

✍️ Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages