Chat Record Image Dataset

#optical character recognition #sentiment analysis #text classification #language model training #text recognition #natural language processing #image text recognition #chat analysis
  • 500 records
  • 1.3G
  • JPG
  • CC-BY-NC-SA 4.0
  • MOBIUSI INCMOBIUSI INC
Updated:2026-03-05

AI Analysis & Value Prop

Currently, with the rapid development of communication technology, chat records have become a common form of data in daily life and work. The effective parsing of these chat record images is of great significance for improving information processing efficiency. However, existing text recognition and natural language processing technologies often face challenges such as low recognition accuracy, complex backgrounds, and diverse fonts when dealing with diversified and complex image texts. This dataset aims to assist researchers in solving the technical difficulties of extracting text information from images by collecting diverse chat record images, enhancing the accuracy and efficiency of automated recognition.The data collection process uses various mobile devices to capture chat screenshots under different lighting and background conditions to ensure data diversity. In terms of quality control, we employ a three-round annotation process to ensure annotation accuracy and consistency. The annotation team consists of language technology experts, totaling 50 people. The data undergoes OCR recognition preprocessing to generate structured text, improving analysis efficiency. The data is stored in JPG format and organized and managed by conversation topics for easy retrieval and use.The core advantages of the dataset include high accuracy and diversity of annotations, with annotation accuracy exceeding 95%. We have innovatively introduced a self-supervised learning annotation method, combined with data augmentation techniques, to achieve more comprehensive language model training. The dataset effectively improves overall performance in chat record analysis, such as a 15% increase in recognition accuracy. Compared to other similar datasets in the market, our dataset offers higher annotation quality and rich scene diversity. Additionally, the dataset provides scarce corpora, offering valuable resources for low-resource language research. This dataset has good scalability, suitable for various natural language processing tasks, and can support cross-domain general applications and innovative research.

Dataset Insights

Sample Examples

9b2b71ae**.jpg|1034*1937|121.02 KB

dd659e3c**.jpg|1047*2011|152.04 KB

5440758b**.jpg|1000*1920|150.17 KB

1a102906**.jpg|1050*1960|159.82 KB

79cdf4bb**.jpg|1046*1979|143.74 KB

07c2f324**.jpg|1044*1980|132.74 KB

a05b3d2f**.jpg|1047*1993|176.59 KB

Technical Specifications

Compliance Statement

Authorization TypeCC-BY-NC-SA 4.0 (Attribution–NonCommercial–ShareAlike)
Commercial UseRequires exclusive subscription or authorization contract (monthly or per-invocation charging)
Privacy and AnonymizationNo PII, no real company names, simulated scenarios follow industry standards
Compliance SystemCompliant with China's Data Security Law / EU GDPR / supports enterprise data access logs

Can't find the data you need?

Post a request and let data providers reach out to you.

Get this Dataset

Verified for Enterprise Use

Cite this Work

@dataset{Mobiusiundefined,
  title={},
  author={Mobiusi},
  year={undefined},
  url={https://www.mobiusi.com/datasets/5c012d4f901038f1b591c5cf27b692dd?dataset_scene_cate_type=9},
  urldate={},
  keywords={},
  version={}
}

Using this in research? Please cite us.

placeholder
placeholder
placeholder
placeholder
placeholder
placeholder
placeholder

Popular Dataset Searches