Social Media Lab

We’re using easyocr. See the documentation for more complex configurations. Using CPU only this process takes from minutes to hours (depends on the amount of images). OCR may also be outsourced (e.g. using Google Vision API), see future sessions (and Memespector) for this.

In [6]:

!pip -q install easyocr

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.9/2.9 MB 29.7 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 908.3/908.3 kB 57.5 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 307.2/307.2 kB 29.6 MB/s eta 0:00:00

In [7]:

# Imports for OCR
import easyocr
reader = easyocr.Reader(['de','en'])

WARNING:easyocr.easyocr:Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.
WARNING:easyocr.easyocr:Downloading detection model, please wait. This may take several minutes depending upon your network connection.

Progress: |██████████████████████████████████████████████████| 100.0% Complete

WARNING:easyocr.easyocr:Downloading recognition model, please wait. This may take several minutes depending upon your network connection.

Progress: |██████████████████████████████████████████████████| 100.0% Complete

We define a very simple method to receive one string for all text recognized: The readtextmethod returns a list of text areas, in this example we concatenate the string, therefore the order of words is sometimes not correct.

Also, we save the file to Google Drive to save our results.

In [8]:

def run_ocr(image_path):
    ocr_result = reader.readtext(image_path, detail = 0)
    ocr_text = " ".join(ocr_result)
    return ocr_text

df['ocr_text'] = df['image_file'].apply(run_ocr)

# Saving Results to Drive
df.to_csv('/content/drive/MyDrive/2022-11-09-Stories-Exported.csv')

In [27]:

df.head()

	Unnamed: 0.1	Unnamed: 0	ID	Time of Posting	Type of Content	video_url	image_url	Username	Video Length (s)	Expiration	...	Is Verified	Stickers	Accessibility Caption	Attribution URL	video_file	audio_file	duration	sampling_rate	image_file	ocr_text
0	0	0	3234500408402516260_1383567706	2023-11-12 15:21:53	Image	NaN	NaN	news24	NaN	2023-11-13 15:21:53	...	True	[]	Photo by News24 on November 12, 2023. May be a...	https://www.threads.net/t/CzjB80Zqme0	NaN	NaN	NaN	NaN	media/images/3234500408402516260_1383567706.jpg	Keee WEEKEND NEWS24 PLUS: TESTING FORDS RANGER...
1	1	1	3234502795095897337_8537434	2023-11-12 15:26:39	Image	NaN	NaN	bild	NaN	2023-11-13 15:26:39	...	True	[]	Photo by BILD on November 12, 2023. May be an ...	NaN	NaN	NaN	NaN	NaN	media/images/3234502795095897337_8537434.jpg	Dieses Auto ist einfach der Horror Du glaubst ...
2	2	2	3234503046678453705_8537434	2023-11-12 15:27:10	Image	NaN	NaN	bild	NaN	2023-11-13 15:27:10	...	True	[]	Photo by BILD on November 12, 2023. May be an ...	NaN	NaN	NaN	NaN	NaN	media/images/3234503046678453705_8537434.jpg	Touchdown bei Taylor Swift und Travis Kelce De...
3	3	3	3234503930728728807_8537434	2023-11-12 15:28:55	Image	NaN	NaN	bild	NaN	2023-11-13 15:28:55	...	True	[]	Photo by BILD on November 12, 2023. May be an ...	NaN	NaN	NaN	NaN	NaN	media/images/3234503930728728807_8537434.jpg	Horror-Diagnose für Barton Cowperthwaite Netfl...
4	4	4	3234504185910204562_8537434	2023-11-12 15:29:25	Image	NaN	NaN	bild	NaN	2023-11-13 15:29:25	...	True	[]	Photo by BILD on November 12, 2023. May be an ...	NaN	NaN	NaN	NaN	NaN	media/images/3234504185910204562_8537434.jpg	3v Bilde GG JJ Besorgniserregende Ufo-Aktivitä...

5 rows × 21 columns

OCR using easyocr