Back to Article
OCR using easyocr
Download Notebook

We’re using easyocr. See the documentation for more complex configurations. Using CPU only this process takes from minutes to hours (depends on the amount of images). OCR may also be outsourced (e.g. using Google Vision API), see future sessions (and Memespector) for this.

In [6]:
!pip -q install easyocr
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.9/2.9 MB 29.7 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 908.3/908.3 kB 57.5 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 307.2/307.2 kB 29.6 MB/s eta 0:00:00
In [7]:
# Imports for OCR
import easyocr
reader = easyocr.Reader(['de','en'])
WARNING:easyocr.easyocr:Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.
WARNING:easyocr.easyocr:Downloading detection model, please wait. This may take several minutes depending upon your network connection.
Progress: |██████████████████████████████████████████████████| 100.0% Complete
WARNING:easyocr.easyocr:Downloading recognition model, please wait. This may take several minutes depending upon your network connection.
Progress: |██████████████████████████████████████████████████| 100.0% Complete

We define a very simple method to receive one string for all text recognized: The readtextmethod returns a list of text areas, in this example we concatenate the string, therefore the order of words is sometimes not correct.

Also, we save the file to Google Drive to save our results.

In [8]:
def run_ocr(image_path):
    ocr_result = reader.readtext(image_path, detail = 0)
    ocr_text = " ".join(ocr_result)
    return ocr_text

df['ocr_text'] = df['image_file'].apply(run_ocr)

# Saving Results to Drive
df.to_csv('/content/drive/MyDrive/2022-11-09-Stories-Exported.csv')
In [27]:
df.head()
Unnamed: 0.1 Unnamed: 0 ID Time of Posting Type of Content video_url image_url Username Video Length (s) Expiration ... Is Verified Stickers Accessibility Caption Attribution URL video_file audio_file duration sampling_rate image_file ocr_text
0 0 0 3234500408402516260_1383567706 2023-11-12 15:21:53 Image NaN NaN news24 NaN 2023-11-13 15:21:53 ... True [] Photo by News24 on November 12, 2023. May be a... https://www.threads.net/t/CzjB80Zqme0 NaN NaN NaN NaN media/images/3234500408402516260_1383567706.jpg Keee WEEKEND NEWS24 PLUS: TESTING FORDS RANGER...
1 1 1 3234502795095897337_8537434 2023-11-12 15:26:39 Image NaN NaN bild NaN 2023-11-13 15:26:39 ... True [] Photo by BILD on November 12, 2023. May be an ... NaN NaN NaN NaN NaN media/images/3234502795095897337_8537434.jpg Dieses Auto ist einfach der Horror Du glaubst ...
2 2 2 3234503046678453705_8537434 2023-11-12 15:27:10 Image NaN NaN bild NaN 2023-11-13 15:27:10 ... True [] Photo by BILD on November 12, 2023. May be an ... NaN NaN NaN NaN NaN media/images/3234503046678453705_8537434.jpg Touchdown bei Taylor Swift und Travis Kelce De...
3 3 3 3234503930728728807_8537434 2023-11-12 15:28:55 Image NaN NaN bild NaN 2023-11-13 15:28:55 ... True [] Photo by BILD on November 12, 2023. May be an ... NaN NaN NaN NaN NaN media/images/3234503930728728807_8537434.jpg Horror-Diagnose für Barton Cowperthwaite Netfl...
4 4 4 3234504185910204562_8537434 2023-11-12 15:29:25 Image NaN NaN bild NaN 2023-11-13 15:29:25 ... True [] Photo by BILD on November 12, 2023. May be an ... NaN NaN NaN NaN NaN media/images/3234504185910204562_8537434.jpg 3v Bilde GG JJ Besorgniserregende Ufo-Aktivitä...

5 rows × 21 columns