π° Assignment: OCR & Digital Analysis of a Historical Newspaper Page
Case Study: El Martillo (Chiclayo, 1903β1919)
Source: https://fuenteshistoricasdelperu.com/2020/12/06/el-martillo-chiclayo-1903-1919/
β° Deadline
December 6 β until 11:59 PM (local time)
Late submissions will not be accepted.
π¦ Repository Requirement (MANDATORY)
You must create your own GitHub repository for this assignment.
Your repo must contain:
README.md
- The Python notebook
- The CSV dataset
- Your short report (Markdown)
- The image/PDF of the selected newspaper page
Name your repository:
el-martillo-ocr-[yourname]
π― Objective
Using Claude API (vision/OCR), digitize and analyze one single scanned page from the historical Peruvian newspaper El Martillo.
Your goal is to transform that page into structured data and produce a short exploratory insight.
π Required Data
Select ONE newspaper page from El Martillo (any year between 1903β1919).
Official source (required):
π https://fuenteshistoricasdelperu.com/2020/12/06/el-martillo-chiclayo-1903-1919/
Save your file as:
/data/el_martillo/page_01.png
π¦ Deliverables (all must be in your GitHub repo)
-
Python Notebook (.ipynb)
- Load the selected page
- Extract text with Claude API (vision)
- Structure the extracted output (CSV/JSON)
-
Structured Dataset (.csv)
Required columns:
date
issue_number
headline
section
type (article / advertisement / other)
text_excerpt
-
Short Report (.md)
- Explain why you selected the page
- Describe OCR challenges or distortions
- Include one simple chart
- Provide 2β3 brief insights
-
Raw media file:
- The scanned newspaper page you used (
.png or .jpg)
π§ Tasks to Complete
- Choose and download one page from the newspaper.
- Run Claude OCR to extract titles, sections, and content.
- Normalize and clean the extracted text.
- Build a CSV file with structured information.
- Write a short summary of insights.
- Upload everything to your GitHub repository.
π Evaluation (20 points)
| Criterion |
Points |
| Claude OCR extraction |
11 |
| Dataset structure & quality |
7 |
| Report clarity |
2 |
π° Assignment: OCR & Digital Analysis of a Historical Newspaper Page
Case Study: El Martillo (Chiclayo, 1903β1919)
Source: https://fuenteshistoricasdelperu.com/2020/12/06/el-martillo-chiclayo-1903-1919/
β° Deadline
December 6 β until 11:59 PM (local time)
Late submissions will not be accepted.
π¦ Repository Requirement (MANDATORY)
You must create your own GitHub repository for this assignment.
Your repo must contain:
README.mdName your repository:
el-martillo-ocr-[yourname]
π― Objective
Using Claude API (vision/OCR), digitize and analyze one single scanned page from the historical Peruvian newspaper El Martillo.
Your goal is to transform that page into structured data and produce a short exploratory insight.
π Required Data
Select ONE newspaper page from El Martillo (any year between 1903β1919).
Official source (required):
π https://fuenteshistoricasdelperu.com/2020/12/06/el-martillo-chiclayo-1903-1919/
Save your file as:
/data/el_martillo/page_01.png
π¦ Deliverables (all must be in your GitHub repo)
Python Notebook (
.ipynb)Structured Dataset (
.csv)Required columns:
dateissue_numberheadlinesectiontype(article / advertisement / other)text_excerptShort Report (
.md)Raw media file:
.pngor.jpg)π§ Tasks to Complete
π Evaluation (20 points)