[컴퓨터와 비전] 논문발표_Detection&Segmentation

[컴퓨터와 비전] 논문발표

Detection&Segmentation

Object Detection (객체 탐지)

: 이미지 또는 비디오 내에서 무엇(what)이 어디(where)에 있는지를 찾는 과정
출력은 일반적으로 **Bounding Box(경계 상자)**와 클래스 레이블

자율주행: 보행자, 차량, 신호등 탐지
영상 감시: 이상 행동 탐지

Segmentation (분할)

: Segmentation은 픽셀 단위로 객체를 분류하는 과정

Detection보다 한 단계 정밀한 분석을 수행

1. Chargrid‑OCR: End‑to‑end Trainable Optical Character Recognition for Printed Documents using Instance Segmentation (2019)

:인쇄 문서에 대해 문자 격자(chargrid) 표현을 이용해 세그멘테이션(픽셀 단위) + 물체 탐지(문자 박스) 방식을 통합한 OCR 접근법

문자 객체(Instance) 단위 Detection
- 각 문자를 하나의 객체(instance)로 인식하고 Bounding Box를 예측합니다.
문자 격자(chargrid) 기반 Segmentation
- 문서 이미지를 픽셀 수준에서 각 문자가 차지하는 영역으로 표현하여 semantic segmentation 수행.
End-to-End 학습 가능 구조
- Detection과 Segmentation이 완전히 통합되어 있어, 전통적인 OCR 파이프라인(탐지 → 인식)보다 일관성이 높음.

https://arxiv.org/abs/1909.04469?utm_source=chatgpt.com

Chargrid-OCR: End-to-end Trainable Optical Character Recognition for Printed Documents using Instance Segmentation

We present an end-to-end trainable approach for Optical Character Recognition (OCR) on printed documents. Specifically, we propose a model that predicts a) a two-dimensional character grid (\emph{chargrid}) representation of a document image as a semantic

arxiv.org

2. Towards generative and interactive end to end OCR models (2025)

: 탐지(text detection)와 인식(text recognition)을 하나의 생성형(Transformer 기반) 모델에서 처리하는 최신 연구

https://arxiv.org/abs/2504.03621

VISTA-OCR: Towards generative and interactive end to end OCR models

We introduce \textbf{VISTA-OCR} (Vision and Spatially-aware Text Analysis OCR), a lightweight architecture that unifies text detection and recognition within a single generative model. Unlike conventional methods that require separate branches with dedicat

arxiv.org

3. OCR-free Document Understanding Transformer (WIP)

: 기존의 “OCR 기반 문서 이해(VDU)”를 대체할 완전한 End-to-End OCR-free 문서 이해 모델 제안

https://arxiv.org/abs/2111.15664

OCR-free Document Understanding Transformer

Understanding document images (e.g., invoices) is a core but challenging task since it requires complex functions such as reading text and a holistic understanding of the document. Current Visual Document Understanding (VDU) methods outsource the task of r

arxiv.org

1. 논문 핵심 요약

문제 정의	기존 OCR은 탐지(detection) → 인식(recognition)의 복잡한 파이프라인 구조로 구성되어 오류 누적 및 도메인 적응성 한계가 존재
핵심 아이디어	OCR을 “문자 인스턴스 세그멘테이션(instance segmentation)” 문제로 재정의
핵심 출력	(1) Chargrid (문자 세그멘테이션 맵) — 각 픽셀을 문자 클래스로 분류 (2) Character boxes (탐지 결과) — 각 문자의 바운딩 박스 예측
핵심 모델 구조	U-Net 기반 Encoder–Decoder CNN + Dual Decoder (Segmentation / Detection)
핵심 기술 요소	① Chargrid 표현 ② Graphcore 알고리즘 (초고밀도 문자 박스 필터링) ③ 문자→단어 클러스터링 (Word grouping)
주요 성과	Tesseract v4 대비 116배 빠르고, 정확도(WRR) 최대 82% 달성
데이터셋 구성	- Synthetic(Wiki 기반, 깨끗한 레이블) - Real(EDGAR 금융문서, 노이즈 라벨) → 두 데이터셋 결합 시 가장 높은 성능

2. 구조적 분석: Detection + Segmentation 통합 설계

2.1 전체 파이프라인 개요

Input 문서 이미지
       ↓
Encoder (Feature Extraction)
       ↓
 ┌──────────────┬───────────────┐
 │Segmentation Decoder │Detection Decoder │
 │ (Chargrid 생성) │ (Box, Center, Size 예측) │
 └──────────────┴───────────────┘
       ↓
Post-processing (Graphcore + Word Clustering)
       ↓
Final OCR 결과 (문자, 단어, 위치)

2.2 Chargrid (문자 격자 표현)

목표: 문서 내 각 픽셀을 “어떤 문자 클래스인지” 예측
방법: semantic segmentation을 이용해 각 픽셀을 문자로 분류 (총 89개 클래스)
예시:
- A→1, B→2, ..., Z→26, 숫자·특수문자 포함
결과: 문서가 “픽셀 기반 문자 분할 지도”로 변환 → 시각적으로 layout 구조까지 반영 가능

💡 팁:
Chargrid는 단순 문자 인식이 아닌 2D 문서 구조 이해(Visual Document Understanding) 로 확장 가능한 표현입니다.

2.3 Object Detection 모듈 (문자 박스 탐지)

목표: 각 문자를 “객체로 탐지”
예측 항목:
- Bc: box mask (binary)
- (Xc, Yc): 각 픽셀에서 문자 중심까지의 오프셋
- (Wc, Hc): 문자 박스의 width, height
기술 참고: SSD(Single Shot Detector) 구조를 차용 (anchor-free 방식)

💡 팁:
Chargrid-OCR의 탐지는 “초고밀도(ultra-dense)” 탐지입니다.
자연 이미지 객체 탐지와 달리 한 페이지에 수천 개 객체(문자) 가 존재하기 때문에, 효율적인 필터링이 핵심입니다.

2.4 Graphcore (효율적 박스 필터링)

문제점: 수천 개 문자 → 수십만 개 박스 예측 → NMS(비최대 억제)는 계산량 O(N²)로 비효율적
해결책:
- 각 픽셀을 그래프의 노드로 간주
- 픽셀 A가 픽셀 B를 문자 중심으로 예측하면 A→B 엣지 생성
- k-core(=1) 연산으로 루프(중심 픽셀)만 유지 → 선형시간 O(N) 필터링
- 이후 NMS 적용 → 문자당 1~2개 박스만 유지

💡 팁:
Graphcore는 OCR 특화 “Linear-time Object Filtering” 알고리즘으로, Dense Detection 문제의 실질적 병목을 해결합니다.

2.5 단어(Word) 클러스터링

입력: 문자 단위 박스
출력: 단어 단위 클러스터
방법:
- 각 문자 픽셀에서 “word center” 예측
- 문자 단위 박스를 기준으로 word center 반사(reflection) 영역 계산
- 서로 50% 이상 겹치는 문자 박스 → 동일 단어로 클러스터링

💡 팁:
이 접근법은 회전된 텍스트(rotated text) 도 자연스럽게 인식할 수 있습니다.
문자의 방향성이 word center 기반으로 계산되기 때문입니다.

'개인 프로젝트 > 대학원 수업 정리' 카테고리의 다른 글

[과제2] Computer Vision 2 (2 - 1, 2, 3) (0)	2025.11.03
[과제2] Computer Vision 2 (1 - 1, 2, 3, 4) (0)	2025.11.03
[기초통계] 중요 과제 (0)	2025.10.28
[중간고사] Computer Vision (0)	2025.10.16
Computer Vision 개념정리 (0)	2025.10.15

Learning_EunBi

[컴퓨터와 비전] 논문발표_Detection&Segmentation

1. 논문 핵심 요약