[Python] Python을 이용한 PDF 파일 정보 추출

Python/문서 데이터 분석

[Python] Python을 이용한 PDF 파일 정보 추출

데이터 세상 2022. 1. 9. 10:32

728x90

PDF 파일 정보 추출을 위한 python 라이브러리들을 소개하고자 한다.

PDF 파일에서 추출하고 싶은 데이터의 구조(텍스트, 테이블 데이터 등)나 Output 형태(이미지 파일, Dataframe 등)에 따라 적합한 라이브러리를 채택하여 데이터를 추출해야 한다.

PyPDF2

※ 한글 텍스트가 정상 추출되지 않는다.

[Python/문서 데이터 분석] - PyPDF2

PyPDF2

PyPDF2 https://pythonhosted.org/PyPDF2/ PyPDF2 Documentation — PyPDF2 1.26.0 documentation pythonhosted.org PyPDF2 설치 pip install PyPDF2 PyPDF2를 이용한 파일 정보 추출 from PyPDF2 import PdfFileR..

yumdata.tistory.com

PyMuPDF

※ 한글 텍스트가 정상 추출되지만, 테이블 형태의 테이터를 파악하기 어렵다.

[Python/문서 데이터 분석] - PyMuPDF

PyMuPDF

PyMuPDF 설치 pip install PyMuPDF PyMuPDF를 이용한 파일 정보 추출 import fitz pdf_doc = fitz.open("sample.pdf") # number of pages print(f"전체 Page 수: {pdf_doc.page_count}") # Get the first page pag..

yumdata.tistory.com

tabula-py

※ PDF 파일 내의 테이블 정보를 pandas의 Dataframe으로 추출할 수 있다.

[Python/문서 데이터 분석] - tabula-py

tabula-py

tabula-py https://github.com/chezou/tabula-py GitHub - chezou/tabula-py: Simple wrapper of tabula-java: extract table from PDF into pandas DataFrame Simple wrapper of tabula-java: extract table from..

yumdata.tistory.com

Tika-python

이전 발행글의 Tika-python을 참고하여 PDF 파일의 정보를 추출할 수 있다.

[Python/문서 데이터 분석] - tika-python

tika-python

tika-python [tika-pyhon @github] GitHub - chrismattmann/tika-python: Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be call Tika-Python is a Python binding to th..

yumdata.tistory.com

Textract

[Python/문서 데이터 분석] - Textract

Textract

Textract 워드, 파워포인트, PDF 파일 등의 텍스트 추출 https://github.com/deanmalmgren/textract GitHub - deanmalmgren/textract: extract text from any document. no muss. no fuss. extract text from any..

yumdata.tistory.com

References

http://theautomatic.net/2019/05/24/3-ways-to-scrape-tables-from-pdfs-with-python/

728x90

저작자표시 비영리 변경금지

'Python > 문서 데이터 분석' 카테고리의 다른 글

[Python] [tabula-py] PDF 파일 정보 추출 (0)	2022.01.10
[Python] [PyMuPDF] PDF 파일 정보 추출 (0)	2022.01.10
[Python] [PyPDF2] PDF 파일 정보 추출 (0)	2022.01.10
[Python] Python을 이용한 Powerpoint 파일 정보 추출 비교 (0)	2022.01.09
[Python] [python-pptx] Powerpoint 문서 정보 추출 (0)	2022.01.09

현재글[Python] Python을 이용한 PDF 파일 정보 추출

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

데이터와 인공지능 훑어보기