Python에서 큰 텍스트 파일을 읽는 방법

Python File 객체는 텍스트 파일을 읽는 다양한 방법을 제공합니다. 인기 있는 방법은 파일의 모든 줄 목록을 반환하는 readlines() 메서드를 사용하는 것입니다. 그러나 전체 파일 내용이 메모리에 로드되기 때문에 큰 텍스트 파일을 읽는 데는 적합하지 않습니다.

Python에서 대용량 텍스트 파일 읽기

파일 객체를 반복자로 사용할 수 있습니다. 반복자는 처리할 수 있는 각 줄을 하나씩 반환합니다. 이것은 전체 파일을 메모리로 읽지 않으며 Python에서 큰 파일을 읽는 데 적합합니다. 다음은 Python에서 큰 파일을 반복자로 처리하여 큰 파일을 읽는 코드 스니펫입니다.

import resource
import os

file_name = "/Users/pankaj/abcdef.txt"

print(f'File Size is {os.stat(file_name).st_size / (1024 * 1024)} MB')

txt_file = open(file_name)

count = 0

for line in txt_file:
    # we can process file line by line here, for simplicity I am taking count of lines
    count += 1

txt_file.close()

print(f'Number of Lines in the file is {count}')

print('Peak Memory Usage =', resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)
print('User Mode Time =', resource.getrusage(resource.RUSAGE_SELF).ru_utime)
print('System Mode Time =', resource.getrusage(resource.RUSAGE_SELF).ru_stime)

이 프로그램을 실행할 때 생성되는 출력은 다음과 같습니다.

File Size is 257.4920654296875 MB
Number of Lines in the file is 60000000
Peak Memory Usage = 5840896
User Mode Time = 11.46692
System Mode Time = 0.09655899999999999

파일 크기를 인쇄하기 위해 os 모듈을 사용하고 있습니다.
리소스 모듈은 프로그램의 메모리 및 CPU 시간 사용량을 확인하는 데 사용됩니다.

with 문을 사용하여 파일을 열 수도 있습니다. 이 경우 파일 객체를 명시적으로 닫을 필요가 없습니다.

with open(file_name) as txt_file:
    for line in txt_file:
        # process the line
        pass

대용량 파일에 줄이 없으면 어떻게 됩니까?

위의 코드는 큰 파일 콘텐츠를 여러 줄로 나눌 때 잘 작동합니다. 그러나 한 줄에 많은 양의 데이터가 있으면 많은 메모리를 사용하게 됩니다. 이 경우 파일 내용을 버퍼로 읽어 처리할 수 있습니다.

with open(file_name) as f:
    while True:
        data = f.read(1024)
        if not data:
            break
        print(data)

위의 코드는 파일 데이터를 1024바이트의 버퍼로 읽어들입니다. 그런 다음 콘솔에 인쇄합니다. 전체 파일을 읽으면 데이터가 비게 되고 break 문이 while 루프를 종료합니다. 이 방법은 이미지, PDF, 워드 문서 등과 같은 바이너리 파일을 읽는 데에도 유용합니다. 다음은 파일의 복사본을 만드는 간단한 코드 스니펫입니다.

with open(destination_file_name, 'w') as out_file:
    with open(source_file_name) as in_file:
        for line in in_file:
            out_file.write(line)

참조: StackOverflow 질문