week2_day3

RAG 시스템 구축과 전처리: Python 클래스와 문서 로드

개요

본 강의에서는 Retrieval Augmented Generation (RAG) 과정을 구현하기 위한 전처리 및 핵심 파이썬 코드 설계를 다룹니다.

OpenAI API 활용
파일·폴더 접근, 날짜·시간 라이브러리
임베딩(embedding)과 코사인 유사도
문서 저장·검색을 위한 DocumentStore 클래스
문서 로드 함수(load_text_files, load_csv_files, load_json_file)
RAG 파이프라인(검색→컨텍스트 제공→생성) 구현

핵심 개념

개념	설명
RAG	Retrieval + Augmented + Generation. 외부 지식(문서)을 검색해 LLM에 컨텍스트로 제공, 보다 정확한 답변을 생성.
임베딩(Embedding)	텍스트를 고차원 벡터로 변환. `text-embedding-3-small` 모델 사용.
코사인 유사도	두 벡터 간 유사도를 측정해 검색 결과를 정렬.
문서 스토어	문서 객체(`Document`)를 저장하고, ID, 내용, 메타데이터를 관리. 검색·추가·수량·검색 기능 포함.
클래스 구조	`__init__`, `add`, `count`, `search`, `retrieve`, `generate` 등 메소드로 기능 캡슐화.
파일 로드	텍스트, CSV, JSON 파일을 읽어 `Document` 객체로 변환.
파인튜닝 vs RAG	파인튜닝은 모델 재학습(비용·시간), RAG는 검색·컨텍스트 제공(경제적·빠름).

상세 노트

1. 환경 설정 및 라이브러리 로드

import os
from pathlib import Path
import csv
import json
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.schema import HumanMessage, SystemMessage
from langchain.embeddings import OpenAIEmbeddings
from sklearn.metrics.pairwise import cosine_similarity

.env 파일에서 OPENAI_API_KEY 로드
ChatOpenAI 모델: gpt-4o-mini
OpenAIEmbeddings 모델: text-embedding-3-small

2. 샘플 데이터 폴더 생성

sample_dir = Path("sample_data")
sample_dir.mkdir(exist_ok=True)

companypolicy.txt, airreport.txt 등 예시 파일을 저장.

3. 문서 저장용 딕셔너리

documents = {
    "companypolicy.txt": "내용...",
    "airreport.txt": "내용..."
}
for filename, content in documents.items():
    with open(sample_dir / filename, "w", encoding="utf-8") as f:
        f.write(content)

4. `DocumentStore` 클래스

class DocumentStore:
    def __init__(self):
        self.documents = []   # 리스트 of dicts
        self.next_id = 1
        self.embedding = OpenAIEmbeddings()

    def add(self, content, source="default"):
        doc = {
            "id": self.next_id,
            "content": content,
            "source": source,
            "vector": self.embedding.embed_query(content)
        }
        self.documents.append(doc)
        self.next_id += 1

    def count(self):
        return len(self.documents)

    def search(self, keyword):
        return [d for d in self.documents if keyword.lower() in d["content"].lower()]

    def retrieve(self, query, top_k=3):
        query_vec = self.embedding.embed_query(query)
        scores = []
        for doc in self.documents:
            score = cosine_similarity(
                [query_vec], [doc["vector"]]
            )[0][0]
            scores.append((doc, score))
        scores.sort(key=lambda x: x[1], reverse=True)
        return [doc for doc, _ in scores[:top_k]]

add: 문서 저장 + 벡터 생성
search: 키워드 기반 단순 필터
retrieve: 코사인 유사도 기반 상위 k 문서 반환

5. `WordCounter` 클래스

class WordCounter:
    def __init__(self):
        self.texts = []

    def add_text(self, txt):
        self.texts.append(txt)

    def count_word(self):
        all_words = " ".join(self.texts).split()
        return len(all_words)

6. 문서 로드 함수

def load_text_files(directory: Path):
    docs = []
    for fp in directory.glob("*.txt"):
        content = fp.read_text(encoding="utf-8")
        doc = {
            "source": fp.name,
            "content": content,
            "vector": None  # later embed
        }
        docs.append(doc)
    return docs

def load_csv_files(directory: Path):
    docs = []
    for fp in directory.glob("*.csv"):
        with fp.open(newline="", encoding="utf-8") as f:
            reader = csv.DictReader(f)
            for row in reader:
                content = row.get("content", "")
                doc = {
                    "source": fp.name,
                    "content": content,
                    "vector": None
                }
                docs.append(doc)
    return docs

def load_json_file(file_path: Path):
    with file_path.open(encoding="utf-8") as f:
        data = json.load(f)
    content = json.dumps(data, ensure_ascii=False, indent=2)
    doc = {
        "source": file_path.name,
        "content": content,
        "vector": None
    }
    return doc

7. RAG 파이프라인 구현

def rag_chain(query, store: DocumentStore):
    # 1. 검색
    retrieved = store.retrieve(query, top_k=2)
    # 2. 컨텍스트 생성
    context = "\n\n".join([doc["content"] for doc in retrieved])
    # 3. LLM 호출
    messages = [
        SystemMessage(content="제공된 문서를 바탕으로 정확히 답변하십시오."),
        HumanMessage(content=f"문서: {context}\n\n질문: {query}")
    ]
    response = store.llm.invoke(messages)
    return response.content

store.llm 은 ChatOpenAI 인스턴스

8. 사용 예시

store = DocumentStore()
store.add("김철수는 개발팀에 소속돼 있습니다.", source="companypolicy.txt")
store.add("우리 회사의 정책은...", source="companypolicy.txt")

print(store.count())          # 2
print(store.search("개발팀")) # 해당 문서 반환

answer = rag_chain("김철수 부서가 뭐야?", store)
print(answer)

9. 장단점 정리

항목	파인튜닝	RAG
비용	높음 (모델 재학습)	낮음 (API 호출만)
시간	오래 걸림	빠름
유연성	한 번 학습 후 고정	실시간 검색·업데이트 가능
정확도	데이터에 따라 변동	검색 품질에 따라 변동

핵심: RAG는 외부 문서를 검색해 LLM에 컨텍스트로 제공함으로써 hallucination을 줄이고, 파인튜닝보다 비용·시간이 적게 든다.
실행 팁: 문서 로드 시 메타데이터(소스, 길이 등)를 포함하고, DocumentStore에서 벡터를 미리 저장해 두면 검색 속도가 크게 향상된다.

week2_day3

RAG 시스템 구축과 전처리: Python 클래스와 문서 로드

개요

핵심 개념

상세 노트

1. 환경 설정 및 라이브러리 로드

2. 샘플 데이터 폴더 생성

3. 문서 저장용 딕셔너리

4. DocumentStore 클래스

5. WordCounter 클래스

6. 문서 로드 함수

7. RAG 파이프라인 구현

8. 사용 예시

9. 장단점 정리

4. `DocumentStore` 클래스

5. `WordCounter` 클래스