๐ŸŽ ์‹ํ’ˆ์—์„œ AI ๊ณต๋ถ€ํ•˜๊ธฐ

RAG(Retrieval-Augmented Generation): ์ •ํ™•ํ•œ ์ •๋ณด ์ œ๊ณต์„ ์œ„ํ•œ AI ๊ธฐ์ˆ  ๋ณธ๋ฌธ

Food_Health_AI/RAG

RAG(Retrieval-Augmented Generation): ์ •ํ™•ํ•œ ์ •๋ณด ์ œ๊ณต์„ ์œ„ํ•œ AI ๊ธฐ์ˆ 

FoodAI 2025. 4. 7. 23:00

๐Ÿ’ก๋“ค์–ด๊ฐ€๋ฉฐ

์ตœ๊ทผ AI ๊ธฐ์ˆ ์˜ ๋ฐœ์ „์œผ๋กœ ChatGPT์™€ ๊ฐ™์€ ๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ(LLM)์ด ๋†€๋ผ์šด ์ •๋ณด ์ƒ์„ฑ ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Ÿฌํ•œ ๋ชจ๋ธ๋“ค์€ ์ข…์ข… ์ž˜๋ชป๋œ ์ •๋ณด๋ฅผ ์ƒ์„ฑํ•˜๋Š” 'ํ™˜๊ฐ(Hallucination)' ํ˜„์ƒ์„ ๋ณด์ด๊ธฐ๋„ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฐ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋“ฑ์žฅํ•œ ๊ธฐ์ˆ ์ด ๋ฐ”๋กœ RAG(Retrieval-Augmented Generation)์ž…๋‹ˆ๋‹ค. RAG๋Š” ์™ธ๋ถ€ ์ง€์‹ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์—์„œ ๊ด€๋ จ ์ •๋ณด๋ฅผ ๊ฒ€์ƒ‰ํ•˜์—ฌ AI์˜ ์‘๋‹ต์„ ๋ณด๊ฐ•ํ•จ์œผ๋กœ์จ ๋” ์ •ํ™•ํ•˜๊ณ  ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒฐ๊ณผ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ ๊ฑด๊ฐ•๊ณผ ์˜์–‘ ๋ถ„์•ผ์—์„œ๋Š” ์ •ํ™•ํ•œ ์ •๋ณด ์ œ๊ณต์ด ๋งค์šฐ ์ค‘์š”ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, RAG ๊ธฐ์ˆ ์˜ ํ™œ์šฉ ๊ฐ€์น˜๊ฐ€ ๋งค์šฐ ๋†’์Šต๋‹ˆ๋‹ค.


I. RAG์˜ ๋ฐฐ๊ฒฝ ๋ฐ ํ•„์š”์„ฑ

์ผ๋ฐ˜ ๋ชฉ์  ์–ธ์–ด ๋ชจ๋ธ์˜ ํ•œ๊ณ„

ํ™˜๊ฐ(Hallucination) ๋ฌธ์ œ

  • BERT, BART, GPT, T5 ๊ฐ™์€ ์ผ๋ฐ˜ ๋ชฉ์  ์–ธ์–ด ๋ชจ๋ธ๋“ค์€ ๋Œ€๊ทœ๋ชจ ๋ง๋ญ‰์น˜(corpus)์—์„œ ํ•™์Šตํ•œ ์ง€์‹์„ ํŒŒ๋ผ๋ฏธํ„ฐ์— ์ €์žฅํ•˜์ง€๋งŒ, ์ข…์ข… ์‚ฌ์‹ค๊ณผ ๋‹ค๋ฅธ ์ •๋ณด๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  • ํŠน์ • ์—…๋ฌด์— ํŠนํ™”๋œ ๋ชจ๋ธ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ๋‚ฎ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋ชจ๋ธ์˜ ์ง€์‹ ๊ธฐ๋ฐ˜์ด ์ œํ•œ์ ์ด๋ฉฐ, ํ•™์Šต ๋ฐ์ดํ„ฐ ์ดํ›„์˜ ์ •๋ณด๋Š” ์•Œ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

์ž๋™ ํšŒ๊ท€(Auto-Regression) LLM ๋ชจ๋ธ์˜ ํŠน์„ฑ

  • ์ด์ „ ๋‹จ์–ด๋ฅผ ๋ณด๊ณ  ๊ฐ€์žฅ ๋†’์€ ํ™•๋ฅ ์˜ ๋‹ค์Œ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.
  • ๋ฌธ๋งฅ์— ๋”ฐ๋ผ ๋‹ค์–‘ํ•œ ๋‹ต๋ณ€์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ํ•™์Šตํ•œ ์ •๋ณด์— ํ•ด๋‹น์ด ์—†๋”๋ผ๋„, ํ•™์Šต ๋ฐฉ์‹์„ ๊ธฐ์ค€์œผ๋กœ ๊ทธ๋Ÿด๋“ฏํ•œ ๋‹ต๋ณ€์„ ์ƒ์„ฑํ•˜๊ธฐ๋„ ํ•ฉ๋‹ˆ๋‹ค.

๊ฒ€์ƒ‰ ๋ชจ๋ธ(Non-parametric method)

  • ๋ชจ๋ธ ์™ธ๋ถ€์—์„œ ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๋Š” ์ง€์‹ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋ฅผ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
  • ์งˆ๋ฌธ์— ๋Œ€ํ•ด ์ ์ ˆํžˆ ๋‹ต๋ณ€ํ•˜๊ธฐ ์œ„ํ•ด Wikipedia์™€ ๊ฐ™์€ ์™ธ๋ถ€ ์ง€์‹์„ ๊ฐ€์ ธ์™€ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ์ •๋ณด ๊ฒ€์ƒ‰ ์ž‘์—…์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค.

๋‘ ์ ‘๊ทผ ๋ฐฉ์‹์˜ ๋น„๊ต:

Parametric (ํŒŒ๋ผ๋ฏธํ„ฐ ๊ธฐ๋ฐ˜) Non-parametric (๋น„ํŒŒ๋ผ๋ฏธํ„ฐ ๊ธฐ๋ฐ˜)
๋Œ€๊ทœ๋ชจ ๋ง๋ญ‰์น˜๋กœ ๋งŽ์€ ์–‘์˜ ์ง€์‹ ํ•™์Šต ์™ธ๋ถ€ ์ง€์‹ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ์ ‘๊ทผ ๊ฐ€๋Šฅ
ํŒŒ๋ผ๋ฏธํ„ฐ ํ™•์žฅ, ์ˆ˜์ •, ํ•ด์„์ด ์–ด๋ ค์›€ ํŒŒ๋ผ๋ฏธํ„ฐ ํ™•์žฅ, ์ˆ˜์ •, ํ•ด์„ ๊ฐ€๋Šฅ
BART, GPT-3์ฒ˜๋Ÿผ ์ƒ์„ฑ์— ํŠนํ™”๋จ BM25, ๊ฒ€์ƒ‰ ์—”์ง„์ฒ˜๋Ÿผ ๊ฒ€์ƒ‰์— ํŠนํ™”๋จ

 


II. RAG(Retrieval Augmented Generation)์˜ ๊ฐœ์š”

RAG๋Š” ์‚ฌ์šฉ์ž ์งˆ๋ฌธ์— ๊ด€๋ จ๋œ ์ •๋ณด๋ฅผ ์™ธ๋ถ€ ๋ฌธ์„œ์—์„œ ๊ฒ€์ƒ‰ํ•˜๊ณ , ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ์–ธ์–ด ๋ชจ๋ธ์— ์ „๋‹ฌํ•˜์—ฌ ์ƒ์„ฑํ•œ ์‘๋‹ต์„ ์‚ฌ์šฉ์ž์—๊ฒŒ ์ œ๊ณตํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.

RAG์˜ ์ฃผ์š” ๊ตฌ์„ฑ ์š”์†Œ

  • Retriever: ํŒŒ๋ผ๋ฉ”ํŠธ๋ฆญ ์ ‘๊ทผ์˜ ๋‹จ์ ์ธ ์™ธ๋ถ€ ์ง€์‹์„ ์ฐธ์กฐํ•˜์—ฌ ์‚ฌ์‹ค์— ๊ฐ€๊นŒ์šด ์ •๋ณด๋ฅผ ๊ฒ€์ƒ‰ํ•ฉ๋‹ˆ๋‹ค.
  • Generator: ๋น„ํŒŒ๋ผ๋ฉ”ํŠธ๋ฆญ ์ ‘๊ทผ์˜ ์žฅ์ ์ธ ๋Œ€๊ทœ๋ชจ ๋ง๋ญ‰์น˜ ํ•™์Šต์„ ํ†ตํ•ด ๋‹ค์–‘ํ•œ ํ˜•ํƒœ์˜ ๋‹ต๋ณ€์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

RAG์˜ ์ž‘๋™ ๋ฐฉ์‹์€ ์•„๋ž˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

 

 

์ด ๊ณผ์ •์—์„œ Retriever๋Š” ์‚ฌ์šฉ์ž ์งˆ๋ฌธ๊ณผ ๊ด€๋ จ๋œ ๋ฌธ์„œ๋ฅผ ์™ธ๋ถ€ ์ง€์‹ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์—์„œ ์ฐพ์•„๋‚ด๊ณ , Generator๋Š” ๊ฒ€์ƒ‰๋œ ์ •๋ณด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ž์—ฐ์Šค๋Ÿฌ์šด ์‘๋‹ต์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

III. RAG ๋ชจ๋ธ์˜ ๊ตฌ์กฐ

Retriever ๋ชจ๋ธ

Retriever ๋ชจ๋ธ์€ ์ผ๋ฐ˜์ ์œผ๋กœ Bi-Encoder ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

  • ์ž…๋ ฅ ์‹œํ€€์Šค x์™€ ์—ฌ๋Ÿฌ ํ›„๋ณด ๋ฌธ์„œ y ์‚ฌ์ด์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ๊ฐ€์žฅ ๊ด€๋ จ์„ฑ ๋†’์€ ๋ฌธ์„œ๋ฅผ ์ฐพ์•„๋ƒ…๋‹ˆ๋‹ค.
  • ๋‘ ๊ฐœ์˜ ๋…๋ฆฝ์ ์ธ BERT ์ธ์ฝ”๋”๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์งˆ๋ฌธ๊ณผ ๋ฌธ์„œ๋ฅผ ๊ฐ๊ฐ ๋ณ„๋„์˜ BERT์— ์ž…๋ ฅํ•˜์—ฌ ์ž„๋ฒ ๋”ฉ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  • ๋‘ ์ž„๋ฒ ๋”ฉ ๊ฐ„์˜ ๋‚ด์ (dot product)์„ ํ†ตํ•ด ์œ ์‚ฌ๋„ ์ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
  • ๊ฐ€์žฅ ๋†’์€ ์ ์ˆ˜๋ฅผ ๋ฐ›์€ ๋ฌธ์„œ๊ฐ€ ์„ ํƒ๋ฉ๋‹ˆ๋‹ค.

Generator ๋ชจ๋ธ

Generator๋Š” ๊ฒ€์ƒ‰๋œ ๋ฌธ์„œ์™€ ์งˆ๋ฌธ์„ ๊ฒฐํ•ฉํ•˜์—ฌ ์ตœ์ข… ์‘๋‹ต์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค:

  • BART๋‚˜ GPT-3์™€ ๊ฐ™์€ ์ƒ์„ฑ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ์—ฌ๋Ÿฌ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ(Passage)๋ฅผ ์ฐธ์กฐํ•˜์—ฌ ํ™•๋ฅ ์ ์œผ๋กœ ๊ฐ€์žฅ ์ ์ ˆํ•œ ์‘๋‹ต์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.
  • ๊ฒ€์ƒ‰๋œ ์ •๋ณด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ž์—ฐ์Šค๋Ÿฌ์šด ๋ฌธ์žฅ์„ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

IV. RAG ์‹ค์ œ ๊ตฌํ˜„ ์˜ˆ์‹œ

RAG๋ฅผ ์‹ค์ œ๋กœ ๊ตฌํ˜„ํ•˜๋Š” ๊ณผ์ •์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

1. ๋ฌธ์„œ ์ค€๋น„ ๋ฐ ์ฒ˜๋ฆฌ

์ €๋Š” PubMed์—์„œ ํŠน์ • ๋‹จ์–ด๋กœ ๊ฒ€์ƒ‰๋œ ๋…ผ๋ฌธ์˜ abstract๋“ค์„ 'abstracts'๋ผ๋Š” ํด๋”์— txt ํŒŒ์ผ ํ˜•ํƒœ๋กœ ์ €์žฅํ•ด๋‘์–ด, ์ด ํด๋”์—์„œ ๋ฌธ์„œ๋ฅผ load ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

loader = DirectoryLoader('abstracts', glob="*.txt", loader_cls=TextLoader)
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(documents)

์ด ์ฝ”๋“œ๋Š” ๋ฌธ์„œ๋ฅผ 1000์ž ๋‹จ์œ„๋กœ ๋‚˜๋ˆ„๊ณ , ์ค‘๋ณต์„ ์œ„ํ•ด 200์ž์”ฉ ๊ฒน์น˜๊ฒŒ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.

2. ์ž„๋ฒ ๋”ฉ ์ƒ์„ฑ ๋ฐ ๋ฒกํ„ฐ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ์ €์žฅ

persist_directory = 'db'
embedding = OpenAIEmbeddings()
vectordb = Chroma.from_documents(
    documents=texts,
    embedding=embedding,
    persist_directory=persist_directory)

vectordb.persist()
vectordb = None

vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding)

OpenAI์˜ ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฌธ์„œ๋ฅผ ๋ฒกํ„ฐํ™”ํ•˜๊ณ , ChromaDB์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.

3. Retrieval-Generator ๋ชจ๋ธ ๊ตฌ์ถ•

qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    chain_type="stuff",
    retriever=vectordb.as_retriever(),
    return_source_documents=True)

def process_llm_response(llm_response):
    result = llm_response['result']
    source_info = []
    for source in llm_response["source_documents"]:
        pubmeid = source.metadata['source'].replace(".txt","").split("_")[0].replace('abstract','')
        article_type = " ".join(source.metadata['source'].replace(".txt","").split("_")[1:])
        source_info.append(f'Sources: {pubmeid}\nArticle Type: {article_type}')
    return result, '\n\n'.join(source_info)

์ด ์ฝ”๋“œ๋Š” RetrievalQA๋ฅผ ์„ค์ •ํ•˜๊ณ , ์‘๋‹ต๊ณผ ํ•จ๊ป˜ ์ถœ์ฒ˜ ์ •๋ณด๋ฅผ ํ•จ๊ป˜ ์ œ๊ณตํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.

์˜์–‘ ๋ฐ ๊ฑด๊ฐ• ๊ด€๋ จ ์˜ˆ์‹œ ์ฟผ๋ฆฌ

python query = "Recommend diet foods for management of hypertension in adults." 

llm_response = qa_chain(query) process_llm_response(llm_response)
๋”๋ณด๊ธฐ

'Foods that may help manage hypertension in adults include fruits and vegetables, whole grains, low-fat dairy products, lean proteins, and nuts and seeds. Foods to avoid or limit include foods high in salt, saturated fat, and sugar.'

 

'Source PMID: 16003449 Source PMID: 19583632' 


python query = "Is 25-hydroxyvitamin D good for obese men?" 

llm_response = qa_chain(query) process_llm_response(llm_response)
๋”๋ณด๊ธฐ

'This study suggests that weight loss is associated with a marginally improved vitamin D status. Additional studies in unsupplemented individuals are needed to confirm these findings, but it may be beneficial for obese men to consider weight loss as a way to improve their vitamin D status.'

 

'Source PMID: 27604772' 

 

IV. ๊ฒฐ๋ก  ๐ŸŽฏ

RAG(Retrieval-Augmented Generation) ๊ธฐ์ˆ ์€ AI ์–ธ์–ด ๋ชจ๋ธ์˜ ํ•œ๊ณ„๋ฅผ ๋ณด์™„ํ•˜๋Š” ํšจ๊ณผ์ ์ธ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ํŠนํžˆ ๊ฑด๊ฐ•๊ณผ ์˜์–‘ ๋ถ„์•ผ์™€ ๊ฐ™์ด ์ •ํ™•ํ•œ ์ •๋ณด๊ฐ€ ์ค‘์š”ํ•œ ์˜์—ญ์—์„œ๋Š” ๋”์šฑ ๊ฐ€์น˜๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. RAG๋Š”:

  • ํ™˜๊ฐ(Hallucination) ๋ฌธ์ œ๋ฅผ ์ค„์ด๊ณ  ์‚ฌ์‹ค์— ๊ธฐ๋ฐ˜ํ•œ ์‘๋‹ต์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
  • ์ตœ์‹  ์ •๋ณด์— ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ์–ด ๋ชจ๋ธ์˜ ์ง€์‹ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•ฉ๋‹ˆ๋‹ค.
  • ๊ทผ๊ฑฐ ์ถœ์ฒ˜๋ฅผ ์ œ์‹œํ•˜์—ฌ ์‘๋‹ต์˜ ์‹ ๋ขฐ์„ฑ์„ ๋†’์ž…๋‹ˆ๋‹ค.
  • ์˜์–‘ ์ƒ๋‹ด, ์‹๋‹จ ์ถ”์ฒœ, ๊ฑด๊ฐ• ์ •๋ณด ์ œ๊ณต ๋“ฑ ๋‹ค์–‘ํ•œ ํ—ฌ์Šค์ผ€์–ด ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์— ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์•ž์œผ๋กœ ๋” ๋ฐœ์ „๋œ RAG ๊ธฐ์ˆ ์€ ๊ฐœ์ธ ๋งž์ถคํ˜• ์˜์–‘ ์ƒ๋‹ด, ์˜ํ•™ ์ •๋ณด ๊ฒ€์ƒ‰, ๊ฑด๊ฐ• ๊ด€๋ฆฌ ์•ˆ๋‚ด ๋“ฑ ํ—ฌ์Šค์ผ€์–ด ๋ถ„์•ผ์—์„œ ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•  ๊ฒƒ์œผ๋กœ ๊ธฐ๋Œ€๋ฉ๋‹ˆ๋‹ค.


์ฐธ๊ณ  ๋ฌธํ—Œ:

  • Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., ... & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33.
  • Izacard, G., & Grave, E. (2021). Leveraging passage retrieval with generative models for open domain question answering. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics.
  • Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., ... & Yih, W. T. (2020). Dense passage retrieval for open-domain question answering. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).