[해외DS] 세일즈포스, '긴 문서 특화' LLM 출시···벤치마크 '깡패'

[해외DS]는 해외 유수의 데이터 사이언스 전문지들에서 전하는 업계 전문가들의 의견을 담았습니다. 저희 데이터 사이언스 경영 연구소 (MDSA R&D)에서 영어 원문 공개 조건으로 콘텐츠 제휴가 진행 중입니다.

ChatGPT나 Bard는 긴 문서를 요약하고 고객 데이터를 쉽고 간단하게 처리할 수 있다. 이러한 이점을 통해 기업은 비즈니스에 유용한 인사이트를 얻을 수 있다.

하지만 현실적인 어려움이 있다. 이러한 AI에 필요한 대규모 학습에 드는 비용이 상당하기 때문이다. 이에 맞서 기업들은 더 작고 효율적인 모델을 사용해왔지만 성공적인 비즈니스를 이끌어내기에는 역부족이었다.

메타(Meta)의 LLaMA, Falcon-7B, MPT-7B와 같은 오픈 소스 모델이 학습할 수 있는 데이터의 최대 시퀀스 길이는 약 2,000토큰이다. 이 수치는 문서와 같이 긴 비정형 데이터를 처리하는 데 조금 부족한 것으로 알려졌다.

이에 세일즈포스(Salesforce)는 대형 언어 모델(LLM)인 XGen-7B를 공개했다. XGen-7B는 최대 1조 5천억 개의 토큰, 8,000개의 시퀀스 길이의 데이터를 학습했다. 해당 모델은 ‘표준 고도 집중(Standard Dense Attention)’라고 불리는 학습 방식을 통해 긴 문서 입력을 보다 쉽게 처리할 수 있다고 알려졌다.

세일즈포스 연구진은 70억 개의 매개변수 모델 시리즈를 세일즈포스의 사내 라이브러리인 JaxFormer와 퍼블릭 도메인 교육용 데이터를 통해 학습시켰다. 그 결과 XGen-7B는 LLaMA, Flacon, Redpajama와 같은 오픈 소스 모델과 비교했을 때 비슷하거나 더 낫다고 밝혀졌다.

또한 연구진은 이 모델이 구글 클라우드의 TPU-v4 클라우드 컴퓨팅 플랫폼을 사용하여 1조 개의 토큰을 학습하는 데 15만 달러에 불과하다고 전했다.

XGen-7B 성능, 3개 부문 벤치마크 최고 기록 달성

세일즈포스의 XGen은 여러 벤치마크에서 유명한 오픈 소스 LLM보다 높은 점수를 기록하는 등 인상적인 모습을 선보였다.

대규모 멀티태스크 언어 이해도 측정(MMLU) 벤치마크에서 테스트한 결과, XGen은 테스트된 4개 부문 중 3개 부문에서 최고 점수를 획득했으며 가중 평균에서도 최고 점수를 기록했다. 인문학을 대상으로 한 MMLU 테스트에서는 Meta의 LLaMA만이 XGen보다 높은 점수를 받았다.

동일한 벤치마크의 제로 샷 테스트에서 XGen은 비슷한 결과를 얻었으며, 인문학에서는 다시 LLaMA에 밀렸다.

전체 제로 샷 테스트에서 XGen은 TruthfulQA 벤치마크의 다른 모든 모델보다 높은 점수를 받았다. Meta의 LLaMA는 ARC_ch, 헬라 스웨그, 위노그란데 등의 벤치마크에서 더 나은 결과를 기록했다.

그러나 코드 생성 작업에서는 XGen이 HumanEval 벤치마크의 통과율(pass@1) 지표에서 14.20점을 기록하며 LLaMA와 다른 모델을 앞섰다. LLaMA는 10.38점에 그쳤다.

긴 시퀀스 작업은 XGen이 가장 빛을 발한 분야로, SCROLLS 벤치마크의 QMSum 및 GovReport 데이터 세트에서 놀라울 정도로 높은 점수를 기록했다.

하지만 세일즈포스 연구진은 XGen 모델이 동일한 교육 데이터로 학습된 것이 아니기 때문에 “엄밀히 비교하기는 어렵다”고 지적했다.

XGen-7B 시리즈 출시

세일즈포스 연구진은 XGen-7B-4K-base, XGen-7B-8K-base 및 XGen-7B-inst의 세 가지 모델을 선보였다.

XGen-7B-4K-base는 8,000억 개의 컨텍스트 토큰을 처리할 수 있으며, 2,000개의 시퀀스 길이 토큰과 이후 4,000개의 시퀀스 길이 토큰으로 훈련되었다.

XGen-7B-8K-base는 4K 버전에 3,000억 개의 토큰이 추가되어 총 컨텍스트 이해 기능이 1조 5,000억 개의 토큰으로 확장된 버전이다.

XGen-7B-inst는 databricks-dolly-15k, oasst1, Baize 및 GPT 관련 데이터 세트를 포함한 퍼블릭 도메인 인스트럭션 데이터를 통해 미세 조정되었다. 이 모델은 4,000개와 8,000개의 토큰으로 학습되었으며 연구 목적으로만 공개되었다.

모델을 학습시키기 위해 연구진은 각 단계마다 다른 데이터 조합을 사용하는 2단계 학습 전략을 채택했다.

연구진은 다음과 같이 설명했다. “C4의 경우, C4 파이프라인으로 6개의 공통 크롤링 덤프를 처리하고 URL이 동일한 문서의 최신 타임스탬프만 유지하여 여러 덤프에 걸쳐 문서를 중복 제거했다. C4 데이터를 위키백과와 유사한 문서와 임의의 문서로 분류하는 선형 모델을 학습시켰다. 그런 다음 상위 20%의 위키백과 유사 문서를 선택했다.”

그 다음 작업으로 코드 생성을 지원하기 위해 연구진은 허깅페이스(Hugging Face)와 합작한 코드 생성 모델인 스타코더(Starcoder)를 추가했다. 이후 스타코더의 핵심 데이터를 이전 단계의 데이터와 결합했다.

다음으로 OpenAI의 틱토큰을 사용하여 모델의 데이터를 토큰화했다. 이후 연속된 공백과 탭에 대한 추가 토큰이 추가되었다.

XGen 학습 과정을 통해 일련의 강력한 AI 모델이 탄생했지만, 결함이 없는 것은 아니다. 연구진은 이 모델이 여전히 ‘환각(Hallucination)’ 현상을 겪고 있다고 밝혔다.

컨텍스트가 핵심

긴 입력을 이해할 수 있는 모델은 비즈니스에 큰 이점이 될 수 있다. 챗봇 애플리케이션에서 더 긴 컨텍스트는 더 긴 대화를 의미한다. 세일즈포스 연구진은 긴 컨텍스트를 통해 “사전 학습된 LLM이 고객 데이터를 살펴보고 유용한 정보를 찾는 쿼리에 응답할 수 있다”고 말했다.

세일즈포스 외에도 컨텍스트 확장 및 응용을 연구하는 곳이 있다. OpenAI를 떠난 기술자들이 설립한 AI 스타트업 앤트로픽(Anthropic)은 최근 자사의 대표 애플리케이션인 클로드(Claude)의 컨텍스트 길이를 확장했다. 클로드는 현재 여러 개의 긴 비즈니스 문서나 책에서 정보를 복구하는 데 활용된다.

현재의 여러 모델들에게 긴 컨텍스트란 까다로운 도전의 영역이다. ChatGPT나 Bing의 AI 채팅과 같은 애플리케이션이 유행하는 동시에 단일 대화에서 모델을 오래 사용할수록 응답이 불안정해지는 문제점도 수면 위로 떠올랐다. 원인으로는 모델이 문맥 길이를 처리하지 못해 혼란스러워지고 결국 환각을 일으켰기 때문이라는 해석이 대부분이다.

5월에 보고된 Bing의 부적절한 응답과 같은 사례로 인해 마이크로소프트는 긴 대화를 처리할 수 없어 사용자가 앱과 나눌 수 있는 대화 분량을 제한할 수밖에 없었다.

With greater demand for AI tools comes greater demand for systems to do more. Prompts for tools like ChatGPT that started out as a sentence or two are becoming increasingly complex. And the data being input for systems to analyze is unstructured.

Businesses could benefit from having a chat interface like ChatGPT or Bard capable of summarizing lengthy documents or sifting through customer data for insights. But to perform tasks like these, models would need to be trained on large amounts of data. And businesses have instead opted for smaller, more cost-effective models – which cannot handle such tasks well.

Open source models such as Meta’s LLaMA, Falcon-7B and MPT-7B have been trained with a maximum sequence length of around 2,000 tokens – or basic units of text or code − making their abilities to handle lengthy unstructured data like a document difficult.

Enter XGen-7B, a family of large language models from Salesforce that can handle lengthy document inputs more easily thanks to their training with “standard dense attention” on up to an 8,000 sequence length for up to 1.5 trillion tokens.

Salesforce researchers took a series of seven billion parameter models and trained them on Salesforce’s in-house library, JaxFormer, as well as public-domain instructional data.

The resulting model achieves comparable or better results when compared to open source models like LLaMA, Flacon and Redpajama.

AI researchers at Salesforce said the model cost just $150,000 to train on 1 trillion tokens using Google Cloud’s TPU-v4 cloud computing platform.

XGen-7B results: Benchmark bulldoze

Salesforce’s model achieved some impressive results – scoring higher than popular open source large language models on a host of benchmarks.

When tested on the Measuring Massive Multitask Language Understanding (MMLU) benchmark, XGen achieved the best score in three out of the four tested categories, as well as in the weighted average. Only Meta’s LLaMA scored higher than XGen in the MMLU test covering the humanities.

On the same benchmark’s zero-shot test, XGen achieved similar results – again losing out to LLaMA on humanities.

In terms of overall zero-shot tests, XGen only outscored every other model in the TruthfulQA benchmark. Meta’s LLaMA did record better results on benchmarks including ARC_ch, Hella Swag and Winogrande.

However, on code generation tasks, XGen outclassed LLaMA and other models, scoring 14.20 on the HumanEval benchmark’s pass@1 metric. LLaMA could only muster 10.38.

Long-sequence tasks was where Salesforce’s new AI model shone the most – scoring incredibly well on the SCROLLS benchmark’s QMSum and GovReport datasets.

However, Salesforce’s researchers did note that since the XGen models are not trained on the same instructional data, “they are not strictly comparable.”

The XGen-7B family

Salesforce’s researchers created three models – XGen-7B-4K-base, XGen-7B-8K-base and XGen-7B-inst.

XGen-7B-4K-base is capable of handling 800 billion context tokens, having been trained on 2,000 and later 4,000 sequence length tokens. It has been released under an Apache-2.0 license, meaning derivative works can be distributed under a different license, however, all unmodified components must use the Apache 2.0 license.

XGen-7B-8K-base saw the earlier mentioned model beefed up with a further 300 billion tokens, taking its total contextual understanding capability to 1.5 trillion tokens. This model was also released under Apache 2.0.

XGen-7B-inst was fine-tuned on public domain instructional data, including databricks-dolly-15k, oasst1, Baize and GPT-related datasets. The model was trained on both 4,000 and 8,000 tokens and has been released solely for research purposes.

To train the models, Salesforce’s researchers employed a two-stage training strategy, where each stage used a different data mixture.

The team explained: “For C4, we processed 6 Common Crawl dumps with C4 pipeline, and deduplicated the documents across different dumps by only keeping the newest timestamp for the documents with the same URL. We trained a linear model, which classifies the C4 data as a Wikipedia-like document vs. a random document. We then chose the top 20% Wikipedia-like documents.”

Starcoder, the code-generation model created by Salesforce and Hugging Face, was then added to support code-generation tasks. Core data from Starcoder was then mixed with the data from the earlier stage.

OpenAI’s tiktoken was then used to tokenize the model’s data. Additional tokens for consecutive whitespaces and tabs were later added.

While the XGen training process results in a series of powerful AI models, it is not without its flaws. Salesforce noted that the model still suffers from hallucinations.

For more on XGen-7B, Salesforce posted a detailed blog post on the model. The codebase for the model can be found on GitHub and the model checkpoints can be found on Hugging Face.

Context is key

Models that are able to understand longer inputs could be a huge benefit for businesses.

Salesforce’s researchers said a large context “allows a pre-trained LLM to look at customer data and respond to useful information-seeking queries.”

For chatbot applications, more context means more conversation. And Salesforce isn’t the only organization looking into this concept. Anthropic, the high-rising AI startup founded by OpenAI alumni, recently expanded the context length of its flagship application, Claude.

Claude can now be used to recover information from multiple lengthy business documents or books, with users able to prompt the bot for questions on the data.

Current models struggle with increasing context lengths. As applications like ChatGPT and Bing’s AI chat began to emerge, users found the longer they used the model in a single conversation, the more unhinged its responses became. This was due to the model being unable to handle the context length, causing it to become confused and subsequently hallucinate.

Instances like Bing’s inappropriate responses reported in May forced Microsoft to limit the number of conversations users could have with the application as it simply could not handle long context conversations.