Highlights:

  • According to the companies, StarCoder is the most sophisticated model of its kind in the open-source community.
  • The Stack dataset, which contains program code from 358 different programming languages, served as the training ground for StarCoderBase.

Hugging Face Inc. and ServiceNow Inc. unveil an open-source AI model named StarCoder that can generate code in several programming languages.

According to the companies, StarCoder is the most sophisticated model of its kind in the open-source community. It was created as a result of a study ServiceNow and Hugging Face started last year. Not only did the engineers from the two organizations contribute to the project, but also hundreds of additional AI specialists.

Leandro von Werra, Co-lead at BigCode, said, “The joint efforts led by Hugging Face and ServiceNow enable the release of powerful base models that empower the community to build a wide range of applications more efficiently than a single company could come up with. This endeavor is a testament to the potential of open‑source as we work toward democratizing AI.”

There are various versions of StarCoder. The StarCoderBase core edition has 15.5 billion parameters. These options control how an AI model goes about carrying out activities like writing code.

The Stack dataset, which contains program code from 358 different programming languages, served as the training ground for StarCoderBase. ServiceNow and Hugging Face only used code samples created in 86 of the supported programming languages, not the whole dataset.

The businesses also provided StarCoderBase with software documentation and associated technical data during the training. The AI model was trained using roughly one trillion tokens in total. A token is a piece of data made up of one or more words, word fragments, or numerals.

StarCoderBase was trained by ServiceNow and Hugging Face utilizing a cluster of 64 servers with A100 graphics cards. Up until last year, when Nvidia Corp. unveiled its newest H100 processor, the A100 served as the company’s top-tier data center AI accelerator. The businesses claim that 512 graphics cards were part of the server cluster they used to train StarCoderBase.

The businesses assert that the AI model is not only capable of producing code in numerous languages but also does so more quickly than many competing models. The companies contrasted StarCoderBase with a number of open-source options during an internal test. They found that the AI, which has built-in support for several programming languages, surpasses all other open-source code creation models.

The AI, according to ServiceNow and Hugging Face, can also transcend an early iteration of OpenAI LLC’s Codex model. The AI coding assistant supplied by Microsoft Corp.’s GitHub division, GitHub Copilot, is powered by Codex.

There are various versions of StarCoderBase. Python, Java, and JavaScript code samples were used to train an edition of the model in addition to the core version, which should result in better support for the three languages. There is also a version that is tailored to produce Python code.