Highlights:

  • According to the firms, StarCoder2 may be adjusted and integrated into enterprise systems to carry out operations including text summarizing, workflow creation, and source code production.
  • The firms said that because StarCoder2’s 3 billion parameters are trained on a wider corpus of languages, it can make more accurate predictions and performs comparably to the original StarCoder’s 15 billion parameters.

Hugging Face Inc., Nvidia Corp., and ServiceNow Inc., released StarCoder2, the most recent iteration of their open-source StarCoder family of large language models for code creation.

According to the firms, StarCoder2 has features that guard against intellectual property infringement and is faster and more adaptable than its predecessor.

StarCoder2, which is proficient in 619 programming languages, was created in collaboration with the BigCode Community as part of a research project overseen by Hugging Face and ServiceNow. The initial StarCoder was introduced last year. A new code dataset known as Stack v2, which is more than seven times larger than Stack v1, serves as the basis for the model. Together with training methodologies for low-resource programming languages like Cobol, mathematics, and discussions of program source code, the new dataset also contains other types of information.

According to the firms, StarCoder2 may be adjusted and integrated into enterprise applications to carry out operations including text summarizing, workflow creation, and source code production. Its code completion, code summarization, and retrieval of code snippets, among other features, enable developers to compose code more quickly.

Size Preference

There are three sizes of the model: one trained by ServiceNow with 3 billion parameters, one trained by Hugging Face with 7 billion parameters, and one constructed by Nvidia using its NeMo generative AI framework and trained on Nvidia infrastructure with 15 billion parameters. Due to the fact that fewer parameters necessitate computation during the inferencing phase, when models derive conclusions from their training data, the smaller variants reduce computing expenses. Additionally, they can operate on graphics processing device of consumer grade.

StarCoder2’s 3 billion-parameter model matches the efficacy of the original StarCoder’s 15 billion-parameter model and, having been trained on a larger corpus of languages, can make more accurate predictions, according to the companies. Its more extensive and in-depth training, according to them, enables the model to generate more accurate predictions that account for context.

AI has found widespread use in software development, partly due to early achievements such as Copilot from GitHub Inc. and CodeWhisperer from Amazon Web Services Inc. According to a recent GitHub poll, 91% of developers in the United States utilize AI coding tools. Nevertheless, according to a CoderPad Inc. poll, 28% of developers stated their workplace forbids the usage of AI, and roughly 25% of developers are dubious about its usefulness.

Role of Transparency

Concerns that generators may generate code containing copyrighted material in their training model, thereby introducing security vulnerabilities and potentially causing intellectual property theft, are significant factors contributing to reluctance. Recent research from Stanford University has revealed that AI assistants produce insecure code in laboratory environments.

To allay these concerns, the three sponsoring corporations are emphasizing transparency. Software Heritage, which claims to have the largest public repository of source code, provided licenses for the ethically sourced data used in the construction of StarCoder2. The GitHub page for the BigCode project will host the supporting code for the model. It is being made available with royalty-free access and use under the BigCode OpenRAIL-M license.

The license isn’t strictly an open-source license, even if it includes many open-source features. RAIL-M places limitations on licensed software, such as making it impossible for it to administer justice or give medical advice. It has also drawn criticism for being overly ambiguous.

Hugging Face will also offer downloads for all StarCoder2 models, and the Nvidia AI Foundation models support the StarCoder2 15-billion-parameter model.