About SixGPT?
SixGPT is an agentic data generation platform built on the Vana network. SixGPT leverages a decentralized network of miners to generate high-fidelity synthetic data. That synthetic data is evaluated by a separate set of agents, and the best performing synthetic data is combined with real data to drive the next generation of AI models.
The problem
AI models are ultimately only as good as their training data
And AI researchers have run into a problem: we are out of training data. We’ve used up all of the publicly available data (15 trillion words, training dataset size of llama3). To make AI smarter and more capable, we need more data. This is what AI researchers call the “data wall”.
SixGPT breaks the data wall, accelerates model development, and attempts to achieve the Moore’s law of AI with the generation of synthetic data.
Why synthetic data?
A dataset is composed of a set of observations and their labels. In real-world datasets, observations are sourced from actual environments, and labels are manually generated by humans, which is an extremely labor-intensive process. As models improve, human labeling becomes a bottleneck, limiting the pace of progress.
Synthetic datasets eliminate this bottleneck by using models, not humans, to generate labels. This shifts the workload from human labor to compute. Moreover, synthetic datasets can generate both labels and observations, fully automating the data generation process.
Recent research has shown how synthetic data can amplify the value of human-generated data:
- "Textbooks Are All You Need" (Gunasekar et al., 2023) showed that synthetic textbook-style data can improve model performance across tasks by up to 30%.
- "Scaling Synthetic Data Creation with 1,000,000,000 Personas" (Chan et al., 2024) demonstrated how persona-driven synthesis using a billion diverse personas can generate high-quality data for mathematical reasoning, knowledge tasks, and interactive agents.
This approach not only accelerates model development but also offers numerous advantages over real-world datasets:
- Data Privacy: Synthetic data protects customer privacy since it does not contain real user data, helping organizations comply with business regulations.
- Bias Reduction: Real-world data often covers only specific parts of the data distribution, leading to biases. Synthetic data can fill in these gaps and reduce bias.
- Scalability: Synthetic data can be generated in large volumes, enabling efficient testing and training of machine learning models.
- Diversity: It allows for the creation of diverse data that mirrors a wide range of real-world scenarios.
- Speed and Cost Efficiency: By accelerating data generation and reducing reliance on human labeling, synthetic data speeds up projects and cuts development costs.
Still, synthetic data is not a panacea. Two bottlenecks remain: The data generated by synthetic data remains fixed to a probability distribution derived from the underlying model. No AI model perfectly captures the real world, and thus the data generated from it will not be perfect either, as errors compound Inference for multi-billion parameter models is expensive.
SixGPT solves these bottlenecks, unlocking the full potential of synthetic data and enabling the continuation of exponential AI progress.
Enabling data creation
SixGPT leverages a decentralized network of miners to generate a synthetic dataset that covers a much larger (and more real) part of the probability distribution then can be manifest by any single actor.
Each miner contributes data sourced from a model they select, based on prompts they can curate, and underlying randomness that they generate.
Even if the data contributed by a single miner can only cover a small section of the distribution of data manifested in the real world, tens of thousands miners together can cover other models' deficiencies. This approach is similar to how OpenAI and Claude train their flagship models - by synthesizing the outputs of several best-in-class models together - except done on a much larger scale.
The Agentic Future
SixGPT leverages a decentralized network of miners to generate high-fidelity synthetic data.
Documentation
Welcome to the sixgpt docs. To run a miner, go to docs.sixgpt.xyz/miner. For more help, dm us on X.