General he needs synthetic data. We must be able to trust them

Today’s generating models, such as those behind chatgt and twins, have been trained in real -world data reams, but all the content online is not enough to prepare a model for any possible situation.

To continue to grow, these models must be trained in simulated or synthetic data, which are scenarios that are reliable but not true. Developers of it have to do so responsibly, experts said at a southwest panel, or things can go quickly.

The use of simulated data in training artificial intelligence models has gained new attention this year since the beginning of Deepseek He, a new model produced in China that was trained using more synthetic data than other models, saving money and processing power.

But experts say it has more to do with saving in data collection and processing. Synthetic data-often generated by herself-can learn a model about scenarios that do not exist in the real-world information that has been provided but may face in the future. That one-in-one option should not come as a surprise to a model if it is seen a simulation of it.

“With simulated data, you can escape the idea of ​​advantage cases, assuming you can trust them,” said Oji Udezue, who has led teams of products on Twitter, Atlansian, Microsoft and other companies. He and other panelists were talking Sunday at the SXSW conference in Austin, Texas. “We can build a product that works for 8 billion people, in theory, as long as we can trust them.”

The difficult part is to ensure that you can trust it.

Problem with simulated data

Simulated data have many benefits. For one, it costs less to produce. You can crash to try thousands of simulated cars using some software, but to get the same results in real life, you need to destroy cars – which costs a lot of money – said Udezue.

If you are training a self-driving car, for example, you will need to capture some less common scenarios that a vehicle may experience on the road, even if they are not in training data, said Tahir Ekin, a professor of business analytics at the State University of Texas. He used the case of bats who make spectacular performances from the Austin Congress Bridge. This may not appear in the training data, but a self-driving machine will need an understanding of how to respond to a flock of bats.

The risks come from the way a trained machine using synthetic data responds to real -world changes. It cannot exist in an alternative reality, or it becomes less useful, or even dangerous, Ekin said. “How would you feel,” he asked, “entering a self-driving machine that was not trained on the road, which was only trained in simulated data?” The system that uses simulated data must be “based on the real world,” he said, including reactions on how his simulated reasoning matches what is actually happening.

Udezue compared the problem to the creation of social media, which began as a way to expand communication around the world, a goal that achieved. But social media has also been misused, he said, pointing out that “now despots use it to control people, and people use it to show jokes at the same time.”

As the tools of it grow on the scale and popularity, a scenario made easier than using synthetic training data, the possible real -world impacts of incredible training, and the models that become detached from reality increase more significant. “The burden is for American builders, scientists, to be double, triple secure that the system is reliable,” Udezue said. “It’s not a fantasy.”

How to Keep Simulated Data in Czech

One way to ensure that models are reliable is to make their training transparent so that users to choose what model to use based on their evaluation of this information. Panelists constantly used the analogy of a nutritional label, which is easy for a user to understand.

There are some transparency, such as model cards available through developer platform that embrace face that disrupts details of different systems. This information should be as clear and transparent as possible, said Mike Hollinger, Director of Product Management for that Enterprise Generation Generation Nvidia. “Those types of things should be in place,” he said.

Holinger said, after all, it will not only be the developers of him, but also users of the one who will determine the best practices of the industry.

The industry should also remember ethics and risks, Udezue said. “Synthetic data will make many things easier to do,” he said. “This will reduce the cost of building things. But some of those things will change society.”

Udezue said that observation, transparency and trust should be built on models to ensure their reliability. This includes updating training models so that they reflect accurate data and not enlarge errors in synthetic data. A concern is the model’s collapse, when a model he trained in data produced by other models of he will be more and more removed from reality, to the point of becoming useless.

“The more you get away from catching the real world diversity, the answers can be unhealthy,” Udezue said. The solution is to correct the error, he said. “These do not feel as insoluble problems if you combine the idea of ​​trust, transparency and error correction in them.”

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top