ActivityPub Viewer

A small tool to view real-world ActivityPub objects as JSON! Enter a URL or username from Mastodon or a similar service below, and we'll send a request with the right Accept header to the server to view the underlying object.

Open in browser →
{ "@context": [ "https://www.w3.org/ns/activitystreams", { "ostatus": "http://ostatus.org#", "atomUri": "ostatus:atomUri", "inReplyToAtomUri": "ostatus:inReplyToAtomUri", "conversation": "ostatus:conversation", "sensitive": "as:sensitive", "toot": "http://joinmastodon.org/ns#", "votersCount": "toot:votersCount" } ], "id": "https://rukii.net/users/tero/statuses/113486632677250928", "type": "Note", "summary": null, "inReplyTo": null, "published": "2024-11-15T11:03:09Z", "url": "https://rukii.net/@tero/113486632677250928", "attributedTo": "https://rukii.net/users/tero", "to": [ "https://www.w3.org/ns/activitystreams#Public" ], "cc": [ "https://rukii.net/users/tero/followers" ], "sensitive": false, "atomUri": "https://rukii.net/users/tero/statuses/113486632677250928", "inReplyToAtomUri": null, "conversation": "tag:rukii.net,2024-11-15:objectId=319620414:objectType=Conversation", "content": "<p>Multi-epoch training is not just taking successive steps in the loss landscape. It is fundamentally different from single-epoch training.</p><p>Before the first epoch, the model knows nothing about the data coming in. During the first epoch, large models are typically capable of learning the training examples verbatim from a single sample only.</p><p>The task is: Learn to predict the given example when you know only about other stuff, not this particular example. This trains circuits in a large model which are generalizable, for predicting new, previously unseen data, with only the memorized knowledge from other training data, but not this one.</p><p>When the same data is seen again in the next epoch, the model doesn&#39;t learn to derive it from other data, but from its imprinted memory of the same exact thing. This trains competing circuitry in the model, fetching verbatim from memory.</p><p>Multi-epoch training is thus very harmful for the generalization capability of large models. Validation set performance might increase, but this is a mirage; the validation set is typically from within the same distribution as the training set and so it doesn&#39;t sound an alarm when out-of-distribution generalization capability erodes. Additionally, the loss functions give better scores for more certain outputs, and the outputs become excessively certain for cases where the ground truth was already seen before.</p><p>That&#39;s why LLMs are typically trained only one epoch, or possible just a bit over one epoch, to compensate for the initial warm up.</p><p>The same thing applies to fine-tuning existing models. Try to rather add more data than add more epochs.</p>", "contentMap": { "en": "<p>Multi-epoch training is not just taking successive steps in the loss landscape. It is fundamentally different from single-epoch training.</p><p>Before the first epoch, the model knows nothing about the data coming in. During the first epoch, large models are typically capable of learning the training examples verbatim from a single sample only.</p><p>The task is: Learn to predict the given example when you know only about other stuff, not this particular example. This trains circuits in a large model which are generalizable, for predicting new, previously unseen data, with only the memorized knowledge from other training data, but not this one.</p><p>When the same data is seen again in the next epoch, the model doesn&#39;t learn to derive it from other data, but from its imprinted memory of the same exact thing. This trains competing circuitry in the model, fetching verbatim from memory.</p><p>Multi-epoch training is thus very harmful for the generalization capability of large models. Validation set performance might increase, but this is a mirage; the validation set is typically from within the same distribution as the training set and so it doesn&#39;t sound an alarm when out-of-distribution generalization capability erodes. Additionally, the loss functions give better scores for more certain outputs, and the outputs become excessively certain for cases where the ground truth was already seen before.</p><p>That&#39;s why LLMs are typically trained only one epoch, or possible just a bit over one epoch, to compensate for the initial warm up.</p><p>The same thing applies to fine-tuning existing models. Try to rather add more data than add more epochs.</p>" }, "attachment": [], "tag": [], "replies": { "id": "https://rukii.net/users/tero/statuses/113486632677250928/replies", "type": "Collection", "first": { "type": "CollectionPage", "next": "https://rukii.net/users/tero/statuses/113486632677250928/replies?only_other_accounts=true&page=true", "partOf": "https://rukii.net/users/tero/statuses/113486632677250928/replies", "items": [] } } }