2 Comments

This is super detailed and interesting review. Could you speak a little more on the nature of the data in this Multimodality capability meaning is it necessary that these data be static in language context, 2-D in image context, etc. where are the limitations of the data input at this point.

Secondly, if I were to develop an AI system to which there are no data available to feed into it, what are the best type (or quality) of data that we should strive to collect. Generally speaking of course which leads to my question of what is the ideal dataset needed criteria wise.

Apologies for the long questions and thanks again for this well-thought-out piece!

Expand full comment

My overall recommendation is to work with chatGPT to explore machine learning tools or find a reliable vendor before using generative or other deeplearning models. If there is enough interest, I may show a tutorial on training a model.

There are some limitations to the types of data that can be processed effectively:

Language context: AI models like GPT-4 primarily work with static data from the pre-training process, so unless LLMs are connected to another API, such as LangFlow, or perform transfer learning there will some limitations.

Image context: AI models can process 2D images effectively, but their performance with 3D images or dynamic scenes (e.g., videos) might not be as strong. This is still a major research area in AI, which we will see rapid improvements. They can still generate some understanding of the content, but the accuracy and contextual understanding may be limited.

For other data, there may be base models you can use, but a brand-new deep learning model requires an extensive dataset, $$$$, and software engineering expertise to train.

Regarding the development of an AI system with no existing data, you'll need to collect a dataset that is representative, diverse, and balanced. Here are some criteria to consider when collecting an ideal dataset:

Representative: The data should accurately represent the problem you're trying to solve. This means including examples from various scenarios and contexts relevant to the task.

Diverse: The dataset should include a wide range of examples, covering different aspects of the problem. This helps ensure that the AI system can generalize well to new and unseen situations.

Balanced: Make sure the dataset is not biased towards certain classes or features. Imbalanced datasets can lead to biased AI models, as the model will tend to perform better on the overrepresented classes or features.

High quality: The data should be accurate, clean, and properly labeled (if needed). This ensures that the AI system learns from correct and reliable examples.

Sufficient size: The dataset should be large enough to provide a solid foundation for the AI system to learn from. The required size may vary depending on the complexity of the problem and the model being used.

IT privacy concerns: When using models like GPT, you need to consider keeping proprietary information safe. Your data will be collected by these companies unless you spoken with the company about restrictions or made a local copy. Meta’s LLM, LLaMa was leaked a month ago that you can train and use locally on the new Apple macbooks.

Ethical and legal considerations: Ensure that the data collection process follows ethical guidelines and legal regulations, such as privacy protection and informed consent.

Expand full comment