Juan Chamero

Metodología Darwin

Philosophical Approach II

Darwin - Philosophical approach II

Semantic Search

Philosophical approach - from ideas to concepts - II

The World of Concepts. Cepts and Keywords

Dr.  Juan Chamero, Darwin Architect, Buenos Aires 10 February 2009




    Finally we arrive to “concept”, for many authorities the basic unit of knowledge.

I’m inclined to see concepts as units of meaning instead. In Darwin ontology a concept correlates to a definition located “hierarchically within” a given knowledge domain.  And “hierarchically within” implies that it is located at the end of a unique “semantic path” within a knowledge domain. The word “unit” and “basic unit” could be misleading because it induces in our mind members of a set sharing the same or similar hierarchy.

    As in Darwin vision formal knowledge structures itself over inverted logical trees all paths could be considered units within a hierarchically structured topology. These paths have two extremes: the “head” pointing upwards the tree and the “tail” pointing downwards the tree. If we imagine a certain Human Knowledge domain structured like a tree the top head is the “root“ usually represented upwards on top meanwhile derived subjects are represented downstream to the most specific subjects: the “leaves”


Note 1: Common Expressions are long coined concepts that universally prompt in our minds well defined situations, scenarios, conclusions, wisdom flashes: Latin locutions like “Res non verba”, “sine qua non”, proverbs, citations and sayings, like “first comes first served”, “a burnt child dreads fire”, “a one thousand journey starts with the first step”, and even short quotations like “we burn daylight” (Shakespeare).

Note 2: epistemology, from Wikipedia

Note 3: idea, a discussion about idea and ideals from Wikipedia; Baruch Spinoza, from Wikipedia

Note 4: Take care; the “innate” of Kant is not the same as the “innate” for Plato .Humans have mind restrictions and “condemned” to see reality thru tinted glasses. 

art tree

    The figure above show us several “semantic paths” of ART as_it_is in the Web as per August 2008, a knowledge domain unveiled by Darwin agents: 7,570 themes along thirteen levels holding up to 300,000 keywords. “Rigoletto” is a single word keyword hosted at the “end” of path [] as its “tail” being 0 its “head”, the ART root.


Creation of a new concept hypothesis

    When we humans “create” a new concept it is agreed that we have arrived to a “precise enough” definition of an “ideal”, following a mental collective process of thinking in the neighborhood of Kant “regulatory ideas”. This definition has sense only if referred to its precise context within the semantic space of the knowledge. So if something like the fake “Programmers Collective Authority” agrees about the meaning of “parallel processing”, this definition has sense along a path of the form


[Information Technologies and Communications => Information Technologies => Computing => Software => Programming => Programming Languages ……]

That is a semantic path with its head on the IT&C root and its tail pointing specifically to a subject such as for example “parallel processing” in Operating Systems. Tails end in tree nodes, including the roots. And it is highly probable that he same name: “parallel processing”, will be used by other humans to define agreed meanings for other semantic paths, for example in chemistry, economy, drugs industry, etc.  



    Now let´s face the meaning of this rather confuse concept. In the Search Engines industry are words or chains of words that “magically” open Pandora boxes that hide pieces of knowledge we, humans, are looking for, bringing documents references that “prima facie” satisfy our cognitive needs.

    We have to take into account that Darwin technology working in the Web space deals with two realms: the “K Realm of the “Established Knowledgerepresented by all Web sites and the “K´ Realm of People” navigating by the “Web Ocean”. In fact Darwin Technology works based in two interacting ontologies one for each realm.

    Real existent Web keywords are in fact “created” by “authorities” initially most of them as neologisms, new words formed by combining in a precise way pre-existent single words sequences like for instance the rather old outlet that probably was initially imagined as out-let, up-stream, down-stream, well formed formulae that ended as WFF, and some other of recent creation such as "quality of education assessment". In fact these keywords could be considered “new words”, new ideograms created to facilitate the human communication and to make it more universal and precise. Perhaps they should be written concatenating their components using (_) or Upper Case letters like for instance WellFormedFormulae as equivalent to well_formed_formulae or AsItIs equivalent to as_it_is.


Keywords versus Common Words domains

    Keywords domains are very limited, restricted to a specific subject within a given discipline meanwhile Common Words and Expressions domain is the one of the full language to where they belong. For instance the common words “well”, “formed” and “formulae” are valid for the English language no matter the subject deal with. On the contrary the keyword name “parallel processing” has existence in at least 100 domains –subjects- as tails of its 100 associated semantic paths.

    In languages like for example Chinese keywords are represented as new ideograms because is what they are: representation of new specific ideas. Along time some keywords become Common Words or Common Expressions and many finally die, usually by obsolescence. The same but slowly happen to Common Words and Expressions.  


The semantic space of existent documents (within the Web space)

    Primitive indexes of this space are located in main Search Engines databases basically structured as virtual two dimensional arrays of documents versus words. Actual search engines do not classify by keywords, only by accepted “words”. These arrays are huge, in the order of 20 million “columns” one for each document hosted and one million “rows” one for each common word or expression, including brand names, geographical names, personalities names, and well known acronyms.

    However something is missing: concepts. As they are not detected and if declared by documents´ authors are either neglected or considered not credible they should be “unveiled”. Darwin ontology of K side guide Darwin agents to perform this important task. Let´s suppose that we were able to unveil the main subject of each document –whether unique- and reorder the documents-words arrays –at least one for language- putting together documents that share the same/similar main subject. By studying these document clusters from the point of view of literary concordance we are going to detect combination of words that tend to appear regularly in most documents of each cluster, and that at the same time are “rare” enough within the whole Web universe to be considered a Common Word or a Common Expression, and that persistently tend to appear associated to others belonging to the same set of rare combinations.    

    This characteristic means that the same combination could exist in some other clusters, belonging to different disciplines and even to the same discipline but never associated to the same semantic neighborhood. For example the combination “parallel processing” related to Computing may appear in another cluster of the same discipline, for instance having as neighbors “n-tier”, “multitasking”, “interleaving”, “distributed computing”, but it may also appear related to Human Brain Processing, associated to “neural network”, “incoming stimuli”, and “computer vision”. Of course their definitions: parallel processing in Computing, and parallel processing in Human Brain Processing are completely different and surely their documented definitions are somewhere in more than one Website but they are not easy to find in this early stage of Knowledge Discovery. One of the Darwin K-side conjecture says that parallel processing located at the tail of a path of the Human Brain tree should be a different concept of parallel processing located as the tail of a path of Computing. At the actual state of the art of Knowledge Discovery using our Darwin technology we accept as discriminators significant concept neighborhood differences.



Something of Imagery: The Web as a man-machine Teaching-Learning system


 e-learninf machine


    This figure shows us a vision of the Cyber space as a Teaching-Learning coupled system. We have two actors: users and authors that may interchange roles. The Web is represented by the circle where documents are hosted inside. There exist a restricted and always evolving EK, Established Knowledge. Users may navigate by the Web space thru Search Engines that may be functionally imagined as the exterior light green circle behaving like a World Search Membrane.

    Users may query the whole Web obtaining information plus some subtle forms of intelligence. They may query either conventionally or semantically. This second form is not yet enabled but it will be shortly. Conventionally queries follow a sort of random paths as it is depicted in white inside the circle. Semantically they could point directly to EK and optionally to the rest of Not Yet EK pages, the Web majority. EK may evolve continuously. The search membrane could be enabled to register all world queries, a crucial data asset because it keeps hidden inside but retrievable “Users Behavior Patterns”! And at large enabling the knowledge of the People’s Thesaurus, in fact how people learn!.

    On the “other side” authors injects more information and knowledge to the Web in a World  Teaching role, and receiving as counteractions “traffic information”, querying patterns and direct users interactions under the form of demands, suggestions, registrations and even offers. We may also imagine a virtual EK Membrane that may control in the near future the best teaching and the best EK evolution.



