Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQLite DB with SQLAlchemy #190

Draft
wants to merge 6 commits into
base: db-migration
Choose a base branch
from

Conversation

JoshuaMaddy
Copy link

A quick mock-up of an SQLite database using SQLAlchemy as the ORM.
My reasoning for suggesting using SQLAlchemy:

  1. The ability to use many different connectors (PostgreSQL, MariaDB, SQLite, etc.) to support different database dialects with minimal adaptations.
  2. The convenience of translating between rows and Python objects. All classes are fully typed, mypy compliant.
  3. SQLAlchemy supports execution of raw SQL if the ORM is not desired.
  4. Simplification of some methods, like Entry.has_tag/Entry.remove_tag

This is heavily inspired by #187 , but not with the exact same table definitions.

src.database.table_declarations.* are individual files declaring each table/Python object. Adapted from classes in src.core.libarary.
src.database.manage is the bare minimum to create an engine, and to add/drop tables from declarations.
src.make_db is a quick, dirty script to demonstrate how to make an SQLite database from the declarations, then add records as objects and query records as objects.

This is just a proof of concept, so there are some notable issues.

  1. Collations are not implemented.
  2. Path objects are stored as strings, but they can be auto-casted from Pathlib Path -> varchar-> Pathlib Path.
  3. Multi-value attributes, like crop, dimensions, etc are not implemented, but could be casted, stored as a pickle blob, or as JSON.

If people like this direction, I'd be glad to flesh it out more! I also get that some devs dislike ORMs, so no worries if that's the case.

@CyanVoxel CyanVoxel added Type: Refactor Code that needs to be restructured or cleaned up TagStudio: Library Relating to the TagStudio library system labels May 18, 2024
@CyanVoxel CyanVoxel added this to the SQLite Database Migration milestone May 18, 2024
@yedpodtrzitko
Copy link
Collaborator

Nice job. Do you think it would make sense to add Alembic as well for in case the DB schema will change among versions?

@JoshuaMaddy
Copy link
Author

Nice job. Do you think it would make sense to add Alembic as well for in case the DB schema will change among versions?

Alembic would definitely be a good addition once the schema has settled down. Things are still very, very unstable.

@JoshuaMaddy
Copy link
Author

JoshuaMaddy commented May 20, 2024

The most recent commit adds some of the basic functionality: adding tags to entries, creating tags, and modifying tags. Adding/editing/removing tag box/text line fields should also work. It works with multi-selection. Many changes were made to several Qt widgets/modals to support the database types, and there is a significant overhaul of search results to leverage dataclasses instead of ambiguous tuples/lists. There's still a lot left to do to get it back to functional parity.

A lot of the codebase relies on mutating/reading what was an always accessible database. That's a lot harder to do now that there is a required interface layer with stricter access. I am still working on how best to resolve that.

In many cases, querying the database from Qt is done through the Library object. In some specific cases, like writing preview panel field widgets, a database session is opened in the logic to allow for lazy-loading joins deep in callbacks. I don't like mixing direct database access into the Qt code, but for now, it's the best option due to its simplicity.

I'd appreciate any thoughts on whether it's acceptable to open database sessions inside Qt classes. From an MVC viewpoint, it feels wrong - but this codebase isn't really following an MVC pattern.

@yedpodtrzitko
Copy link
Collaborator

yedpodtrzitko commented May 21, 2024

I really like that instead of just talking what needs to be done and how it needs to be done, you just go and write the code. And that's much easier to reason about than some abstract plans. So in this regard I fully support this PR. However I dont really have any decisive power in this project, so this doesnt mean that much :)
Just word of advice - I dont like all the code which is commented out, but touching more things will lead to more merge conflicts and it's more things to review, so maybe open another PR just for doing that?

@CyanVoxel - since git exists, it's should be easy to keep a separate branch for bugfixes of the existing release, and this could get merged into the development branch where more people will be exposed to it, and it can be gradually polished until it will fully work as expected.
SQLAlchemy in combination with Alembic allows to seamlessly change DB schema. That means even if this PR is currently missing implementation of Collations etc., it can be done later without losing any user's data.
So my 2 cents would be to focus on this and make it a working MVP just with Tags (since no other things are implemented in the current versions anyway, afaik) and then merge it. New features can be added later. Perfect is enemy of good enough, and that perfect version might never get done.

@JoshuaMaddy
Copy link
Author

JoshuaMaddy commented May 21, 2024

Just word of advice - I dont like all the code which is commented out, but touching more things will lead to more merge conflicts and it's more things to review, so maybe open another PR just for doing that?

Yeah, I wasn't too sure if removing commented out code was a good idea for basically that reason. I tried to only touch comments that were in areas I was actively working in because they sort of got in the visual way, but I probably overreached in that aspect, apologies.

... this could get merged into the development branch where more people will be exposed to it, and it can be gradually polished until it will fully work as expected.

This is still very much a work in progress, a lot of my changes have been sort of ad-hoc. Before considering merging into dev, I would like to continue refining some key aspects, like the Library methods and Session usage. As I refactored, their purposes became slightly muddied and I'd like to make them more unified. Also, a lot of dead code is still sitting in src.core, and there are some globals I need to move to the newer src.core.constants. Small things, but lots of them.

@yedpodtrzitko
Copy link
Collaborator

This is still very much a work in progress, a lot of my changes have been sort of ad-hoc. Before considering merging into dev, I would like to continue refining some key aspects, like the Library methods and Session usage. As I refactored, their purposes became slightly muddied and I'd like to make them more unified. Also, a lot of dead code is still sitting in src.core, and there are some globals I need to move to the newer src.core.constants. Small things, but lots of them.

Of course, I didnt mean to merge it now. I briefly tried to run the code from this branch, and it's not ready for that. I meant this should get merged when it will have feature parity with the current state of main branch, rather than postponing it because there are missing some things from roadmap. I bet if this would keep staying unmerged for 3 months, you'd eventually lose interest to keep working on it.

@CyanVoxel
Copy link
Member

@CyanVoxel - since git exists, it's should be easy to keep a separate branch for bugfixes of the existing release, and this could get merged into the development branch where more people will be exposed to it, and it can be gradually polished until it will fully work as expected.ng any user's data. So my 2 cents would be to focus on this and make it a working MVP just with Tags (since no other things are implemented in the current versions anyway, afaik) and then merge it. New features can be added later. Perfect is enemy of good enough, and that perfect version might never get done.

So do you think the best approach going forward would be to open up a db-migration branch and start by pulling this in there? Or were you thinking of a more general dev branch where all these big changes can go from now on? I agree this should be out of a PR and onto a separate branch until it reaches parity.

@yedpodtrzitko
Copy link
Collaborator

So do you think the best approach going forward would be to open up a db-migration branch and start by pulling this in there? Or were you thinking of a more general dev branch where all these big changes can go from now on? I agree this should be out of a PR and onto a separate branch until it reaches parity.

I'm on the fence regarding the general dev branch, so I would rather say dont change it if the current setup works.

Regarding "dev branch" for the DB migration...
It's a bit unfortunate that the effort is split into two concurrent PRs. Even though #187 was created first, I consider the SQLAlchemy approach (in comparison with the vanilla sqlite) superior for various reasons.

No matter which one will get chosen, I'd suggest to merge it when Tags will be working, and the rest can be added later *. However merging it will cause the same scenario as above. Meaning the main branch will basically become a "development" branch unless it will be 100% ready for release. From my point of view it's still more preferable approach than trying to get implemented everything (Collations etc.) before merging it. Because that's doing two things at the same time - migrating from JSON to DB, and then also adding new features which are not currently implemented even in JSON, and looking at the size of both competing PRs, they're already pretty big as they are.

* the DB schema will change sooner or later no matter what. And at that point SQLAlchemy(+Alembic) provides better way how to deal with it than anything sqlite can provide (afaik)

yedpodtrzitko added a commit to yedpodtrzitko/TagStudio that referenced this pull request May 25, 2024
@CyanVoxel CyanVoxel changed the base branch from main to db-migration May 31, 2024 09:30
@CyanVoxel
Copy link
Member

Just wanted to touch base with this PR. A db-migration branch has been opened so this migration can be worked on bit-by-bit without worrying about getting the entire thing functional in a single PR. #187, which had a non-SQLAlchemy approach but a newer planned schema, has been closed in favor of this SQLAlchemy approach.

In essence, the planned direction we have is to move forward with SQLAlchemy, the schema shown in #187 (ideally), and on the db-migration branch. Assuming the conflicts are resolved here, I wouldn't mind pulling this PR and using it as a jumping off point. Alternatively, this could be split up into smaller PRs working off the current codebase in order to manage things better.

I just wanted to get everyone's thoughts before getting this moving along 🤞

Loran425 and others added 3 commits June 3, 2024 21:47
Add item delegate for Ignored File Extension to add leading `.` if left off extension
* Added various file formats to constants.py

* Update tagstudio/src/core/constants.py

Co-authored-by: Travis Abendshien <46939827+CyanVoxel@users.noreply.github.com>

* Update tagstudio/src/core/constants.py

Co-authored-by: Travis Abendshien <46939827+CyanVoxel@users.noreply.github.com>

---------

Co-authored-by: Travis Abendshien <46939827+CyanVoxel@users.noreply.github.com>
@JoshuaMaddy
Copy link
Author

I went ahead and reverted the majority of the changes for two reasons:

  1. If the schema is going to follow Switch to SQLite DB storage #187, then the usage of Field types (TextField, DatetimeField, TagBoxField) in this PR's schema will be replaced by entry_attributes. A lot of the work in this PR's src.qt.* relied on these Field classes. More on that below.
  2. A lot of commits were very, very messy. Either removing too many unrelated things, or making changes that I later regretted.

I feel it's best to mostly wipe things, reevaluate/update the schema in this PR, and then reimplement the prior work in following PRs. I don't anticipate reimplementation to take long.

One of the major issues I have with #187's schema is the entry_attributes table. Defining a table that is a catch all for a mixture of datatypes does not sit well with me. In my opinion, there should be a Field class, with subclasses for each Field purpose, hence TextField for things like Note, Description, etc., DatetimeField for all date fields, and TagBoxField for meta tags and user tags, or any other tag boxes.

Using multiple tables/objects for different field types allows for more concrete typing, which eliminates the need to fuzzily check the field type. For example this pattern used in src.qt.widgets.preview_panel:
if self.lib.get_field_attr(field, "type") == "tag_box": -> if isinstance(field, TagBoxField):
It also avoids the need for null checking row values that entry_attributes would introduce. For example, a TextField is deterministic, it will always have a text value column, and there would be no erroneous datetime or number columns.

The other tables I did not implement from #187, entry_page and location, are simple.
entry_page, from my understanding, is for collations which are outside of this PR's current scope.
location can be emulated with this PR's Entry.path:
path: Mapped[Path] = mapped_column(unique=True)
This takes a Pathlib.Path, converts to str on DB commit, and converts back to Path on DB read. You can then access the Path.name property, eliminating the need for storing names directly in the DB. The assumption being that the DB "name" would be the same as the Path.name, and not a user determined alias.

As it is now, the table declarations are functionally equivalent to #187, with the major difference in Field outlined above. I would be fine with merging this into the db-migration branch as is, if the schema is deemed acceptable. I'd like to hear any thoughts on the schema, and I've added an ER diagram for those unfamiliar with SQLAlchemy below!

TagStudio

@Loran425
Copy link
Collaborator

Loran425 commented Jun 6, 2024

I'll have to do some re-reading on past discussions to confirm but I believe the main goal of the mashup was that basically everything was a tag, so the field Author was really a tag with the name Author that had an associated value for a given entry, and the Meta Tags box was a tag that indicated to the UI that it should render a TagBox with all tags included that had Meta Tags as a preference. Like I mentioned though I'll need to do some re-reading to verify.

The locations table was intended to support a multiple directories feature which should probably be excluded from the first set of changes.

Rough Idea for Locations Table

For the locations table the original intent was to support multiple directories for a single Library.
So for each directory you would have a location such as C:\Users\Loran425\Pictures or C:\Users\Loran425\Downloads then each entry was storing the relative path to its location, I don't think that structure needs to stay necessarily but it was a convenient way to be able to let TagStudio know which directories it should be looking at.

Location = {
    "id": 0,
    "path": "C:\Users\Loran425\Pictures",
    "name": "Pictures"
}
Entry = {
    "location": 0,
    "path": "test.png"
    ...
}

Would refer to the file "C:\Users\Loran425\Pictures\test.png" in my original PR.

@yedpodtrzitko
Copy link
Collaborator

Using multiple tables/objects for different field types allows for more concrete typing, which eliminates the need to fuzzily check the field type. For example this pattern used in src.qt.widgets.preview_panel:
if self.lib.get_field_attr(field, "type") == "tag_box": -> if isinstance(field, TagBoxField):
It also avoids the need for null checking row values that entry_attributes would introduce. For example, a TextField is deterministic, it will always have a text value column, and there would be no erroneous datetime or number columns.

@JoshuaMaddy I dont know if having the field split among multiple tables is the best approach. I understand the issue with type-guessing, but what would you think about having a single table with multiple columns (for each type), and then there would be a column field_type which would determine with which column to work. Having everything in a single table could simplify some aspects.

There's also needed a column like field_position for saving information at which position the column should be displayed in the UI, so it will be the same every time.

class FieldType(enum.Enum):
    STRING = "string"
    INTEGER = "integer"
    DATETIME = "datetime"
    TAG = "tag"


class FieldValue(Base):
    
    id = Column(Integer, primary_key=True)
    entry_id = Column(Integer, ForeignKey('entry.id'))
    name = Column(String(255))
    field_type = Column(Enum(FieldType))
    field_position = Column(Integer)
    # different values
    string_value = Column(String(255), nullable=True)
    integer_value = Column(Integer, nullable=True)
    datetime_value = Column(Date, nullable=True)
    tags = relationship('Tag')

@JoshuaMaddy
Copy link
Author

I'll have to do some re-reading on past discussions to confirm but I believe the main goal of the mashup was that basically everything was a tag, so the field Author was really a tag with the name Author that had an associated value for a given entry, and the Meta Tags box was a tag that indicated to the UI that it should render a TagBox with all tags included that had Meta Tags as a preference. Like I mentioned though I'll need to do some re-reading to verify.

To me, this sounds like a common image board approach to tagging, in that each tag has a 'category'. I personally prefer this approach, but I was attempting to emulate the JSON structure which iirc does not treat said fields as tags.
From what I've read, it seems like there are three very clear tag categories:

  1. Meta Tags - ex: Favorite, Archived
  2. User Defined Tags - user populated, optional hierarchy
  3. Artist Tags - user populated, arguably meta tag

Other image board implementations also use copyright and character categories, but I'm hesitant to include these - feels a bit opinionated as to how the program should be used.

There are some optional fields that carry per-entry data, for example date published and notes, that I feel do not lend themselves to a tag representation and should stay as fields.

I will change TagBox / Meta TagBox to function as you've described, with the intention of using Tag categories as the filter when displayed.

The locations table was intended to support a multiple directories feature which should probably be excluded from the first set of changes.

Thank you for the clarification, I misunderstood the purpose of that table. I agree that as of now it is probably outside of the scope, but should be added in the future. I'd be glad to add it now if desired.

@JoshuaMaddy
Copy link
Author

JoshuaMaddy commented Jun 6, 2024

@JoshuaMaddy I dont know if having the field split among multiple tables is the best approach. I understand the issue with type-guessing, but what would you think about having a single table with multiple columns (for each type), and then there would be a column field_type which would determine with which column to work. Having everything in a single table could simplify some aspects.

I personally do not see a benefit to having a monolithic field table, but there is a nice middle ground that SQLAlchemy provides for this exact problem.
I suggest using single table inheritance, aka polymorphic, on Field. This keeps the DB representation of Field and its subclasses as a single table, but on the ORM side, separates entries into classes that are strongly typed. The lower level implementation is actually quite similar to what you've suggested, it's just more tightly integrated with the rest of SQLAlchemy. The reason I did not do this in the initial schema is that it does not follow normal form and as discussed mixes types.

There's also needed a column like field_position for saving information at which position the column should be displayed in the UI, so it will be the same every time.

This will be implemented with the Ordering List extension on the Entry class.

I'll get these changes made today and push a revised schema for further comments!

@JoshuaMaddy
Copy link
Author

I've updated the Schema to use a single table for Fields, Fields are ordered, Tags are joined to Entries, and Tags have categories.
I've also added a temporary example file, database_example.py, which should create an sqlite DB to browse, and output this:

Entry information:
        Meta tags: ['Archived', 'Favorited']
        User tags: ['User Made Tag']
                User Tags' Subtags: [('User Made Tag', ['Subtag'])]
        Is archived: True
        Is favorited: True
        Ordered Fields:
                (0, 'Example Note', 'Example text value')
                (1, 'Example Integer', 100)
                (2, 'Example Float', 100.1)
                (3, 'Example Datetime', datetime.datetime(2024, 6, 6, 11, 52, 37, 310918))
                (4, 'Meta Tag Box', <TagCategory.meta_tag: 'meta_tag'>)
                (5, 'User Tag Box', <TagCategory.user_tag: 'user_tag'>)

Moving first field to end, reordering, and committing to DB...

Entry information:
        Ordered Fields:
                (0, 'Example Integer', 100)
                (1, 'Example Float', 100.1)
                (2, 'Example Datetime', datetime.datetime(2024, 6, 6, 11, 52, 37, 310918))
                (3, 'Meta Tag Box', <TagCategory.meta_tag: 'meta_tag'>)
                (4, 'User Tag Box', <TagCategory.user_tag: 'user_tag'>)
                (5, 'Example Note', 'Example text value')

The updated schema:
ERD

@CyanVoxel CyanVoxel added the Priority: High An important issue requiring attention label Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: High An important issue requiring attention TagStudio: Library Relating to the TagStudio library system Type: Refactor Code that needs to be restructured or cleaned up
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants